acl acl2011 acl2011-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
Reference: text
sentIndex sentText sentNum sentScore
1 com }@ Abstract In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). [sent-5, score-1.033]
2 We model the feature space with a log-linear combination ofmultiple mixture components. [sent-6, score-0.966]
3 Each component contains a large set of features trained in a maximumentropy framework. [sent-7, score-0.258]
4 All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. [sent-8, score-2.604]
5 This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. [sent-9, score-0.332]
6 It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. [sent-10, score-0.25]
7 The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task. [sent-11, score-0.099]
8 1 Introduction Significant progress has been made in statistical machine translation (SMT) in recent years. [sent-12, score-0.158]
9 , 2003) has become the widely adopted one in SMT due to its capability of capturing local context information from adjacent words. [sent-14, score-0.039]
10 There exists significant amount of work focused on the improvement of translation performance with better features. [sent-15, score-0.099]
11 The feature set could be either small (at the order of 10), or large (up to millions). [sent-16, score-0.104]
12 , 2003) is a widely known one using small number of features in a maximum-entropy (log-linear) model (Och and Ney, 2002). [sent-18, score-0.147]
13 The features include phrase translation probabilities, lexical probabilities, number of phrases, and language model scores, etc. [sent-19, score-0.242]
14 The feature weights are usually optimized with minimum error rate training (MERT) as in (Och, 2003). [sent-20, score-0.341]
15 Besides the MERT-based feature weight optimization, there exist other alternative discriminative training methods for MT, such as in (Tillmann and Zhang, 2006; Liang et al. [sent-21, score-0.297]
16 However, scalability is a challenge for these approaches, where all possible translations of each training example need to be searched, which is computationally expensive. [sent-24, score-0.051]
17 , 2009), there are 11K syntactic features proposed for a hierarchical phrase-based system. [sent-26, score-0.071]
18 The feature weights are trained with the Margin Infused Relaxed Algorithm (MIRA) efficiently on a forest of translations from a development set. [sent-27, score-0.263]
19 Even though significant improvement has been obtained compared to the baseline that has small number of features, it is hard to apply the same approach to millions of features due to the data sparseness issue, since the development set is usually small. [sent-28, score-0.295]
20 In (Ittycheriah and Roukos, 2007), a maximum entropy (ME) model is proposed, which utilizes millions of features. [sent-29, score-0.214]
21 All the feature weights are trained with a maximum-likelihood (ML) approach on the full training corpus. [sent-30, score-0.314]
22 However, the estimation of feature weights has no direct connection with the final translation perforProceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-32, score-0.334]
23 In this paper, we propose a hybrid framework, a discriminative mixture model, to bridge the gap between the ML training and the discriminative training for SMT. [sent-35, score-1.123]
24 In Section 2, we briefly review the ME baseline of this work. [sent-36, score-0.076]
25 In Section 3, we introduce the discriminative mixture model that combines various types of features. [sent-37, score-0.912]
26 In Section 4, we present experimental results on a large-scale Arabic-English MT task with focuses on feature combination, alignment combination, and domain adaptation, respectively. [sent-38, score-0.298]
27 2 Maximum-Entropy Model for MT In this section we give a brief review of a special maximum-entropy (ME) model as introduced in (Ittycheriah and Roukos, 2007). [sent-40, score-0.037]
28 The model has the following form, p(t,j|s) =p0(Zt(,sj)|s)expXiλiφi(t,j,s), (1) where s is a source phrase, and t is a target phrase. [sent-41, score-0.19]
29 j is the jump distance from the previously translated source word to the current source word. [sent-42, score-0.41]
30 During training j can vary widely due to automatic word alignment in the parallel corpus. [sent-43, score-0.261]
31 To limit the sparseness created by long jumps, j is capped to a window of source words (-5 to 5 words) around the last translated source word. [sent-44, score-0.224]
32 (1), p0 is a prior distribution, Z is a normalizing term, and φi (t, j,s) are the features of the model, each being a binary question asked about the source, distortion, and target information. [sent-47, score-0.171]
33 The feature weights λi can be estimated with the Improved Iterative Scaling (IIS) algorithm (Della Pietra et al. [sent-48, score-0.235]
34 1 Mixture Model Now we introduce the discriminative mixture model. [sent-51, score-0.838]
35 Suppose we partition the feature space into multiple clusters (details in Section 3. [sent-52, score-0.223]
36 Let the probability of target phrase and jump given certain source phrase for cluster k be pk(t,j|s) =Zk1(s)expXiλkiφki(t,j,s), (2) 425 where Zk is a normalizing factor for cluster k. [sent-54, score-0.582]
37 We propose a log-linear mixture model as shown in Eq. [sent-55, score-0.733]
38 (3) It can be rewritten in the log domain as logp(t,j|s) = logp0(Zt(,sj)|s) +Xwklogpk(t,j|s) Xk = logp0Z(t(,sj)|s)−XkwklogZk(s) +XwkXλkiφki(t,j,s). [sent-58, score-0.056]
39 Xk (4) Xi The individual feature weights λki for the i-th feature in cluster k are estimated in the maximumentropy framework as in the baseline model. [sent-59, score-0.531]
40 How- ever, the mixture weights wk can be optimized directly towards the translation evaluation metric, such as BLEU (Papineni et al. [sent-60, score-1.015]
41 Note that the number of mixture components is relatively small (less than 10) compared to millions of features in baseline. [sent-64, score-0.921]
42 Hence the optimization can be conducted easily to generate reliable mixture weights for decoding with MERT (Och, 2003) or other optimization algorithms, such as the Simplex Armijo Downhill algorithm proposed in (Zhao and Chen, 2009). [sent-65, score-0.924]
43 2 Partition of Feature Space Given the proposed mixture model, how to split the feature space into multiple regions becomes crucial. [sent-67, score-0.842]
44 In order to surpass the baseline model, where all features can be viewed as existing in a single mixture component, the separated mixture components should be complementary to each other. [sent-68, score-1.613]
45 In this work, we explore three different ways of partitions, based on either feature types, word alignment types, or the domain of training data. [sent-69, score-0.349]
46 They fire only if the left source is open (untranslated) or the right source is closed. [sent-71, score-0.184]
47 All the features falling in the same feature category/cluster are tied to each other to share the same mixture weights at the upper level as in Eq. [sent-72, score-1.099]
48 Besides the feature-type-based clustering, we can also divide the feature space based on word alignment types, such as supervised alignment versus unsupervised alignment (to be described in the experiment section). [sent-74, score-0.56]
49 For each type of word alignment, we build a mixture component with millions of ME features. [sent-75, score-0.894]
50 On the task of domain adaptation, we can also split the training data based on their domain/resources, with each mixture component representing a specific domain. [sent-76, score-0.893]
51 The training data includes the UN parallel corpus and LDC-released parallel corpora, 426 with about 10M sentence pairs and 300M words in total (counted at the English side). [sent-79, score-0.117]
52 For each sentence in the training, three types of word alignments are created: maximum entropy alignment (Ittycheriah and Roukos, 2005), GIZA++ alignment (Och and Ney, 2000), and HMM alignment (Vogel et al. [sent-80, score-0.58]
53 Our tuning and test sets are extracted from the GALE DEV10 Newswire set, with no overlap between tuning and test. [sent-82, score-0.08]
54 There are 1063 sentences (168 documents) in the tuning set, and 1089 sentences (168 documents) in the test set. [sent-83, score-0.04]
55 Both sets have one reference translation for each sentence. [sent-84, score-0.099]
56 Instead of using all the training data, we sample the training corpus based on the tuning/test set to train the systems more efficiently. [sent-85, score-0.102]
57 A 5-gram language model is trained from the English Gigaword corpus and the English portion ofthe parallel corpus used in the translation model training. [sent-88, score-0.234]
58 In this work, the decoding weights for both the baseline and the mixture model are tuned with the Simplex Armijo Downhill algorithm (Zhao and Chen, 2009) towards the maximum BLEU. [sent-89, score-1.041]
59 SystemFeaturesBLEU (F1 to F8), baseline, or mixture model. [sent-90, score-0.696]
60 The translation results on the test set from the baseline and the mixture model are listed in Table 1. [sent-94, score-0.908]
61 The MT performance is measured with the widely adopted BLEU metric. [sent-95, score-0.039]
62 We also evaluate the systems that utilize only one of the mixture components (F1 to F8). [sent-96, score-0.742]
63 The number of features used in each system is also listed in the table. [sent-97, score-0.071]
64 As we can see, when using all 18M features in the baseline model, without mixture weighting, the baseline achieved 3. [sent-98, score-0.946]
65 Since there are exactly the same number of features in the baseline and mixture model, the better performance is due to two facts: separate training of the feature weights λ within each mixture component; the discriminative training of mixture weights w. [sent-103, score-2.845]
66 The first one allows better parameter estimation given the number of features in each mixture component is much less than that in the baseline. [sent-104, score-0.857]
67 The second factor connects the mixture weighting to the final translation performance directly. [sent-105, score-0.84]
68 In the baseline, all feature weights are trained together solely under the maximum likelihood criterion, with no differentiation of the various types of features in terms of their contribution to the translation performance. [sent-106, score-0.504]
69 3 Alignment Combination In the baseline mentioned above, three types of word alignments are used (via corpus concatenation) for phrase extraction and feature training. [sent-110, score-0.312]
70 Given the mixture model structure, we can apply it to an alignment combination problem. [sent-111, score-0.924]
71 With the phrase table extracted from all the alignments, we train three feature mixture components, each on one type of alignments. [sent-112, score-0.835]
72 Each mixture component contains millions of features from all feature types described in Section 3. [sent-113, score-1.106]
73 Again, the mixture weights are optimized towards the maximum BLEU. [sent-115, score-0.95]
74 3 minor gain compared to extracting features from ME alignment only (note that phrases are from all the alignments). [sent-118, score-0.263]
75 With the mixture model, 427 we can achieve another 0. [sent-119, score-0.696]
76 5 gain compared to the baseline, especially with less number of features. [sent-120, score-0.054]
77 This presents a new way of doing alignment combination in the feature space instead of in the usual phrase space. [sent-121, score-0.404]
78 4 Domain Adaptation Another popular task in SMT is domain adaptation (Foster et al. [sent-125, score-0.13]
79 It tries to take advantage of any out-of-domain training data by combining them with the in-domain data in an appropriate way. [sent-127, score-0.051]
80 In our sub-sampled training corpus, there exist three subsets: newswire (1M sentences), weblog (200K), and UN data (300K). [sent-128, score-0.108]
81 We train three mixture components, each on one of the training subsets. [sent-129, score-0.747]
82 The baseline that was trained on all the data achieved 0. [sent-131, score-0.131]
83 5 gain compared to using the newswire training data alone (understandably it is the best component given the newswire test data). [sent-132, score-0.309]
84 Note that since the baseline is trained on subsampled training data, there is already certain domain adaptation effect involved. [sent-133, score-0.285]
85 On top of that, the mixture model results in another 0. [sent-134, score-0.733]
86 All the improvements in the mixture models above against the baseline are statistically significant with p-value < 0. [sent-136, score-0.772]
87 5 Conclusion In this paper we presented a novel discriminative mixture model for bridging the gap between the maximum-likelihood training and the discriminative training in SMT. [sent-138, score-1.207]
88 The features in each region are tied together to share the same mixture weights that are optimized towards the maximum BLEU scores. [sent-140, score-1.118]
89 It was shown that the same model structure can be ef- fectively applied to feature combination, alignment combination and domain adaptation. [sent-141, score-0.388]
90 For example, we can cluster the features based on both feature types and alignments. [sent-143, score-0.259]
91 Further improvement may be achieved with other feature space partition approaches in the future. [sent-144, score-0.25]
92 A discriminative latent variable model for statistical machine translation. [sent-149, score-0.238]
93 Discriminative instance weighting for domain adaptation in satistical machine translation. [sent-161, score-0.204]
94 A maximum entropy word aligner for arabic-english machine translation. [sent-165, score-0.098]
95 Discriminative training and maximum entropy models for statistical machine translations. [sent-185, score-0.179]
96 Bleu: a method for automatic evaluation of machine translation. [sent-193, score-0.029]
97 Measuring confidence intervals for the machine translation evaluation metrics. [sent-205, score-0.128]
98 A simplex armijo downhill algorithm for optimizing statistical machine translation decoding parameters. [sent-209, score-0.478]
wordName wordTfidf (topN-words)
[('mixture', 0.696), ('jump', 0.226), ('ittycheriah', 0.178), ('discriminative', 0.142), ('alignment', 0.138), ('weights', 0.131), ('ki', 0.13), ('systemfeaturesbleu', 0.117), ('millions', 0.108), ('feature', 0.104), ('armijo', 0.103), ('downhill', 0.103), ('translation', 0.099), ('zt', 0.095), ('source', 0.092), ('component', 0.09), ('abraham', 0.089), ('simplex', 0.081), ('examine', 0.078), ('partition', 0.077), ('och', 0.076), ('baseline', 0.076), ('adaptation', 0.074), ('bleu', 0.073), ('features', 0.071), ('maximumentropy', 0.069), ('expxi', 0.069), ('tied', 0.063), ('mt', 0.062), ('target', 0.061), ('alignments', 0.06), ('newswire', 0.057), ('tillmann', 0.056), ('smt', 0.056), ('domain', 0.056), ('optimized', 0.055), ('vogel', 0.054), ('jumps', 0.054), ('gain', 0.054), ('combination', 0.053), ('roukos', 0.053), ('training', 0.051), ('ml', 0.05), ('zhao', 0.049), ('cluster', 0.047), ('bridging', 0.047), ('components', 0.046), ('weighting', 0.045), ('pietra', 0.045), ('della', 0.045), ('franz', 0.044), ('space', 0.042), ('salim', 0.041), ('gap', 0.041), ('xk', 0.041), ('tuning', 0.04), ('christoph', 0.04), ('blunsom', 0.04), ('sparseness', 0.04), ('widely', 0.039), ('normalizing', 0.039), ('un', 0.039), ('josef', 0.038), ('hermann', 0.038), ('types', 0.037), ('model', 0.037), ('foster', 0.037), ('mert', 0.036), ('entropy', 0.035), ('phrase', 0.035), ('ney', 0.035), ('towards', 0.034), ('bing', 0.034), ('bxi', 0.034), ('ofacl', 0.034), ('aotf', 0.034), ('feea', 0.034), ('ofmultiple', 0.034), ('perforproceedings', 0.034), ('shengyuan', 0.034), ('maximum', 0.034), ('share', 0.034), ('decoding', 0.033), ('stephan', 0.033), ('parallel', 0.033), ('usual', 0.032), ('optimization', 0.032), ('roland', 0.032), ('koehn', 0.031), ('statistical', 0.03), ('teh', 0.03), ('infused', 0.03), ('machine', 0.029), ('surpass', 0.028), ('heights', 0.028), ('yorktown', 0.028), ('trained', 0.028), ('papineni', 0.028), ('achieved', 0.027), ('sydney', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
2 0.18874837 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation
Author: Nan Duan ; Mu Li ; Ming Zhou
Abstract: This paper presents hypothesis mixture decoding (HM decoding), a new decoding scheme that performs translation reconstruction using hypotheses generated by multiple translation systems. HM decoding involves two decoding stages: first, each component system decodes independently, with the explored search space kept for use in the next step; second, a new search space is constructed by composing existing hypotheses produced by all component systems using a set of rules provided by the HM decoder itself, and a new set of model independent features are used to seek the final best translation from this new search space. Few assumptions are made by our approach about the underlying component systems, enabling us to leverage SMT models based on arbitrary paradigms. We compare our approach with several related techniques, and demonstrate significant BLEU improvements in large-scale Chinese-to-English translation tasks.
3 0.18842532 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?
Author: Jinxi Xu ; Jinying Chen
Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1
4 0.14948952 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
Author: Coskun Mermer ; Murat Saraclar
Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r
5 0.14034894 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering
Author: Nadir Durrani ; Helmut Schmid ; Alexander Fraser
Abstract: We present a novel machine translation model which models translation by a linear sequence of operations. In contrast to the “N-gram” model, this sequence includes not only translation but also reordering operations. Key ideas of our model are (i) a new reordering approach which better restricts the position to which a word or phrase can be moved, and is able to handle short and long distance reorderings in a unified way, and (ii) a joint sequence model for the translation and reordering probabilities which is more flexible than standard phrase-based MT. We observe statistically significant improvements in BLEU over Moses for German-to-English and Spanish-to-English tasks, and comparable results for a French-to-English task.
6 0.13419886 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
7 0.13071002 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
8 0.11827252 313 acl-2011-Two Easy Improvements to Lexical Weighting
9 0.11812809 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features
10 0.11424962 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
11 0.10407244 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation
12 0.10366201 141 acl-2011-Gappy Phrasal Alignment By Agreement
13 0.10327744 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
14 0.10314201 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
15 0.10303375 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction
16 0.099159218 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation
17 0.097558044 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
18 0.096727185 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations
19 0.096699834 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
20 0.093851149 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition
topicId topicWeight
[(0, 0.234), (1, -0.133), (2, 0.114), (3, 0.112), (4, 0.063), (5, 0.005), (6, 0.009), (7, 0.028), (8, 0.032), (9, 0.113), (10, 0.102), (11, 0.015), (12, 0.009), (13, -0.037), (14, -0.031), (15, 0.034), (16, -0.041), (17, -0.022), (18, -0.077), (19, -0.022), (20, -0.03), (21, -0.112), (22, -0.002), (23, 0.017), (24, -0.028), (25, 0.002), (26, 0.029), (27, -0.01), (28, -0.048), (29, 0.027), (30, -0.03), (31, 0.03), (32, -0.054), (33, 0.003), (34, 0.01), (35, 0.022), (36, -0.043), (37, -0.02), (38, 0.026), (39, -0.005), (40, 0.094), (41, 0.035), (42, 0.011), (43, -0.032), (44, 0.043), (45, 0.066), (46, 0.089), (47, 0.054), (48, -0.089), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.95765889 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
2 0.79529959 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan
Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
3 0.74842638 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?
Author: Jinxi Xu ; Jinying Chen
Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1
4 0.74145764 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation
Author: Nan Duan ; Mu Li ; Ming Zhou
Abstract: This paper presents hypothesis mixture decoding (HM decoding), a new decoding scheme that performs translation reconstruction using hypotheses generated by multiple translation systems. HM decoding involves two decoding stages: first, each component system decodes independently, with the explored search space kept for use in the next step; second, a new search space is constructed by composing existing hypotheses produced by all component systems using a set of rules provided by the HM decoder itself, and a new set of model independent features are used to seek the final best translation from this new search space. Few assumptions are made by our approach about the underlying component systems, enabling us to leverage SMT models based on arbitrary paradigms. We compare our approach with several related techniques, and demonstrate significant BLEU improvements in large-scale Chinese-to-English translation tasks.
5 0.73824131 220 acl-2011-Minimum Bayes-risk System Combination
Author: Jesus Gonzalez-Rubio ; Alfons Juan ; Francisco Casacuberta
Abstract: We present minimum Bayes-risk system combination, a method that integrates consensus decoding and system combination into a unified multi-system minimum Bayes-risk (MBR) technique. Unlike other MBR methods that re-rank translations of a single SMT system, MBR system combination uses the MBR decision rule and a linear combination of the component systems’ probability distributions to search for the minimum risk translation among all the finite-length strings over the output vocabulary. We introduce expected BLEU, an approximation to the BLEU score that allows to efficiently apply MBR in these conditions. MBR system combination is a general method that is independent of specific SMT models, enabling us to combine systems with heterogeneous structure. Experiments show that our approach bring significant improvements to single-system-based MBR decoding and achieves comparable results to different state-of-the-art system combination methods.
6 0.71238732 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
7 0.71061242 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
8 0.70486933 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features
9 0.70128667 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation
10 0.68874532 313 acl-2011-Two Easy Improvements to Lexical Weighting
11 0.68017429 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation
12 0.64993376 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
13 0.64371037 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices
14 0.64169496 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
15 0.63902313 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
16 0.63869542 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering
17 0.62052763 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
18 0.62008518 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition
19 0.61851656 44 acl-2011-An exponential translation model for target language morphology
topicId topicWeight
[(5, 0.018), (17, 0.027), (26, 0.015), (37, 0.521), (39, 0.028), (41, 0.04), (55, 0.024), (59, 0.017), (72, 0.038), (91, 0.023), (96, 0.17)]
simIndex simValue paperId paperTitle
1 0.99326241 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?
Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata
Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefullydesigned experiments that led us to these conclusions. 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? Answer: Yes. The reason is that while many lations of a word may be valid, the MT system have a systematic bias. For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. From the perspective of MT, this translation is correct and preserves sentiment polarity. But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? Answer: No. It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). Here we present a series of experiments that led us to this conclusion. First we describe the experiment design (§2) and baselines (§3), before answering Question §12 (§4) dan bda Question 32) (§5). 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. Domain mismatch can arise from language differences (e.g. English vs. translated text) or market differences (e.g. DVD vs. Book reviews). Our experiments will involve fixing Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. This allows us to experiment with different settings for adaptation. We use the Amazon review dataset of Prettenhofer (2010)1 , due to its wide range of languages (English [EN], Japanese [JP], French [FR], German [DE]) and markets (music, DVD, books). Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. English is not a low-resource language, but this setting allows for more comparisons. Each source dataset has 2000 reviews, equally balanced between positive and negative. The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. Texts in JP, FR, and DE are translated word-by-word into English with Google Translate.2 We perform three sets of experiments, shown in Table 1. Table 2 lists all the results; we will interpret them in the following sections. Target (T) Source (S) 312BDMToVuasbDkil-ecE1N:ExpDMB eorVuimsDkice-JEnPtN,s eBD,MtuoVBDpuoVsk:-iFDck-iERxFN,T DB,vVoMaDruky-sSiDc.E-, 3 How much performance degradation occurs in cross-lingual adaptation? First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. So we train a SVM classifier on labeled source data3, and directly apply it on test data. The oracle setting, which has no domain-mismatch (e.g. train on Music-EN, test on Music-EN), achieves an average test accuracy of (81.6 + 80.9 + 80.0)/3 = 80.8%4. Aver1http://www.webis.de/research/corpora/webis-cls-10 2This is done by querying foreign words to build a bilingual dictionary. The words are converted to tfidf unigram features. 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. 4See column EN of Table 2, Supervised SVM results. 430 age cross-lingual accuracies are: 69.4% (JP), 75.6% (FR), 77.0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE).5 Crossmarket degradations are around -6%6. Observation 1: Degradations due to market and language mismatch are comparable in several cases (e.g. MUSIC-DE and DVD-EN perform similarly for target MUSIC-EN). Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. Does this mean JP-EN is a more difficult language pair for MT? The next section will show that this is not necessarily the case. Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. 4 Where exactly is the domain mismatch? 4.1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e.g. unigram feature vec- tor) and labels y (positive / negative). Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). • Labeling mismatch: ps (y|x) pt(y|x). Instance mismatch implies that the input feature vectors have different distribution (e.g. one dataset uses the word “excellent” often, while the other uses the word “awesome”). This degrades performance because classifiers trained on “excellent” might not know how to classify texts with the word “awesome.” The solution is to tie together these features (Blitzer et al., 2006) or re-weight the input distribution (Sugiyama et al., 2008). Under some assumptions (i.e. covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). Labeling mismatch implies the same input has different labels in different domains. For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. Then, positive JP = = 5See “Adapt by Language” columns of Table 2. Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). Nevertheless, mixing languages seem to give good results. 6See “Adapt by Market” columns of Table 2. TargetClassifierOEraNcleJPAFdaRpt bDyE LanJgPu+agFeR+DEMUASdIaCpt D byV MDar BkeOtOK MUSIC-ENSAudpaeprtvedise TdS SVVMM8719..666783..50 7745..62 7 776..937880..36--7768..847745..16 DVD-ENSAudpaeprtveidse TdS SVVMM8801..907701..14 7765..54 7 767..347789..477754..28--7746..57 BOOK-ENSAudpaeprtveidse TdS SVVMM8801..026793..68 7775..64 7 767..747799..957735..417767..24-Table 2: Test accuracies (%) for English Music/DVD/Book reviews. Each column is an adaptation scenario using different source data. The source data may vary by language or by market. For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68.5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76.8. “Oracle” indicates training on the same market and same language domain as the target. “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. Boldface shows the winner of Supervised vs. Adapted. reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? • Instance mismatch: Systematic Iantessta nwcoerd m diissmtraibtcuhti:on Ssy MT bias gener- sdtiefmferaetinct MfroTm b naturally- occurring English. (Translation may be valid.) Label mismatch: MT error mis-translates a word iLnatob something w: MithT Td eifrfreorren mti polarity. Conclusion from §4.2 and §4.3: Instance mismaCtcohn occurs often; M §4T. error appears Imnisntainmcael. • Mis-translated polarity Effect Taeb0+±.lge→ .3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. 431 4.2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr.) • Set Coverage of Avg(T) on Avg(S): how many Sweotrd C (type) ien o Tf appears oatn le Aavsgt once ionw wS .m Both measures correlate strongly with final accuracy, as seen in Figure 1. The correlation coefficients are r = −0.78 for KL Divergence and r = 0.71 for Coverage, 0 b.7o8th statistically significant (p < 0.05). This implies that instance mismatch is an important reason for the degradations seen in Section 3.7 4.3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. This suggests that MT output may be more constrained in vocabulary than naturally-occurring English. 0.35 0.3 gnvLrDeiceKe0 0 0. 120.25 510 erts TeCovega0 0 0. .98657 68 70 72 7A4ccuracy76 78 80 82 0.4 68 70 72 7A4ccuracy76 78 80 82 Figure 1: KL Divergence and Coverage vs. accuracy. (o) are cross-lingual and (x) are cross-market data points. is causing the polarity flip. Algorithm 1 (with K=2000) shows how we compute polarity flip rate.8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0.04). Conclusion: Labeling mismatch is not a factor in performance degradation. Nevertheless, we note there is a surprising large number of flips (24% on average). A manual check of the flipped words in BOOK-JP revealed few MT mistakes. Only 3.7% of 450 random EN-JP word pairs checked can be judged as blatantly incorrect (without sentence context). The majority of flipped words do not have a clear sentiment orientation (e.g. “amazon”, “human”, “moreover”). 5 Are standard adaptation algorithms applicable to cross-lingual problems? One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. This makes available a host of preexisting adaptation algorithms for improving over supervised results. However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. to ensure 432 Algorithm 1 Measuring labeling mismatch Input: Weight vectors for source wsand target wt Input: Target data average sample vector avg(T) Output: Polarity flip rate f 1: Normalize: ws = avg(T) * ws ; wt = avg(T) * wt 2: Set S+ = { K most positive features in ws} 3: Set S− == {{ KK mmoosstt negative ffeeaattuurreess inn wws}} 4: Set T+ == {{ KK m moosstt npoesgiatitivvee f efeaatuturreess i inn w wt}} 5: Set T− == {{ KK mmoosstt negative ffeeaattuurreess inn wwt}} 6: for each= f{e a Ktur me io ∈t T+ adtiov 7: rif e ia c∈h S fe−a ttuhreen i if ∈ = T f + 1 8: enidf fio ∈r 9: for each feature j ∈ T− do 10: rif e j ∈h Sfe+a uthreen j f ∈ = T f + 1 11: enidf fjo r∈ 12: f = 2Kf better to “adapt” the standard adaptation algorithm to the cross-lingual setting. We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. Table 2 shows that TSVM outperforms SVM in all but one case for cross-market adaptation, but gives mixed results for cross-lingual adaptation. This is a puzzling result considering that both use the same unlabeled data. Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? Is unlabeled target data interacting with source data in some unexpected way? Certainly there are several successful studies (Wan, 2009; Wei and Pal, 2010; Banea et al., 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. We conjecture that adapting from artificially-generated text (e.g. MT output) is a different story than adapting from naturallyoccurring text (e.g. cross-market). In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. References Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Alessandro Bergamo and Lorenzo Torresani. 2010. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Jenny Rose Finkel and Chris Manning. 2009. Hierarchical Bayesian domain adaptation. In Proc. of NAACL Human Language Technologies (HLT). Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proc. of the Association for Computational Linguistics (ACL). Peter Prettenhofer and Benno Stein. 2010. Crosslanguage text classification using structural correspondence learning. In Proc. of the Association for Computational Linguistics (ACL). Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inferenc, 90. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B ¨unau, and Motoaki Kawanabe. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4). Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proc. of the Association for Computational Linguistics (ACL). Bin Wei and Chris Pal. 2010. Cross lingual adaptation: an experiment on sentiment classification. In Proceedings of the ACL 2010 Conference Short Papers. 433
2 0.97268742 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation
Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport
Abstract: Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, 2004; Cohen and Smith, 2009; Spitkovsky et al., 2010a), a small set of parameters can be found whose modification yields a significant improvement in standard evaluation measures. These parameters correspond to local cases where no linguistic consensus exists as to the proper gold annotation. Therefore, the standard evaluation does not provide a true indication of algorithm quality. We present a new measure, Neutral Edge Direction (NED), and show that it greatly reduces this undesired phenomenon.
3 0.97017533 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai
Abstract: In this paper, we present a novel approach which incorporates the web-derived selectional preferences to improve statistical dependency parsing. Conventional selectional preference learning methods have usually focused on word-to-class relations, e.g., a verb selects as its subject a given nominal class. This paper extends previous work to wordto-word selectional preferences by using webscale data. Experiments show that web-scale data improves statistical dependency parsing, particularly for long dependency relationships. There is no data like more data, performance improves log-linearly with the number of parameters (unique N-grams). More importantly, when operating on new domains, we show that using web-derived selectional preferences is essential for achieving robust performance.
same-paper 4 0.96847218 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
5 0.95523632 204 acl-2011-Learning Word Vectors for Sentiment Analysis
Author: Andrew L. Maas ; Raymond E. Daly ; Peter T. Pham ; Dan Huang ; Andrew Y. Ng ; Christopher Potts
Abstract: Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area.
6 0.95209104 122 acl-2011-Event Extraction as Dependency Parsing
7 0.95159936 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars
8 0.94798332 334 acl-2011-Which Noun Phrases Denote Which Concepts?
10 0.8609252 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
11 0.85622513 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers
12 0.85389364 256 acl-2011-Query Weighting for Ranking Model Adaptation
13 0.8530947 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
14 0.83516181 85 acl-2011-Coreference Resolution with World Knowledge
15 0.83444208 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
16 0.83157122 292 acl-2011-Target-dependent Twitter Sentiment Classification
17 0.82780617 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
18 0.82418764 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning
19 0.82221621 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
20 0.81855571 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation