emnlp emnlp2013 emnlp2013-128 knowledge-graph by maker-knowledge-mining

128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training


Source: pdf

Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao

Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. [sent-13, score-0.285]

2 Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. [sent-14, score-0.294]

3 oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. [sent-15, score-0.612]

4 Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2. [sent-16, score-0.165]

5 0 on dev/test over MERT, thanks to 20M+ sparse features. [sent-18, score-0.176]

6 This is the first successful effort of large-scale online discriminative training for MT. [sent-19, score-0.178]

7 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al. [sent-20, score-0.189]

8 , 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. [sent-21, score-0.165]

9 However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. [sent-25, score-0.051]

10 Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e. [sent-28, score-0.498]

11 As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. [sent-31, score-0.048]

12 (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. [sent-34, score-0.409]

13 We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al. [sent-35, score-0.536]

14 Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al. [sent-37, score-0.443]

15 To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al. [sent-39, score-0.159]

16 , 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. [sent-40, score-0.457]

17 The basic idea is to update when search error happens, rather than at the end of the search. [sent-41, score-0.181]

18 To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. [sent-42, score-0.074]

19 We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. [sent-43, score-1.043]

20 Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. [sent-44, score-0.389]

21 When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et. [sent-45, score-0.215]

22 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. [sent-47, score-0.215]

23 Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). [sent-49, score-0.306]

24 Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. [sent-51, score-0.678]

25 We adapt it to MT using latent variables for derivations. [sent-52, score-0.074]

26 Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e. [sent-54, score-0.393]

27 hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. [sent-56, score-0.183]

28 We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. [sent-58, score-0.232]

29 Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. [sent-62, score-0.063]

30 For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. [sent-63, score-0.108]

31 Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2. [sent-64, score-0.057]

32 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. [sent-69, score-0.504]

33 B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). [sent-73, score-0.243]

34 Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. [sent-74, score-0.241]

35 To compute the distortion cost we also need to maintain the ending position of the last phrase (e. [sent-76, score-0.142]

36 In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. [sent-79, score-0.18]

37 The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst. [sent-80, score-0.186]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('forced', 0.33), ('perceptron', 0.24), ('derivations', 0.233), ('reachable', 0.215), ('decoding', 0.174), ('mt', 0.158), ('oracles', 0.157), ('bush', 0.143), ('inexact', 0.133), ('mert', 0.13), ('lm', 0.126), ('dev', 0.116), ('translation', 0.108), ('mira', 0.106), ('search', 0.102), ('sparse', 0.1), ('derivation', 0.099), ('sharon', 0.099), ('pro', 0.093), ('states', 0.093), ('bleu', 0.093), ('distortion', 0.091), ('liang', 0.083), ('convergence', 0.081), ('theoretical', 0.081), ('update', 0.079), ('thheer', 0.078), ('intricate', 0.078), ('wwoherdre', 0.078), ('hbs', 0.078), ('tnhoet', 0.078), ('thanks', 0.076), ('adapt', 0.074), ('scalable', 0.073), ('sentencelevel', 0.072), ('ong', 0.072), ('hmi', 0.072), ('targetlanguage', 0.072), ('prohibits', 0.072), ('cuny', 0.072), ('wisdom', 0.072), ('coverage', 0.072), ('city', 0.071), ('exact', 0.071), ('structured', 0.069), ('chiang', 0.067), ('extensive', 0.067), ('seo', 0.067), ('ramp', 0.067), ('inability', 0.067), ('uncovered', 0.067), ('witnessed', 0.067), ('lexicalized', 0.066), ('success', 0.065), ('effort', 0.065), ('huang', 0.063), ('engineer', 0.063), ('qc', 0.063), ('inevitably', 0.063), ('convincing', 0.063), ('gc', 0.063), ('watson', 0.06), ('excellent', 0.06), ('avoiding', 0.06), ('ux', 0.06), ('customized', 0.06), ('koehn', 0.058), ('towards', 0.057), ('kai', 0.057), ('bs', 0.057), ('heng', 0.057), ('recent', 0.057), ('gains', 0.057), ('discriminative', 0.057), ('successful', 0.056), ('center', 0.056), ('definite', 0.055), ('haitao', 0.055), ('collins', 0.054), ('scalability', 0.053), ('loss', 0.051), ('decoders', 0.051), ('ml', 0.051), ('visiting', 0.051), ('severe', 0.051), ('theoretically', 0.051), ('roark', 0.051), ('hypothesis', 0.051), ('cost', 0.051), ('watanabe', 0.05), ('dc', 0.05), ('accepted', 0.05), ('partial', 0.048), ('gimpel', 0.048), ('ks', 0.048), ('ls', 0.048), ('hopkins', 0.048), ('overly', 0.048), ('college', 0.048), ('handful', 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao

Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02

2 0.2820172 141 emnlp-2013-Online Learning for Inexact Hypergraph Search

Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald

Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.

3 0.19192672 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

Author: Ming Tan ; Tian Xia ; Shaojun Wang ; Bowen Zhou

Abstract: MIRA based tuning methods have been widely used in statistical machine translation (SMT) system with a large number of features. Since the corpus-level BLEU is not decomposable, these MIRA approaches usually define a variety of heuristic-driven sentencelevel BLEUs in their model losses. Instead, we present a new MIRA method, which employs an exact corpus-level BLEU to compute the model loss. Our method is simpler in implementation. Experiments on Chinese-toEnglish translation show its effectiveness over two state-of-the-art MIRA implementations.

4 0.16563435 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

Author: Maryam Siahbani ; Baskaran Sankaran ; Anoop Sarkar

Abstract: Left-to-right (LR) decoding (Watanabe et al., 2006b) is a promising decoding algorithm for hierarchical phrase-based translation (Hiero). It generates the target sentence by extending the hypotheses only on the right edge. LR decoding has complexity O(n2b) for input of n words and beam size b, compared to O(n3) for the CKY algorithm. It requires a single language model (LM) history for each target hypothesis rather than two LM histories per hypothesis as in CKY. In this paper we present an augmented LR decoding algorithm that builds on the original algorithm in (Watanabe et al., 2006b). Unlike that algorithm, using experiments over multiple language pairs we show two new results: our LR decoding algorithm provides demonstrably more efficient decoding than CKY Hiero, four times faster; and by introducing new distortion and reordering features for LR decoding, it maintains the same translation quality (as in BLEU scores) ob- tained phrase-based and CKY Hiero with the same translation model.

5 0.15305287 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

Author: Xinyan Xiao ; Deyi Xiong

Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.

6 0.15188399 159 emnlp-2013-Regularized Minimum Error Rate Training

7 0.14611612 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming

8 0.13270231 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

9 0.12280197 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

10 0.1173916 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

11 0.11610742 145 emnlp-2013-Optimal Beam Search for Machine Translation

12 0.10250905 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

13 0.1019611 201 emnlp-2013-What is Hidden among Translation Rules

14 0.087958805 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation

15 0.085061319 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation

16 0.081472121 55 emnlp-2013-Decoding with Large-Scale Neural Language Models Improves Translation

17 0.078752846 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

18 0.071263105 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

19 0.071091548 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

20 0.070072696 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.223), (1, -0.235), (2, 0.1), (3, 0.063), (4, 0.067), (5, -0.015), (6, 0.033), (7, 0.048), (8, 0.064), (9, 0.191), (10, -0.112), (11, -0.058), (12, -0.048), (13, -0.018), (14, 0.162), (15, -0.289), (16, 0.013), (17, 0.058), (18, -0.056), (19, 0.035), (20, -0.006), (21, 0.119), (22, -0.159), (23, -0.107), (24, 0.062), (25, 0.056), (26, 0.083), (27, 0.106), (28, -0.071), (29, 0.049), (30, 0.111), (31, 0.123), (32, -0.081), (33, 0.034), (34, -0.093), (35, -0.05), (36, -0.091), (37, -0.027), (38, 0.093), (39, 0.008), (40, -0.067), (41, -0.008), (42, -0.114), (43, -0.031), (44, -0.057), (45, 0.02), (46, -0.046), (47, 0.039), (48, 0.023), (49, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97332078 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao

Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02

2 0.80265725 141 emnlp-2013-Online Learning for Inexact Hypergraph Search

Author: Hao Zhang ; Liang Huang ; Kai Zhao ; Ryan McDonald

Abstract: Online learning algorithms like the perceptron are widely used for structured prediction tasks. For sequential search problems, like left-to-right tagging and parsing, beam search has been successfully combined with perceptron variants that accommodate search errors (Collins and Roark, 2004; Huang et al., 2012). However, perceptron training with inexact search is less studied for bottom-up parsing and, more generally, inference over hypergraphs. In this paper, we generalize the violation-fixing perceptron of Huang et al. (2012) to hypergraphs and apply it to the cube-pruning parser of Zhang and McDonald (2012). This results in the highest reported scores on WSJ evaluation set (UAS 93.50% and LAS 92.41% respectively) without the aid of additional resources.

3 0.72305256 145 emnlp-2013-Optimal Beam Search for Machine Translation

Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins

Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.

4 0.66984093 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

Author: Ming Tan ; Tian Xia ; Shaojun Wang ; Bowen Zhou

Abstract: MIRA based tuning methods have been widely used in statistical machine translation (SMT) system with a large number of features. Since the corpus-level BLEU is not decomposable, these MIRA approaches usually define a variety of heuristic-driven sentencelevel BLEUs in their model losses. Instead, we present a new MIRA method, which employs an exact corpus-level BLEU to compute the model loss. Our method is simpler in implementation. Experiments on Chinese-toEnglish translation show its effectiveness over two state-of-the-art MIRA implementations.

5 0.6000573 159 emnlp-2013-Regularized Minimum Error Rate Training

Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova

Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.

6 0.56299007 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

7 0.54095572 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming

8 0.50867301 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

9 0.41297311 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

10 0.40586761 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

11 0.37365448 55 emnlp-2013-Decoding with Large-Scale Neural Language Models Improves Translation

12 0.3567819 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

13 0.32934466 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

14 0.28227007 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation

15 0.28110763 201 emnlp-2013-What is Hidden among Translation Rules

16 0.2808215 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

17 0.26797995 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time

18 0.26553553 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

19 0.26205003 2 emnlp-2013-A Convex Alternative to IBM Model 2

20 0.26001421 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.242), (3, 0.016), (18, 0.062), (22, 0.034), (26, 0.079), (30, 0.108), (43, 0.013), (50, 0.033), (51, 0.163), (66, 0.049), (71, 0.036), (75, 0.02), (77, 0.064), (96, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88060278 101 emnlp-2013-Improving Alignment of System Combination by Using Multi-objective Optimization

Author: Tian Xia ; Zongcheng Ji ; Shaodan Zhai ; Yidong Chen ; Qun Liu ; Shaojun Wang

Abstract: This paper proposes a multi-objective optimization framework which supports heterogeneous information sources to improve alignment in machine translation system combination techniques. In this area, most of techniques usually utilize confusion networks (CN) as their central data structure to compact an exponential number of an potential hypotheses, and because better hypothesis alignment may benefit constructing better quality confusion networks, it is natural to add more useful information to improve alignment results. However, these information may be heterogeneous, so the widely-used Viterbi algorithm for searching the best alignment may not apply here. In the multi-objective optimization framework, each information source is viewed as an independent objective, and a new goal of improving all objectives can be searched by mature algorithms. The solutions from this framework, termed Pareto optimal solutions, are then combined to construct confusion networks. Experiments on two Chinese-to-English translation datasets show significant improvements, 0.97 and 1.06 BLEU points over a strong Indirected Hidden Markov Model-based (IHMM) system, and 4.75 and 3.53 points over the best single machine translation systems.

same-paper 2 0.84558392 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao

Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02

3 0.67863041 122 emnlp-2013-Learning to Freestyle: Hip Hop Challenge-Response Induction via Transduction Rule Segmentation

Author: Dekai Wu ; Karteek Addanki ; Markus Saers ; Meriem Beloucif

Abstract: We present a novel model, Freestyle, that learns to improvise rhyming and fluent responses upon being challenged with a line of hip hop lyrics, by combining both bottomup token based rule induction and top-down rule segmentation strategies to learn a stochastic transduction grammar that simultaneously learns bothphrasing and rhyming associations. In this attack on the woefully under-explored natural language genre of music lyrics, we exploit a strictly unsupervised transduction grammar induction approach. Our task is particularly ambitious in that no use of any a priori linguistic or phonetic information is allowed, even though the domain of hip hop lyrics is particularly noisy and unstructured. We evaluate the performance of the learned model against a model learned only using the more conventional bottom-up token based rule induction, and demonstrate the superiority of our combined token based and rule segmentation induction method toward generating higher quality improvised responses, measured on fluency and rhyming criteria as judged by human evaluators. To highlight some of the inherent challenges in adapting other algorithms to this novel task, we also compare the quality ofthe responses generated by our model to those generated by an out-ofthe-box phrase based SMT system. We tackle the challenge of selecting appropriate training data for our task via a dedicated rhyme scheme detection module, which is also acquired via unsupervised learning and report improved quality of the generated responses. Finally, we report results with Maghrebi French hip hop lyrics indicating that our model performs surprisingly well with no special adaptation to other languages. 102

4 0.66376442 145 emnlp-2013-Optimal Beam Search for Machine Translation

Author: Alexander Rush ; Yin-Wen Chang ; Michael Collins

Abstract: Beam search is a fast and empirically effective method for translation decoding, but it lacks formal guarantees about search error. We develop a new decoding algorithm that combines the speed of beam search with the optimal certificate property of Lagrangian relaxation, and apply it to phrase- and syntax-based translation decoding. The new method is efficient, utilizes standard MT algorithms, and returns an exact solution on the majority of translation examples in our test data. The algorithm is 3.5 times faster than an optimized incremental constraint-based decoder for phrase-based translation and 4 times faster for syntax-based translation.

5 0.64207697 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

Author: Will Y. Zou ; Richard Socher ; Daniel Cer ; Christopher D. Manning

Abstract: We introduce bilingual word embeddings: semantic embeddings associated across two languages in the context of neural language models. We propose a method to learn bilingual embeddings from a large unlabeled corpus, while utilizing MT word alignments to constrain translational equivalence. The new embeddings significantly out-perform baselines in word semantic similarity. A single semantic similarity feature induced with bilingual embeddings adds near half a BLEU point to the results of NIST08 Chinese-English machine translation task.

6 0.64091784 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

7 0.6399278 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation

8 0.63895166 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

9 0.63822579 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

10 0.63710821 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

11 0.63521558 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

12 0.63383973 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

13 0.63107491 143 emnlp-2013-Open Domain Targeted Sentiment

14 0.63082671 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation

15 0.62823325 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

16 0.62737525 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

17 0.6272465 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

18 0.62592316 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

19 0.62565476 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction

20 0.62464166 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction