emnlp emnlp2013 emnlp2013-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova
Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. [sent-3, score-0.131]
2 First, MERT is an unregularized learner and is therefore prone to overfitting. [sent-4, score-0.188]
3 Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. [sent-5, score-0.164]
4 To address these issues, we study the addition of a regularization term to the MERT objective function. [sent-6, score-0.445]
5 Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. [sent-7, score-0.381]
6 To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. [sent-8, score-0.359]
7 Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets. [sent-9, score-0.058]
8 1 Introduction Minimum Error Rate Training emerged a decade ago (Och, 2003) as a superior training method for small numbers oflinear model parameters ofmachine translation systems, improving over prior work using maximum likelihood criteria (Och and Ney, 2002). [sent-10, score-0.04]
9 , 2009) were subsequently developed, with the benefit of reducing the approximation error from n-best lists. [sent-14, score-0.143]
10 It directly optimizes the evaluation metric under consideration (e. [sent-16, score-0.032]
11 com Secondly, it offers a globally optimal line search. [sent-23, score-0.09]
12 Unfortunately, there are several potential difficulties in scaling MERT to larger numbers of features, due to its non-convex loss function and its lack of regularization. [sent-24, score-0.203]
13 These challenges have prompted some researchers to move away from MERT, in favor of linearly decomposable approximations of the evaluation metric (Chiang et al. [sent-25, score-0.085]
14 , 2009; Hopkins and May, 2011; Cherry and Foster, 2012), which correspond to easier optimization problems and which naturally incorporate regularization. [sent-26, score-0.041]
15 , 2009) has shown that adding thousands or tens of thousands of features can improve MT quality when weights are optimized using a margin-based approximation. [sent-28, score-0.104]
16 In this paper, we seek to preserve the advantages of MERT while addressing its shortcomings in terms of regularization and search. [sent-30, score-0.293]
17 The idea of adding a regularization term to the MERT objective function can be perplexing at first, because the most common regularizers, such as ‘1 and ‘2, are not directly applicable to MERT. [sent-31, score-0.51]
18 Indeed, these regularizers are scale sensitive, while the MERT objective function is not: scaling the weight vector neither changes the predictions of the linear model nor affects the error count. [sent-32, score-0.726]
19 Hence, MERT can hedge any regularization penalty by maximally scaling down linear model weights. [sent-33, score-0.462]
20 The first contribution ofthis paper is to analyze various forms of regularization that are not susceptible to this scaling problem. [sent-34, score-0.402]
21 We analyze and experiment with ‘0, a form of regularization that is scale insensitive. [sent-35, score-0.351]
22 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is9t4ic8s–1959, regularization, where we apply ‘2 regularization to scale-senstive linear transforms of the original linear model. [sent-38, score-0.459]
23 In addition, we introduce efficient methods of incorporating regularization in Och (2003)’s exact line searches. [sent-39, score-0.419]
24 For all of these regularizers, our methods let us find the true optimum of the regularized objective function along the line. [sent-40, score-0.526]
25 Finally, we address the issue of searching in a high-dimensional space by using the gradient of expected BLEU (Smith and Eisner, 2006) to find better search directions for our line searches. [sent-41, score-0.239]
26 This direction finder addresses one of the serious concerns raised by Hopkins and May (201 1): MERT widely failed to reach the optimum of a synthetic linear objective function. [sent-42, score-0.525]
27 In replicating Hopkins and May’s experiments, we confirm that existing search algorithms for MERT—including coordinate ascent, Powell’s algorithm (Powell, 1964), and random direction sets (Cer et al. [sent-43, score-0.23]
28 However, when using our gradient-based direction finder, MERT has no problem finding the true optimum even in a 1000-dimensional space. [sent-45, score-0.286]
29 Our results suggest that the combination of a regularized objective function and a gradient-informed line search algorithm enables MERT to scale well with a large number of features. [sent-46, score-0.489]
30 Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO (Hopkins and May, 2011), a parameter tuning method known to be effective with large feature sets. [sent-47, score-0.092]
31 2 Unregularized MERT Prior to introducing regularized MERT, we briefly review standard unregularized MERT (Och, 2003). [sent-48, score-0.249]
32 fS} to denote the S input sentences of a given tuning }set. [sent-52, score-0.031]
33 E}ach input and output sentence pair (fs, es,m) is weighted using a linear model that applies model parameters w = (w1 . [sent-57, score-0.135]
34 Furthermore, let hs,m ∈ RD denote the feature vector representing the tran∈slat Rion pair (fs, es,m). [sent-67, score-0.033]
35 In MERT, the goal is to minimize a loss function E(r, e) that scores translation hypotheses against a set of reference translations r1S = {r1 . [sent-68, score-0.154]
36 This yields the following optimization problem: f1S wˆ = argmwin? [sent-72, score-0.041]
37 (1) (2) While the error surface of Equation 1 is only an approximation of the true error surface of the MT decoder, the quality of this approximation depends on the size of the hypothesis space represented by the M-best list. [sent-81, score-0.255]
38 To increase the size of the hypothesis space, subsequent work (Macherey et al. [sent-83, score-0.039]
39 , 2008) instead operated on lattices, but this paper focuses on M-best lists. [sent-84, score-0.03]
40 A crucial observation is that the unsmoothed error count represented in Equation 1is a piecewise constant function. [sent-85, score-0.568]
41 This enabled Och (2003) to devise a line search algorithm guaranteed to find the optimum point along the line. [sent-86, score-0.43]
42 To extend the search from one to multiple dimensions, MERT applies a sequence of line optimizations along some fixed or variable set of search directions {dt} until some convergence criteria are met. [sent-87, score-0.337]
43 Considering a given point wt and a given direction dt at iteration t, finding the most probable translation hypothesis in the set of candidates translations Cs = {es,1 . [sent-88, score-0.289]
44 es,M} corresponds to solving the following optimization problem: e(fs;γ) =m ar∈g{1m. [sent-91, score-0.041]
45 (3) The function in this equation is piecewise linear (Papineni, 1999), which enables an efficient exhaustive computation. [sent-96, score-0.561]
46 Specifically, this function is optimized by enumerating the up to M hypotheses that form the upper envelope of the model score function. [sent-97, score-0.245]
47 The error count, then, is a piecewise constant function 1949 γf1s γfMs defined by the points < · · · < at which an increase in γ causes a change of optimum in Equation 3. [sent-98, score-0.841]
48 Error counts for the whole corpus are simply the sums of sentence-level piecewise constant functions aggre- gated over all sentences ofthe corpus. [sent-99, score-0.491]
49 1 The optimal γ is finally computed by enumerating all piecewise constant intervals of the corpus-level error function, and by selecting the one that has the lowest error count (or, correspondingly, highest BLEU score). [sent-100, score-0.661]
50 Assuming the optimum is found in the interval [γk−1 , γk], we define γopt = (γk−1 + γk)/2 and change the parameters using the update wt+1 = wt + γopt · dt. [sent-101, score-0.363]
51 Finally, this method is turned into a global Ddimensional search using algorithms that repeatedly use the aforementioned exact line search algorithm. [sent-102, score-0.242]
52 Och (2003) first advocated the use of Powell’s method (Powell, 1964; Press et al. [sent-103, score-0.03]
53 Pharaoh (Koehn, 2004) and subsequently Moses (Koehn et al. [sent-105, score-0.035]
54 , 2007) instead use coordinate ascent, and more recent work often uses random search directions (Cer et al. [sent-106, score-0.209]
55 In Section 4, we will present a novel direction finder for maximum-BLEU optimization, which uses the gradient of expected BLEU to find directions where the BLEU score is most likely to increase. [sent-109, score-0.261]
56 3 Regularization for MERT Because MERT is prone to overfitting when a large number of parameters must be optimized, we study the addition of a regularization term to the objective function. [sent-110, score-0.523]
57 One conventional approach is to regularize the objective function wqith a penalty based on the Euclidean norm||w||2=qPiwi2, also known as‘2 regularization. [sent-111, score-0.154]
58 In the caseq oPf MERT, this yields the following objective functionP:2 wˆ = argmwin? [sent-112, score-0.089]
59 (4) 1This assumes that the sufficient statistics of the metric under consideration are additively decomposable by sentence, which is the case with most popular evaluation metrics such as BLEU (Papineni et al. [sent-114, score-0.114]
60 2The ‘2 regularizer is often used in conjunction with loglikelihood objectives. [sent-116, score-0.048]
61 The regularization term of Equation 4 could similarly be added to the log of an objective—e. [sent-117, score-0.356]
62 304 γ, the step size in the current direction Figure 1: Example MERT values along one coordinate, first unregularized. [sent-147, score-0.117]
63 When regularized with ‘2, the piecewise constant function becomes piecewise quadratic. [sent-148, score-1.015]
64 When using ‘0, the function remains piecewise constant with a point discontinuity at 0. [sent-149, score-0.556]
65 where the regularization term 1/2σ2 is a free parameter that controls the strength of the regularization penalty. [sent-150, score-0.715]
66 Similar regularizers have also been used in conjunction with other norms, such as ‘P1 and ‘0 norms. [sent-151, score-0.207]
67 The ‘1 norm, defined as | |w| | 1 = Pi |wi |, applies a constant force toward zero, preferrPing vectors with fewer non-zero components; ‘0, dePfined as | |w| |0 = |{i | wi 0} |, simply counts the number of non-zero components o|f the weight vector, encoding a preference for sparse vectors. [sent-152, score-0.194]
68 Geometrically, ‘2 is a parabola, ‘1 is the wedgeshaped absolute value function, and ‘0 is an impulse function with a spike at 0. [sent-153, score-0.161]
69 The original formulation (Equation 1) of MERT consists of a piecewise constant representation of the loss, as a function of the step size in a given direction. [sent-154, score-0.526]
70 But with these three reg- = 1950 ularization terms, the function respectively becomes piecewise quadratic, piecewise linear, or piecewise constant with a potential impulse jump for each distinct choice of regularizer. [sent-155, score-1.345]
71 As discussed in (McAllester and Keshet, 2011), the problem with optimizing Equation 4 directly is that the output of the underlying linear classifier, and therefore the error count, are not sensitive to the scale of w. [sent-157, score-0.235]
72 Moreover, ‘2 regularization (as well as ‘1 regularization) is scale sensitive, which means any optimizer of this function can drive the regularization term down to zero by scaling down w. [sent-158, score-0.851]
73 As special treatments for ‘2, we evaluate three linear transforms of the weight vector, where the vector w of the regularization term ||w| |22/2σ2 is replaced with either: 1. [sent-159, score-0.524]
74 a nv aecftfionre ew triathn only (D 1) free parameters, e. [sent-162, score-0.033]
75 an ‘1 ,re·n·o·r ,mwalization: w/ | |w| | 1 − In (1), regularization is biased towards w0, a weight vector previously optimized using a competitive yet much smaller feature set, such as core features of a phrase-based (Koehn et al. [sent-165, score-0.399]
76 Otherwise, any regularization toward an overfit parameter vector w0 would defeat the purpose of introducing a regularization term in the first place. [sent-168, score-0.745]
77 3 In (2), the transformation is motivated by the observation that the D-parameter linear model of Equation 2 only needs (D − 1) degrees of freedom. [sent-169, score-0.06]
78 Fixing one of the components of w to any non-zero constant and allowing the others to vary, the new linear model retains the same modeling power, but the (D − 1) free parameters are no longer scale invariant(, Di. [sent-170, score-0.321]
79 , scaling the (D − 1)-dimensional vector now has an effect on line(aDr m −o d1)el predictions. [sent-172, score-0.112]
80 In (3), the weight vector is normalized as to have an ‘1-norm equal to 1. [sent-173, score-0.062]
81 In contrast, the ‘0 norm is scale insensitive, thus not affected by this problem. [sent-174, score-0.11]
82 1 Exact line search with regularization Optimizing with a regularized error surface requires a change in the line search algorithm presented in 3(Gimpel and Smith, 2012, footnote 6) briefly mentions the use of such a regularizer with its ramp loss objective function. [sent-176, score-1.023]
83 Section 2, but the other aspects of MERT remain the same, and we can still use global search algorithms such as coordinate ascent, Powell, and random directions exactly the same way as with unregularized MERT. [sent-177, score-0.329]
84 Line search with a regularization term is still as efficient as in (Och, 2003), and it is still guaranteed to find the optimum of the (now regularized) objective function along the line. [sent-178, score-0.85]
wordName wordTfidf (topN-words)
[('mert', 0.512), ('piecewise', 0.36), ('regularization', 0.293), ('regularizers', 0.207), ('optimum', 0.206), ('fs', 0.191), ('powell', 0.15), ('regularized', 0.129), ('unregularized', 0.12), ('bleu', 0.106), ('argmwin', 0.104), ('xss', 0.104), ('constant', 0.101), ('hopkins', 0.092), ('coordinate', 0.092), ('line', 0.09), ('finder', 0.09), ('objective', 0.089), ('wt', 0.085), ('dt', 0.085), ('och', 0.083), ('direction', 0.08), ('scaling', 0.079), ('error', 0.077), ('macherey', 0.077), ('equation', 0.076), ('pro', 0.071), ('impulse', 0.069), ('ascent', 0.069), ('function', 0.065), ('term', 0.063), ('linear', 0.06), ('envelope', 0.06), ('directions', 0.059), ('loss', 0.059), ('opt', 0.059), ('search', 0.058), ('scale', 0.058), ('decomposable', 0.055), ('norm', 0.052), ('rs', 0.051), ('regularizer', 0.048), ('enumerating', 0.046), ('transforms', 0.046), ('optimized', 0.044), ('lattices', 0.042), ('optimization', 0.041), ('sensitive', 0.04), ('parameters', 0.04), ('guaranteed', 0.039), ('hypothesis', 0.039), ('chiang', 0.039), ('microsoft', 0.038), ('prone', 0.038), ('cer', 0.038), ('along', 0.037), ('exact', 0.036), ('applies', 0.035), ('subsequently', 0.035), ('vector', 0.033), ('koehn', 0.033), ('free', 0.033), ('cherry', 0.033), ('parameter', 0.033), ('gradient', 0.032), ('change', 0.032), ('consideration', 0.032), ('minimum', 0.031), ('tuning', 0.031), ('papineni', 0.031), ('approximation', 0.031), ('mt', 0.03), ('hypotheses', 0.03), ('geometrically', 0.03), ('advocated', 0.03), ('unsmoothed', 0.03), ('susceptible', 0.03), ('affine', 0.03), ('ularization', 0.03), ('orient', 0.03), ('discontinuity', 0.03), ('gated', 0.03), ('adr', 0.03), ('operated', 0.03), ('pharaoh', 0.03), ('hedge', 0.03), ('tfe', 0.03), ('ddimensional', 0.03), ('defeat', 0.03), ('prompted', 0.03), ('learner', 0.03), ('thousands', 0.03), ('components', 0.029), ('changes', 0.029), ('weight', 0.029), ('extensions', 0.028), ('additively', 0.027), ('opf', 0.027), ('spike', 0.027), ('invariance', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 159 emnlp-2013-Regularized Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova
Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.
2 0.15188399 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
3 0.14929166 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
Author: Ming Tan ; Tian Xia ; Shaojun Wang ; Bowen Zhou
Abstract: MIRA based tuning methods have been widely used in statistical machine translation (SMT) system with a large number of features. Since the corpus-level BLEU is not decomposable, these MIRA approaches usually define a variety of heuristic-driven sentencelevel BLEUs in their model losses. Instead, we present a new MIRA method, which employs an exact corpus-level BLEU to compute the model loss. Our method is simpler in implementation. Experiments on Chinese-toEnglish translation show its effectiveness over two state-of-the-art MIRA implementations.
4 0.12669905 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
Author: Kevin Gimpel ; Dhruv Batra ; Chris Dyer ; Gregory Shakhnarovich
Abstract: This paper addresses the problem of producing a diverse set of plausible translations. We present a simple procedure that can be used with any statistical machine translation (MT) system. We explore three ways of using diverse translations: (1) system combination, (2) discriminative reranking with rich features, and (3) a novel post-editing scenario in which multiple translations are presented to users. We find that diversity can improve performance on these tasks, especially for sentences that are difficult for MT.
5 0.091963679 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
Author: Sida Wang ; Mengqiu Wang ; Stefan Wager ; Percy Liang ; Christopher D. Manning
Abstract: NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a > 1% absolute performance gain over use of standard L2 regularization.
6 0.086338423 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
7 0.078123026 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
8 0.07311368 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
9 0.073010132 2 emnlp-2013-A Convex Alternative to IBM Model 2
10 0.071849331 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
11 0.059325531 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
12 0.058061831 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering
13 0.056387901 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
14 0.054034818 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
15 0.053671144 145 emnlp-2013-Optimal Beam Search for Machine Translation
16 0.052462131 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
17 0.050196953 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
18 0.042826332 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
19 0.040840499 201 emnlp-2013-What is Hidden among Translation Rules
20 0.040632103 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation
topicId topicWeight
[(0, -0.153), (1, -0.119), (2, 0.046), (3, 0.028), (4, 0.053), (5, -0.014), (6, 0.045), (7, 0.029), (8, -0.014), (9, 0.004), (10, -0.051), (11, -0.026), (12, -0.012), (13, -0.044), (14, 0.118), (15, -0.214), (16, 0.052), (17, 0.044), (18, -0.025), (19, 0.11), (20, 0.006), (21, 0.184), (22, -0.093), (23, 0.07), (24, 0.028), (25, 0.096), (26, -0.123), (27, -0.099), (28, -0.006), (29, -0.135), (30, 0.121), (31, 0.069), (32, -0.079), (33, -0.073), (34, -0.046), (35, 0.077), (36, -0.129), (37, -0.007), (38, 0.084), (39, 0.099), (40, -0.065), (41, 0.017), (42, 0.005), (43, -0.084), (44, 0.075), (45, -0.145), (46, 0.055), (47, 0.159), (48, -0.05), (49, 0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.97759557 159 emnlp-2013-Regularized Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova
Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.
2 0.77781451 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
Author: Ming Tan ; Tian Xia ; Shaojun Wang ; Bowen Zhou
Abstract: MIRA based tuning methods have been widely used in statistical machine translation (SMT) system with a large number of features. Since the corpus-level BLEU is not decomposable, these MIRA approaches usually define a variety of heuristic-driven sentencelevel BLEUs in their model losses. Instead, we present a new MIRA method, which employs an exact corpus-level BLEU to compute the model loss. Our method is simpler in implementation. Experiments on Chinese-toEnglish translation show its effectiveness over two state-of-the-art MIRA implementations.
3 0.57706624 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
Author: Kevin Gimpel ; Dhruv Batra ; Chris Dyer ; Gregory Shakhnarovich
Abstract: This paper addresses the problem of producing a diverse set of plausible translations. We present a simple procedure that can be used with any statistical machine translation (MT) system. We explore three ways of using diverse translations: (1) system combination, (2) discriminative reranking with rich features, and (3) a novel post-editing scenario in which multiple translations are presented to users. We find that diversity can improve performance on these tasks, especially for sentences that are difficult for MT.
4 0.54465199 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao
Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02
5 0.5308007 2 emnlp-2013-A Convex Alternative to IBM Model 2
Author: Andrei Simion ; Michael Collins ; Cliff Stein
Abstract: The IBM translation models have been hugely influential in statistical machine translation; they are the basis of the alignment models used in modern translation systems. Excluding IBM Model 1, the IBM translation models, and practically all variants proposed in the literature, have relied on the optimization of likelihood functions or similar functions that are non-convex, and hence have multiple local optima. In this paper we introduce a convex relaxation of IBM Model 2, and describe an optimization algorithm for the relaxation based on a subgradient method combined with exponentiated-gradient updates. Our approach gives the same level of alignment accuracy as IBM Model 2.
6 0.50926024 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
8 0.38415378 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
9 0.33423573 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
10 0.33158895 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
11 0.28563648 28 emnlp-2013-Automated Essay Scoring by Maximizing Human-Machine Agreement
12 0.27623957 55 emnlp-2013-Decoding with Large-Scale Neural Language Models Improves Translation
13 0.273274 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
14 0.26748091 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
15 0.26565877 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
16 0.26357651 23 emnlp-2013-Animacy Detection with Voting Models
17 0.25973549 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
18 0.25929657 101 emnlp-2013-Improving Alignment of System Combination by Using Multi-objective Optimization
19 0.25795993 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
20 0.25119901 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation
topicId topicWeight
[(3, 0.023), (18, 0.03), (22, 0.028), (30, 0.121), (43, 0.016), (50, 0.487), (51, 0.105), (66, 0.02), (71, 0.024), (75, 0.021), (77, 0.026), (96, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.89252627 159 emnlp-2013-Regularized Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova
Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.
2 0.87653369 78 emnlp-2013-Exploiting Language Models for Visual Recognition
Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi
Abstract: The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. However, little is known about the performance of these language models when applied to the computer vision domain. In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. We determine the usefulness of different language models in aiding the two visual recognition tasks. The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset.
3 0.7685129 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology
Author: Jan A. Botha ; Phil Blunsom
Abstract: This paper contributes an approach for expressing non-concatenative morphological phenomena, such as stem derivation in Semitic languages, in terms of a mildly context-sensitive grammar formalism. This offers a convenient level of modelling abstraction while remaining computationally tractable. The nonparametric Bayesian framework of adaptor grammars is extended to this richer grammar formalism to propose a probabilistic model that can learn word segmentation and morpheme lexicons, including ones with discontiguous strings as elements, from unannotated data. Our experiments on Hebrew and three variants of Arabic data find that the additional expressiveness to capture roots and templates as atomic units improves the quality of concatenative segmentation and stem identification. We obtain 74% accuracy in identifying triliteral Hebrew roots, while performing morphological segmentation with an F1-score of 78. 1.
4 0.53975123 98 emnlp-2013-Image Description using Visual Dependency Representations
Author: Desmond Elliott ; Frank Keller
Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.
Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni
Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.
6 0.43567967 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries
7 0.4041661 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
8 0.4009496 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
9 0.39589924 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation
10 0.38852757 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
11 0.38767084 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
12 0.38709733 2 emnlp-2013-A Convex Alternative to IBM Model 2
13 0.38414544 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
14 0.37825531 30 emnlp-2013-Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora
15 0.3777467 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
16 0.37431684 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training
17 0.37392747 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
18 0.37332857 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
19 0.37259907 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
20 0.37251529 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction