acl acl2013 acl2013-264 knowledge-graph by maker-knowledge-mining

264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation


Source: pdf

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. [sent-4, score-0.415]

2 We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. [sent-6, score-0.84]

3 We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1. [sent-7, score-0.175]

4 3 TER on average over state-of-the-art optimizers with the large feature set. [sent-10, score-0.201]

5 1 Introduction The desire to incorporate high-dimensional sparse feature representations into statistical machine translation (SMT) models has driven recent research away from Minimum Error Rate Training (MERT) (Och, 2003), and toward other discriminative methods that can optimize more features. [sent-11, score-0.289]

6 However, as the dimension of the feature space increases, generalization becomes increasingly difficult. [sent-20, score-0.116]

7 , 2012), which is usually substantially larger than the tuning set, but this is complementary to our goal here of better generalization given a fixed size tuning set. [sent-24, score-0.235]

8 The latter, RMM, was introduced as an effective and less computationally expensive way to incorporate the spread of the data second order information about the – 1116 Proce dingsS o f ita h,e B 5u1lgsta Arinan,u Aaulg Musete 4ti-n9g 2 o0f1 t3h. [sent-35, score-0.362]

9 Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1 16–1 26, distance between hypotheses when projected onto the line defined by the weight vector w. [sent-37, score-0.117]

10 This motivates an online gradient-based optimization approach—an approach that is particularly attractive because its simple update is well suited for efficiently processing structured objects with sparse features (Crammer et al. [sent-42, score-0.354]

11 After background discussion on learning in SMT (§2), we introduce a novel online learning algorithm f)o, rw ree ilanttirvoed margin omveaxl oimnliiznaeti loean snuinitgab allefor SMT (§3). [sent-47, score-0.382]

12 en Fti rssttr,u wcetuirnetdr o rdeulacteivRe margin objective which incorporates cost-augmented hypothesis selection and latent variables. [sent-50, score-0.328]

13 Then, we derive a simple closed-form online update necessary to create a large margin solution while simultaneously bounding the spread of the projection of the data (§3. [sent-51, score-1.063]

14 a Ct our algorithm, RM, significantly outperforms strong state-of-the-art optimizers, in both a basic feature setting and high-dimensional (sparse) feature space (§4). [sent-55, score-0.16]

15 Finally, we discuss the spread avnandt aogteheour key )is. [sent-59, score-0.362]

16 The instability of MERT in larger feature sets (Foster and Kuhn, 2009; Hopkins and May, 2011), has motivated many alternative tuning methods for SMT. [sent-65, score-0.154]

17 Recent batch optimizers, PRO and RAMPION, and Batch-MIRA (Cherry and Foster, 2012), have been partly motivated by existing MT infrastructures, as they iterate between decoding the entire tuning set and optimizing the parameters. [sent-69, score-0.138]

18 C acts as a regularization parameter, trading offbetween margin maximization and constraint violations. [sent-80, score-0.446]

19 While solving the optimization problem relies on computing the margin between the correct output yi, and y0, in SMT our decoder is often incapable of producing the reference translation, i. [sent-81, score-0.372]

20 δs(xi, y−) ≥ ∆i(y−) − ∆i(y+) − ξi where δs(xi, y−)=s(xi, d+)-s(xi, y−, d−) This leads to a variant of the structured ramp loss to be optimized: ‘ = y+, ? [sent-90, score-0.154]

21 , 2006), which is used to solve this problem, updates w on each round such that the score of the correct hypothesis is greater than the score of the incorrect y−by a margin at least as large as the cost incurred by predicting the incorrect hypothesis, while keeping the change to w small. [sent-98, score-0.367]

22 y+ (a) (b) Figure 1: (a) RM and large margin solution comparison and (b) the spread of the projections given by each. [sent-99, score-0.731]

23 RM and large margin solutions are shown with a darker dotted line and a darker solid line, respectively. [sent-100, score-0.472]

24 It is maximized by minimizing the norm in SVM, or analogously, the proximity constraint in MIRA: arg minw | |w wt | |2. [sent-103, score-0.19]

25 The measures of complexity usually take the form of some value on the radius of the data, such as the ratio of the radius of the data to the margin (Shivaswamy and Jebara, 2009a). [sent-105, score-0.408]

26 As radius is a way of measuring spread in any projection direction, here we will specifically be interested in the the spread of the data as measured after the projection defined by the learned model y+, 21 − w. [sent-106, score-0.856]

27 More formally, the spread is the distance between and the worst candidate (yw, dw) ← arg min(y,d)∈Y(xi),D(xi) s(xi, y, d), after projecting b moithn onto Yt(hxe )l,Din(ex defined by the y+, weight vector w. [sent-107, score-0.453]

28 For each y0, this projection is conveniently given by s(xi, y0, d), thus the spread is calculated as δs(xi, yw). [sent-108, score-0.408]

29 RMM was introduced as a generalization over SVM that incorporates both the margin constraint y+, 1118 and information regarding the spread of the data. [sent-109, score-0.819]

30 The relative margin is the ratio of the absolute, or maximum margin, to the spread of the projected data. [sent-110, score-0.786]

31 Thus, the RMM learns a large margin solution relative to the spread of the data, or in other words, creates a max margin while simultaneously bounding the spread of the projected data. [sent-111, score-1.712]

32 The large margin decision boundary is shown with a darker solid line, while the relative margin solution is shown with a darker dotted line. [sent-114, score-0.892]

33 Notice that with a large margin solution, although the distance between and y−is greater, the points are highly spread, extending far to the left of the decision boundary. [sent-117, score-0.359]

34 In contrast, with a relative margin, although we have a smaller absolute margin, the spread is smaller, all points being within a smaller distance ? [sent-118, score-0.444]

35 The higher the spread of the projection, the higher the variance of the projected points, and the greater the likelihood that we will mislabel a new instance, since the high variance projections may cross the learned decision boundary. [sent-120, score-0.407]

36 In higher dimensions, accounting for the spread becomes even more crucial, as will be discussed in Section 6. [sent-121, score-0.393]

37 Nonetheless, since structured RMM is a generalization of Structured SVM, which shares its underlying objective with MIRA, our intuition is that SMT should be able to benefit as well. [sent-123, score-0.115]

38 They use second-order information in the form of a distribution over weights to change the maximum margin solution. [sent-126, score-0.328]

39 cient optimization procedure that does not require batch training or an off-the-shelf QP solver. [sent-127, score-0.122]

40 2 RM Algorithm We address the above-mentioned limitations by introducing a novel online learning algorithm for relative margin maximization, RM. [sent-129, score-0.433]

41 The relative margin solution is obtained by maximizing the same margin as Equation (2), but now with respect to the distance between and the worst y+, yw. [sent-130, score-0.779]

42 The online latent structured soft relative margin optimization problem is then: y+ yw. [sent-132, score-0.539]

43 : δs(xi, y+, y−) ≥ ∆i(y−) ∆i(y+) − B − τi ≤ δs(xi, y+, yw) ≤ B + τi − − ξi (4) where additional bounding constraints are added to the usual margin constraints in order to contain the spread by bounding the difference in projections. [sent-135, score-1.006]

44 B is an additional parameter; it controls the spread, trading off between margin maximization and spread minimization. [sent-136, score-0.732]

45 Notice that when B → ∞, the bounding constraints disappear, and we are le, tfth ew bitohu nthdien original problem pine Equation (2). [sent-137, score-0.158]

46 D, which plays an analogous role to C, allows penalized violations of the bounding constraints. [sent-138, score-0.158]

47 (5) where the α Lagrange multiplier corresponds to the standard margin constraint, while β and 1119 β∗ each correspond to a bounding constraint, and ωi (y+ , y0) corresponds to the difference of y+, f(xi, d+) and f(xi, y0, d0). [sent-141, score-0.486]

48 1 decomposes the overall problem into subproblems which are solved independently by creating working sets which correspond to the largest violations of either the margin constraint, or bounding constraints, and iteratively satisfying the constraints in each set. [sent-145, score-0.486]

49 The updates amount to performing a subgradient descent step to update w in accordance with the constraints. [sent-149, score-0.113]

50 Sij, Since the constraint matrix of the dual program is not strictly decomposable across constraint types, we are in effect solving an approximation of the original problem. [sent-150, score-0.201]

51 Algorithm1RMCuttingPlaneAlgorithm (adapted from (Shivaswamy and Jebara, 2009a)) Require: ithtraining example (xi,yi), weight w, margin Sre1ig. [sent-151, score-0.328]

52 s(xi , , y) 4: y1 ← arg maxy∈Y(xi) H()y −) 5: y2 ← arg maxy∈Y(xi) G(y) := δs(xi , , y) 6: y3 ← arg miny∈Y(xi) −G(y) 7: ξ ← max {0, maxy∈Si H(y)} 8: Vξ1 ← ←← m Hax(y{10), −m ξax − ? [sent-167, score-0.217]

53 , 2006), which would simply bypass the cutting plane and select the most violated constraint for Algorithm 2 RM update with α,β,β∗ 1: procedure OPTIMIZE(w,Si1,Si2,Si3,C,B) 2: while ? [sent-175, score-0.3]

54 oM caoxnsIttrearin dtso y, y0 from Si1 19: ∆i(y0)−|∆|ωi((yy,)y−0)δ|s|2(xi, y, y0) γα ← 20: 2 12: γα ← max(−αy , min(αy0 ,γα)) αwy ←← w αy ++ γ γαα(ω;(yα,yy00←)) αy0− γα 23: endw wfor ← 24: end procedure 25: procedure UPDATEUPPERBOUND(w, Si2, B) 26: βy ← 0 for all y ∈ Si2 27: for ←n ← 0 o1. [sent-213, score-0.16]

55 eM caonxIsttrearin dt y from Si2 29: ← max(0, B|−|ωδ(ys(x+i,y,y)+||2,y)) γβ 30: βy ← βy + γβ 3 1: w ←← w β − γβ (ω(y+ , y)) 32: endw wfor ← 33: end procedure 34: procedure UPDATELOWERBOUND(w, Si3, B) 35: βy∗ ← 0 for all y ∈ Si3 36: for n← ← 0 o1. [sent-219, score-0.16]

56 eM caonxIsttrearin dt y from Si3 38: ← max(0, −B||ω−(δys+(x,yi,)y|+|2,y)) γβ∗ 39: βy∗ ← βy∗ + γβ∗ 40: w ← w + γβ∗ (ω(y+ , y)) 41: endw wfor ← 42: end procedure each set, if there is one, and perform the corresponding parameter updates in Alg. [sent-225, score-0.168]

57 We re- fer to the resulting passive-aggressive algorithm as RM-PA, and the cutting plane version as RM-CP. [sent-227, score-0.119]

58 As in the standard MIRA solution, we select the maximum margin constraint violator, y−, shown as the triangle, and update such that the margin is greater than the cost. [sent-231, score-0.806]

59 Additionally, we select the maximum bound- y+, 1120 Figure 2: RM update with margin and bounding constraints. [sent-232, score-0.56]

60 8M tune (MT06) 1797 55k 49k Ar-En MT05 1056 36k 33k MT08 1360 51k 45k 4-gram LM24M600M– Table 1: Corpus statistics yw, ing constraint violator, shown as the upsidedown triangle, and update so the distance from is no greater than B. [sent-242, score-0.181]

61 1 Setup To evaluate the advantage of explicitly accounting for the spread of the data, we conducted several experiments on two Chinese-English translation test sets, using two different feature sets in each. [sent-244, score-0.494]

62 We applied several competitive optimizers as baselines: hypergraph-based MERT (Kumar et al. [sent-254, score-0.138]

63 All optimizers were implemented in cdec and use the same system configuration, thus the only independent variable is the optimizer itself. [sent-262, score-0.269]

64 2 Feature Sets We experimented with a small (basic) feature set, and a large (sparse) feature set. [sent-273, score-0.126]

65 For experiments with a larger feature set, we introduced additional lexical and non-lexical sparse Boolean features of the form commonly found in the literature (Chiang et al. [sent-275, score-0.183]

66 9M Table 2: Active sparse feature templates abe et al. [sent-280, score-0.183]

67 4 million possible features, of which only a fraction were active for the respective tuning set and optimizer, as shown in Table 2. [sent-291, score-0.126]

68 5 Surprisingly, it seems that MIRA did not benefit as much from the sparse features as RM. [sent-296, score-0.12]

69 6 TER improvement over MERT since MERT has been shown to be competitive with small numbers of features compared to high-dimensional optimizers such as MIRA (Chiang et al. [sent-299, score-0.138]

70 For the tuning set, the decoder performance was consistently the lowest with RM, compared to the – – – – 5In the small feature set RAMPION yielded similar best BLEU scores, but worse TER. [sent-301, score-0.193]

71 We believe this is due to the RM bounding constraint being more resistant to – overfitting the training data, and thus allowing for improved generalization. [sent-306, score-0.234]

72 Conversely, while PRO had the second lowest tuning scores, it seemed to display signs of underfitting in the basic and large feature settings. [sent-307, score-0.188]

73 The sparse feature templates resulted here in a total of 4. [sent-313, score-0.183]

74 However, this was also usually coupled had a higher brevity penalty (BP) than MIRA, with the BP increasing slightly when moving to the sparse setting. [sent-322, score-0.12]

75 6 Discussion The trend of the results, summarized as RM gain over other optimizers averaged over all test sets, is presented in Table 5. [sent-323, score-0.138]

76 RM shows clear advantage in both basic and sparse feature sets, over all other state-of-the-art optimizers. [sent-324, score-0.217]

77 The RM gains are notably higher in the large feature set, which we take 6In our preliminary experiments with the smaller trigram LM, MERT did better on MT05 in the smaller feature set, and MIRA had a small advantage in two cases. [sent-325, score-0.126]

78 1122 Table 3: Performance on Zh-En with basic (left) and sparse (right) feature sets on MT03 and MT05. [sent-332, score-0.217]

79 Table 4: Performance on Ar-En with basic (left) and sparse (right) feature sets on MT05 and MT08. [sent-333, score-0.217]

80 over other optimizers averaged as an indication for the importance of bounding the spread. [sent-339, score-0.296]

81 Spread analysis: For RM, the average spread of the projected data in the Chinese-English small feature set was 0. [sent-340, score-0.47]

82 Iitne comparison, tehe h spread eofthe data for MIRA was 5. [sent-347, score-0.362]

83 4 for the best iteration, while aMgeIR sApr ehaadd o a spread ofofr 1th4e. [sent-353, score-0.362]

84 r Similarly, on Arabic-English, RdM o fh 1ad4 a spread of 0. [sent-356, score-0.362]

85 e4a idn tfhe 0 sparse setting, mwhalil e s eMttIinRgA,’s a spread was 9 i. [sent-360, score-0.482]

86 1, lfeor M tIhRe Asm’sa slpl raenadd sparse settings, respectively. [sent-364, score-0.12]

87 ,N footric tehe eth samt atlhle average spread nfgors ,R rMestays about the same when moving to higher dimensions, with the variance decreasing in both cases. [sent-365, score-0.362]

88 For MIRA, however, the average spread increases in both cases, with the variance being much higher than RM. [sent-366, score-0.362]

89 For instance, observe that the spread of MIRA on Chinese grows from 5. [sent-367, score-0.362]

90 While bounding the spread is useful in the low-dimensional setting (0. [sent-370, score-0.52]

91 5 BLEU gain with RM over MIRA as shown in Table 3), accounting for the spread is even more crucial with sparse features, where MIRA gains only up to 0. [sent-372, score-0.513]

92 These results support the claim that our imposed bound B indeed helps decrease the spread, and that, in turn, lower spread yields better generalization performance. [sent-374, score-0.448]

93 Therefore we conducted a coarse error analysis on 15 randomly selected sentences from MERT, RMM and MIRA, with basic and sparse feature settings for the latter two. [sent-377, score-0.217]

94 However, RM seemed to benefit the most from the sparse features, as its bad word order rate dropped close to MIRA, and its excess/wrong function word rate dropped below that of MIRA with sparse features (MIRA’s rate actually doubled from its basic feature set). [sent-382, score-0.337]

95 While MIRA is only concerned with the margin between and y−, RM also accounts for the distance between y+ y+ yw. [sent-387, score-0.359]

96 Also, we only explored several settings of B, and there remains a continuum of RM solutions that trade off between margin and spread in different ways. [sent-390, score-0.69]

97 7 y+ Conclusions and Future Work We have introduced RM, a novel online marginbased algorithm designed for optimizing highdimensional feature spaces, which introduces con- straints into a large-margin optimizer that bound the spread of the projection of the data while maximizing the margin. [sent-394, score-0.632]

98 The closed-form online update for our relative margin solution accounts for surrogate references and latent variables. [sent-395, score-0.601]

99 7 These improvements are achieved using standard, relatively small tuning sets, contrasted with improvements involving sparse features obtained using much larger tuning sets, on the order of hundreds of thousands of sentences (Liang et al. [sent-400, score-0.302]

100 In future work we also intend to explore using additional sparse features that are known to be useful in translation, e. [sent-405, score-0.12]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('spread', 0.362), ('margin', 0.328), ('mira', 0.323), ('rm', 0.283), ('xi', 0.22), ('rampion', 0.208), ('rmm', 0.208), ('bounding', 0.158), ('optimizers', 0.138), ('shivaswamy', 0.138), ('crammer', 0.121), ('sparse', 0.12), ('bleu', 0.11), ('xy', 0.098), ('tuning', 0.091), ('pro', 0.09), ('ter', 0.083), ('jebara', 0.08), ('chiang', 0.078), ('constraint', 0.076), ('mert', 0.075), ('optimizer', 0.074), ('smt', 0.074), ('update', 0.074), ('plane', 0.063), ('yw', 0.063), ('feature', 0.063), ('structured', 0.062), ('tillmann', 0.061), ('arg', 0.06), ('ramp', 0.057), ('cdec', 0.057), ('cutting', 0.056), ('online', 0.054), ('wt', 0.054), ('generalization', 0.053), ('simianer', 0.053), ('darker', 0.053), ('surrogate', 0.053), ('argw', 0.052), ('endw', 0.052), ('umiacs', 0.052), ('relative', 0.051), ('gimpel', 0.051), ('dual', 0.049), ('koby', 0.048), ('batch', 0.047), ('eidelman', 0.047), ('blunsom', 0.046), ('performer', 0.046), ('wfor', 0.046), ('bartlett', 0.046), ('projection', 0.046), ('projected', 0.045), ('optimization', 0.044), ('qp', 0.043), ('maxy', 0.042), ('nse', 0.042), ('maximization', 0.042), ('solution', 0.041), ('hypotheses', 0.041), ('radius', 0.04), ('tsochantaridis', 0.04), ('vladimir', 0.04), ('dredze', 0.04), ('updates', 0.039), ('yielded', 0.039), ('hopkins', 0.039), ('translation', 0.038), ('lect', 0.038), ('dotted', 0.038), ('max', 0.037), ('liang', 0.036), ('loss', 0.035), ('active', 0.035), ('amll', 0.035), ('caonxisttrearin', 0.035), ('confidenceweighted', 0.035), ('pannagadatta', 0.035), ('rademacher', 0.035), ('violator', 0.035), ('yax', 0.035), ('basic', 0.034), ('sij', 0.034), ('discriminative', 0.034), ('optimize', 0.034), ('bound', 0.033), ('bp', 0.032), ('foster', 0.032), ('distance', 0.031), ('minimum', 0.031), ('watanabe', 0.031), ('accounting', 0.031), ('squares', 0.031), ('updating', 0.031), ('procedure', 0.031), ('arun', 0.031), ('gradientbased', 0.031), ('mya', 0.031), ('smith', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

2 0.29526308 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

Author: Vladimir Eidelman ; Ke Wu ; Ferhan Ture ; Philip Resnik ; Jimmy Lin

Abstract: We present an open-source framework for large-scale online structured learning. Developed with the flexibility to handle cost-augmented inference problems such as statistical machine translation (SMT), our large-margin learner can be used with any decoder. Integration with MapReduce using Hadoop streaming allows efficient scaling with increasing size of training data. Although designed with a focus on SMT, the decoder-agnostic design of our learner allows easy future extension to other structured learning problems such as sequence labeling and parsing.

3 0.23477198 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

4 0.19180982 24 acl-2013-A Tale about PRO and Monsters

Author: Preslav Nakov ; Francisco Guzman ; Stephan Vogel

Abstract: While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- and BLEU+1based cut-offs, outlier filters, stochastic sampling, and random acceptance. The best of these fixes not only slay and protect against monsters, but also yield higher stability for PRO as well as improved testtime BLEU scores. Thus, we recommend them to anybody using PRO, monsterbeliever or not. – 1 Once Upon a Time... For years, the standard way to do statistical machine translation parameter tuning has been to use minimum error-rate training, or MERT (Och, 2003). However, as researchers started using models with thousands of parameters, new scalable optimization algorithms such as MIRA (Watanabe et al., 2007; Chiang et al., 2008) and PRO (Hopkins and May, 2011) have emerged. As these algorithms are relatively new, they are still not quite well understood, and studying their properties is an active area of research. For example, Nakov et al. (2012) have pointed out that PRO tends to generate translations that are consistently shorter than desired. They have blamed this on inadequate smoothing in PRO’s optimization objective, namely sentencelevel BLEU+1, and they have addressed the problem using more sensible smoothing. We wondered whether the issue could be partially relieved simply by tuning on longer sentences, for which the effect of smoothing would naturally be smaller. To our surprise, tuning on the longer 50% of the tuning sentences had a disastrous effect on PRO, causing an absolute drop of three BLEU points on testing; at the same time, MERT and MIRA did not have such a problem. While investigating the reasons, we discovered hundreds of monsters creeping under PRO’s surface... Our tale continues as follows. We first explain what monsters are in Section 2, then we present a theory about how they can be slayed in Section 3, we put this theory to test in practice in Section 4, and we discuss some related efforts in Section 5. Finally, we present the moral of our tale, and we hint at some planned future battles in Section 6. 2 Monsters, Inc. PRO uses pairwise ranking optimization, where the learning task is to classify pairs of hypotheses into correctly or incorrectly ordered (Hopkins and May, 2011). It searches for a vector of weights w such that higher evaluation metric scores correspond to higher model scores and vice versa. More formally, PRO looks for weights w such that g(i, j) > g(i, j0) ⇔ hw (i, j) > hw (i, j0), where g is a local scoring fu hnction (typically, sentencelevel BLEU+1) and hw are the model scores for a given input sentence i and two candidate hypotheses j and j0 that were obtained using w. If g(i, j) > g(i, j0), we will refer to j and j0 as the positive and the negative example in the pair. Learning good parameter values requires negative examples that are comparable to the positive ones. Instead, tuning on long sentences quickly introduces monsters, i.e., corrupted negative examples that are unsuitable for learning: they are (i) much longer than the respective positive examples and the references, and (ii) have very low BLEU+1 scores compared to the positive examples and in absolute terms. The low BLEU+1 means that PRO effectively has to learn from positive examples only. 12 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 12–17, Avg. Lengths Avg. BLEU+1 iter. pos neg ref. pos neg 1 45.2 44.6 46.5 52.5 37.6 2 3 4 5 ... 25 46.4 46.4 46.4 46.3 ... 47.9 70.5 261.0 250.0 248.0 ... 229.0 53.2 53.4 53.0 53.0 ... 52.5 52.8 52.4 52.0 52.1 ... 52.2 14.5 2.19 2.30 2.34 ... 2.81 Table 1: PRO iterations, tuning on long sentences. Table 1shows an optimization run of PRO when tuning on long sentences. We can see monsters after iterations in which positive examples are on average longer than negative ones (e.g., iter. 1). As a result, PRO learns to generate longer sentences, but it overshoots too much (iter. 2), which gives rise to monsters. Ideally, the learning algorithm should be able to recover from overshooting. However, once monsters are encountered, they quickly start dominating, with no chance for PRO to recover since it accumulates n-best lists, and thus also monsters, over iterations. As a result, PRO keeps jumping up and down and converges to random values, as Figure 1 shows. By default, PRO’s parameters are averaged over iterations, and thus the final result is quite mediocre, but selecting the highest tuning score does not solve the problem either: for example, on Figure 1, PRO never achieves a BLEU better than that for the default initialization parameters. iteration Figure 1: PRO tuning results on long sentences across iterations. The dark-gray line shows the tuning BLEU (left axis), the light-gray one is the hypothesis/reference length ratio (right axis). Figure 2 shows the translations after iterations 1, 3 and 4; the last two are monsters. The monster at iteration 3 is potentially useful, but that at iteration 4 is clearly unsuitable as a negative example. Optimizer Objective BLEU PROsent-BLEU+144.57 MERT corpus-BLEU 47.53 MIRA pseudo-doc-BLEU 47.80 PRO (6= objective)pseudo-doc-BLEU21.35 PMRIORA (6= =(6= o bojbejcetcivteiv)e) sent-BLEU+1 47.59 PMRIRO,A PC (6=-sm obojoectthiv,e g)roundfixed sent-BLEU+145.71 Table 2: PRO vs. MERT vs. MIRA. We also checked whether other popular optimizers yield very low BLEU scores at test time when tuned on long sentences. Lines 2-3 in Table 2 show that this is not the case for MERT and MIRA. Since they optimize objectives that are different from PRO’s,1 we further experimented with plugging MIRA’s objective into PRO and PRO’s objective into MIRA. The resulting MIRA scores were not much different from before, while PRO’s score dropped even further; we also found mon- sters. Next, we applied the length fix for PRO proposed in (Nakov et al., 2012); this helped a bit, but still left PRO two BLEU points behind and MIRA, and the monsters did not go away. We can conclude that the monster problem is PRO-specific, cannot be blamed on the objective function, and is different from the length bias. Note also that monsters are not specific to a dataset or language pair. We found them when tuning on the top-50% of WMT10 and testing on WMT1 1 for Spanish-English; this yielded a drop in BLEU from 29.63 (1M2/emEs/RprTes)la to 27.12 n(inPg/RtmOp.)1.1 MERT2 **REF** : but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - **IT1** : but we are that we add our ranks to some of us and that we know that in the strength and weakness in **IT3** : , we are the but of the that that the , and , of ranks the the on the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and **IT4** : namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never balme your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3 .8 billion in tourism income proceeds Chamber are divided among themselves : some ' ve dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure Ftroai ngtkhsu et rahefef 2he : Ea t h xte ha,e motfo pstohlmeee r leafst eorfe wne c, et etr laonngs olfa t hei opnar a ofn sdo hhee oy fpwhaoitst hh a]r ee usisn i ostu tofra tnhes ilna tbiakoern s, haef ctoeokr it hee roant ainod nthse 1 t,h 3we aknonw d, 4. T@-@h eAl l tahes ft trwce o, tho ypotheses are monsters. 1See (Cherry and Foster, 2012) for details on objectives. 2Also, using PRO to initialize MERT, as implemented in Moses, yields 46.52 BLEU and monsters, but using MERT to initialize PRO yields 47.55 and no monsters. 13 3 Slaying Monsters: Theory Below we explain what monsters are and where they come from. Then, we propose various monster slaying techniques to be applied during PRO’s selection and acceptance steps. 3.1 What is PRO? PRO is a batch optimizer that iterates between (i) translation: using the current parameter values, generate k-best translations, and (ii) optimization: using the translations from all previous iterations, find new parameter values. The optimization step has four substeps: 1. Sampling: For each sentence, sample uniformly at random Γ = 5000 pairs from the set of all candidate translations for that sentence from all previous iterations. 2. Selection: From these sampled pairs, select those for which the absolute difference between their BLEU+1 scores is higher than α = 0.05 (note: this is 5 BLEU+1 points). 3. Acceptance: For each sentence, accept the Ξ = 50 selected pairs with the highest absolute difference in their BLEU+1 scores. 4. Learning: Assemble the accepted pairs for all sentences into a single set and use it to train a ranker to prefer the higher-scoring sentence in each pair. We believe that monsters are nurtured by PRO’s selection and acceptance policies. PRO’s selection step filters pairs involving hypotheses that differ by less than five BLEU+1 points, but it does not cut-off ones that differ too much based on BLEU+1 or length. PRO’s acceptance step selects Ξ = 50 pairs with the highest BLEU+1 differentials, which creates breeding ground for monsters since these pairs are very likely to include one monster and one good hypothesis. Below we discuss monster slaying geared towards the selection and acceptance steps of PRO. 3.2 Slaying at Selection In the selection step, PRO filters pairs for which the difference in BLEU+1 is less than five points, but it has no cut-off on the maximum BLEU+1 differentials nor cut-offs based on absolute length or difference in length. Here, we propose several selection filters, both deterministic and probabilistic. Cut-offs. A cut-off is a deterministic rule that filters out pairs that do not comply with some criteria. We experiment with a maximal cut-off on (a) the difference in BLEU+1 scores and (b) the difference in lengths. These are relative cut-offs because they refer to the pair, but absolute cut-offs that apply to each of the elements in the pair are also possible (not explored here). Cut-offs (a) and (b) slay monsters by not allowing the negative examples to get much worse in BLEU+1 or in length than the positive example in the pair. Filtering outliers. Outliers are rare or extreme observations in a sample. We assume normal distribution of the BLEU+1 scores (or of the lengths) of the translation hypotheses for the same source sentence, and we define as outliers hypotheses whose BLEU+1 (or length) is more than λ standard deviations away from the sample average. We apply the outlier filter to both the positive and the negative example in a pair, but it is more important for the latter. We experiment with values of λ like 2 and 3. This filtering slays monsters because they are likely outliers. However, it will not work if the population gets riddled with monsters, in which case they would become the norm. Stochastic sampling. Instead of filtering extreme examples, we can randomly sample pairs according to their probability of being typical. Let us assume that the values of the local scoring functions, i.e., the BLEU+1 scores, are distributed nor- mally: g(i, j) ∼ N(µ, σ2). Given a sample of hypothesis (tira,nj)sl ∼atio Nn(sµ {j} of the same source sentpeontchee i, we can ensstim {ja}te o σ empirically. Then, the difference ∆ = g(i, j) − g(i, j0) would be tdhisetr diibfufteerde normally w gi(thi, mean zero and variance 2σ2. Now, given a pair of examples, we can calculate their ∆, and we can choose to select the pair with some probability, according to N(0, 2σ2). 3.3 Slaying at Acceptance Another problem is caused by the acceptance mechanism of PRO: among all selected pairs, it accepts the top-Ξ with the highest BLEU+1 differentials. It is easy to see that these differentials are highest for nonmonster–monster pairs if such pairs exist. One way to avoid focusing primarily on such pairs is to accept a random set of pairs, among the ones that survived the selection step. One possible caveat is that we can lose some of the discriminative power of PRO by focusing on examples that are not different enough. Ξ 14 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) PRO fix Avg. for 3 reruns BLEU StdDev Pos Lengths Neg Ref BLEU+1 Avg. for 3 reruns Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Max diff. cut-offBLEU+1 max=10†47.940.16547.949.649.449.439.947.770.035 BLEU+1 max=20 † 47.73 0.136 47.7 55.5 51.1 49.8 32.7 47.85 0.049 LEN max=5 † 48.09 0.021 46.8 47.0 47.9 52.9 37.8 47.73 0.051 LEN max=10 † 47.99 0.025 47.3 48.5 48.7 52.5 35.6 47.80 0.056 OutliersBLEU+1 λ=2.0†48.050.11946.847.247.752.239.547.470.090 BLEU+1 λ=3.0 LEN λ=2.0 LEN λ=3.0 47.12 46.68 47.02 1.348 2.005 0.727 47.6 49.3 48.2 168.0 82.7 163.0 53.0 53.1 51.4 51.7 52.3 51.4 3.9 5.3 4.2 47.53 47.49 47.65 0.038 0.085 0.096 Stoch. sampl.∆ BLEU+146.331.00046.8216.053.353.12.447.740.035 ∆ LEN 46.36 1.281 47.4 201.0 52.9 53.4 2.9 47.78 0.081 Table 3: Some fixes to PRO (select pairs with highest BLEU+1 differential, also require at least 5 BLEU+1 points difference). A dagger (†) indicates selection fixes that successfully get rid of monsters. 4 Attacking Monsters: Practice Below, we first present our general experimental setup. Then, we present the results for the various selection alternatives, both with the original acceptance strategy and with random acceptance. 4.1 Experimental Setup We used a phrase-based SMT model (Koehn et al., 2003) as implemented in the Moses toolkit (Koehn et al., 2007). We trained on all Arabic-English data for NIST 2012 except for UN, we tuned on (the longest-50% of) the MT06 sentences, and we tested on MT09. We used the MADA ATB segmentation for Arabic (Roth et al., 2008) and truecasing for English, phrases of maximal length 7, Kneser-Ney smoothing, and lexicalized reorder- ing (Koehn et al., 2005), and a 5-gram language model, trained on GigaWord v.5 using KenLM (Heafield, 2011). We dropped unknown words both at tuning and testing, and we used minimum Bayes risk decoding at testing (Kumar and Byrne, 2004). We evaluated the output with NIST’s scoring tool v.13a, cased. We used the Moses implementations of MERT, PRO and batch MIRA, with the –return-best-dev parameter for the latter. We ran these optimizers for up to 25 iterations and we used 1000-best lists. For stability (Foster and Kuhn, 2009), we performed three reruns of each experiment (tuning + evaluation), and we report averaged scores. 4.2 Selection Alternatives Table 3 presents the results for different selection alternatives. The first two columns show the testing results: average BLEU and standard deviation over three reruns. The following five columns show statistics about the last iteration (it. 25) of PRO’s tuning for the worst rerun: average lengths of the positive and the negative examples and average effective reference length, followed by average BLEU+1 scores for the positive and the negative examples in the pairs. The last two columns present the results when tuning on the full tuning set. These are included to verify the behavior of PRO in a nonmonster prone environment. We can see in Table 3 that all selection mechanisms considerably improve BLEU compared to the baseline PRO, by 2-3 BLEU points. However, not every selection alternative gets rid of monsters, which can be seen by the large lengths and low BLEU+1 for the negative examples (in bold). The max cut-offs for BLEU+1 and for lengths both slay the monsters, but the latter yields much lower standard deviation (thirteen times lower than for the baseline PRO!), thus considerably increasing PRO’s stability. On the full dataset, BLEU scores are about the same as for the original PRO (with small improvement for BLEU+1 max=20), but the standard deviations are slightly better. Rejecting outliers using BLEU+1 and λ = 3 is not strong enough to filter out monsters, but making this criterion more strict by setting λ = 2, yields competitive BLEU and kills the monsters. Rejecting outliers based on length does not work as effectively though. We can think of two possible reasons: (i) lengths are not normally distributed, they are more Poisson-like, and (ii) the acceptance criterion is based on the top-Ξ differentials based on BLEU+1, not based on length. On the full dataset, rejecting outliers, BLEU+1 and length, yields lower BLEU and less stability. 15 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) Avg. for 3 reruns Lengths BLEU+1 Avg. for 3 reruns PRO fix BLEU StdDev Pos Neg Ref Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Rand. acceptPRO, rand††47.870.14747.748.548.7047.742.947.590.114 OutliersBLEU+1 λ=2.0, rand∗47.850.07848.248.448.947.543.647.620.091 BLEU+1 λ=3.0, rand 47.97 0.168 47.6 47.6 48.4 47.8 43.6 47.44 0.070 LEN λ=2.0, rand∗ 47.69 0.114 47.8 47.8 48.6 47.9 43.6 47.48 0.046 LEN λ=3.0, rand 47.89 0.235 47.8 48.0 48.7 47.7 43. 1 47.64 0.090 Stoch. sampl.∆ BLEU+1, rand∗47.990.08747.948.048.747.843.547.670.096 ∆ LEN, rand∗ 47.94 0.060 47.8 47.9 48.6 47.8 43.6 47.65 0.097 Table 4: More fixes to PRO (with random acceptance, no minimum BLEU+1). The (††) indicates that random acceptance kills monsters. The asterisk (∗) indicates improved stability over random acceptance. Reasons (i) and (ii) arguably also apply to stochastic sampling of differentials (for BLEU+1 or for length), which fails to kill the monsters, maybe because it gives them some probability of being selected by design. To alleviate this, we test the above settings with random acceptance. 4.3 Random Acceptance Table 4 shows the results for accepting training pairs for PRO uniformly at random. To eliminate possible biases, we also removed the min=0.05 BLEU+1 selection criterion. Surprisingly, this setup effectively eliminated the monster problem. Further coupling this with the distributional criteria can also yield increased stability, and even small further increase in test BLEU. For instance, rejecting BLEU outliers with λ = 2 yields comparable average test BLEU, but with only half the standard deviation. On the other hand, using the stochastic sampling of differentials based on either BLEU+1 or lengths improves the test BLEU score while increasing the stability across runs. The random acceptance has a caveat though: it generally decreases the discriminative power of PRO, yielding worse results when tuning on the full, nonmonster prone tuning dataset. Stochastic selection does help to alleviate this problem. Yet, the results are not as good as when using a max cut-off for the length. Therefore, we recommend using the latter as a default setting. 5 Related Work We are not aware of previous work that discusses the issue of monsters, but there has been work on a different, length problem with PRO (Nakov et al., 2012). We have seen that its solution, fix the smoothing in BLEU+1, did not work for us. The stability of MERT has been improved using regularization (Cer et al., 2008), random restarts (Moore and Quirk, 2008), multiple replications (Clark et al., 2011), and parameter aggregation (Cettolo et al., 2011). With the emergence of new optimization techniques, there have been studies that compare stability between MIRA–MERT (Chiang et al., 2008; Chiang et al., 2009; Cherry and Foster, 2012), PRO–MERT (Hopkins and May, 2011), MIRA– PRO–MERT (Cherry and Foster, 2012; Gimpel and Smith, 2012; Nakov et al., 2012). Pathological verbosity can be an issue when tuning MERT on recall-oriented metrics such as METEOR (Lavie and Denkowski, 2009; Denkowski and Lavie, 2011). Large variance between the results obtained with MIRA has also been reported (Simianer et al., 2012). However, none of this work has focused on monsters. 6 Tale’s Moral and Future Battles We have studied a problem with PRO, namely that it can fall victim to monsters, overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning. We have proposed several effective ways to address this problem, based on length- and BLEU+1-based cut-offs, outlier filters and stochastic sampling. The best of these fixes have not only slayed the monsters, but have also brought much higher stability to PRO as well as improved test-time BLEU scores. These benefits are less visible on the full dataset, but we still recommend them to everybody who uses PRO as protection against monsters. Monsters are inherent in PRO; they just do not always take over. In future work, we plan a deeper look at the mechanism of monster creation in PRO and its possible connection to PRO’s length bias. 16 References Daniel Cer, Daniel Jurafsky, and Christopher Manning. 2008. Regularization and search for minimum error rate training. In Proc. of Workshop on Statistical Machine Translation, WMT ’08, pages 26–34. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Methods for smoothing the optimizer instability in SMT. MT Summit XIII: the Machine Translation Summit, pages 32–39. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 427–436. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 224–233. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine transla- tion. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’09, pages 218–226. Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 11, pages 176–181 . Michael Denkowski and Alon Lavie. 2011. Meteortuned phrase-based SMT: CMU French-English and Haitian-English systems for WMT 2011. Technical report, CMU-LTI-1 1-01 1, Language Technologies Institute, Carnegie Mellon University. George Foster and Roland Kuhn. 2009. Stabilizing minimum error rate training. In Proceedings of the Workshop on Statistical Machine Translation, StatMT ’09, pages 242–249. Kevin Gimpel and Noah Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 221–231. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation, WMT ’ 11, pages 187–197. Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 11, pages 1352–1362. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLTNAACL ’03, pages 48–54. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation, IWSLT ’05. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of the Meeting of the Association for Computational Linguistics, ACL ’07, pages 177–180. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL ’04, pages 169–176. Alon Lavie and Michael Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation, 23: 105–1 15. Robert Moore and Chris Quirk. 2008. Random restarts in minimum error rate training for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics, COLING ’08, pages 585–592. Preslav Nakov, Francisco Guzm a´n, and Stephan Vogel. 2012. Optimizing for sentence-level BLEU+1 yields short translations. In Proceedings ofthe International Conference on Computational Linguistics, COLING ’ 12, pages 1979–1994. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’03, pages 160–167. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’08, pages 117–120. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 12, pages 11–21. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, pages 764–773. 17

5 0.13229547 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Author: Kristina Toutanova ; Byung-Gyu Ahn

Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.

6 0.1163165 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

7 0.11361998 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

8 0.10316277 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

9 0.092881382 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

10 0.088138595 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

11 0.083905645 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

12 0.079956211 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

13 0.078987926 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

14 0.075729184 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

15 0.075459108 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

16 0.074832037 328 acl-2013-Stacking for Statistical Machine Translation

17 0.074500605 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

18 0.072308548 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

19 0.072048545 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

20 0.071990669 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.201), (1, -0.109), (2, 0.071), (3, 0.061), (4, -0.003), (5, 0.018), (6, 0.07), (7, -0.007), (8, -0.045), (9, 0.055), (10, -0.023), (11, -0.003), (12, -0.05), (13, -0.089), (14, -0.017), (15, 0.114), (16, -0.058), (17, 0.106), (18, 0.127), (19, -0.029), (20, 0.218), (21, 0.144), (22, -0.029), (23, 0.041), (24, 0.065), (25, 0.025), (26, 0.094), (27, 0.048), (28, -0.105), (29, -0.077), (30, 0.234), (31, 0.028), (32, -0.009), (33, 0.035), (34, 0.068), (35, -0.051), (36, -0.019), (37, 0.15), (38, 0.102), (39, -0.023), (40, 0.073), (41, -0.01), (42, 0.033), (43, -0.012), (44, -0.08), (45, -0.03), (46, 0.09), (47, -0.027), (48, 0.084), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95117569 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

2 0.89298284 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

Author: Vladimir Eidelman ; Ke Wu ; Ferhan Ture ; Philip Resnik ; Jimmy Lin

Abstract: We present an open-source framework for large-scale online structured learning. Developed with the flexibility to handle cost-augmented inference problems such as statistical machine translation (SMT), our large-margin learner can be used with any decoder. Integration with MapReduce using Hadoop streaming allows efficient scaling with increasing size of training data. Although designed with a focus on SMT, the decoder-agnostic design of our learner allows easy future extension to other structured learning problems such as sequence labeling and parsing.

3 0.87267894 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

4 0.84333193 24 acl-2013-A Tale about PRO and Monsters

Author: Preslav Nakov ; Francisco Guzman ; Stephan Vogel

Abstract: While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- and BLEU+1based cut-offs, outlier filters, stochastic sampling, and random acceptance. The best of these fixes not only slay and protect against monsters, but also yield higher stability for PRO as well as improved testtime BLEU scores. Thus, we recommend them to anybody using PRO, monsterbeliever or not. – 1 Once Upon a Time... For years, the standard way to do statistical machine translation parameter tuning has been to use minimum error-rate training, or MERT (Och, 2003). However, as researchers started using models with thousands of parameters, new scalable optimization algorithms such as MIRA (Watanabe et al., 2007; Chiang et al., 2008) and PRO (Hopkins and May, 2011) have emerged. As these algorithms are relatively new, they are still not quite well understood, and studying their properties is an active area of research. For example, Nakov et al. (2012) have pointed out that PRO tends to generate translations that are consistently shorter than desired. They have blamed this on inadequate smoothing in PRO’s optimization objective, namely sentencelevel BLEU+1, and they have addressed the problem using more sensible smoothing. We wondered whether the issue could be partially relieved simply by tuning on longer sentences, for which the effect of smoothing would naturally be smaller. To our surprise, tuning on the longer 50% of the tuning sentences had a disastrous effect on PRO, causing an absolute drop of three BLEU points on testing; at the same time, MERT and MIRA did not have such a problem. While investigating the reasons, we discovered hundreds of monsters creeping under PRO’s surface... Our tale continues as follows. We first explain what monsters are in Section 2, then we present a theory about how they can be slayed in Section 3, we put this theory to test in practice in Section 4, and we discuss some related efforts in Section 5. Finally, we present the moral of our tale, and we hint at some planned future battles in Section 6. 2 Monsters, Inc. PRO uses pairwise ranking optimization, where the learning task is to classify pairs of hypotheses into correctly or incorrectly ordered (Hopkins and May, 2011). It searches for a vector of weights w such that higher evaluation metric scores correspond to higher model scores and vice versa. More formally, PRO looks for weights w such that g(i, j) > g(i, j0) ⇔ hw (i, j) > hw (i, j0), where g is a local scoring fu hnction (typically, sentencelevel BLEU+1) and hw are the model scores for a given input sentence i and two candidate hypotheses j and j0 that were obtained using w. If g(i, j) > g(i, j0), we will refer to j and j0 as the positive and the negative example in the pair. Learning good parameter values requires negative examples that are comparable to the positive ones. Instead, tuning on long sentences quickly introduces monsters, i.e., corrupted negative examples that are unsuitable for learning: they are (i) much longer than the respective positive examples and the references, and (ii) have very low BLEU+1 scores compared to the positive examples and in absolute terms. The low BLEU+1 means that PRO effectively has to learn from positive examples only. 12 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 12–17, Avg. Lengths Avg. BLEU+1 iter. pos neg ref. pos neg 1 45.2 44.6 46.5 52.5 37.6 2 3 4 5 ... 25 46.4 46.4 46.4 46.3 ... 47.9 70.5 261.0 250.0 248.0 ... 229.0 53.2 53.4 53.0 53.0 ... 52.5 52.8 52.4 52.0 52.1 ... 52.2 14.5 2.19 2.30 2.34 ... 2.81 Table 1: PRO iterations, tuning on long sentences. Table 1shows an optimization run of PRO when tuning on long sentences. We can see monsters after iterations in which positive examples are on average longer than negative ones (e.g., iter. 1). As a result, PRO learns to generate longer sentences, but it overshoots too much (iter. 2), which gives rise to monsters. Ideally, the learning algorithm should be able to recover from overshooting. However, once monsters are encountered, they quickly start dominating, with no chance for PRO to recover since it accumulates n-best lists, and thus also monsters, over iterations. As a result, PRO keeps jumping up and down and converges to random values, as Figure 1 shows. By default, PRO’s parameters are averaged over iterations, and thus the final result is quite mediocre, but selecting the highest tuning score does not solve the problem either: for example, on Figure 1, PRO never achieves a BLEU better than that for the default initialization parameters. iteration Figure 1: PRO tuning results on long sentences across iterations. The dark-gray line shows the tuning BLEU (left axis), the light-gray one is the hypothesis/reference length ratio (right axis). Figure 2 shows the translations after iterations 1, 3 and 4; the last two are monsters. The monster at iteration 3 is potentially useful, but that at iteration 4 is clearly unsuitable as a negative example. Optimizer Objective BLEU PROsent-BLEU+144.57 MERT corpus-BLEU 47.53 MIRA pseudo-doc-BLEU 47.80 PRO (6= objective)pseudo-doc-BLEU21.35 PMRIORA (6= =(6= o bojbejcetcivteiv)e) sent-BLEU+1 47.59 PMRIRO,A PC (6=-sm obojoectthiv,e g)roundfixed sent-BLEU+145.71 Table 2: PRO vs. MERT vs. MIRA. We also checked whether other popular optimizers yield very low BLEU scores at test time when tuned on long sentences. Lines 2-3 in Table 2 show that this is not the case for MERT and MIRA. Since they optimize objectives that are different from PRO’s,1 we further experimented with plugging MIRA’s objective into PRO and PRO’s objective into MIRA. The resulting MIRA scores were not much different from before, while PRO’s score dropped even further; we also found mon- sters. Next, we applied the length fix for PRO proposed in (Nakov et al., 2012); this helped a bit, but still left PRO two BLEU points behind and MIRA, and the monsters did not go away. We can conclude that the monster problem is PRO-specific, cannot be blamed on the objective function, and is different from the length bias. Note also that monsters are not specific to a dataset or language pair. We found them when tuning on the top-50% of WMT10 and testing on WMT1 1 for Spanish-English; this yielded a drop in BLEU from 29.63 (1M2/emEs/RprTes)la to 27.12 n(inPg/RtmOp.)1.1 MERT2 **REF** : but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - **IT1** : but we are that we add our ranks to some of us and that we know that in the strength and weakness in **IT3** : , we are the but of the that that the , and , of ranks the the on the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and **IT4** : namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never balme your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3 .8 billion in tourism income proceeds Chamber are divided among themselves : some ' ve dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure Ftroai ngtkhsu et rahefef 2he : Ea t h xte ha,e motfo pstohlmeee r leafst eorfe wne c, et etr laonngs olfa t hei opnar a ofn sdo hhee oy fpwhaoitst hh a]r ee usisn i ostu tofra tnhes ilna tbiakoern s, haef ctoeokr it hee roant ainod nthse 1 t,h 3we aknonw d, 4. T@-@h eAl l tahes ft trwce o, tho ypotheses are monsters. 1See (Cherry and Foster, 2012) for details on objectives. 2Also, using PRO to initialize MERT, as implemented in Moses, yields 46.52 BLEU and monsters, but using MERT to initialize PRO yields 47.55 and no monsters. 13 3 Slaying Monsters: Theory Below we explain what monsters are and where they come from. Then, we propose various monster slaying techniques to be applied during PRO’s selection and acceptance steps. 3.1 What is PRO? PRO is a batch optimizer that iterates between (i) translation: using the current parameter values, generate k-best translations, and (ii) optimization: using the translations from all previous iterations, find new parameter values. The optimization step has four substeps: 1. Sampling: For each sentence, sample uniformly at random Γ = 5000 pairs from the set of all candidate translations for that sentence from all previous iterations. 2. Selection: From these sampled pairs, select those for which the absolute difference between their BLEU+1 scores is higher than α = 0.05 (note: this is 5 BLEU+1 points). 3. Acceptance: For each sentence, accept the Ξ = 50 selected pairs with the highest absolute difference in their BLEU+1 scores. 4. Learning: Assemble the accepted pairs for all sentences into a single set and use it to train a ranker to prefer the higher-scoring sentence in each pair. We believe that monsters are nurtured by PRO’s selection and acceptance policies. PRO’s selection step filters pairs involving hypotheses that differ by less than five BLEU+1 points, but it does not cut-off ones that differ too much based on BLEU+1 or length. PRO’s acceptance step selects Ξ = 50 pairs with the highest BLEU+1 differentials, which creates breeding ground for monsters since these pairs are very likely to include one monster and one good hypothesis. Below we discuss monster slaying geared towards the selection and acceptance steps of PRO. 3.2 Slaying at Selection In the selection step, PRO filters pairs for which the difference in BLEU+1 is less than five points, but it has no cut-off on the maximum BLEU+1 differentials nor cut-offs based on absolute length or difference in length. Here, we propose several selection filters, both deterministic and probabilistic. Cut-offs. A cut-off is a deterministic rule that filters out pairs that do not comply with some criteria. We experiment with a maximal cut-off on (a) the difference in BLEU+1 scores and (b) the difference in lengths. These are relative cut-offs because they refer to the pair, but absolute cut-offs that apply to each of the elements in the pair are also possible (not explored here). Cut-offs (a) and (b) slay monsters by not allowing the negative examples to get much worse in BLEU+1 or in length than the positive example in the pair. Filtering outliers. Outliers are rare or extreme observations in a sample. We assume normal distribution of the BLEU+1 scores (or of the lengths) of the translation hypotheses for the same source sentence, and we define as outliers hypotheses whose BLEU+1 (or length) is more than λ standard deviations away from the sample average. We apply the outlier filter to both the positive and the negative example in a pair, but it is more important for the latter. We experiment with values of λ like 2 and 3. This filtering slays monsters because they are likely outliers. However, it will not work if the population gets riddled with monsters, in which case they would become the norm. Stochastic sampling. Instead of filtering extreme examples, we can randomly sample pairs according to their probability of being typical. Let us assume that the values of the local scoring functions, i.e., the BLEU+1 scores, are distributed nor- mally: g(i, j) ∼ N(µ, σ2). Given a sample of hypothesis (tira,nj)sl ∼atio Nn(sµ {j} of the same source sentpeontchee i, we can ensstim {ja}te o σ empirically. Then, the difference ∆ = g(i, j) − g(i, j0) would be tdhisetr diibfufteerde normally w gi(thi, mean zero and variance 2σ2. Now, given a pair of examples, we can calculate their ∆, and we can choose to select the pair with some probability, according to N(0, 2σ2). 3.3 Slaying at Acceptance Another problem is caused by the acceptance mechanism of PRO: among all selected pairs, it accepts the top-Ξ with the highest BLEU+1 differentials. It is easy to see that these differentials are highest for nonmonster–monster pairs if such pairs exist. One way to avoid focusing primarily on such pairs is to accept a random set of pairs, among the ones that survived the selection step. One possible caveat is that we can lose some of the discriminative power of PRO by focusing on examples that are not different enough. Ξ 14 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) PRO fix Avg. for 3 reruns BLEU StdDev Pos Lengths Neg Ref BLEU+1 Avg. for 3 reruns Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Max diff. cut-offBLEU+1 max=10†47.940.16547.949.649.449.439.947.770.035 BLEU+1 max=20 † 47.73 0.136 47.7 55.5 51.1 49.8 32.7 47.85 0.049 LEN max=5 † 48.09 0.021 46.8 47.0 47.9 52.9 37.8 47.73 0.051 LEN max=10 † 47.99 0.025 47.3 48.5 48.7 52.5 35.6 47.80 0.056 OutliersBLEU+1 λ=2.0†48.050.11946.847.247.752.239.547.470.090 BLEU+1 λ=3.0 LEN λ=2.0 LEN λ=3.0 47.12 46.68 47.02 1.348 2.005 0.727 47.6 49.3 48.2 168.0 82.7 163.0 53.0 53.1 51.4 51.7 52.3 51.4 3.9 5.3 4.2 47.53 47.49 47.65 0.038 0.085 0.096 Stoch. sampl.∆ BLEU+146.331.00046.8216.053.353.12.447.740.035 ∆ LEN 46.36 1.281 47.4 201.0 52.9 53.4 2.9 47.78 0.081 Table 3: Some fixes to PRO (select pairs with highest BLEU+1 differential, also require at least 5 BLEU+1 points difference). A dagger (†) indicates selection fixes that successfully get rid of monsters. 4 Attacking Monsters: Practice Below, we first present our general experimental setup. Then, we present the results for the various selection alternatives, both with the original acceptance strategy and with random acceptance. 4.1 Experimental Setup We used a phrase-based SMT model (Koehn et al., 2003) as implemented in the Moses toolkit (Koehn et al., 2007). We trained on all Arabic-English data for NIST 2012 except for UN, we tuned on (the longest-50% of) the MT06 sentences, and we tested on MT09. We used the MADA ATB segmentation for Arabic (Roth et al., 2008) and truecasing for English, phrases of maximal length 7, Kneser-Ney smoothing, and lexicalized reorder- ing (Koehn et al., 2005), and a 5-gram language model, trained on GigaWord v.5 using KenLM (Heafield, 2011). We dropped unknown words both at tuning and testing, and we used minimum Bayes risk decoding at testing (Kumar and Byrne, 2004). We evaluated the output with NIST’s scoring tool v.13a, cased. We used the Moses implementations of MERT, PRO and batch MIRA, with the –return-best-dev parameter for the latter. We ran these optimizers for up to 25 iterations and we used 1000-best lists. For stability (Foster and Kuhn, 2009), we performed three reruns of each experiment (tuning + evaluation), and we report averaged scores. 4.2 Selection Alternatives Table 3 presents the results for different selection alternatives. The first two columns show the testing results: average BLEU and standard deviation over three reruns. The following five columns show statistics about the last iteration (it. 25) of PRO’s tuning for the worst rerun: average lengths of the positive and the negative examples and average effective reference length, followed by average BLEU+1 scores for the positive and the negative examples in the pairs. The last two columns present the results when tuning on the full tuning set. These are included to verify the behavior of PRO in a nonmonster prone environment. We can see in Table 3 that all selection mechanisms considerably improve BLEU compared to the baseline PRO, by 2-3 BLEU points. However, not every selection alternative gets rid of monsters, which can be seen by the large lengths and low BLEU+1 for the negative examples (in bold). The max cut-offs for BLEU+1 and for lengths both slay the monsters, but the latter yields much lower standard deviation (thirteen times lower than for the baseline PRO!), thus considerably increasing PRO’s stability. On the full dataset, BLEU scores are about the same as for the original PRO (with small improvement for BLEU+1 max=20), but the standard deviations are slightly better. Rejecting outliers using BLEU+1 and λ = 3 is not strong enough to filter out monsters, but making this criterion more strict by setting λ = 2, yields competitive BLEU and kills the monsters. Rejecting outliers based on length does not work as effectively though. We can think of two possible reasons: (i) lengths are not normally distributed, they are more Poisson-like, and (ii) the acceptance criterion is based on the top-Ξ differentials based on BLEU+1, not based on length. On the full dataset, rejecting outliers, BLEU+1 and length, yields lower BLEU and less stability. 15 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) Avg. for 3 reruns Lengths BLEU+1 Avg. for 3 reruns PRO fix BLEU StdDev Pos Neg Ref Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Rand. acceptPRO, rand††47.870.14747.748.548.7047.742.947.590.114 OutliersBLEU+1 λ=2.0, rand∗47.850.07848.248.448.947.543.647.620.091 BLEU+1 λ=3.0, rand 47.97 0.168 47.6 47.6 48.4 47.8 43.6 47.44 0.070 LEN λ=2.0, rand∗ 47.69 0.114 47.8 47.8 48.6 47.9 43.6 47.48 0.046 LEN λ=3.0, rand 47.89 0.235 47.8 48.0 48.7 47.7 43. 1 47.64 0.090 Stoch. sampl.∆ BLEU+1, rand∗47.990.08747.948.048.747.843.547.670.096 ∆ LEN, rand∗ 47.94 0.060 47.8 47.9 48.6 47.8 43.6 47.65 0.097 Table 4: More fixes to PRO (with random acceptance, no minimum BLEU+1). The (††) indicates that random acceptance kills monsters. The asterisk (∗) indicates improved stability over random acceptance. Reasons (i) and (ii) arguably also apply to stochastic sampling of differentials (for BLEU+1 or for length), which fails to kill the monsters, maybe because it gives them some probability of being selected by design. To alleviate this, we test the above settings with random acceptance. 4.3 Random Acceptance Table 4 shows the results for accepting training pairs for PRO uniformly at random. To eliminate possible biases, we also removed the min=0.05 BLEU+1 selection criterion. Surprisingly, this setup effectively eliminated the monster problem. Further coupling this with the distributional criteria can also yield increased stability, and even small further increase in test BLEU. For instance, rejecting BLEU outliers with λ = 2 yields comparable average test BLEU, but with only half the standard deviation. On the other hand, using the stochastic sampling of differentials based on either BLEU+1 or lengths improves the test BLEU score while increasing the stability across runs. The random acceptance has a caveat though: it generally decreases the discriminative power of PRO, yielding worse results when tuning on the full, nonmonster prone tuning dataset. Stochastic selection does help to alleviate this problem. Yet, the results are not as good as when using a max cut-off for the length. Therefore, we recommend using the latter as a default setting. 5 Related Work We are not aware of previous work that discusses the issue of monsters, but there has been work on a different, length problem with PRO (Nakov et al., 2012). We have seen that its solution, fix the smoothing in BLEU+1, did not work for us. The stability of MERT has been improved using regularization (Cer et al., 2008), random restarts (Moore and Quirk, 2008), multiple replications (Clark et al., 2011), and parameter aggregation (Cettolo et al., 2011). With the emergence of new optimization techniques, there have been studies that compare stability between MIRA–MERT (Chiang et al., 2008; Chiang et al., 2009; Cherry and Foster, 2012), PRO–MERT (Hopkins and May, 2011), MIRA– PRO–MERT (Cherry and Foster, 2012; Gimpel and Smith, 2012; Nakov et al., 2012). Pathological verbosity can be an issue when tuning MERT on recall-oriented metrics such as METEOR (Lavie and Denkowski, 2009; Denkowski and Lavie, 2011). Large variance between the results obtained with MIRA has also been reported (Simianer et al., 2012). However, none of this work has focused on monsters. 6 Tale’s Moral and Future Battles We have studied a problem with PRO, namely that it can fall victim to monsters, overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning. We have proposed several effective ways to address this problem, based on length- and BLEU+1-based cut-offs, outlier filters and stochastic sampling. The best of these fixes have not only slayed the monsters, but have also brought much higher stability to PRO as well as improved test-time BLEU scores. These benefits are less visible on the full dataset, but we still recommend them to everybody who uses PRO as protection against monsters. Monsters are inherent in PRO; they just do not always take over. In future work, we plan a deeper look at the mechanism of monster creation in PRO and its possible connection to PRO’s length bias. 16 References Daniel Cer, Daniel Jurafsky, and Christopher Manning. 2008. Regularization and search for minimum error rate training. In Proc. of Workshop on Statistical Machine Translation, WMT ’08, pages 26–34. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Methods for smoothing the optimizer instability in SMT. MT Summit XIII: the Machine Translation Summit, pages 32–39. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 427–436. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 224–233. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine transla- tion. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’09, pages 218–226. Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 11, pages 176–181 . Michael Denkowski and Alon Lavie. 2011. Meteortuned phrase-based SMT: CMU French-English and Haitian-English systems for WMT 2011. Technical report, CMU-LTI-1 1-01 1, Language Technologies Institute, Carnegie Mellon University. George Foster and Roland Kuhn. 2009. Stabilizing minimum error rate training. In Proceedings of the Workshop on Statistical Machine Translation, StatMT ’09, pages 242–249. Kevin Gimpel and Noah Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 221–231. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation, WMT ’ 11, pages 187–197. Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 11, pages 1352–1362. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLTNAACL ’03, pages 48–54. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation, IWSLT ’05. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of the Meeting of the Association for Computational Linguistics, ACL ’07, pages 177–180. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL ’04, pages 169–176. Alon Lavie and Michael Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation, 23: 105–1 15. Robert Moore and Chris Quirk. 2008. Random restarts in minimum error rate training for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics, COLING ’08, pages 585–592. Preslav Nakov, Francisco Guzm a´n, and Stephan Vogel. 2012. Optimizing for sentence-level BLEU+1 yields short translations. In Proceedings ofthe International Conference on Computational Linguistics, COLING ’ 12, pages 1979–1994. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’03, pages 160–167. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’08, pages 117–120. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 12, pages 11–21. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, pages 764–773. 17

5 0.71712053 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Author: Kristina Toutanova ; Byung-Gyu Ahn

Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.

6 0.69603997 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

7 0.61021036 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

8 0.59552628 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

9 0.58287346 328 acl-2013-Stacking for Statistical Machine Translation

10 0.51122236 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

11 0.48816073 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

12 0.45607159 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

13 0.43818671 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

14 0.40068567 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

15 0.3903681 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

16 0.38535246 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

17 0.38078365 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures

18 0.37918803 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

19 0.37814647 237 acl-2013-Margin-based Decomposed Amortized Inference

20 0.37786621 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.041), (6, 0.054), (7, 0.108), (11, 0.079), (14, 0.01), (24, 0.031), (26, 0.086), (28, 0.017), (35, 0.065), (42, 0.092), (48, 0.093), (70, 0.035), (85, 0.012), (87, 0.016), (88, 0.026), (90, 0.032), (95, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92694604 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

2 0.91596305 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

Author: Xiaojun Quan ; Chunyu Kit ; Yan Song

Abstract: This paper studies the problem of nonmonotonic sentence alignment, motivated by the observation that coupled sentences in real bitexts do not necessarily occur monotonically, and proposes a semisupervised learning approach based on two assumptions: (1) sentences with high affinity in one language tend to have their counterparts with similar relatedness in the other; and (2) initial alignment is readily available with existing alignment techniques. They are incorporated as two constraints into a semisupervised learning framework for optimization to produce a globally optimal solution. The evaluation with realworld legal data from a comprehensive legislation corpus shows that while exist- ing alignment algorithms suffer severely from non-monotonicity, this approach can work effectively on both monotonic and non-monotonic data.

3 0.90150237 253 acl-2013-Multilingual Affect Polarity and Valence Prediction in Metaphor-Rich Texts

Author: Zornitsa Kozareva

Abstract: Metaphor is an important way of conveying the affect of people, hence understanding how people use metaphors to convey affect is important for the communication between individuals and increases cohesion if the perceived affect of the concrete example is the same for the two individuals. Therefore, building computational models that can automatically identify the affect in metaphor-rich texts like “The team captain is a rock.”, “Time is money.”, “My lawyer is a shark.” is an important challenging problem, which has been of great interest to the research community. To solve this task, we have collected and manually annotated the affect of metaphor-rich texts for four languages. We present novel algorithms that integrate triggers for cognitive, affective, perceptual and social processes with stylistic and lexical information. By running evaluations on datasets in English, Spanish, Russian and Farsi, we show that the developed affect polarity and valence prediction technology of metaphor-rich texts is portable and works equally well for different languages.

4 0.89495927 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

Author: Oliver Ferschke ; Iryna Gurevych ; Marc Rittberger

Abstract: With the increasing amount of user generated reference texts in the web, automatic quality assessment has become a key challenge. However, only a small amount of annotated data is available for training quality assessment systems. Wikipedia contains a large amount of texts annotated with cleanup templates which identify quality flaws. We show that the distribution of these labels is topically biased, since they cannot be applied freely to any arbitrary article. We argue that it is necessary to consider the topical restrictions of each label in order to avoid a sampling bias that results in a skewed classifier and overly optimistic evaluation results. . We factor out the topic bias by extracting reliable training instances from the revision history which have a topic distribution similar to the labeled articles. This approach better reflects the situation a classifier would face in a real-life application.

5 0.85850281 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.

6 0.85460997 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

7 0.85286391 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

8 0.85134852 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

9 0.85070091 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

10 0.85011196 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

11 0.84910566 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

12 0.84880209 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

13 0.84841007 80 acl-2013-Chinese Parsing Exploiting Characters

14 0.84827268 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

15 0.84667373 97 acl-2013-Cross-lingual Projections between Languages from Different Families

16 0.84642357 62 acl-2013-Automatic Term Ambiguity Detection

17 0.84513617 275 acl-2013-Parsing with Compositional Vector Grammars

18 0.84210366 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

19 0.84091592 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

20 0.84025723 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models