acl acl2013 acl2013-156 knowledge-graph by maker-knowledge-mining

156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models


Source: pdf

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. [sent-3, score-0.49]

2 The standard tuning algorithm—MERT—only scales to tens of features. [sent-4, score-0.349]

3 Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. [sent-5, score-0.252]

4 Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. [sent-6, score-0.73]

5 Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. [sent-7, score-0.128]

6 Adaptation of discriminative learning methods for these types of features to statistical machine translation (MT) systems, which have historically used idiosyncratic learning techniques for a few dense features, has been an active research area for the past half-decade. [sent-10, score-0.479]

7 The algorithm scales to millions of features and large tuning sets. [sent-17, score-0.473]

8 It optimizes a logistic objective identical to that of PRO (Hopkins and May, 2011) with stochastic gradient descent, although other objectives are possible. [sent-18, score-0.203]

9 , 2011), which is particularly effective for the mixture of dense and sparse features present in MT models. [sent-20, score-0.421]

10 Finally, feature selection is implemented as efficient L1 regularization in the forward-backward splitting (FOBOS) framework (Duchi and Singer, 2009). [sent-21, score-0.15]

11 Experiments show that our algorithm converges faster than batch alternatives. [sent-22, score-0.138]

12 To learn good weights for the sparse features, most algorithms—including ours—benefit from more tuning data, and the natural source is the training bitext. [sent-23, score-0.384]

13 First, it has a single reference, sometimes of lower quality than the multiple references in tuning sets from MT competitions. [sent-25, score-0.309]

14 Second, large bitexts often comprise many text genres (Haddow and Koehn, 2012), a virtue for classical dense MT models but a curse for high dimensional models: bitext tuning can lead to a significant domain adaptation problem when evaluating on standard test sets. [sent-26, score-0.814]

15 The first experiment uses standard tuning and test sets from the NIST OpenMT competitions. [sent-31, score-0.309]

16 The second experiment uses tuning and test sets sampled from the large bitexts. [sent-32, score-0.309]

17 When we have a large feature set and therefore want to tune on a large data set, batch methods are infeasible. [sent-39, score-0.131]

18 Recall that stochastic gradient descent (SGD), a fundamental online method, updates weights w according to wt = wt−1 − η∇‘t (wt−1) (1) with loss function1 ‘t(w) of the tth example, (sub)gradient of the loss with respect to the parameters ∇‘t(wt−1), and learning rate η. [sent-41, score-0.871]

19 SrGs ∇D is sensitive to the learning rate η, which is difficult to set in an MT system that mixes frequent “dense” features (like the language model) with sparse features (e. [sent-42, score-0.202]

20 We would instead like to take larger steps for sparse features and smaller steps for dense features. [sent-46, score-0.421]

21 1 AdaGrad AdaGrad is a method for setting an adaptive learning rate that comes with good theoretical guarantees. [sent-48, score-0.138]

22 AdaGrad makes the following update: ηΣt1/2∇‘t(wt−1) Σt−1 = Σ−1 + ∇‘t(wt−1)∇‘t(wt−1)> wt = wt−1 − (2) Xt = X∇‘i(wi−1)∇‘i(wi−1)> (3) Xi= X1 A diagonal approximation to Σ can be used for a high-dimensional vector wt. [sent-50, score-0.349]

23 Consider a single dimension j,and let scalars vt = wt,j, gt = ∇j‘t(wt−1), Gt = gi2, then the update ru=le ∇is Pit=1 vt = vt−1 Gt−1/2gt + gt2 − η Gt = Gt−1 ComparedtoSGD,wejustneedtostoreGt = for each dimension j. [sent-52, score-0.218]

24 For example, if Σ is diagonal and xt are indicator features, the confidence term then says that the weight for a rarer feature should have more variance and vice-versa. [sent-73, score-0.234]

25 If we substitute (10) into (8), linearize the loss ‘t(w) as before, and solve, then we have the linearized AROW update wt = wt−1 − ηΣt∇‘t (wt−1) (11) which is also an adaptive update with per-coordinate learning rates specified by Σt (as opposed to in AdaGrad). [sent-76, score-0.661]

26 3 Comparing AdaGrad, MIRA, AROW Compare (3) to (10) and observe that if we set Σ0−1 = 0 and αt = 1, then the only difference between the AROW update (11) and the AdaGrad update (2) is a square root. [sent-78, score-0.158]

27 Second, SGD can be improved in the adaptivity direction using AdaGrad where the decaying stepsize is more robust and the adaptive stepsize allows better weight updates to features differing in sparsity and scale. [sent-81, score-0.44]

28 For MT, adaptivity allows us to deal with mixed dense/sparse features effectively without specific normalization. [sent-83, score-0.175]

29 MIRA/AROW requires selecting the loss function ‘(w) so that wt can be solved in closed-form, by a quadratic program (QP), or in some other way that is better than linearizing. [sent-85, score-0.404]

30 On the other hand, AdaGrad/linearized AROW only requires that the gradient of the loss function can be computed efficiently. [sent-87, score-0.203]

31 10: Update wt = wt−1 − ηΣ1t/2gt 11: Regularize wt 12: Set t = t + 1 13: end for 14: until convergence . [sent-94, score-0.677]

32 Finally, by using AdaGrad, we separate adaptivity from conservativity. [sent-98, score-0.131]

33 Our experiments suggest that adaptivity is actually more important. [sent-99, score-0.131]

34 AdaGrad (lines 9–10) is a crucial piece, but the loss function, regularization technique, and parallelization strategy described in this section are equally important in the MT setting. [sent-101, score-0.252]

35 We cast MT tuning as pairwise ranking (Herbrich et al. [sent-104, score-0.35]

36 The pairwise approach results in simple, convex loss functions suitable for online learning. [sent-106, score-0.214]

37 We use the familiar logistic loss: − ‘t(w) = ‘(Dt,w) = −x+X∈Dtlog1 + e1−w·x+ (12) Choosing the hinge loss instead of the logistic loss results in the 1-class SVM problem. [sent-123, score-0.291]

38 In line 4 we simply partition the tuning set so that ibecomes a mini-batch of examples. [sent-126, score-0.309]

39 2 Updating and Regularization Algorithm 1lines 9–1 1compute the adaptive learning rate, update the weights, and apply regularization. [sent-128, score-0.178]

40 The two-step FOBOS update is = wt−1 − ηt−1∇‘t−1 (wt−1) wt= argwmin21kw − wt−21k22+ ηt−1r(w) wt−21 (13) (14) where (13) is just an unregularized gradient descent step and (14) balances the regularization term r(w) with staying close to the gradient step. [sent-132, score-0.464]

41 For example, simple indicator features like lexicalized re-ordering classes are potentially useful yet bloat the the feature set and, in the worst case, can negatively impact Algorithm 2“Stale gradient” parallelization method for Algorithm 1. [sent-135, score-0.141]

42 1lines 5–8 7: while ∃ p0 done with gradient gt0 do t0 ≤ t 8: Uhiplde∃a te p wt = wt−1 − ηgt0 Alg. [sent-143, score-0.437]

43 Crucially, the weight updates need not be applied in order, so synchronization is unnecessary; the workers only idle at the end of an epoch. [sent-163, score-0.133]

44 The consequence is that the update in line 8 of Algorithm 2 is with respect to gradient gt0 with t0 ≤ t. [sent-164, score-0.197]

45 During a tuning run, the online method decodes the tuning set under many more weight vectors than a MERT-style batch method. [sent-170, score-0.847]

46 The adaptive algorithm identifies appropriate learning rates for the mixture of dense and sparse features. [sent-172, score-0.523]

47 To the dense features we add three high dimensional “sparse” feature sets. [sent-184, score-0.437]

48 2 Tuning Algorithms The primary baseline is the dense feature set tuned with MERT (Och, 2003). [sent-197, score-0.418]

49 First, we tuned with batch PRO using the default settings in Phrasal (L2 regularization with σ=0. [sent-203, score-0.277]

50 We did implement an online version of MIRA, and in small-scale experiments found that the batch variant worked just as well. [sent-206, score-0.179]

51 Moses5 also contains the discriminative phrase table implementation of (Hasler et al. [sent-209, score-0.125]

52 60 1,313 Table 2: Ar-En results [BLEU-4 % uncased] for the NIST tuning experiment. [sent-266, score-0.309]

53 The tuning and test sets each have four references. [sent-267, score-0.309]

54 15 1,597 Table 3: Zh-En results [BLEU-4 % uncased] for the NIST tuning experiment. [sent-310, score-0.309]

55 sampled 15 derivation pairs for each tuning ample and scored them with BLEU+1. [sent-313, score-0.309]

56 3 NIST OpenMT Experiment The first experiment evaluates our algorithm when tuning and testing on standard test sets, each with four references. [sent-315, score-0.356]

57 When we add features, our algorithm tends to overfit to a standard-sized tuning set like MT06. [sent-316, score-0.356]

58 We thus concatenated MT05, MT06, and MT08 to create a larger tuning set. [sent-317, score-0.309]

59 Our algorithm is competitive with MERT in the low dimensional “dense” setting, and compares favorably to PRO with the PT feature set. [sent-319, score-0.138]

60 With the PT feature set alone, our algorithm tuned on MT05/6/8 scores well below the best model, e. [sent-329, score-0.163]

61 24 Table 4: Ar-En results [BLEU-4 % uncased] for the bitext tuning experiment. [sent-343, score-0.42]

62 Somewhat surprisingly our algorithm improves over MERT in the dense setting. [sent-363, score-0.349]

63 When we add the discriminative phrase table, our algorithm improves over kbMIRA, and over batch PRO on two evaluation sets. [sent-364, score-0.263]

64 With all features and the MT05/6/8 tuning set, we improve significantly over all other models. [sent-365, score-0.353]

65 PRO learns a smaller model with the PT+AL+LO feature set which is surprising given that it applies L2 regularization (AdaGrad uses L1). [sent-366, score-0.15]

66 Our algorithm decodes each example with a new weight vector, thus exploring more of the search space for the same tuning set. [sent-368, score-0.406]

67 4 Bitext Tuning Experiment Tables 2 and 3 show that adding tuning examples improves translation quality. [sent-370, score-0.362]

68 Nevertheless, even the larger tuning set is small relative to the bitext from which rules were extracted. [sent-371, score-0.42]

69 (2012) showed significant translation quality gains by tuning on the bitext. [sent-373, score-0.362]

70 Domain mismatch matters for the dense feature set (Haddow and Koehn, 2012). [sent-376, score-0.342]

71 Before aligning each bitext, we randomly sampled and sequestered 5k and 15k sentence tuning sets, and a 5k test set. [sent-378, score-0.309]

72 We then tuned a dense model with MERT on MT06, and feature-rich models on both MT05/6/8 and the bitext tuning set. [sent-386, score-0.798]

73 When tuned on bitext5k the translation quality gains are significant for bitext5k-test relative to tuning on MT05/6/8, which has multiple references. [sent-388, score-0.438]

74 1 Feature Overlap Analysis How many sparse features appear in both the tuning and test sets? [sent-392, score-0.428]

75 In Table 6, A is the set of phrase table features that received a non-zero weight when tuned on dataset DA (same for B). [sent-393, score-0.215]

76 Column DA lists several Zh-En teDst sets used and column DB lists tuning sets. [sent-394, score-0.309]

77 Our experiments showed thaDt tuning on MT06 generalizes better to MT04 than tuning 317 on bitext5k, whereas tuning on bitext5k generalizes better to bitext5k-test than tuning on MT06. [sent-395, score-1.236]

78 Phrase table features in A ∩ B are overwhelmingly short, simple, and correAct ∩ phrases, suggesting L1 regularization is effective for feature selection. [sent-397, score-0.194]

79 It is also important to balance the number of features with how well weights can be learned for those features, as tuning on bitext15k produced higher coverage for MT04 but worse generalization than tuning on MT06. [sent-398, score-0.662]

80 2 Domain Adaptation Analysis To understand the domain adaptation issue we compared the non-zero weights in the discriminative phrase table (PT) for Ar-En models tuned on bitext5k and MT05/6/8. [sent-400, score-0.201]

81 Of course, this discrepancy is consequential for both dense and feature-rich models. [sent-403, score-0.302]

82 However, we observe that the feature-rich models fit the tuning data more closely. [sent-404, score-0.309]

83 Bottom: rule counts in the discriminative phrase table (PT) for models tuned on the two tuning sets. [sent-483, score-0.51]

84 (1) ref: lebanese prime minister , fuad siniora , announced a. [sent-513, score-0.264]

85 the lebanese prime minister fouad siniora announced (2) ref: the newspaper and television reported a. [sent-515, score-0.302]

86 television and newspaper said In (1) the dense model (1a) drops the verb while the feature-rich model correctly re-orders and inserts it after the subject (1b). [sent-517, score-0.377]

87 The coordinated subject in (2) becomes an embedded subject in the dense output (2a). [sent-518, score-0.302]

88 The core of our method is an inner product between the adaptive learning rate vector and the gradient. [sent-532, score-0.138]

89 (2012b), who tuned models containing both sparse and dense features with Moses. [sent-537, score-0.497]

90 A discriminative phrase table helped them improve slightly over a dense, online MIRA baseline, but their best results required initialization with MERT-tuned weights and re-tuning a single, shared weight for the discriminative phrase table with MERT. [sent-538, score-0.388]

91 , 2010), which we found to be slower than the stale gradient method in section 3. [sent-548, score-0.192]

92 Many other discriminative techniques have been proposed based on: ramp loss (Gimpel, 2012); hinge loss (Cherry and Foster, 2012; Haddow et al. [sent-555, score-0.297]

93 7 Conclusion and Outlook We introduced a new online method for tuning feature-rich translation models. [sent-560, score-0.45]

94 The method is faster per epoch than MERT, scales to millions of features, and converges quickly. [sent-561, score-0.13]

95 We used efficient L1 regularization for feature selection, obviating the need for the feature scaling and heuristic filtering common in prior work. [sent-562, score-0.227]

96 Even basic discriminative features were effective, so we believe that our work enables fresh approaches to more sophisticated MT feature engineering. [sent-564, score-0.164]

97 Hope and fear for discriminative training of statistical translation models. [sent-630, score-0.133]

98 Efficient online and batch learning using forward backward splitting. [sent-652, score-0.179]

99 Adaptive subgradient methods for online learning and stochastic optimization. [sent-659, score-0.136]

100 Joint feature selection in distributed stochastic learning for largescale discriminative training in SMT. [sent-874, score-0.168]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wt', 0.319), ('tuning', 0.309), ('dense', 0.302), ('adagrad', 0.27), ('pt', 0.268), ('arow', 0.197), ('mert', 0.145), ('pro', 0.141), ('mira', 0.141), ('adaptivity', 0.131), ('sgd', 0.121), ('gradient', 0.118), ('xt', 0.114), ('bitext', 0.111), ('regularization', 0.11), ('lo', 0.106), ('haddow', 0.1), ('adaptive', 0.099), ('batch', 0.091), ('online', 0.088), ('loss', 0.085), ('hasler', 0.082), ('discriminative', 0.08), ('update', 0.079), ('fobos', 0.076), ('tuned', 0.076), ('mt', 0.075), ('sparse', 0.075), ('conservativity', 0.074), ('stale', 0.074), ('cer', 0.074), ('moses', 0.074), ('al', 0.072), ('gt', 0.067), ('chiang', 0.066), ('alia', 0.065), ('duchi', 0.062), ('cherry', 0.061), ('uncased', 0.061), ('phrasal', 0.059), ('simianer', 0.057), ('parallelization', 0.057), ('epoch', 0.057), ('langford', 0.056), ('openmt', 0.056), ('siniora', 0.056), ('gimpel', 0.054), ('nist', 0.053), ('translation', 0.053), ('inter', 0.052), ('dimensional', 0.051), ('updates', 0.05), ('weight', 0.05), ('arun', 0.049), ('lebanese', 0.049), ('minister', 0.049), ('stochastic', 0.048), ('jmlr', 0.047), ('hinge', 0.047), ('algorithm', 0.047), ('phrase', 0.045), ('television', 0.045), ('och', 0.045), ('liang', 0.045), ('dj', 0.044), ('features', 0.044), ('arabic', 0.043), ('announced', 0.043), ('foster', 0.043), ('programme', 0.041), ('pairwise', 0.041), ('bitexts', 0.041), ('ittycheriah', 0.041), ('epochs', 0.041), ('scales', 0.04), ('feature', 0.04), ('convergence', 0.039), ('rate', 0.039), ('descent', 0.039), ('fuad', 0.037), ('obviating', 0.037), ('parallelizes', 0.037), ('logistic', 0.037), ('crammer', 0.037), ('bottou', 0.036), ('vt', 0.036), ('watanabe', 0.033), ('updating', 0.033), ('hopkins', 0.033), ('idle', 0.033), ('featurerich', 0.033), ('stepsize', 0.033), ('millions', 0.033), ('ensuring', 0.032), ('koehn', 0.031), ('newspaper', 0.03), ('riezler', 0.03), ('prime', 0.03), ('diagonal', 0.03), ('galley', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

2 0.23477198 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

3 0.20453466 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

Author: Vladimir Eidelman ; Ke Wu ; Ferhan Ture ; Philip Resnik ; Jimmy Lin

Abstract: We present an open-source framework for large-scale online structured learning. Developed with the flexibility to handle cost-augmented inference problems such as statistical machine translation (SMT), our large-margin learner can be used with any decoder. Integration with MapReduce using Hadoop streaming allows efficient scaling with increasing size of training data. Although designed with a focus on SMT, the decoder-agnostic design of our learner allows easy future extension to other structured learning problems such as sequence labeling and parsing.

4 0.19910532 24 acl-2013-A Tale about PRO and Monsters

Author: Preslav Nakov ; Francisco Guzman ; Stephan Vogel

Abstract: While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- and BLEU+1based cut-offs, outlier filters, stochastic sampling, and random acceptance. The best of these fixes not only slay and protect against monsters, but also yield higher stability for PRO as well as improved testtime BLEU scores. Thus, we recommend them to anybody using PRO, monsterbeliever or not. – 1 Once Upon a Time... For years, the standard way to do statistical machine translation parameter tuning has been to use minimum error-rate training, or MERT (Och, 2003). However, as researchers started using models with thousands of parameters, new scalable optimization algorithms such as MIRA (Watanabe et al., 2007; Chiang et al., 2008) and PRO (Hopkins and May, 2011) have emerged. As these algorithms are relatively new, they are still not quite well understood, and studying their properties is an active area of research. For example, Nakov et al. (2012) have pointed out that PRO tends to generate translations that are consistently shorter than desired. They have blamed this on inadequate smoothing in PRO’s optimization objective, namely sentencelevel BLEU+1, and they have addressed the problem using more sensible smoothing. We wondered whether the issue could be partially relieved simply by tuning on longer sentences, for which the effect of smoothing would naturally be smaller. To our surprise, tuning on the longer 50% of the tuning sentences had a disastrous effect on PRO, causing an absolute drop of three BLEU points on testing; at the same time, MERT and MIRA did not have such a problem. While investigating the reasons, we discovered hundreds of monsters creeping under PRO’s surface... Our tale continues as follows. We first explain what monsters are in Section 2, then we present a theory about how they can be slayed in Section 3, we put this theory to test in practice in Section 4, and we discuss some related efforts in Section 5. Finally, we present the moral of our tale, and we hint at some planned future battles in Section 6. 2 Monsters, Inc. PRO uses pairwise ranking optimization, where the learning task is to classify pairs of hypotheses into correctly or incorrectly ordered (Hopkins and May, 2011). It searches for a vector of weights w such that higher evaluation metric scores correspond to higher model scores and vice versa. More formally, PRO looks for weights w such that g(i, j) > g(i, j0) ⇔ hw (i, j) > hw (i, j0), where g is a local scoring fu hnction (typically, sentencelevel BLEU+1) and hw are the model scores for a given input sentence i and two candidate hypotheses j and j0 that were obtained using w. If g(i, j) > g(i, j0), we will refer to j and j0 as the positive and the negative example in the pair. Learning good parameter values requires negative examples that are comparable to the positive ones. Instead, tuning on long sentences quickly introduces monsters, i.e., corrupted negative examples that are unsuitable for learning: they are (i) much longer than the respective positive examples and the references, and (ii) have very low BLEU+1 scores compared to the positive examples and in absolute terms. The low BLEU+1 means that PRO effectively has to learn from positive examples only. 12 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 12–17, Avg. Lengths Avg. BLEU+1 iter. pos neg ref. pos neg 1 45.2 44.6 46.5 52.5 37.6 2 3 4 5 ... 25 46.4 46.4 46.4 46.3 ... 47.9 70.5 261.0 250.0 248.0 ... 229.0 53.2 53.4 53.0 53.0 ... 52.5 52.8 52.4 52.0 52.1 ... 52.2 14.5 2.19 2.30 2.34 ... 2.81 Table 1: PRO iterations, tuning on long sentences. Table 1shows an optimization run of PRO when tuning on long sentences. We can see monsters after iterations in which positive examples are on average longer than negative ones (e.g., iter. 1). As a result, PRO learns to generate longer sentences, but it overshoots too much (iter. 2), which gives rise to monsters. Ideally, the learning algorithm should be able to recover from overshooting. However, once monsters are encountered, they quickly start dominating, with no chance for PRO to recover since it accumulates n-best lists, and thus also monsters, over iterations. As a result, PRO keeps jumping up and down and converges to random values, as Figure 1 shows. By default, PRO’s parameters are averaged over iterations, and thus the final result is quite mediocre, but selecting the highest tuning score does not solve the problem either: for example, on Figure 1, PRO never achieves a BLEU better than that for the default initialization parameters. iteration Figure 1: PRO tuning results on long sentences across iterations. The dark-gray line shows the tuning BLEU (left axis), the light-gray one is the hypothesis/reference length ratio (right axis). Figure 2 shows the translations after iterations 1, 3 and 4; the last two are monsters. The monster at iteration 3 is potentially useful, but that at iteration 4 is clearly unsuitable as a negative example. Optimizer Objective BLEU PROsent-BLEU+144.57 MERT corpus-BLEU 47.53 MIRA pseudo-doc-BLEU 47.80 PRO (6= objective)pseudo-doc-BLEU21.35 PMRIORA (6= =(6= o bojbejcetcivteiv)e) sent-BLEU+1 47.59 PMRIRO,A PC (6=-sm obojoectthiv,e g)roundfixed sent-BLEU+145.71 Table 2: PRO vs. MERT vs. MIRA. We also checked whether other popular optimizers yield very low BLEU scores at test time when tuned on long sentences. Lines 2-3 in Table 2 show that this is not the case for MERT and MIRA. Since they optimize objectives that are different from PRO’s,1 we further experimented with plugging MIRA’s objective into PRO and PRO’s objective into MIRA. The resulting MIRA scores were not much different from before, while PRO’s score dropped even further; we also found mon- sters. Next, we applied the length fix for PRO proposed in (Nakov et al., 2012); this helped a bit, but still left PRO two BLEU points behind and MIRA, and the monsters did not go away. We can conclude that the monster problem is PRO-specific, cannot be blamed on the objective function, and is different from the length bias. Note also that monsters are not specific to a dataset or language pair. We found them when tuning on the top-50% of WMT10 and testing on WMT1 1 for Spanish-English; this yielded a drop in BLEU from 29.63 (1M2/emEs/RprTes)la to 27.12 n(inPg/RtmOp.)1.1 MERT2 **REF** : but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - **IT1** : but we are that we add our ranks to some of us and that we know that in the strength and weakness in **IT3** : , we are the but of the that that the , and , of ranks the the on the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and **IT4** : namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never balme your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3 .8 billion in tourism income proceeds Chamber are divided among themselves : some ' ve dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure Ftroai ngtkhsu et rahefef 2he : Ea t h xte ha,e motfo pstohlmeee r leafst eorfe wne c, et etr laonngs olfa t hei opnar a ofn sdo hhee oy fpwhaoitst hh a]r ee usisn i ostu tofra tnhes ilna tbiakoern s, haef ctoeokr it hee roant ainod nthse 1 t,h 3we aknonw d, 4. T@-@h eAl l tahes ft trwce o, tho ypotheses are monsters. 1See (Cherry and Foster, 2012) for details on objectives. 2Also, using PRO to initialize MERT, as implemented in Moses, yields 46.52 BLEU and monsters, but using MERT to initialize PRO yields 47.55 and no monsters. 13 3 Slaying Monsters: Theory Below we explain what monsters are and where they come from. Then, we propose various monster slaying techniques to be applied during PRO’s selection and acceptance steps. 3.1 What is PRO? PRO is a batch optimizer that iterates between (i) translation: using the current parameter values, generate k-best translations, and (ii) optimization: using the translations from all previous iterations, find new parameter values. The optimization step has four substeps: 1. Sampling: For each sentence, sample uniformly at random Γ = 5000 pairs from the set of all candidate translations for that sentence from all previous iterations. 2. Selection: From these sampled pairs, select those for which the absolute difference between their BLEU+1 scores is higher than α = 0.05 (note: this is 5 BLEU+1 points). 3. Acceptance: For each sentence, accept the Ξ = 50 selected pairs with the highest absolute difference in their BLEU+1 scores. 4. Learning: Assemble the accepted pairs for all sentences into a single set and use it to train a ranker to prefer the higher-scoring sentence in each pair. We believe that monsters are nurtured by PRO’s selection and acceptance policies. PRO’s selection step filters pairs involving hypotheses that differ by less than five BLEU+1 points, but it does not cut-off ones that differ too much based on BLEU+1 or length. PRO’s acceptance step selects Ξ = 50 pairs with the highest BLEU+1 differentials, which creates breeding ground for monsters since these pairs are very likely to include one monster and one good hypothesis. Below we discuss monster slaying geared towards the selection and acceptance steps of PRO. 3.2 Slaying at Selection In the selection step, PRO filters pairs for which the difference in BLEU+1 is less than five points, but it has no cut-off on the maximum BLEU+1 differentials nor cut-offs based on absolute length or difference in length. Here, we propose several selection filters, both deterministic and probabilistic. Cut-offs. A cut-off is a deterministic rule that filters out pairs that do not comply with some criteria. We experiment with a maximal cut-off on (a) the difference in BLEU+1 scores and (b) the difference in lengths. These are relative cut-offs because they refer to the pair, but absolute cut-offs that apply to each of the elements in the pair are also possible (not explored here). Cut-offs (a) and (b) slay monsters by not allowing the negative examples to get much worse in BLEU+1 or in length than the positive example in the pair. Filtering outliers. Outliers are rare or extreme observations in a sample. We assume normal distribution of the BLEU+1 scores (or of the lengths) of the translation hypotheses for the same source sentence, and we define as outliers hypotheses whose BLEU+1 (or length) is more than λ standard deviations away from the sample average. We apply the outlier filter to both the positive and the negative example in a pair, but it is more important for the latter. We experiment with values of λ like 2 and 3. This filtering slays monsters because they are likely outliers. However, it will not work if the population gets riddled with monsters, in which case they would become the norm. Stochastic sampling. Instead of filtering extreme examples, we can randomly sample pairs according to their probability of being typical. Let us assume that the values of the local scoring functions, i.e., the BLEU+1 scores, are distributed nor- mally: g(i, j) ∼ N(µ, σ2). Given a sample of hypothesis (tira,nj)sl ∼atio Nn(sµ {j} of the same source sentpeontchee i, we can ensstim {ja}te o σ empirically. Then, the difference ∆ = g(i, j) − g(i, j0) would be tdhisetr diibfufteerde normally w gi(thi, mean zero and variance 2σ2. Now, given a pair of examples, we can calculate their ∆, and we can choose to select the pair with some probability, according to N(0, 2σ2). 3.3 Slaying at Acceptance Another problem is caused by the acceptance mechanism of PRO: among all selected pairs, it accepts the top-Ξ with the highest BLEU+1 differentials. It is easy to see that these differentials are highest for nonmonster–monster pairs if such pairs exist. One way to avoid focusing primarily on such pairs is to accept a random set of pairs, among the ones that survived the selection step. One possible caveat is that we can lose some of the discriminative power of PRO by focusing on examples that are not different enough. Ξ 14 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) PRO fix Avg. for 3 reruns BLEU StdDev Pos Lengths Neg Ref BLEU+1 Avg. for 3 reruns Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Max diff. cut-offBLEU+1 max=10†47.940.16547.949.649.449.439.947.770.035 BLEU+1 max=20 † 47.73 0.136 47.7 55.5 51.1 49.8 32.7 47.85 0.049 LEN max=5 † 48.09 0.021 46.8 47.0 47.9 52.9 37.8 47.73 0.051 LEN max=10 † 47.99 0.025 47.3 48.5 48.7 52.5 35.6 47.80 0.056 OutliersBLEU+1 λ=2.0†48.050.11946.847.247.752.239.547.470.090 BLEU+1 λ=3.0 LEN λ=2.0 LEN λ=3.0 47.12 46.68 47.02 1.348 2.005 0.727 47.6 49.3 48.2 168.0 82.7 163.0 53.0 53.1 51.4 51.7 52.3 51.4 3.9 5.3 4.2 47.53 47.49 47.65 0.038 0.085 0.096 Stoch. sampl.∆ BLEU+146.331.00046.8216.053.353.12.447.740.035 ∆ LEN 46.36 1.281 47.4 201.0 52.9 53.4 2.9 47.78 0.081 Table 3: Some fixes to PRO (select pairs with highest BLEU+1 differential, also require at least 5 BLEU+1 points difference). A dagger (†) indicates selection fixes that successfully get rid of monsters. 4 Attacking Monsters: Practice Below, we first present our general experimental setup. Then, we present the results for the various selection alternatives, both with the original acceptance strategy and with random acceptance. 4.1 Experimental Setup We used a phrase-based SMT model (Koehn et al., 2003) as implemented in the Moses toolkit (Koehn et al., 2007). We trained on all Arabic-English data for NIST 2012 except for UN, we tuned on (the longest-50% of) the MT06 sentences, and we tested on MT09. We used the MADA ATB segmentation for Arabic (Roth et al., 2008) and truecasing for English, phrases of maximal length 7, Kneser-Ney smoothing, and lexicalized reorder- ing (Koehn et al., 2005), and a 5-gram language model, trained on GigaWord v.5 using KenLM (Heafield, 2011). We dropped unknown words both at tuning and testing, and we used minimum Bayes risk decoding at testing (Kumar and Byrne, 2004). We evaluated the output with NIST’s scoring tool v.13a, cased. We used the Moses implementations of MERT, PRO and batch MIRA, with the –return-best-dev parameter for the latter. We ran these optimizers for up to 25 iterations and we used 1000-best lists. For stability (Foster and Kuhn, 2009), we performed three reruns of each experiment (tuning + evaluation), and we report averaged scores. 4.2 Selection Alternatives Table 3 presents the results for different selection alternatives. The first two columns show the testing results: average BLEU and standard deviation over three reruns. The following five columns show statistics about the last iteration (it. 25) of PRO’s tuning for the worst rerun: average lengths of the positive and the negative examples and average effective reference length, followed by average BLEU+1 scores for the positive and the negative examples in the pairs. The last two columns present the results when tuning on the full tuning set. These are included to verify the behavior of PRO in a nonmonster prone environment. We can see in Table 3 that all selection mechanisms considerably improve BLEU compared to the baseline PRO, by 2-3 BLEU points. However, not every selection alternative gets rid of monsters, which can be seen by the large lengths and low BLEU+1 for the negative examples (in bold). The max cut-offs for BLEU+1 and for lengths both slay the monsters, but the latter yields much lower standard deviation (thirteen times lower than for the baseline PRO!), thus considerably increasing PRO’s stability. On the full dataset, BLEU scores are about the same as for the original PRO (with small improvement for BLEU+1 max=20), but the standard deviations are slightly better. Rejecting outliers using BLEU+1 and λ = 3 is not strong enough to filter out monsters, but making this criterion more strict by setting λ = 2, yields competitive BLEU and kills the monsters. Rejecting outliers based on length does not work as effectively though. We can think of two possible reasons: (i) lengths are not normally distributed, they are more Poisson-like, and (ii) the acceptance criterion is based on the top-Ξ differentials based on BLEU+1, not based on length. On the full dataset, rejecting outliers, BLEU+1 and length, yields lower BLEU and less stability. 15 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) Avg. for 3 reruns Lengths BLEU+1 Avg. for 3 reruns PRO fix BLEU StdDev Pos Neg Ref Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Rand. acceptPRO, rand††47.870.14747.748.548.7047.742.947.590.114 OutliersBLEU+1 λ=2.0, rand∗47.850.07848.248.448.947.543.647.620.091 BLEU+1 λ=3.0, rand 47.97 0.168 47.6 47.6 48.4 47.8 43.6 47.44 0.070 LEN λ=2.0, rand∗ 47.69 0.114 47.8 47.8 48.6 47.9 43.6 47.48 0.046 LEN λ=3.0, rand 47.89 0.235 47.8 48.0 48.7 47.7 43. 1 47.64 0.090 Stoch. sampl.∆ BLEU+1, rand∗47.990.08747.948.048.747.843.547.670.096 ∆ LEN, rand∗ 47.94 0.060 47.8 47.9 48.6 47.8 43.6 47.65 0.097 Table 4: More fixes to PRO (with random acceptance, no minimum BLEU+1). The (††) indicates that random acceptance kills monsters. The asterisk (∗) indicates improved stability over random acceptance. Reasons (i) and (ii) arguably also apply to stochastic sampling of differentials (for BLEU+1 or for length), which fails to kill the monsters, maybe because it gives them some probability of being selected by design. To alleviate this, we test the above settings with random acceptance. 4.3 Random Acceptance Table 4 shows the results for accepting training pairs for PRO uniformly at random. To eliminate possible biases, we also removed the min=0.05 BLEU+1 selection criterion. Surprisingly, this setup effectively eliminated the monster problem. Further coupling this with the distributional criteria can also yield increased stability, and even small further increase in test BLEU. For instance, rejecting BLEU outliers with λ = 2 yields comparable average test BLEU, but with only half the standard deviation. On the other hand, using the stochastic sampling of differentials based on either BLEU+1 or lengths improves the test BLEU score while increasing the stability across runs. The random acceptance has a caveat though: it generally decreases the discriminative power of PRO, yielding worse results when tuning on the full, nonmonster prone tuning dataset. Stochastic selection does help to alleviate this problem. Yet, the results are not as good as when using a max cut-off for the length. Therefore, we recommend using the latter as a default setting. 5 Related Work We are not aware of previous work that discusses the issue of monsters, but there has been work on a different, length problem with PRO (Nakov et al., 2012). We have seen that its solution, fix the smoothing in BLEU+1, did not work for us. The stability of MERT has been improved using regularization (Cer et al., 2008), random restarts (Moore and Quirk, 2008), multiple replications (Clark et al., 2011), and parameter aggregation (Cettolo et al., 2011). With the emergence of new optimization techniques, there have been studies that compare stability between MIRA–MERT (Chiang et al., 2008; Chiang et al., 2009; Cherry and Foster, 2012), PRO–MERT (Hopkins and May, 2011), MIRA– PRO–MERT (Cherry and Foster, 2012; Gimpel and Smith, 2012; Nakov et al., 2012). Pathological verbosity can be an issue when tuning MERT on recall-oriented metrics such as METEOR (Lavie and Denkowski, 2009; Denkowski and Lavie, 2011). Large variance between the results obtained with MIRA has also been reported (Simianer et al., 2012). However, none of this work has focused on monsters. 6 Tale’s Moral and Future Battles We have studied a problem with PRO, namely that it can fall victim to monsters, overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning. We have proposed several effective ways to address this problem, based on length- and BLEU+1-based cut-offs, outlier filters and stochastic sampling. The best of these fixes have not only slayed the monsters, but have also brought much higher stability to PRO as well as improved test-time BLEU scores. These benefits are less visible on the full dataset, but we still recommend them to everybody who uses PRO as protection against monsters. Monsters are inherent in PRO; they just do not always take over. In future work, we plan a deeper look at the mechanism of monster creation in PRO and its possible connection to PRO’s length bias. 16 References Daniel Cer, Daniel Jurafsky, and Christopher Manning. 2008. Regularization and search for minimum error rate training. In Proc. of Workshop on Statistical Machine Translation, WMT ’08, pages 26–34. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Methods for smoothing the optimizer instability in SMT. MT Summit XIII: the Machine Translation Summit, pages 32–39. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 427–436. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 224–233. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine transla- tion. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’09, pages 218–226. Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 11, pages 176–181 . Michael Denkowski and Alon Lavie. 2011. Meteortuned phrase-based SMT: CMU French-English and Haitian-English systems for WMT 2011. Technical report, CMU-LTI-1 1-01 1, Language Technologies Institute, Carnegie Mellon University. George Foster and Roland Kuhn. 2009. Stabilizing minimum error rate training. In Proceedings of the Workshop on Statistical Machine Translation, StatMT ’09, pages 242–249. Kevin Gimpel and Noah Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 221–231. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation, WMT ’ 11, pages 187–197. Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 11, pages 1352–1362. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLTNAACL ’03, pages 48–54. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation, IWSLT ’05. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of the Meeting of the Association for Computational Linguistics, ACL ’07, pages 177–180. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL ’04, pages 169–176. Alon Lavie and Michael Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation, 23: 105–1 15. Robert Moore and Chris Quirk. 2008. Random restarts in minimum error rate training for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics, COLING ’08, pages 585–592. Preslav Nakov, Francisco Guzm a´n, and Stephan Vogel. 2012. Optimizing for sentence-level BLEU+1 yields short translations. In Proceedings ofthe International Conference on Computational Linguistics, COLING ’ 12, pages 1979–1994. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’03, pages 160–167. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’08, pages 117–120. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 12, pages 11–21. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, pages 764–773. 17

5 0.19807163 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Author: Kristina Toutanova ; Byung-Gyu Ahn

Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.

6 0.15542778 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

7 0.12204298 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

8 0.12138887 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

9 0.11928929 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

10 0.11550828 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

11 0.10838338 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

12 0.10146451 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

13 0.093565591 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

14 0.092154786 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

15 0.092098467 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

16 0.078978568 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

17 0.074839957 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

18 0.074018538 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

19 0.072379038 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

20 0.071767136 314 acl-2013-Semantic Roles for String to Tree Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, -0.132), (2, 0.103), (3, 0.083), (4, -0.015), (5, 0.027), (6, 0.065), (7, -0.023), (8, -0.02), (9, 0.062), (10, -0.038), (11, 0.065), (12, -0.049), (13, -0.064), (14, 0.004), (15, 0.093), (16, -0.102), (17, 0.088), (18, 0.079), (19, -0.013), (20, 0.166), (21, 0.109), (22, -0.009), (23, 0.03), (24, 0.077), (25, 0.007), (26, 0.072), (27, 0.003), (28, -0.059), (29, -0.05), (30, 0.209), (31, 0.029), (32, 0.047), (33, -0.032), (34, 0.065), (35, -0.09), (36, 0.013), (37, 0.117), (38, 0.057), (39, -0.042), (40, 0.059), (41, -0.023), (42, 0.008), (43, 0.012), (44, -0.06), (45, -0.01), (46, 0.126), (47, -0.028), (48, 0.104), (49, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94762713 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

2 0.92563945 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

3 0.91038024 24 acl-2013-A Tale about PRO and Monsters

Author: Preslav Nakov ; Francisco Guzman ; Stephan Vogel

Abstract: While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- and BLEU+1based cut-offs, outlier filters, stochastic sampling, and random acceptance. The best of these fixes not only slay and protect against monsters, but also yield higher stability for PRO as well as improved testtime BLEU scores. Thus, we recommend them to anybody using PRO, monsterbeliever or not. – 1 Once Upon a Time... For years, the standard way to do statistical machine translation parameter tuning has been to use minimum error-rate training, or MERT (Och, 2003). However, as researchers started using models with thousands of parameters, new scalable optimization algorithms such as MIRA (Watanabe et al., 2007; Chiang et al., 2008) and PRO (Hopkins and May, 2011) have emerged. As these algorithms are relatively new, they are still not quite well understood, and studying their properties is an active area of research. For example, Nakov et al. (2012) have pointed out that PRO tends to generate translations that are consistently shorter than desired. They have blamed this on inadequate smoothing in PRO’s optimization objective, namely sentencelevel BLEU+1, and they have addressed the problem using more sensible smoothing. We wondered whether the issue could be partially relieved simply by tuning on longer sentences, for which the effect of smoothing would naturally be smaller. To our surprise, tuning on the longer 50% of the tuning sentences had a disastrous effect on PRO, causing an absolute drop of three BLEU points on testing; at the same time, MERT and MIRA did not have such a problem. While investigating the reasons, we discovered hundreds of monsters creeping under PRO’s surface... Our tale continues as follows. We first explain what monsters are in Section 2, then we present a theory about how they can be slayed in Section 3, we put this theory to test in practice in Section 4, and we discuss some related efforts in Section 5. Finally, we present the moral of our tale, and we hint at some planned future battles in Section 6. 2 Monsters, Inc. PRO uses pairwise ranking optimization, where the learning task is to classify pairs of hypotheses into correctly or incorrectly ordered (Hopkins and May, 2011). It searches for a vector of weights w such that higher evaluation metric scores correspond to higher model scores and vice versa. More formally, PRO looks for weights w such that g(i, j) > g(i, j0) ⇔ hw (i, j) > hw (i, j0), where g is a local scoring fu hnction (typically, sentencelevel BLEU+1) and hw are the model scores for a given input sentence i and two candidate hypotheses j and j0 that were obtained using w. If g(i, j) > g(i, j0), we will refer to j and j0 as the positive and the negative example in the pair. Learning good parameter values requires negative examples that are comparable to the positive ones. Instead, tuning on long sentences quickly introduces monsters, i.e., corrupted negative examples that are unsuitable for learning: they are (i) much longer than the respective positive examples and the references, and (ii) have very low BLEU+1 scores compared to the positive examples and in absolute terms. The low BLEU+1 means that PRO effectively has to learn from positive examples only. 12 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 12–17, Avg. Lengths Avg. BLEU+1 iter. pos neg ref. pos neg 1 45.2 44.6 46.5 52.5 37.6 2 3 4 5 ... 25 46.4 46.4 46.4 46.3 ... 47.9 70.5 261.0 250.0 248.0 ... 229.0 53.2 53.4 53.0 53.0 ... 52.5 52.8 52.4 52.0 52.1 ... 52.2 14.5 2.19 2.30 2.34 ... 2.81 Table 1: PRO iterations, tuning on long sentences. Table 1shows an optimization run of PRO when tuning on long sentences. We can see monsters after iterations in which positive examples are on average longer than negative ones (e.g., iter. 1). As a result, PRO learns to generate longer sentences, but it overshoots too much (iter. 2), which gives rise to monsters. Ideally, the learning algorithm should be able to recover from overshooting. However, once monsters are encountered, they quickly start dominating, with no chance for PRO to recover since it accumulates n-best lists, and thus also monsters, over iterations. As a result, PRO keeps jumping up and down and converges to random values, as Figure 1 shows. By default, PRO’s parameters are averaged over iterations, and thus the final result is quite mediocre, but selecting the highest tuning score does not solve the problem either: for example, on Figure 1, PRO never achieves a BLEU better than that for the default initialization parameters. iteration Figure 1: PRO tuning results on long sentences across iterations. The dark-gray line shows the tuning BLEU (left axis), the light-gray one is the hypothesis/reference length ratio (right axis). Figure 2 shows the translations after iterations 1, 3 and 4; the last two are monsters. The monster at iteration 3 is potentially useful, but that at iteration 4 is clearly unsuitable as a negative example. Optimizer Objective BLEU PROsent-BLEU+144.57 MERT corpus-BLEU 47.53 MIRA pseudo-doc-BLEU 47.80 PRO (6= objective)pseudo-doc-BLEU21.35 PMRIORA (6= =(6= o bojbejcetcivteiv)e) sent-BLEU+1 47.59 PMRIRO,A PC (6=-sm obojoectthiv,e g)roundfixed sent-BLEU+145.71 Table 2: PRO vs. MERT vs. MIRA. We also checked whether other popular optimizers yield very low BLEU scores at test time when tuned on long sentences. Lines 2-3 in Table 2 show that this is not the case for MERT and MIRA. Since they optimize objectives that are different from PRO’s,1 we further experimented with plugging MIRA’s objective into PRO and PRO’s objective into MIRA. The resulting MIRA scores were not much different from before, while PRO’s score dropped even further; we also found mon- sters. Next, we applied the length fix for PRO proposed in (Nakov et al., 2012); this helped a bit, but still left PRO two BLEU points behind and MIRA, and the monsters did not go away. We can conclude that the monster problem is PRO-specific, cannot be blamed on the objective function, and is different from the length bias. Note also that monsters are not specific to a dataset or language pair. We found them when tuning on the top-50% of WMT10 and testing on WMT1 1 for Spanish-English; this yielded a drop in BLEU from 29.63 (1M2/emEs/RprTes)la to 27.12 n(inPg/RtmOp.)1.1 MERT2 **REF** : but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - **IT1** : but we are that we add our ranks to some of us and that we know that in the strength and weakness in **IT3** : , we are the but of the that that the , and , of ranks the the on the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and **IT4** : namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never balme your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3 .8 billion in tourism income proceeds Chamber are divided among themselves : some ' ve dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure Ftroai ngtkhsu et rahefef 2he : Ea t h xte ha,e motfo pstohlmeee r leafst eorfe wne c, et etr laonngs olfa t hei opnar a ofn sdo hhee oy fpwhaoitst hh a]r ee usisn i ostu tofra tnhes ilna tbiakoern s, haef ctoeokr it hee roant ainod nthse 1 t,h 3we aknonw d, 4. T@-@h eAl l tahes ft trwce o, tho ypotheses are monsters. 1See (Cherry and Foster, 2012) for details on objectives. 2Also, using PRO to initialize MERT, as implemented in Moses, yields 46.52 BLEU and monsters, but using MERT to initialize PRO yields 47.55 and no monsters. 13 3 Slaying Monsters: Theory Below we explain what monsters are and where they come from. Then, we propose various monster slaying techniques to be applied during PRO’s selection and acceptance steps. 3.1 What is PRO? PRO is a batch optimizer that iterates between (i) translation: using the current parameter values, generate k-best translations, and (ii) optimization: using the translations from all previous iterations, find new parameter values. The optimization step has four substeps: 1. Sampling: For each sentence, sample uniformly at random Γ = 5000 pairs from the set of all candidate translations for that sentence from all previous iterations. 2. Selection: From these sampled pairs, select those for which the absolute difference between their BLEU+1 scores is higher than α = 0.05 (note: this is 5 BLEU+1 points). 3. Acceptance: For each sentence, accept the Ξ = 50 selected pairs with the highest absolute difference in their BLEU+1 scores. 4. Learning: Assemble the accepted pairs for all sentences into a single set and use it to train a ranker to prefer the higher-scoring sentence in each pair. We believe that monsters are nurtured by PRO’s selection and acceptance policies. PRO’s selection step filters pairs involving hypotheses that differ by less than five BLEU+1 points, but it does not cut-off ones that differ too much based on BLEU+1 or length. PRO’s acceptance step selects Ξ = 50 pairs with the highest BLEU+1 differentials, which creates breeding ground for monsters since these pairs are very likely to include one monster and one good hypothesis. Below we discuss monster slaying geared towards the selection and acceptance steps of PRO. 3.2 Slaying at Selection In the selection step, PRO filters pairs for which the difference in BLEU+1 is less than five points, but it has no cut-off on the maximum BLEU+1 differentials nor cut-offs based on absolute length or difference in length. Here, we propose several selection filters, both deterministic and probabilistic. Cut-offs. A cut-off is a deterministic rule that filters out pairs that do not comply with some criteria. We experiment with a maximal cut-off on (a) the difference in BLEU+1 scores and (b) the difference in lengths. These are relative cut-offs because they refer to the pair, but absolute cut-offs that apply to each of the elements in the pair are also possible (not explored here). Cut-offs (a) and (b) slay monsters by not allowing the negative examples to get much worse in BLEU+1 or in length than the positive example in the pair. Filtering outliers. Outliers are rare or extreme observations in a sample. We assume normal distribution of the BLEU+1 scores (or of the lengths) of the translation hypotheses for the same source sentence, and we define as outliers hypotheses whose BLEU+1 (or length) is more than λ standard deviations away from the sample average. We apply the outlier filter to both the positive and the negative example in a pair, but it is more important for the latter. We experiment with values of λ like 2 and 3. This filtering slays monsters because they are likely outliers. However, it will not work if the population gets riddled with monsters, in which case they would become the norm. Stochastic sampling. Instead of filtering extreme examples, we can randomly sample pairs according to their probability of being typical. Let us assume that the values of the local scoring functions, i.e., the BLEU+1 scores, are distributed nor- mally: g(i, j) ∼ N(µ, σ2). Given a sample of hypothesis (tira,nj)sl ∼atio Nn(sµ {j} of the same source sentpeontchee i, we can ensstim {ja}te o σ empirically. Then, the difference ∆ = g(i, j) − g(i, j0) would be tdhisetr diibfufteerde normally w gi(thi, mean zero and variance 2σ2. Now, given a pair of examples, we can calculate their ∆, and we can choose to select the pair with some probability, according to N(0, 2σ2). 3.3 Slaying at Acceptance Another problem is caused by the acceptance mechanism of PRO: among all selected pairs, it accepts the top-Ξ with the highest BLEU+1 differentials. It is easy to see that these differentials are highest for nonmonster–monster pairs if such pairs exist. One way to avoid focusing primarily on such pairs is to accept a random set of pairs, among the ones that survived the selection step. One possible caveat is that we can lose some of the discriminative power of PRO by focusing on examples that are not different enough. Ξ 14 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) PRO fix Avg. for 3 reruns BLEU StdDev Pos Lengths Neg Ref BLEU+1 Avg. for 3 reruns Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Max diff. cut-offBLEU+1 max=10†47.940.16547.949.649.449.439.947.770.035 BLEU+1 max=20 † 47.73 0.136 47.7 55.5 51.1 49.8 32.7 47.85 0.049 LEN max=5 † 48.09 0.021 46.8 47.0 47.9 52.9 37.8 47.73 0.051 LEN max=10 † 47.99 0.025 47.3 48.5 48.7 52.5 35.6 47.80 0.056 OutliersBLEU+1 λ=2.0†48.050.11946.847.247.752.239.547.470.090 BLEU+1 λ=3.0 LEN λ=2.0 LEN λ=3.0 47.12 46.68 47.02 1.348 2.005 0.727 47.6 49.3 48.2 168.0 82.7 163.0 53.0 53.1 51.4 51.7 52.3 51.4 3.9 5.3 4.2 47.53 47.49 47.65 0.038 0.085 0.096 Stoch. sampl.∆ BLEU+146.331.00046.8216.053.353.12.447.740.035 ∆ LEN 46.36 1.281 47.4 201.0 52.9 53.4 2.9 47.78 0.081 Table 3: Some fixes to PRO (select pairs with highest BLEU+1 differential, also require at least 5 BLEU+1 points difference). A dagger (†) indicates selection fixes that successfully get rid of monsters. 4 Attacking Monsters: Practice Below, we first present our general experimental setup. Then, we present the results for the various selection alternatives, both with the original acceptance strategy and with random acceptance. 4.1 Experimental Setup We used a phrase-based SMT model (Koehn et al., 2003) as implemented in the Moses toolkit (Koehn et al., 2007). We trained on all Arabic-English data for NIST 2012 except for UN, we tuned on (the longest-50% of) the MT06 sentences, and we tested on MT09. We used the MADA ATB segmentation for Arabic (Roth et al., 2008) and truecasing for English, phrases of maximal length 7, Kneser-Ney smoothing, and lexicalized reorder- ing (Koehn et al., 2005), and a 5-gram language model, trained on GigaWord v.5 using KenLM (Heafield, 2011). We dropped unknown words both at tuning and testing, and we used minimum Bayes risk decoding at testing (Kumar and Byrne, 2004). We evaluated the output with NIST’s scoring tool v.13a, cased. We used the Moses implementations of MERT, PRO and batch MIRA, with the –return-best-dev parameter for the latter. We ran these optimizers for up to 25 iterations and we used 1000-best lists. For stability (Foster and Kuhn, 2009), we performed three reruns of each experiment (tuning + evaluation), and we report averaged scores. 4.2 Selection Alternatives Table 3 presents the results for different selection alternatives. The first two columns show the testing results: average BLEU and standard deviation over three reruns. The following five columns show statistics about the last iteration (it. 25) of PRO’s tuning for the worst rerun: average lengths of the positive and the negative examples and average effective reference length, followed by average BLEU+1 scores for the positive and the negative examples in the pairs. The last two columns present the results when tuning on the full tuning set. These are included to verify the behavior of PRO in a nonmonster prone environment. We can see in Table 3 that all selection mechanisms considerably improve BLEU compared to the baseline PRO, by 2-3 BLEU points. However, not every selection alternative gets rid of monsters, which can be seen by the large lengths and low BLEU+1 for the negative examples (in bold). The max cut-offs for BLEU+1 and for lengths both slay the monsters, but the latter yields much lower standard deviation (thirteen times lower than for the baseline PRO!), thus considerably increasing PRO’s stability. On the full dataset, BLEU scores are about the same as for the original PRO (with small improvement for BLEU+1 max=20), but the standard deviations are slightly better. Rejecting outliers using BLEU+1 and λ = 3 is not strong enough to filter out monsters, but making this criterion more strict by setting λ = 2, yields competitive BLEU and kills the monsters. Rejecting outliers based on length does not work as effectively though. We can think of two possible reasons: (i) lengths are not normally distributed, they are more Poisson-like, and (ii) the acceptance criterion is based on the top-Ξ differentials based on BLEU+1, not based on length. On the full dataset, rejecting outliers, BLEU+1 and length, yields lower BLEU and less stability. 15 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) Avg. for 3 reruns Lengths BLEU+1 Avg. for 3 reruns PRO fix BLEU StdDev Pos Neg Ref Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Rand. acceptPRO, rand††47.870.14747.748.548.7047.742.947.590.114 OutliersBLEU+1 λ=2.0, rand∗47.850.07848.248.448.947.543.647.620.091 BLEU+1 λ=3.0, rand 47.97 0.168 47.6 47.6 48.4 47.8 43.6 47.44 0.070 LEN λ=2.0, rand∗ 47.69 0.114 47.8 47.8 48.6 47.9 43.6 47.48 0.046 LEN λ=3.0, rand 47.89 0.235 47.8 48.0 48.7 47.7 43. 1 47.64 0.090 Stoch. sampl.∆ BLEU+1, rand∗47.990.08747.948.048.747.843.547.670.096 ∆ LEN, rand∗ 47.94 0.060 47.8 47.9 48.6 47.8 43.6 47.65 0.097 Table 4: More fixes to PRO (with random acceptance, no minimum BLEU+1). The (††) indicates that random acceptance kills monsters. The asterisk (∗) indicates improved stability over random acceptance. Reasons (i) and (ii) arguably also apply to stochastic sampling of differentials (for BLEU+1 or for length), which fails to kill the monsters, maybe because it gives them some probability of being selected by design. To alleviate this, we test the above settings with random acceptance. 4.3 Random Acceptance Table 4 shows the results for accepting training pairs for PRO uniformly at random. To eliminate possible biases, we also removed the min=0.05 BLEU+1 selection criterion. Surprisingly, this setup effectively eliminated the monster problem. Further coupling this with the distributional criteria can also yield increased stability, and even small further increase in test BLEU. For instance, rejecting BLEU outliers with λ = 2 yields comparable average test BLEU, but with only half the standard deviation. On the other hand, using the stochastic sampling of differentials based on either BLEU+1 or lengths improves the test BLEU score while increasing the stability across runs. The random acceptance has a caveat though: it generally decreases the discriminative power of PRO, yielding worse results when tuning on the full, nonmonster prone tuning dataset. Stochastic selection does help to alleviate this problem. Yet, the results are not as good as when using a max cut-off for the length. Therefore, we recommend using the latter as a default setting. 5 Related Work We are not aware of previous work that discusses the issue of monsters, but there has been work on a different, length problem with PRO (Nakov et al., 2012). We have seen that its solution, fix the smoothing in BLEU+1, did not work for us. The stability of MERT has been improved using regularization (Cer et al., 2008), random restarts (Moore and Quirk, 2008), multiple replications (Clark et al., 2011), and parameter aggregation (Cettolo et al., 2011). With the emergence of new optimization techniques, there have been studies that compare stability between MIRA–MERT (Chiang et al., 2008; Chiang et al., 2009; Cherry and Foster, 2012), PRO–MERT (Hopkins and May, 2011), MIRA– PRO–MERT (Cherry and Foster, 2012; Gimpel and Smith, 2012; Nakov et al., 2012). Pathological verbosity can be an issue when tuning MERT on recall-oriented metrics such as METEOR (Lavie and Denkowski, 2009; Denkowski and Lavie, 2011). Large variance between the results obtained with MIRA has also been reported (Simianer et al., 2012). However, none of this work has focused on monsters. 6 Tale’s Moral and Future Battles We have studied a problem with PRO, namely that it can fall victim to monsters, overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning. We have proposed several effective ways to address this problem, based on length- and BLEU+1-based cut-offs, outlier filters and stochastic sampling. The best of these fixes have not only slayed the monsters, but have also brought much higher stability to PRO as well as improved test-time BLEU scores. These benefits are less visible on the full dataset, but we still recommend them to everybody who uses PRO as protection against monsters. Monsters are inherent in PRO; they just do not always take over. In future work, we plan a deeper look at the mechanism of monster creation in PRO and its possible connection to PRO’s length bias. 16 References Daniel Cer, Daniel Jurafsky, and Christopher Manning. 2008. Regularization and search for minimum error rate training. In Proc. of Workshop on Statistical Machine Translation, WMT ’08, pages 26–34. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Methods for smoothing the optimizer instability in SMT. MT Summit XIII: the Machine Translation Summit, pages 32–39. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 427–436. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 224–233. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine transla- tion. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’09, pages 218–226. Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 11, pages 176–181 . Michael Denkowski and Alon Lavie. 2011. Meteortuned phrase-based SMT: CMU French-English and Haitian-English systems for WMT 2011. Technical report, CMU-LTI-1 1-01 1, Language Technologies Institute, Carnegie Mellon University. George Foster and Roland Kuhn. 2009. Stabilizing minimum error rate training. In Proceedings of the Workshop on Statistical Machine Translation, StatMT ’09, pages 242–249. Kevin Gimpel and Noah Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 221–231. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation, WMT ’ 11, pages 187–197. Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 11, pages 1352–1362. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLTNAACL ’03, pages 48–54. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation, IWSLT ’05. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of the Meeting of the Association for Computational Linguistics, ACL ’07, pages 177–180. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL ’04, pages 169–176. Alon Lavie and Michael Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation, 23: 105–1 15. Robert Moore and Chris Quirk. 2008. Random restarts in minimum error rate training for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics, COLING ’08, pages 585–592. Preslav Nakov, Francisco Guzm a´n, and Stephan Vogel. 2012. Optimizing for sentence-level BLEU+1 yields short translations. In Proceedings ofthe International Conference on Computational Linguistics, COLING ’ 12, pages 1979–1994. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’03, pages 160–167. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’08, pages 117–120. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 12, pages 11–21. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, pages 764–773. 17

4 0.8678627 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

Author: Vladimir Eidelman ; Ke Wu ; Ferhan Ture ; Philip Resnik ; Jimmy Lin

Abstract: We present an open-source framework for large-scale online structured learning. Developed with the flexibility to handle cost-augmented inference problems such as statistical machine translation (SMT), our large-margin learner can be used with any decoder. Integration with MapReduce using Hadoop streaming allows efficient scaling with increasing size of training data. Although designed with a focus on SMT, the decoder-agnostic design of our learner allows easy future extension to other structured learning problems such as sequence labeling and parsing.

5 0.78620529 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Author: Kristina Toutanova ; Byung-Gyu Ahn

Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.

6 0.65650958 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

7 0.65288693 328 acl-2013-Stacking for Statistical Machine Translation

8 0.61865669 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

9 0.61086071 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

10 0.60056382 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

11 0.5730722 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

12 0.55239767 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

13 0.48429814 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

14 0.48359978 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

15 0.48289764 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

16 0.45289004 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

17 0.43427747 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

18 0.43373463 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

19 0.42977437 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

20 0.42096639 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.058), (6, 0.048), (11, 0.227), (19, 0.072), (24, 0.031), (26, 0.072), (28, 0.011), (35, 0.047), (40, 0.011), (42, 0.069), (48, 0.048), (70, 0.05), (88, 0.021), (90, 0.035), (95, 0.096)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94239277 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy

Author: Francesco Sartorio ; Giorgio Satta ; Joakim Nivre

Abstract: We present a novel transition-based, greedy dependency parser which implements a flexible mix of bottom-up and top-down strategies. The new strategy allows the parser to postpone difficult decisions until the relevant information becomes available. The novel parser has a ∼12% error reduction in unlabeled attach∼ment score over an arc-eager parser, with a slow-down factor of 2.8.

same-paper 2 0.93657267 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

3 0.93399459 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation

Author: Ruey-Cheng Chen

Abstract: We study the mathematical properties of a recently proposed MDL-based unsupervised word segmentation algorithm, called regularized compression. Our analysis shows that its objective function can be efficiently approximated using the negative empirical pointwise mutual information. The proposed extension improves the baseline performance in both efficiency and accuracy on a standard benchmark.

4 0.93363279 61 acl-2013-Automatic Interpretation of the English Possessive

Author: Stephen Tratz ; Eduard Hovy

Abstract: The English ’s possessive construction occurs frequently in text and can encode several different semantic relations; however, it has received limited attention from the computational linguistics community. This paper describes the creation of a semantic relation inventory covering the use of ’s, an inter-annotator agreement study to calculate how well humans can agree on the relations, a large collection of possessives annotated according to the relations, and an accurate automatic annotation system for labeling new examples. Our 21,938 example dataset is by far the largest annotated possessives dataset we are aware of, and both our automatic classification system, which achieves 87.4% accuracy in our classification experiment, and our annotation data are publicly available.

5 0.93289804 245 acl-2013-Modeling Human Inference Process for Textual Entailment Recognition

Author: Hen-Hsen Huang ; Kai-Chun Chang ; Hsin-Hsi Chen

Abstract: This paper aims at understanding what human think in textual entailment (TE) recognition process and modeling their thinking process to deal with this problem. We first analyze a labeled RTE-5 test set and find that the negative entailment phenomena are very effective features for TE recognition. Then, a method is proposed to extract this kind of phenomena from text-hypothesis pairs automatically. We evaluate the performance of using the negative entailment phenomena on both the English RTE-5 dataset and Chinese NTCIR-9 RITE dataset, and conclude the same findings.

6 0.92237091 286 acl-2013-Psycholinguistically Motivated Computational Models on the Organization and Processing of Morphologically Complex Words

7 0.91986132 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora

8 0.91323298 376 acl-2013-Using Lexical Expansion to Learn Inference Rules from Sparse Data

9 0.90712714 170 acl-2013-GlossBoot: Bootstrapping Multilingual Domain Glossaries from the Web

10 0.90660042 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

11 0.90546274 75 acl-2013-Building Japanese Textual Entailment Specialized Data Sets for Inference of Basic Sentence Relations

12 0.86247778 242 acl-2013-Mining Equivalent Relations from Linked Data

13 0.86114264 154 acl-2013-Extracting bilingual terminologies from comparable corpora

14 0.85791349 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

15 0.84767509 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

16 0.84252 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

17 0.83687955 333 acl-2013-Summarization Through Submodularity and Dispersion

18 0.83222347 133 acl-2013-Efficient Implementation of Beam-Search Incremental Parsers

19 0.82589483 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

20 0.81746411 202 acl-2013-Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web