acl acl2013 acl2013-221 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kristina Toutanova ; Byung-Gyu Ahn
Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract In this paper we show how to automatically induce non-linear features for machine translation. [sent-2, score-0.232]
2 The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. [sent-3, score-0.439]
3 We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. [sent-4, score-1.132]
4 Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks. [sent-5, score-0.392]
5 However, a significant feature engineering effort is still required from practitioners. [sent-11, score-0.17]
6 When a linear model does not fit well, researchers are careful to manually add important feature conjunctions, as for example, (Daum ´e III and Jagarlamudi, 2011; Clark et al. [sent-12, score-0.24]
7 In the related field of web search ranking, automatically learned non-linear features have brought dramatic improvements in quality (Burges et al. [sent-14, score-0.165]
8 In this paper we develop a framework for inducing non-linear features in the form of regression decision trees, which decompose locally and can be integrated efficiently in decoding. [sent-22, score-0.622]
9 The regression trees encode nonlinear feature combinations of the original features. [sent-23, score-0.461]
10 We build on the work by Friedman (2000) which shows how to induce features to minimize any differentiable loss function. [sent-24, score-0.579]
11 In our application the features are regression decision trees, and the loss function is the pairwise ranking log-loss from the PRO method for parameter tuning (Hopkins and May, 2011). [sent-25, score-0.87]
12 Additionally, we show how to design the learning process such that the induced features are local on phrase-pairs and their language model and reordering context, and thus can be incorporated in decoding efficiently. [sent-26, score-0.373]
13 Our results using re-ranking on two language pairs show that the feature induction approach can bring small gains in performance. [sent-27, score-0.32]
14 Further improvements in the original feature set and the induction algorithm, as well as full integration in decoding are needed to potentially result in substantial performance improvements. [sent-30, score-0.311]
15 2 Feature learning using gradient boosting machines In the linear model for machine translation, scores of translation hypotheses the are weighted sums of a set of input features over the hypotheses. [sent-31, score-1.114]
16 Local and global features for the translation hypothesis are shown. [sent-35, score-0.305]
17 , λL, the hypothesis scores are defined as: F(h) = ∑l=1. [sent-43, score-0.218]
18 In current state-of-the-art mode∑ls, the features fl(h) decompose locally on phrase-pairs (with language model and reordering context) inside the hypotheses. [sent-47, score-0.382]
19 This enables hypothesis recombination during machine translation decoding, leading to faster and more accurate search. [sent-48, score-0.203]
20 Two phrase-pairs are used to compose the translation, and each phrase-pair has a set of local feature function values. [sent-50, score-0.294]
21 We can see that the hypothesis-level (global) feature values are sums of phrase-level (local) feature values. [sent-52, score-0.403]
22 The score of a translation given feature weights λ can be computed either by scoring the phrase-pairs and adding the scores, or by scoring the complete hypothesis by computing its global feature values. [sent-53, score-0.611]
23 The local feature values do look at some limited context outside of a phrase-pair, to compute language model scores and re-ordering scores; therefore we say that the features are defined on phrase-pairs in context. [sent-54, score-0.491]
24 We start with such a state-of-the-art linear model with decomposable features and show how we can automatically induce additional features. [sent-55, score-0.38]
25 The new features are also locally decomposable, so that the scores of hypotheses can be computed as sums of phrase-level scores. [sent-56, score-0.487]
26 The new local phrase-level features are non-linear combinations of the original phrase-level features. [sent-57, score-0.247]
27 Figure 2: Example of two decision tree features. [sent-58, score-0.245]
28 The left decision tree has linear nodes and the right decision tree has constant nodes. [sent-59, score-0.682]
29 1 Form of induced features We will use the example in Figure 1 to introduce × the form of the new features we induce and to give an intuition of why such features might be useful. [sent-61, score-0.518]
30 The new features are expressed by regression decision trees; Figure 2 shows two examples. [sent-62, score-0.358]
31 The first regression tree feature h1 in Figure 2 captures this intuition. [sent-64, score-0.419]
32 The feature value for a phrase-pair of this feature is computed as follows: if f2 ≤ 2, then h1(f0, f1, f2, f3) = 2 f1; otherwise, h1(f0, f1, f2, f3) = f1. [sent-65, score-0.34]
33 T =he 2 2e ×ffec ft of this new feature h1 is to boost the importance of the lexical weighting score for phrase-pairs of low joint count. [sent-66, score-0.24]
34 More generally, the regression tree features we consider have either linear or constant leaf nodes, and have up to 8 leaves. [sent-67, score-0.666]
35 Deeper trees can capture more complex conditions on several input feature values. [sent-68, score-0.316]
36 Each non-leaf node performs a comparison of some input feature value to a threshold and each leaf node (for linear nodes) returns the value of some input feature multiplied by some factor. [sent-69, score-0.823]
37 For a given regression tree with linear nodes, all leaf nodes are expressions of the same input feature but have different coefficients for it (for example, both leaf nodes of h1 return affine functions of the input feature f1). [sent-70, score-1.408]
38 A decision tree feature with constant-valued leaf nodes is illustrated by the right-hand-side tree in Figure 2. [sent-71, score-0.825]
39 For these decision trees, the leaf nodes contain a constant, which is specific to each leaf. [sent-72, score-0.417]
40 These kinds oftrees can effectively perform conjunctions of several binary-valued input feature functions; or they can achieve binning ofreal-values features together with conjunctions over binned values. [sent-73, score-0.453]
41 We apply the framework of gradient boosting for decision tree weak learners (Friedman, 2000). [sent-75, score-0.845]
42 To define the framework, we need to introduce the original input features, the differentiable loss function, and the details of the tree growing algorithm. [sent-76, score-0.489]
43 2 Initial features Our baseline MT system uses relative frequency and lexical weighting channel model weights, one or more language models, distortion penalty, word count, phrase count, and multiple lexicalized reordering weights, one for each distortion type. [sent-79, score-0.591]
44 We have around 15 features in this base feature set. [sent-80, score-0.306]
45 We further expand the input set of features to increase the possibility that useful feature combinations could be found by our feature induction method. [sent-81, score-0.658]
46 This is the feature set we use as a basis for weak learner induction. [sent-83, score-0.28]
47 3 Loss function We use a pair-wise ranking log-loss as in the PRO parameter tuning method (Hopkins and May, 2011). [sent-85, score-0.299]
48 The loss is defined by comparing the model scores of pairs of hypotheses hi and hj where the BLEU score of the first hypothesis is greater than the BLEU score of the second hypothesis by a specified threshold. [sent-86, score-0.707]
49 The loss-function Ψ is defined in terms of the hypothesis model scores 1 s1, s2, sN. [sent-94, score-0.218]
50 =R −[∂∂ΨF(F(x(rx)))]F(x)=Fm−1(x), r 4: = argminα,β ∑rR=1[yr βh(xi; α)]2 αm − 5: ρm = arg minρ Ψ(∑Fm−1 (x) + ρh(x; αm) 6: Fm(x) = Fm−1 (x) ρmh(x; αm) 7: end for + Figure 3: A gradient boosting algorithm for local feature functions. [sent-103, score-0.713]
51 The idea of the gradient boosting method is to induce additional features by computing a functional gradient of the target loss function and iteratively selecting the next weak learner (feature) that is most parallel to the negative gradient. [sent-111, score-1.383]
52 Since we want to induce features such that the hypothesis scores decompose locally, we need to formulate our loss function as a function of local phrase-pair in context scores. [sent-112, score-1.027]
53 Having the model scores decompose locally means that the sco∑res of hypothe- ses F(h) decompose as F(h) = ∑pr∈h F(pr)), where by pr ∈ h we denote the en∑ume∈rahtion over phrase pairs in∈ context ethnoatt are parts ∑mofe hra. [sent-113, score-0.744]
54 iIfo xr dvee-r notes the input feature vector for a phrase-pair in context pr, the score of this phrase-pair can be expressed as F(xr). [sent-114, score-0.556]
55 We are now ready to introduce the gradient boosting algorithm, summarized in Figure 3. [sent-116, score-0.48]
56 In the first step of the algorithm, we start by setting the phrase-pair in context scoring function F0(x) as a linear function of the input feature values, by selecting the feature weights λ to minimize the PRO loss Ψ(F0 (x)) as a function of λ. [sent-117, score-1.081]
57 This is equivalent to using the ∑(Hopkins and May, 2011) method of parameter tuning for a fixed input feature set and a linear model. [sent-122, score-0.512]
58 Then we iterate and induce a new decision tree weak learner h(x; αm) like the examples in Figure 2 at each iteration. [sent-124, score-0.485]
59 The parameter vectors αm encode the topology and parame- ters of the decision trees, including which feature value is tested at each node, what the comparison cutoffs are, and the way to compute the values at the leaf nodes. [sent-125, score-0.549]
60 After a new decision tree 408 FLCahin s- gE unange9T2r. [sent-126, score-0.245]
61 s7t598361 Table 2: Results for the two language pairs using different weight tuning methods and feature sets. [sent-133, score-0.366]
62 h(x; αm) is induced, it is treated as new feature and a linear coefficient ρm for that feature is set by minimizing the loss as a function of this parameter (Line 5). [sent-134, score-0.73]
63 The new model scores are set as the old model scores plus a weighted contribution from the new feature (Line 6). [sent-135, score-0.362]
64 At the end of learning, we have a linear model over the input features and additional decision tree features. [sent-136, score-0.484]
65 This is done by first computing the functional gradient of the loss with respect to the phrase scores F(xr) at the point of the current model scores Fm−1 (xr). [sent-145, score-0.726]
66 We then induce a regression tree using mean-squareerror minimization, setting the direction given by the negative gradient as a target to be predicted using the features of each phrase-pair in context instance. [sent-147, score-0.826]
67 When we tune weights to minimize the loss, such as the weights λ of the initial features, or the weights ρm of induced learners, we also include an L2 penalty on the parameters, to prevent overfitting. [sent-151, score-0.39]
68 , 2002) with strict brevity penalty, where a long translation for one sentence can not be used to counteract the brevity penalty for another sentence with a short translation. [sent-163, score-0.225]
69 In our experiments with different feature sets and hyperparameters we observed more stable results and better correlation of Dev-Train, Dev-Select, and Test results using BLEU-SBP. [sent-166, score-0.222]
70 For our experiments, we first trained weights for the base feature sets described in Section 2. [sent-167, score-0.272]
71 All results in Table 2 report performance of re-ranking on these 500-best lists using different feature sets and parameter tuning methods. [sent-170, score-0.375]
72 The baseline (base feature set) performance using MERT and PRO tuning on the two language pairs is shown on the first two lines. [sent-171, score-0.366]
73 In line with prior work, PRO tuning achieves a bit lower scores on the tuning set but higher scores on the test set, compared to MERT. [sent-172, score-0.545]
74 The large feature set additionally contains over 170 manually specified features, described in Section 2. [sent-173, score-0.17]
75 It was infeasible to run MERT training on this feature set. [sent-175, score-0.17]
76 The test set results using PRO tuning for the large set are about a quarter of a BLEU-SBP point higher than the results using the base feature set on both language pairs. [sent-176, score-0.363]
77 Finally, the last two rows show the performance of the gradient boosting method. [sent-177, score-0.48]
78 In 409 addition to learning locally decomposable features boost-local, we also implemented boost-global, where we are learning combinations of the global feature values and lose decomposability. [sent-178, score-0.541]
79 The gradient boosting results are mixed; for Finnish-English, we see around . [sent-181, score-0.48]
80 2 gain of the boost-local model over the large feature set. [sent-182, score-0.17]
81 We did not see a large difference in performance among models using different decision tree leaf node types and different maximum numbers of leaf nodes. [sent-184, score-0.695]
82 The selected boost-local model for FIN-ENU used trees with maximum of 2 leaf nodes and linear leaf values; 25 new features were induced before performance started to degrade on the Dev-Select set. [sent-185, score-0.831]
83 The induced features for Finnish included combinations of language model and channel model scores, combinations of word count and channel model scores, and combinations of channel and lexicalized reordering scores. [sent-186, score-1.111]
84 For example, one feature increases the contribution of the relative frequency channel score for phrases with many target words, and decreases the channel model contribution for shorter phrases. [sent-187, score-0.534]
85 The best boost-local model for Chs-Enu used trees with a maximum of 2 constant-values leaf nodes, and induced 24 new tree features. [sent-188, score-0.487]
86 The features effectively promoted and demoted phrasepairs in context based on whether an input feature’s value was smaller than a determined cutoff. [sent-189, score-0.262]
87 In conclusion, we proposed a new method to induce feature combinations for machine translation, which do not increase the decoding complexity. [sent-190, score-0.456]
88 Further improvements in the original feature set and the induction algorithm, as well as full integration in decoding are needed to result in substantial performance improvements. [sent-192, score-0.311]
89 It would be interesting to compare such alternatives to the regression tree features we explored. [sent-194, score-0.351]
90 One system, many domains: Open-domain statistical machine translation via feature augmentation. [sent-227, score-0.251]
91 Svore, and hypotheses (the first hypothesis in each pair), and 4 currently receiving high scores. [sent-260, score-0.239]
92 Let us number all phrase-pairs in context in all hypotheses in all sentences as p1, . [sent-266, score-0.177]
93 , pR and denote their input feature vectors as x1, . [sent-269, score-0.237]
94 We will use F(pr) and F(xr) interchangeably, because the score of a phrase-pair in context is defined by its input feature vector. [sent-273, score-0.297]
95 The loss Ψ(F(xr)) is expressed as follows: ∑nN=1∑kK=1log(1 + e∑pr∈hjnkF(xr)−∑pr∈hinkF(xr)). [sent-274, score-0.213]
96 Next∑ we derive the derivatives of the loss Ψ(F(x)) with respect to the phrase scores. [sent-275, score-0.335]
97 Intuitively, we are treating the scores we want to learn as parameters for the loss function; thus the loss function has a huge number of parameters, one for each instance of each phrase pair in context in each translation. [sent-276, score-0.754]
98 We ask the loss function if these scores could be set in an arbitrary way, what direction it would like to move them in to be minimized. [sent-277, score-0.404]
99 Each phrase-pair in context pr occurs in exactly one hypothesis h in one sentence. [sent-279, score-0.316]
100 To express the gradient with respect to F(xr) we therefore need to focus on the terms of the loss from a single sentence and to take into account the hypothesis pairs [hj,k, hi,k] where the left or the right hypothesis is the hypothesis h containing our focus phrase pair pr. [sent-281, score-0.978]
wordName wordTfidf (topN-words)
[('xr', 0.259), ('gradient', 0.251), ('boosting', 0.229), ('loss', 0.213), ('leaf', 0.207), ('channel', 0.182), ('feature', 0.17), ('tuning', 0.159), ('rx', 0.142), ('fm', 0.14), ('pr', 0.134), ('regression', 0.13), ('induce', 0.13), ('decision', 0.126), ('pro', 0.123), ('hypothesis', 0.122), ('tree', 0.119), ('decompose', 0.119), ('hypotheses', 0.117), ('prp', 0.113), ('lfl', 0.111), ('locally', 0.109), ('features', 0.102), ('hopkins', 0.099), ('scores', 0.096), ('friedman', 0.096), ('differentiable', 0.09), ('nodes', 0.084), ('chiang', 0.083), ('combinations', 0.082), ('induced', 0.082), ('translation', 0.081), ('trees', 0.079), ('bleu', 0.078), ('decomposable', 0.078), ('decoding', 0.074), ('hinhkinkff', 0.074), ('hink', 0.074), ('hjnk', 0.074), ('tarowatanabe', 0.074), ('phrase', 0.07), ('linear', 0.07), ('weighting', 0.07), ('weights', 0.068), ('induction', 0.067), ('input', 0.067), ('weak', 0.067), ('burges', 0.066), ('appendix', 0.065), ('dramatic', 0.063), ('sums', 0.063), ('local', 0.063), ('function', 0.061), ('context', 0.06), ('penalty', 0.06), ('minimization', 0.058), ('conjunctions', 0.057), ('jerome', 0.057), ('sco', 0.057), ('breiman', 0.057), ('learners', 0.053), ('reordering', 0.052), ('hyperparameters', 0.052), ('mh', 0.052), ('derivatives', 0.052), ('bulgarian', 0.05), ('parameter', 0.046), ('gains', 0.046), ('fl', 0.045), ('lexicalized', 0.045), ('minimize', 0.044), ('learner', 0.043), ('brevity', 0.042), ('pair', 0.041), ('cherry', 0.04), ('logp', 0.04), ('count', 0.038), ('machines', 0.038), ('constant', 0.038), ('nguyen', 0.037), ('pairs', 0.037), ('selecting', 0.036), ('daum', 0.036), ('node', 0.036), ('inducing', 0.036), ('deletion', 0.035), ('distortion', 0.035), ('line', 0.035), ('base', 0.034), ('direction', 0.034), ('insertion', 0.034), ('smoothed', 0.034), ('ranking', 0.033), ('wu', 0.033), ('tsukuda', 0.033), ('mahajan', 0.033), ('milind', 0.033), ('phrasepairs', 0.033), ('affine', 0.033), ('avien', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
Author: Kristina Toutanova ; Byung-Gyu Ahn
Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.
2 0.19807163 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models
Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning
Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1
3 0.16142021 24 acl-2013-A Tale about PRO and Monsters
Author: Preslav Nakov ; Francisco Guzman ; Stephan Vogel
Abstract: While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- and BLEU+1based cut-offs, outlier filters, stochastic sampling, and random acceptance. The best of these fixes not only slay and protect against monsters, but also yield higher stability for PRO as well as improved testtime BLEU scores. Thus, we recommend them to anybody using PRO, monsterbeliever or not. – 1 Once Upon a Time... For years, the standard way to do statistical machine translation parameter tuning has been to use minimum error-rate training, or MERT (Och, 2003). However, as researchers started using models with thousands of parameters, new scalable optimization algorithms such as MIRA (Watanabe et al., 2007; Chiang et al., 2008) and PRO (Hopkins and May, 2011) have emerged. As these algorithms are relatively new, they are still not quite well understood, and studying their properties is an active area of research. For example, Nakov et al. (2012) have pointed out that PRO tends to generate translations that are consistently shorter than desired. They have blamed this on inadequate smoothing in PRO’s optimization objective, namely sentencelevel BLEU+1, and they have addressed the problem using more sensible smoothing. We wondered whether the issue could be partially relieved simply by tuning on longer sentences, for which the effect of smoothing would naturally be smaller. To our surprise, tuning on the longer 50% of the tuning sentences had a disastrous effect on PRO, causing an absolute drop of three BLEU points on testing; at the same time, MERT and MIRA did not have such a problem. While investigating the reasons, we discovered hundreds of monsters creeping under PRO’s surface... Our tale continues as follows. We first explain what monsters are in Section 2, then we present a theory about how they can be slayed in Section 3, we put this theory to test in practice in Section 4, and we discuss some related efforts in Section 5. Finally, we present the moral of our tale, and we hint at some planned future battles in Section 6. 2 Monsters, Inc. PRO uses pairwise ranking optimization, where the learning task is to classify pairs of hypotheses into correctly or incorrectly ordered (Hopkins and May, 2011). It searches for a vector of weights w such that higher evaluation metric scores correspond to higher model scores and vice versa. More formally, PRO looks for weights w such that g(i, j) > g(i, j0) ⇔ hw (i, j) > hw (i, j0), where g is a local scoring fu hnction (typically, sentencelevel BLEU+1) and hw are the model scores for a given input sentence i and two candidate hypotheses j and j0 that were obtained using w. If g(i, j) > g(i, j0), we will refer to j and j0 as the positive and the negative example in the pair. Learning good parameter values requires negative examples that are comparable to the positive ones. Instead, tuning on long sentences quickly introduces monsters, i.e., corrupted negative examples that are unsuitable for learning: they are (i) much longer than the respective positive examples and the references, and (ii) have very low BLEU+1 scores compared to the positive examples and in absolute terms. The low BLEU+1 means that PRO effectively has to learn from positive examples only. 12 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 12–17, Avg. Lengths Avg. BLEU+1 iter. pos neg ref. pos neg 1 45.2 44.6 46.5 52.5 37.6 2 3 4 5 ... 25 46.4 46.4 46.4 46.3 ... 47.9 70.5 261.0 250.0 248.0 ... 229.0 53.2 53.4 53.0 53.0 ... 52.5 52.8 52.4 52.0 52.1 ... 52.2 14.5 2.19 2.30 2.34 ... 2.81 Table 1: PRO iterations, tuning on long sentences. Table 1shows an optimization run of PRO when tuning on long sentences. We can see monsters after iterations in which positive examples are on average longer than negative ones (e.g., iter. 1). As a result, PRO learns to generate longer sentences, but it overshoots too much (iter. 2), which gives rise to monsters. Ideally, the learning algorithm should be able to recover from overshooting. However, once monsters are encountered, they quickly start dominating, with no chance for PRO to recover since it accumulates n-best lists, and thus also monsters, over iterations. As a result, PRO keeps jumping up and down and converges to random values, as Figure 1 shows. By default, PRO’s parameters are averaged over iterations, and thus the final result is quite mediocre, but selecting the highest tuning score does not solve the problem either: for example, on Figure 1, PRO never achieves a BLEU better than that for the default initialization parameters. iteration Figure 1: PRO tuning results on long sentences across iterations. The dark-gray line shows the tuning BLEU (left axis), the light-gray one is the hypothesis/reference length ratio (right axis). Figure 2 shows the translations after iterations 1, 3 and 4; the last two are monsters. The monster at iteration 3 is potentially useful, but that at iteration 4 is clearly unsuitable as a negative example. Optimizer Objective BLEU PROsent-BLEU+144.57 MERT corpus-BLEU 47.53 MIRA pseudo-doc-BLEU 47.80 PRO (6= objective)pseudo-doc-BLEU21.35 PMRIORA (6= =(6= o bojbejcetcivteiv)e) sent-BLEU+1 47.59 PMRIRO,A PC (6=-sm obojoectthiv,e g)roundfixed sent-BLEU+145.71 Table 2: PRO vs. MERT vs. MIRA. We also checked whether other popular optimizers yield very low BLEU scores at test time when tuned on long sentences. Lines 2-3 in Table 2 show that this is not the case for MERT and MIRA. Since they optimize objectives that are different from PRO’s,1 we further experimented with plugging MIRA’s objective into PRO and PRO’s objective into MIRA. The resulting MIRA scores were not much different from before, while PRO’s score dropped even further; we also found mon- sters. Next, we applied the length fix for PRO proposed in (Nakov et al., 2012); this helped a bit, but still left PRO two BLEU points behind and MIRA, and the monsters did not go away. We can conclude that the monster problem is PRO-specific, cannot be blamed on the objective function, and is different from the length bias. Note also that monsters are not specific to a dataset or language pair. We found them when tuning on the top-50% of WMT10 and testing on WMT1 1 for Spanish-English; this yielded a drop in BLEU from 29.63 (1M2/emEs/RprTes)la to 27.12 n(inPg/RtmOp.)1.1 MERT2 **REF** : but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - **IT1** : but we are that we add our ranks to some of us and that we know that in the strength and weakness in **IT3** : , we are the but of the that that the , and , of ranks the the on the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and **IT4** : namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never balme your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3 .8 billion in tourism income proceeds Chamber are divided among themselves : some ' ve dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure Ftroai ngtkhsu et rahefef 2he : Ea t h xte ha,e motfo pstohlmeee r leafst eorfe wne c, et etr laonngs olfa t hei opnar a ofn sdo hhee oy fpwhaoitst hh a]r ee usisn i ostu tofra tnhes ilna tbiakoern s, haef ctoeokr it hee roant ainod nthse 1 t,h 3we aknonw d, 4. T@-@h eAl l tahes ft trwce o, tho ypotheses are monsters. 1See (Cherry and Foster, 2012) for details on objectives. 2Also, using PRO to initialize MERT, as implemented in Moses, yields 46.52 BLEU and monsters, but using MERT to initialize PRO yields 47.55 and no monsters. 13 3 Slaying Monsters: Theory Below we explain what monsters are and where they come from. Then, we propose various monster slaying techniques to be applied during PRO’s selection and acceptance steps. 3.1 What is PRO? PRO is a batch optimizer that iterates between (i) translation: using the current parameter values, generate k-best translations, and (ii) optimization: using the translations from all previous iterations, find new parameter values. The optimization step has four substeps: 1. Sampling: For each sentence, sample uniformly at random Γ = 5000 pairs from the set of all candidate translations for that sentence from all previous iterations. 2. Selection: From these sampled pairs, select those for which the absolute difference between their BLEU+1 scores is higher than α = 0.05 (note: this is 5 BLEU+1 points). 3. Acceptance: For each sentence, accept the Ξ = 50 selected pairs with the highest absolute difference in their BLEU+1 scores. 4. Learning: Assemble the accepted pairs for all sentences into a single set and use it to train a ranker to prefer the higher-scoring sentence in each pair. We believe that monsters are nurtured by PRO’s selection and acceptance policies. PRO’s selection step filters pairs involving hypotheses that differ by less than five BLEU+1 points, but it does not cut-off ones that differ too much based on BLEU+1 or length. PRO’s acceptance step selects Ξ = 50 pairs with the highest BLEU+1 differentials, which creates breeding ground for monsters since these pairs are very likely to include one monster and one good hypothesis. Below we discuss monster slaying geared towards the selection and acceptance steps of PRO. 3.2 Slaying at Selection In the selection step, PRO filters pairs for which the difference in BLEU+1 is less than five points, but it has no cut-off on the maximum BLEU+1 differentials nor cut-offs based on absolute length or difference in length. Here, we propose several selection filters, both deterministic and probabilistic. Cut-offs. A cut-off is a deterministic rule that filters out pairs that do not comply with some criteria. We experiment with a maximal cut-off on (a) the difference in BLEU+1 scores and (b) the difference in lengths. These are relative cut-offs because they refer to the pair, but absolute cut-offs that apply to each of the elements in the pair are also possible (not explored here). Cut-offs (a) and (b) slay monsters by not allowing the negative examples to get much worse in BLEU+1 or in length than the positive example in the pair. Filtering outliers. Outliers are rare or extreme observations in a sample. We assume normal distribution of the BLEU+1 scores (or of the lengths) of the translation hypotheses for the same source sentence, and we define as outliers hypotheses whose BLEU+1 (or length) is more than λ standard deviations away from the sample average. We apply the outlier filter to both the positive and the negative example in a pair, but it is more important for the latter. We experiment with values of λ like 2 and 3. This filtering slays monsters because they are likely outliers. However, it will not work if the population gets riddled with monsters, in which case they would become the norm. Stochastic sampling. Instead of filtering extreme examples, we can randomly sample pairs according to their probability of being typical. Let us assume that the values of the local scoring functions, i.e., the BLEU+1 scores, are distributed nor- mally: g(i, j) ∼ N(µ, σ2). Given a sample of hypothesis (tira,nj)sl ∼atio Nn(sµ {j} of the same source sentpeontchee i, we can ensstim {ja}te o σ empirically. Then, the difference ∆ = g(i, j) − g(i, j0) would be tdhisetr diibfufteerde normally w gi(thi, mean zero and variance 2σ2. Now, given a pair of examples, we can calculate their ∆, and we can choose to select the pair with some probability, according to N(0, 2σ2). 3.3 Slaying at Acceptance Another problem is caused by the acceptance mechanism of PRO: among all selected pairs, it accepts the top-Ξ with the highest BLEU+1 differentials. It is easy to see that these differentials are highest for nonmonster–monster pairs if such pairs exist. One way to avoid focusing primarily on such pairs is to accept a random set of pairs, among the ones that survived the selection step. One possible caveat is that we can lose some of the discriminative power of PRO by focusing on examples that are not different enough. Ξ 14 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) PRO fix Avg. for 3 reruns BLEU StdDev Pos Lengths Neg Ref BLEU+1 Avg. for 3 reruns Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Max diff. cut-offBLEU+1 max=10†47.940.16547.949.649.449.439.947.770.035 BLEU+1 max=20 † 47.73 0.136 47.7 55.5 51.1 49.8 32.7 47.85 0.049 LEN max=5 † 48.09 0.021 46.8 47.0 47.9 52.9 37.8 47.73 0.051 LEN max=10 † 47.99 0.025 47.3 48.5 48.7 52.5 35.6 47.80 0.056 OutliersBLEU+1 λ=2.0†48.050.11946.847.247.752.239.547.470.090 BLEU+1 λ=3.0 LEN λ=2.0 LEN λ=3.0 47.12 46.68 47.02 1.348 2.005 0.727 47.6 49.3 48.2 168.0 82.7 163.0 53.0 53.1 51.4 51.7 52.3 51.4 3.9 5.3 4.2 47.53 47.49 47.65 0.038 0.085 0.096 Stoch. sampl.∆ BLEU+146.331.00046.8216.053.353.12.447.740.035 ∆ LEN 46.36 1.281 47.4 201.0 52.9 53.4 2.9 47.78 0.081 Table 3: Some fixes to PRO (select pairs with highest BLEU+1 differential, also require at least 5 BLEU+1 points difference). A dagger (†) indicates selection fixes that successfully get rid of monsters. 4 Attacking Monsters: Practice Below, we first present our general experimental setup. Then, we present the results for the various selection alternatives, both with the original acceptance strategy and with random acceptance. 4.1 Experimental Setup We used a phrase-based SMT model (Koehn et al., 2003) as implemented in the Moses toolkit (Koehn et al., 2007). We trained on all Arabic-English data for NIST 2012 except for UN, we tuned on (the longest-50% of) the MT06 sentences, and we tested on MT09. We used the MADA ATB segmentation for Arabic (Roth et al., 2008) and truecasing for English, phrases of maximal length 7, Kneser-Ney smoothing, and lexicalized reorder- ing (Koehn et al., 2005), and a 5-gram language model, trained on GigaWord v.5 using KenLM (Heafield, 2011). We dropped unknown words both at tuning and testing, and we used minimum Bayes risk decoding at testing (Kumar and Byrne, 2004). We evaluated the output with NIST’s scoring tool v.13a, cased. We used the Moses implementations of MERT, PRO and batch MIRA, with the –return-best-dev parameter for the latter. We ran these optimizers for up to 25 iterations and we used 1000-best lists. For stability (Foster and Kuhn, 2009), we performed three reruns of each experiment (tuning + evaluation), and we report averaged scores. 4.2 Selection Alternatives Table 3 presents the results for different selection alternatives. The first two columns show the testing results: average BLEU and standard deviation over three reruns. The following five columns show statistics about the last iteration (it. 25) of PRO’s tuning for the worst rerun: average lengths of the positive and the negative examples and average effective reference length, followed by average BLEU+1 scores for the positive and the negative examples in the pairs. The last two columns present the results when tuning on the full tuning set. These are included to verify the behavior of PRO in a nonmonster prone environment. We can see in Table 3 that all selection mechanisms considerably improve BLEU compared to the baseline PRO, by 2-3 BLEU points. However, not every selection alternative gets rid of monsters, which can be seen by the large lengths and low BLEU+1 for the negative examples (in bold). The max cut-offs for BLEU+1 and for lengths both slay the monsters, but the latter yields much lower standard deviation (thirteen times lower than for the baseline PRO!), thus considerably increasing PRO’s stability. On the full dataset, BLEU scores are about the same as for the original PRO (with small improvement for BLEU+1 max=20), but the standard deviations are slightly better. Rejecting outliers using BLEU+1 and λ = 3 is not strong enough to filter out monsters, but making this criterion more strict by setting λ = 2, yields competitive BLEU and kills the monsters. Rejecting outliers based on length does not work as effectively though. We can think of two possible reasons: (i) lengths are not normally distributed, they are more Poisson-like, and (ii) the acceptance criterion is based on the top-Ξ differentials based on BLEU+1, not based on length. On the full dataset, rejecting outliers, BLEU+1 and length, yields lower BLEU and less stability. 15 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) Avg. for 3 reruns Lengths BLEU+1 Avg. for 3 reruns PRO fix BLEU StdDev Pos Neg Ref Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Rand. acceptPRO, rand††47.870.14747.748.548.7047.742.947.590.114 OutliersBLEU+1 λ=2.0, rand∗47.850.07848.248.448.947.543.647.620.091 BLEU+1 λ=3.0, rand 47.97 0.168 47.6 47.6 48.4 47.8 43.6 47.44 0.070 LEN λ=2.0, rand∗ 47.69 0.114 47.8 47.8 48.6 47.9 43.6 47.48 0.046 LEN λ=3.0, rand 47.89 0.235 47.8 48.0 48.7 47.7 43. 1 47.64 0.090 Stoch. sampl.∆ BLEU+1, rand∗47.990.08747.948.048.747.843.547.670.096 ∆ LEN, rand∗ 47.94 0.060 47.8 47.9 48.6 47.8 43.6 47.65 0.097 Table 4: More fixes to PRO (with random acceptance, no minimum BLEU+1). The (††) indicates that random acceptance kills monsters. The asterisk (∗) indicates improved stability over random acceptance. Reasons (i) and (ii) arguably also apply to stochastic sampling of differentials (for BLEU+1 or for length), which fails to kill the monsters, maybe because it gives them some probability of being selected by design. To alleviate this, we test the above settings with random acceptance. 4.3 Random Acceptance Table 4 shows the results for accepting training pairs for PRO uniformly at random. To eliminate possible biases, we also removed the min=0.05 BLEU+1 selection criterion. Surprisingly, this setup effectively eliminated the monster problem. Further coupling this with the distributional criteria can also yield increased stability, and even small further increase in test BLEU. For instance, rejecting BLEU outliers with λ = 2 yields comparable average test BLEU, but with only half the standard deviation. On the other hand, using the stochastic sampling of differentials based on either BLEU+1 or lengths improves the test BLEU score while increasing the stability across runs. The random acceptance has a caveat though: it generally decreases the discriminative power of PRO, yielding worse results when tuning on the full, nonmonster prone tuning dataset. Stochastic selection does help to alleviate this problem. Yet, the results are not as good as when using a max cut-off for the length. Therefore, we recommend using the latter as a default setting. 5 Related Work We are not aware of previous work that discusses the issue of monsters, but there has been work on a different, length problem with PRO (Nakov et al., 2012). We have seen that its solution, fix the smoothing in BLEU+1, did not work for us. The stability of MERT has been improved using regularization (Cer et al., 2008), random restarts (Moore and Quirk, 2008), multiple replications (Clark et al., 2011), and parameter aggregation (Cettolo et al., 2011). With the emergence of new optimization techniques, there have been studies that compare stability between MIRA–MERT (Chiang et al., 2008; Chiang et al., 2009; Cherry and Foster, 2012), PRO–MERT (Hopkins and May, 2011), MIRA– PRO–MERT (Cherry and Foster, 2012; Gimpel and Smith, 2012; Nakov et al., 2012). Pathological verbosity can be an issue when tuning MERT on recall-oriented metrics such as METEOR (Lavie and Denkowski, 2009; Denkowski and Lavie, 2011). Large variance between the results obtained with MIRA has also been reported (Simianer et al., 2012). However, none of this work has focused on monsters. 6 Tale’s Moral and Future Battles We have studied a problem with PRO, namely that it can fall victim to monsters, overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning. We have proposed several effective ways to address this problem, based on length- and BLEU+1-based cut-offs, outlier filters and stochastic sampling. The best of these fixes have not only slayed the monsters, but have also brought much higher stability to PRO as well as improved test-time BLEU scores. These benefits are less visible on the full dataset, but we still recommend them to everybody who uses PRO as protection against monsters. Monsters are inherent in PRO; they just do not always take over. In future work, we plan a deeper look at the mechanism of monster creation in PRO and its possible connection to PRO’s length bias. 16 References Daniel Cer, Daniel Jurafsky, and Christopher Manning. 2008. Regularization and search for minimum error rate training. In Proc. of Workshop on Statistical Machine Translation, WMT ’08, pages 26–34. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Methods for smoothing the optimizer instability in SMT. MT Summit XIII: the Machine Translation Summit, pages 32–39. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 427–436. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 224–233. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine transla- tion. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’09, pages 218–226. Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 11, pages 176–181 . Michael Denkowski and Alon Lavie. 2011. Meteortuned phrase-based SMT: CMU French-English and Haitian-English systems for WMT 2011. Technical report, CMU-LTI-1 1-01 1, Language Technologies Institute, Carnegie Mellon University. George Foster and Roland Kuhn. 2009. Stabilizing minimum error rate training. In Proceedings of the Workshop on Statistical Machine Translation, StatMT ’09, pages 242–249. Kevin Gimpel and Noah Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 221–231. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation, WMT ’ 11, pages 187–197. Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 11, pages 1352–1362. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLTNAACL ’03, pages 48–54. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation, IWSLT ’05. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of the Meeting of the Association for Computational Linguistics, ACL ’07, pages 177–180. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL ’04, pages 169–176. Alon Lavie and Michael Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation, 23: 105–1 15. Robert Moore and Chris Quirk. 2008. Random restarts in minimum error rate training for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics, COLING ’08, pages 585–592. Preslav Nakov, Francisco Guzm a´n, and Stephan Vogel. 2012. Optimizing for sentence-level BLEU+1 yields short translations. In Proceedings ofthe International Conference on Computational Linguistics, COLING ’ 12, pages 1979–1994. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’03, pages 160–167. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’08, pages 117–120. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 12, pages 11–21. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, pages 764–773. 17
Author: Jiajun Zhang ; Chengqing Zong
Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1
5 0.15156662 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.
6 0.13229547 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation
7 0.11941667 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
8 0.11917974 328 acl-2013-Stacking for Statistical Machine Translation
9 0.11328468 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
10 0.11154865 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
11 0.11131249 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
12 0.10809419 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation
13 0.10763004 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
14 0.10594049 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
15 0.10114646 314 acl-2013-Semantic Roles for String to Tree Machine Translation
16 0.10063688 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
17 0.10029038 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
18 0.098939851 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation
19 0.093135461 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning
20 0.091730349 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
topicId topicWeight
[(0, 0.25), (1, -0.133), (2, 0.112), (3, 0.06), (4, -0.056), (5, 0.044), (6, 0.07), (7, -0.016), (8, -0.029), (9, 0.083), (10, -0.021), (11, 0.049), (12, -0.05), (13, -0.07), (14, -0.009), (15, 0.065), (16, -0.025), (17, 0.097), (18, 0.022), (19, -0.014), (20, 0.135), (21, 0.077), (22, 0.059), (23, 0.016), (24, 0.029), (25, 0.002), (26, 0.067), (27, 0.018), (28, -0.049), (29, -0.041), (30, 0.102), (31, 0.022), (32, 0.018), (33, 0.018), (34, 0.016), (35, -0.058), (36, -0.033), (37, 0.012), (38, 0.002), (39, -0.066), (40, 0.061), (41, 0.0), (42, 0.004), (43, 0.033), (44, -0.033), (45, -0.042), (46, 0.1), (47, 0.02), (48, 0.03), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.95018512 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
Author: Kristina Toutanova ; Byung-Gyu Ahn
Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.
2 0.92017686 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models
Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning
Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1
3 0.8888945 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation
Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik
Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.
4 0.87894589 24 acl-2013-A Tale about PRO and Monsters
Author: Preslav Nakov ; Francisco Guzman ; Stephan Vogel
Abstract: While experimenting with tuning on long sentences, we made an unexpected discovery: that PRO falls victim to monsters overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning and can cause testing BLEU to drop by several points absolute. We propose several effective ways to address the problem, using length- and BLEU+1based cut-offs, outlier filters, stochastic sampling, and random acceptance. The best of these fixes not only slay and protect against monsters, but also yield higher stability for PRO as well as improved testtime BLEU scores. Thus, we recommend them to anybody using PRO, monsterbeliever or not. – 1 Once Upon a Time... For years, the standard way to do statistical machine translation parameter tuning has been to use minimum error-rate training, or MERT (Och, 2003). However, as researchers started using models with thousands of parameters, new scalable optimization algorithms such as MIRA (Watanabe et al., 2007; Chiang et al., 2008) and PRO (Hopkins and May, 2011) have emerged. As these algorithms are relatively new, they are still not quite well understood, and studying their properties is an active area of research. For example, Nakov et al. (2012) have pointed out that PRO tends to generate translations that are consistently shorter than desired. They have blamed this on inadequate smoothing in PRO’s optimization objective, namely sentencelevel BLEU+1, and they have addressed the problem using more sensible smoothing. We wondered whether the issue could be partially relieved simply by tuning on longer sentences, for which the effect of smoothing would naturally be smaller. To our surprise, tuning on the longer 50% of the tuning sentences had a disastrous effect on PRO, causing an absolute drop of three BLEU points on testing; at the same time, MERT and MIRA did not have such a problem. While investigating the reasons, we discovered hundreds of monsters creeping under PRO’s surface... Our tale continues as follows. We first explain what monsters are in Section 2, then we present a theory about how they can be slayed in Section 3, we put this theory to test in practice in Section 4, and we discuss some related efforts in Section 5. Finally, we present the moral of our tale, and we hint at some planned future battles in Section 6. 2 Monsters, Inc. PRO uses pairwise ranking optimization, where the learning task is to classify pairs of hypotheses into correctly or incorrectly ordered (Hopkins and May, 2011). It searches for a vector of weights w such that higher evaluation metric scores correspond to higher model scores and vice versa. More formally, PRO looks for weights w such that g(i, j) > g(i, j0) ⇔ hw (i, j) > hw (i, j0), where g is a local scoring fu hnction (typically, sentencelevel BLEU+1) and hw are the model scores for a given input sentence i and two candidate hypotheses j and j0 that were obtained using w. If g(i, j) > g(i, j0), we will refer to j and j0 as the positive and the negative example in the pair. Learning good parameter values requires negative examples that are comparable to the positive ones. Instead, tuning on long sentences quickly introduces monsters, i.e., corrupted negative examples that are unsuitable for learning: they are (i) much longer than the respective positive examples and the references, and (ii) have very low BLEU+1 scores compared to the positive examples and in absolute terms. The low BLEU+1 means that PRO effectively has to learn from positive examples only. 12 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 12–17, Avg. Lengths Avg. BLEU+1 iter. pos neg ref. pos neg 1 45.2 44.6 46.5 52.5 37.6 2 3 4 5 ... 25 46.4 46.4 46.4 46.3 ... 47.9 70.5 261.0 250.0 248.0 ... 229.0 53.2 53.4 53.0 53.0 ... 52.5 52.8 52.4 52.0 52.1 ... 52.2 14.5 2.19 2.30 2.34 ... 2.81 Table 1: PRO iterations, tuning on long sentences. Table 1shows an optimization run of PRO when tuning on long sentences. We can see monsters after iterations in which positive examples are on average longer than negative ones (e.g., iter. 1). As a result, PRO learns to generate longer sentences, but it overshoots too much (iter. 2), which gives rise to monsters. Ideally, the learning algorithm should be able to recover from overshooting. However, once monsters are encountered, they quickly start dominating, with no chance for PRO to recover since it accumulates n-best lists, and thus also monsters, over iterations. As a result, PRO keeps jumping up and down and converges to random values, as Figure 1 shows. By default, PRO’s parameters are averaged over iterations, and thus the final result is quite mediocre, but selecting the highest tuning score does not solve the problem either: for example, on Figure 1, PRO never achieves a BLEU better than that for the default initialization parameters. iteration Figure 1: PRO tuning results on long sentences across iterations. The dark-gray line shows the tuning BLEU (left axis), the light-gray one is the hypothesis/reference length ratio (right axis). Figure 2 shows the translations after iterations 1, 3 and 4; the last two are monsters. The monster at iteration 3 is potentially useful, but that at iteration 4 is clearly unsuitable as a negative example. Optimizer Objective BLEU PROsent-BLEU+144.57 MERT corpus-BLEU 47.53 MIRA pseudo-doc-BLEU 47.80 PRO (6= objective)pseudo-doc-BLEU21.35 PMRIORA (6= =(6= o bojbejcetcivteiv)e) sent-BLEU+1 47.59 PMRIRO,A PC (6=-sm obojoectthiv,e g)roundfixed sent-BLEU+145.71 Table 2: PRO vs. MERT vs. MIRA. We also checked whether other popular optimizers yield very low BLEU scores at test time when tuned on long sentences. Lines 2-3 in Table 2 show that this is not the case for MERT and MIRA. Since they optimize objectives that are different from PRO’s,1 we further experimented with plugging MIRA’s objective into PRO and PRO’s objective into MIRA. The resulting MIRA scores were not much different from before, while PRO’s score dropped even further; we also found mon- sters. Next, we applied the length fix for PRO proposed in (Nakov et al., 2012); this helped a bit, but still left PRO two BLEU points behind and MIRA, and the monsters did not go away. We can conclude that the monster problem is PRO-specific, cannot be blamed on the objective function, and is different from the length bias. Note also that monsters are not specific to a dataset or language pair. We found them when tuning on the top-50% of WMT10 and testing on WMT1 1 for Spanish-English; this yielded a drop in BLEU from 29.63 (1M2/emEs/RprTes)la to 27.12 n(inPg/RtmOp.)1.1 MERT2 **REF** : but we have to close ranks with each other and realize that in unity there is strength while in division there is weakness . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - **IT1** : but we are that we add our ranks to some of us and that we know that in the strength and weakness in **IT3** : , we are the but of the that that the , and , of ranks the the on the the our the our the some of we can include , and , of to the of we know the the our in of the of some people , force of the that that the in of the that that the the weakness Union the the , and **IT4** : namely Dr Heba Handossah and Dr Mona been pushed aside because a larger story EU Ambassador to Egypt Ian Burg highlighted 've dragged us backwards and dragged our speaking , never balme your defaulting a December 7th 1941 in Pearl Harbor ) we can include ranks will be joined by all 've dragged us backwards and dragged our $ 3 .8 billion in tourism income proceeds Chamber are divided among themselves : some ' ve dragged us backwards and dragged our were exaggerated . Al @-@ Hakim namely Dr Heba Handossah and Dr Mona December 7th 1941 in Pearl Harbor ) cases might be known to us December 7th 1941 in Pearl Harbor ) platform depends on combating all liberal policies Track and Field Federation shortened strength as well face several challenges , namely Dr Heba Handossah and Dr Mona platform depends on combating all liberal policies the report forecast that the weak structure Ftroai ngtkhsu et rahefef 2he : Ea t h xte ha,e motfo pstohlmeee r leafst eorfe wne c, et etr laonngs olfa t hei opnar a ofn sdo hhee oy fpwhaoitst hh a]r ee usisn i ostu tofra tnhes ilna tbiakoern s, haef ctoeokr it hee roant ainod nthse 1 t,h 3we aknonw d, 4. T@-@h eAl l tahes ft trwce o, tho ypotheses are monsters. 1See (Cherry and Foster, 2012) for details on objectives. 2Also, using PRO to initialize MERT, as implemented in Moses, yields 46.52 BLEU and monsters, but using MERT to initialize PRO yields 47.55 and no monsters. 13 3 Slaying Monsters: Theory Below we explain what monsters are and where they come from. Then, we propose various monster slaying techniques to be applied during PRO’s selection and acceptance steps. 3.1 What is PRO? PRO is a batch optimizer that iterates between (i) translation: using the current parameter values, generate k-best translations, and (ii) optimization: using the translations from all previous iterations, find new parameter values. The optimization step has four substeps: 1. Sampling: For each sentence, sample uniformly at random Γ = 5000 pairs from the set of all candidate translations for that sentence from all previous iterations. 2. Selection: From these sampled pairs, select those for which the absolute difference between their BLEU+1 scores is higher than α = 0.05 (note: this is 5 BLEU+1 points). 3. Acceptance: For each sentence, accept the Ξ = 50 selected pairs with the highest absolute difference in their BLEU+1 scores. 4. Learning: Assemble the accepted pairs for all sentences into a single set and use it to train a ranker to prefer the higher-scoring sentence in each pair. We believe that monsters are nurtured by PRO’s selection and acceptance policies. PRO’s selection step filters pairs involving hypotheses that differ by less than five BLEU+1 points, but it does not cut-off ones that differ too much based on BLEU+1 or length. PRO’s acceptance step selects Ξ = 50 pairs with the highest BLEU+1 differentials, which creates breeding ground for monsters since these pairs are very likely to include one monster and one good hypothesis. Below we discuss monster slaying geared towards the selection and acceptance steps of PRO. 3.2 Slaying at Selection In the selection step, PRO filters pairs for which the difference in BLEU+1 is less than five points, but it has no cut-off on the maximum BLEU+1 differentials nor cut-offs based on absolute length or difference in length. Here, we propose several selection filters, both deterministic and probabilistic. Cut-offs. A cut-off is a deterministic rule that filters out pairs that do not comply with some criteria. We experiment with a maximal cut-off on (a) the difference in BLEU+1 scores and (b) the difference in lengths. These are relative cut-offs because they refer to the pair, but absolute cut-offs that apply to each of the elements in the pair are also possible (not explored here). Cut-offs (a) and (b) slay monsters by not allowing the negative examples to get much worse in BLEU+1 or in length than the positive example in the pair. Filtering outliers. Outliers are rare or extreme observations in a sample. We assume normal distribution of the BLEU+1 scores (or of the lengths) of the translation hypotheses for the same source sentence, and we define as outliers hypotheses whose BLEU+1 (or length) is more than λ standard deviations away from the sample average. We apply the outlier filter to both the positive and the negative example in a pair, but it is more important for the latter. We experiment with values of λ like 2 and 3. This filtering slays monsters because they are likely outliers. However, it will not work if the population gets riddled with monsters, in which case they would become the norm. Stochastic sampling. Instead of filtering extreme examples, we can randomly sample pairs according to their probability of being typical. Let us assume that the values of the local scoring functions, i.e., the BLEU+1 scores, are distributed nor- mally: g(i, j) ∼ N(µ, σ2). Given a sample of hypothesis (tira,nj)sl ∼atio Nn(sµ {j} of the same source sentpeontchee i, we can ensstim {ja}te o σ empirically. Then, the difference ∆ = g(i, j) − g(i, j0) would be tdhisetr diibfufteerde normally w gi(thi, mean zero and variance 2σ2. Now, given a pair of examples, we can calculate their ∆, and we can choose to select the pair with some probability, according to N(0, 2σ2). 3.3 Slaying at Acceptance Another problem is caused by the acceptance mechanism of PRO: among all selected pairs, it accepts the top-Ξ with the highest BLEU+1 differentials. It is easy to see that these differentials are highest for nonmonster–monster pairs if such pairs exist. One way to avoid focusing primarily on such pairs is to accept a random set of pairs, among the ones that survived the selection step. One possible caveat is that we can lose some of the discriminative power of PRO by focusing on examples that are not different enough. Ξ 14 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) PRO fix Avg. for 3 reruns BLEU StdDev Pos Lengths Neg Ref BLEU+1 Avg. for 3 reruns Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Max diff. cut-offBLEU+1 max=10†47.940.16547.949.649.449.439.947.770.035 BLEU+1 max=20 † 47.73 0.136 47.7 55.5 51.1 49.8 32.7 47.85 0.049 LEN max=5 † 48.09 0.021 46.8 47.0 47.9 52.9 37.8 47.73 0.051 LEN max=10 † 47.99 0.025 47.3 48.5 48.7 52.5 35.6 47.80 0.056 OutliersBLEU+1 λ=2.0†48.050.11946.847.247.752.239.547.470.090 BLEU+1 λ=3.0 LEN λ=2.0 LEN λ=3.0 47.12 46.68 47.02 1.348 2.005 0.727 47.6 49.3 48.2 168.0 82.7 163.0 53.0 53.1 51.4 51.7 52.3 51.4 3.9 5.3 4.2 47.53 47.49 47.65 0.038 0.085 0.096 Stoch. sampl.∆ BLEU+146.331.00046.8216.053.353.12.447.740.035 ∆ LEN 46.36 1.281 47.4 201.0 52.9 53.4 2.9 47.78 0.081 Table 3: Some fixes to PRO (select pairs with highest BLEU+1 differential, also require at least 5 BLEU+1 points difference). A dagger (†) indicates selection fixes that successfully get rid of monsters. 4 Attacking Monsters: Practice Below, we first present our general experimental setup. Then, we present the results for the various selection alternatives, both with the original acceptance strategy and with random acceptance. 4.1 Experimental Setup We used a phrase-based SMT model (Koehn et al., 2003) as implemented in the Moses toolkit (Koehn et al., 2007). We trained on all Arabic-English data for NIST 2012 except for UN, we tuned on (the longest-50% of) the MT06 sentences, and we tested on MT09. We used the MADA ATB segmentation for Arabic (Roth et al., 2008) and truecasing for English, phrases of maximal length 7, Kneser-Ney smoothing, and lexicalized reorder- ing (Koehn et al., 2005), and a 5-gram language model, trained on GigaWord v.5 using KenLM (Heafield, 2011). We dropped unknown words both at tuning and testing, and we used minimum Bayes risk decoding at testing (Kumar and Byrne, 2004). We evaluated the output with NIST’s scoring tool v.13a, cased. We used the Moses implementations of MERT, PRO and batch MIRA, with the –return-best-dev parameter for the latter. We ran these optimizers for up to 25 iterations and we used 1000-best lists. For stability (Foster and Kuhn, 2009), we performed three reruns of each experiment (tuning + evaluation), and we report averaged scores. 4.2 Selection Alternatives Table 3 presents the results for different selection alternatives. The first two columns show the testing results: average BLEU and standard deviation over three reruns. The following five columns show statistics about the last iteration (it. 25) of PRO’s tuning for the worst rerun: average lengths of the positive and the negative examples and average effective reference length, followed by average BLEU+1 scores for the positive and the negative examples in the pairs. The last two columns present the results when tuning on the full tuning set. These are included to verify the behavior of PRO in a nonmonster prone environment. We can see in Table 3 that all selection mechanisms considerably improve BLEU compared to the baseline PRO, by 2-3 BLEU points. However, not every selection alternative gets rid of monsters, which can be seen by the large lengths and low BLEU+1 for the negative examples (in bold). The max cut-offs for BLEU+1 and for lengths both slay the monsters, but the latter yields much lower standard deviation (thirteen times lower than for the baseline PRO!), thus considerably increasing PRO’s stability. On the full dataset, BLEU scores are about the same as for the original PRO (with small improvement for BLEU+1 max=20), but the standard deviations are slightly better. Rejecting outliers using BLEU+1 and λ = 3 is not strong enough to filter out monsters, but making this criterion more strict by setting λ = 2, yields competitive BLEU and kills the monsters. Rejecting outliers based on length does not work as effectively though. We can think of two possible reasons: (i) lengths are not normally distributed, they are more Poisson-like, and (ii) the acceptance criterion is based on the top-Ξ differentials based on BLEU+1, not based on length. On the full dataset, rejecting outliers, BLEU+1 and length, yields lower BLEU and less stability. 15 TESTING TUNING (run 1, it. 25, avg.) TEST(tune:full) Avg. for 3 reruns Lengths BLEU+1 Avg. for 3 reruns PRO fix BLEU StdDev Pos Neg Ref Pos Neg BLEU StdDev PRO (baseline)44.700.26647.9229.052.552.22.847.800.052 Rand. acceptPRO, rand††47.870.14747.748.548.7047.742.947.590.114 OutliersBLEU+1 λ=2.0, rand∗47.850.07848.248.448.947.543.647.620.091 BLEU+1 λ=3.0, rand 47.97 0.168 47.6 47.6 48.4 47.8 43.6 47.44 0.070 LEN λ=2.0, rand∗ 47.69 0.114 47.8 47.8 48.6 47.9 43.6 47.48 0.046 LEN λ=3.0, rand 47.89 0.235 47.8 48.0 48.7 47.7 43. 1 47.64 0.090 Stoch. sampl.∆ BLEU+1, rand∗47.990.08747.948.048.747.843.547.670.096 ∆ LEN, rand∗ 47.94 0.060 47.8 47.9 48.6 47.8 43.6 47.65 0.097 Table 4: More fixes to PRO (with random acceptance, no minimum BLEU+1). The (††) indicates that random acceptance kills monsters. The asterisk (∗) indicates improved stability over random acceptance. Reasons (i) and (ii) arguably also apply to stochastic sampling of differentials (for BLEU+1 or for length), which fails to kill the monsters, maybe because it gives them some probability of being selected by design. To alleviate this, we test the above settings with random acceptance. 4.3 Random Acceptance Table 4 shows the results for accepting training pairs for PRO uniformly at random. To eliminate possible biases, we also removed the min=0.05 BLEU+1 selection criterion. Surprisingly, this setup effectively eliminated the monster problem. Further coupling this with the distributional criteria can also yield increased stability, and even small further increase in test BLEU. For instance, rejecting BLEU outliers with λ = 2 yields comparable average test BLEU, but with only half the standard deviation. On the other hand, using the stochastic sampling of differentials based on either BLEU+1 or lengths improves the test BLEU score while increasing the stability across runs. The random acceptance has a caveat though: it generally decreases the discriminative power of PRO, yielding worse results when tuning on the full, nonmonster prone tuning dataset. Stochastic selection does help to alleviate this problem. Yet, the results are not as good as when using a max cut-off for the length. Therefore, we recommend using the latter as a default setting. 5 Related Work We are not aware of previous work that discusses the issue of monsters, but there has been work on a different, length problem with PRO (Nakov et al., 2012). We have seen that its solution, fix the smoothing in BLEU+1, did not work for us. The stability of MERT has been improved using regularization (Cer et al., 2008), random restarts (Moore and Quirk, 2008), multiple replications (Clark et al., 2011), and parameter aggregation (Cettolo et al., 2011). With the emergence of new optimization techniques, there have been studies that compare stability between MIRA–MERT (Chiang et al., 2008; Chiang et al., 2009; Cherry and Foster, 2012), PRO–MERT (Hopkins and May, 2011), MIRA– PRO–MERT (Cherry and Foster, 2012; Gimpel and Smith, 2012; Nakov et al., 2012). Pathological verbosity can be an issue when tuning MERT on recall-oriented metrics such as METEOR (Lavie and Denkowski, 2009; Denkowski and Lavie, 2011). Large variance between the results obtained with MIRA has also been reported (Simianer et al., 2012). However, none of this work has focused on monsters. 6 Tale’s Moral and Future Battles We have studied a problem with PRO, namely that it can fall victim to monsters, overly long negative examples with very low BLEU+1 scores, which are unsuitable for learning. We have proposed several effective ways to address this problem, based on length- and BLEU+1-based cut-offs, outlier filters and stochastic sampling. The best of these fixes have not only slayed the monsters, but have also brought much higher stability to PRO as well as improved test-time BLEU scores. These benefits are less visible on the full dataset, but we still recommend them to everybody who uses PRO as protection against monsters. Monsters are inherent in PRO; they just do not always take over. In future work, we plan a deeper look at the mechanism of monster creation in PRO and its possible connection to PRO’s length bias. 16 References Daniel Cer, Daniel Jurafsky, and Christopher Manning. 2008. Regularization and search for minimum error rate training. In Proc. of Workshop on Statistical Machine Translation, WMT ’08, pages 26–34. Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. 2011. Methods for smoothing the optimizer instability in SMT. MT Summit XIII: the Machine Translation Summit, pages 32–39. Colin Cherry and George Foster. 2012. Batch tuning strategies for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 427–436. David Chiang, Yuval Marton, and Philip Resnik. 2008. Online large-margin training of syntactic and structural translation features. In Proceedings ofthe Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 224–233. David Chiang, Kevin Knight, and Wei Wang. 2009. 11,001 new features for statistical machine transla- tion. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’09, pages 218–226. Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 11, pages 176–181 . Michael Denkowski and Alon Lavie. 2011. Meteortuned phrase-based SMT: CMU French-English and Haitian-English systems for WMT 2011. Technical report, CMU-LTI-1 1-01 1, Language Technologies Institute, Carnegie Mellon University. George Foster and Roland Kuhn. 2009. Stabilizing minimum error rate training. In Proceedings of the Workshop on Statistical Machine Translation, StatMT ’09, pages 242–249. Kevin Gimpel and Noah Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT ’ 12, pages 221–231. Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Workshop on Statistical Machine Translation, WMT ’ 11, pages 187–197. Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 11, pages 1352–1362. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLTNAACL ’03, pages 48–54. Philipp Koehn, Amittai Axelrod, Alexandra Birch Mayne, Chris Callison-Burch, Miles Osborne, and David Talbot. 2005. Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation, IWSLT ’05. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of the Meeting of the Association for Computational Linguistics, ACL ’07, pages 177–180. Shankar Kumar and William Byrne. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, HLT-NAACL ’04, pages 169–176. Alon Lavie and Michael Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation, 23: 105–1 15. Robert Moore and Chris Quirk. 2008. Random restarts in minimum error rate training for statistical machine translation. In Proceedings of the International Conference on Computational Linguistics, COLING ’08, pages 585–592. Preslav Nakov, Francisco Guzm a´n, and Stephan Vogel. 2012. Optimizing for sentence-level BLEU+1 yields short translations. In Proceedings ofthe International Conference on Computational Linguistics, COLING ’ 12, pages 1979–1994. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’03, pages 160–167. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’08, pages 117–120. Patrick Simianer, Stefan Riezler, and Chris Dyer. 2012. Joint feature selection in distributed stochastic learning for large-scale discriminative training in smt. In Proceedings of the Meeting of the Association for Computational Linguistics, ACL ’ 12, pages 11–21. Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki. 2007. Online large-margin training for statistical machine translation. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’07, pages 764–773. 17
5 0.82776707 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce
Author: Vladimir Eidelman ; Ke Wu ; Ferhan Ture ; Philip Resnik ; Jimmy Lin
Abstract: We present an open-source framework for large-scale online structured learning. Developed with the flexibility to handle cost-augmented inference problems such as statistical machine translation (SMT), our large-margin learner can be used with any decoder. Integration with MapReduce using Hadoop streaming allows efficient scaling with increasing size of training data. Although designed with a focus on SMT, the decoder-agnostic design of our learner allows easy future extension to other structured learning problems such as sequence labeling and parsing.
6 0.76964128 328 acl-2013-Stacking for Statistical Machine Translation
7 0.74182642 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
8 0.70148879 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation
9 0.69772071 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
10 0.66187489 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric
11 0.65516949 277 acl-2013-Part-of-speech tagging with antagonistic adversaries
12 0.65374792 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint
13 0.64981884 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts
14 0.6209904 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation
15 0.61916226 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
16 0.59357798 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
17 0.57650554 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
18 0.56785542 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?
19 0.56618023 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
20 0.55934274 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
topicId topicWeight
[(0, 0.036), (6, 0.036), (11, 0.047), (24, 0.013), (26, 0.507), (35, 0.048), (42, 0.068), (48, 0.031), (70, 0.036), (88, 0.015), (90, 0.026), (95, 0.073)]
simIndex simValue paperId paperTitle
1 0.96380687 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts
Author: Alexandra Balahur ; Hristo Tanev
Abstract: Nowadays, the importance of Social Media is constantly growing, as people often use such platforms to share mainstream media news and comment on the events that they relate to. As such, people no loger remain mere spectators to the events that happen in the world, but become part of them, commenting on their developments and the entities involved, sharing their opinions and distributing related content. This paper describes a system that links the main events detected from clusters of newspaper articles to tweets related to them, detects complementary information sources from the links they contain and subsequently applies sentiment analysis to classify them into positive, negative and neutral. In this manner, readers can follow the main events happening in the world, both from the perspective of mainstream as well as social media and the public’s perception on them. This system will be part of the EMM media monitoring framework working live and it will be demonstrated using Google Earth.
2 0.9563753 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections
Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina
Abstract: We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-the-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying “good” training sentences from the parallel corpus and applying self-training. In experimental results on eight languages, our method achieves state-of-the-art results. 1 Unsupervised part-of-speech tagging Currently, part-of-speech (POS) taggers are available for many highly spoken and well-resourced languages such as English, French, German, Italian, and Arabic. For example, Petrov et al. (2012) build supervised POS taggers for 22 languages using the TNT tagger (Brants, 2000), with an average accuracy of 95.2%. However, many widelyspoken languages including Bengali, Javanese, and Lahnda have little data manually labelled for POS, limiting supervised approaches to POS tagging for these languages. However, with the growing quantity of text available online, and in particular, multilingual parallel texts from sources such as multilingual websites, government documents and large archives ofhuman translations ofbooks, news, and so forth, unannotated parallel data is becoming more widely available. This parallel data can be exploited to bridge languages, and in particular, transfer information from a highly-resourced language to a lesser-resourced language, to build unsupervised POS taggers. In this paper, we propose an unsupervised approach to POS tagging in a similar vein to the work of Das and Petrov (201 1). In this approach, — — pecina@ ufal .mff .cuni . c z a parallel corpus for a more-resourced language having a POS tagger, and a lesser-resourced language, is word-aligned. These alignments are exploited to infer an unsupervised tagger for the target language (i.e., a tagger not requiring manuallylabelled data in the target language). Our approach is substantially simpler than that of Das and Petrov, the current state-of-the art, yet performs comparably well. 2 Related work There is a wealth of prior research on building unsupervised POS taggers. Some approaches have exploited similarities between typologically similar languages (e.g., Czech and Russian, or Telugu and Kannada) to estimate the transition probabilities for an HMM tagger for one language based on a corpus for another language (e.g., Hana et al., 2004; Feldman et al., 2006; Reddy and Sharoff, 2011). Other approaches have simultaneously tagged two languages based on alignments in a parallel corpus (e.g., Snyder et al., 2008). A number of studies have used tag projection to copy tag information from a resource-rich to a resource-poor language, based on word alignments in a parallel corpus. After alignment, the resource-rich language is tagged, and tags are projected from the source language to the target language based on the alignment (e.g., Yarowsky and Ngai, 2001 ; Das and Petrov, 2011). Das and Petrov (201 1) achieved the current state-of-the-art for unsupervised tagging by exploiting high confidence alignments to copy tags from the source language to the target language. Graph-based label propagation was used to automatically produce more labelled training data. First, a graph was constructed in which each vertex corresponds to a unique trigram, and edge weights represent the syntactic similarity between vertices. Labels were then propagated by optimizing a convex function to favor the same tags for closely related nodes 634 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 634–639, ModelCoverageAccuracy Many-to-1 alignments88%68% 1-to-1 alignments 68% 78% 1-to-1 alignments: Top 60k sents 91% 80% Table 1: Token coverage and accuracy of manyto-one and 1-to-1 alignments, as well as the top 60k sentences based on alignment score for 1-to-1 alignments, using directly-projected labels only. while keeping a uniform tag distribution for unrelated nodes. A tag dictionary was then extracted from the automatically labelled data, and this was used to constrain a feature-based HMM tagger. The method we propose here is simpler to that of Das and Petrov in that it does not require convex optimization for label propagation or a feature based HMM, yet it achieves comparable results. 3 Tagset Our tagger exploits the idea ofprojecting tag information from a resource-rich to resource-poor language. To facilitate this mapping, we adopt Petrov et al.’s (2012) twelve universal tags: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (de- terminers and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), “.” (punctuation), and X (all other categories, e.g., foreign words, abbreviations). These twelve basic tags are common across taggers for most languages. Adopting a universal tagset avoids the need to map between a variety of different, languagespecific tagsets. Furthermore, it makes it possible to apply unsupervised tagging methods to languages for which no tagset is available, such as Telugu and Vietnamese. 4 A Simpler Unsupervised POS Tagger Here we describe our proposed tagger. The key idea is to maximize the amount of information gleaned from the source language, while limiting the amount of noise. We describe the seed model and then explain how it is successively refined through self-training and revision. 4.1 Seed Model The first step is to construct a seed tagger from directly-projected labels. Given a parallel corpus for a source and target language, Algorithm 1provides a method for building an unsupervised tagger for the target language. In typical applications, the source language would be a better-resourced language having a tagger, while the target language would be lesser-resourced, lacking a tagger and large amounts of manually POS-labelled data. Algorithm 1 Build seed model Algorithm 1Build seed model 1:Tag source side. 2: Word align the corpus with Giza++ and remove the many-to-one mappings. 3: Project tags from source to target using the remaining 1-to-1 alignments. 4: Select the top n sentences based on sentence alignment score. 5: Estimate emission and transition probabilities. 6: Build seed tagger T. We eliminate many-to-one alignments (Step 2). Keeping these would give more POS-tagged tokens for the target side, but also introduce noise. For example, suppose English and French were the source and target language, respectively. In this case alignments such as English laws (NNS) to French les (DT) lois (NNS) would be expected (Yarowsky and Ngai, 2001). However, in Step 3, where tags are projected from the source to target language, this would incorrectly tag French les as NN. We build a French tagger based on English– French data from the Europarl Corpus (Koehn, 2005). We also compare the accuracy and coverage of the tags obtained through direct projection using the French Melt POS tagger (Denis and Sagot, 2009). Table 1confirms that the one-to-one alignments indeed give higher accuracy but lower coverage than the many-to-one alignments. At this stage of the model we hypothesize that highconfidence tags are important, and hence eliminate the many-to-one alignments. In Step 4, in an effort to again obtain higher quality target language tags from direct projection, we eliminate all but the top n sentences based on their alignment scores, as provided by the aligner via IBM model 3. We heuristically set this cutoff × to 60k to balance the accuracy and size of the seed model.1 Returning to our preliminary English– French experiments in Table 1, this process gives improvements in both accuracy and coverage.2 1We considered values in the range 60–90k, but this choice had little impact on the accuracy of the model. 2We also considered using all projected labels for the top 60k sentences, not just 1-to-1 alignments, but in preliminary experiments this did not perform as well, possibly due to the previously-observed problems with many-to-one alignments. 635 The number of parameters for the emission probability is |V | |T| where V is the vocabulary and aTb iilsi ttyh eis tag |s e×t. TTh| ew htrearnesi Vtio ins probability, on atnhed other hand, has only |T|3 parameters for the trigram hmaondde,l we use. TB|ecause of this difference in number of parameters, in step 5, we use different strategies to estimate the emission and transition probabilities. The emission probability is estimated from all 60k selected sentences. However, for the transition probability, which has less parameters, we again focus on “better” sentences, by estimating this probability from only those sen- tences that have (1) token coverage > 90% (based on direct projection of tags from the source language), and (2) length > 4 tokens. These criteria aim to identify longer, mostly-tagged sentences, which we hypothesize are particularly useful as training data. In the case of our preliminary English–French experiments, roughly 62% of the 60k selected sentences meet these criteria and are used to estimate the transition probability. For unaligned words, we simply assign a random POS and very low probability, which does not substantially affect transition probability estimates. In Step 6 we build a tagger by feeding the estimated emission and transition probabilities into the TNT tagger (Brants, 2000), an implementation of a trigram HMM tagger. 4.2 Self training and revision For self training and revision, we use the seed model, along with the large number of target language sentences available that have been partially tagged through direct projection, in order to build a more accurate tagger. Algorithm 2 describes this process of self training and revision, and assumes that the parallel source–target corpus has been word aligned, with many-to-one alignments removed, and that the sentences are sorted by alignment score. In contrast to Algorithm 1, all sentences are used, not just the 60k sentences with the highest alignment scores. We believe that sentence alignment score might correspond to difficulty to tag. By sorting the sentences by alignment score, sentences which are more difficult to tag are tagged using a more mature model. Following Algorithm 1, we divide sentences into blocks of 60k. In step 3 the tagged block is revised by comparing the tags from the tagger with those obtained through direct projection. Suppose source Algorithm 2 Self training and revision 1:Divide target language sentences into blocks of n sentences. 2: Tag the first block with the seed tagger. 3: Revise the tagged block. 4: Train a new tagger on the tagged block. 5: Add the previous tagger’s lexicon to the new tagger. 6: Use the new tagger to tag the next block. 7: Goto 3 and repeat until all blocks are tagged. language word wis is aligned with target language word wjt with probability p(wjt |wsi), Tis is the tag for wis using the tagger availa|bwle for the source language, and Tjt is the tag for wjt using the tagger learned for the > S, where S is a threshold which we heuristically set to 0.7, we replace Tjt by Tis. Self-training can suffer from over-fitting, in which errors in the original model are repeated and amplified in the new model (McClosky et al., 2006). To avoid this, we remove the tag of any token that the model is uncertain of, i.e., if p(wjt |wsi) < S and Tjt Tis then Tjt = Null. So, on th|ew target side, aligned words have a tag from direct projection or no tag, and unaligned words have a tag assigned by our model. Step 4 estimates the emission and transition target language. If p(wtj|wis) = probabilities as in Algorithm 1. In Step 5, emission probabilities for lexical items in the previous model, but missing from the current model, are added to the current model. Later models therefore take advantage of information from earlier models, and have wider coverage. 5 Experimental Results Using parallel data from Europarl (Koehn, 2005) we apply our method to build taggers for the same eight target languages as Das and Petrov (201 1) Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish with English as the source language. Our training data (Europarl) is a subset of the training data of Das and Petrov (who also used the ODS United Nations dataset which we were unable to obtain). The evaluation metric and test data are the same as that used by Das and Petrov. Our results are comparable to theirs, although our system is penalized by having less training data. We tag the source language with the Stanford POS tagger (Toutanova et al., 2003). — — 636 DanishDutchGermanGreekItalianPortugueseSpanishSwedishAverage Seed model83.781.183.677.878.684.981.478.981.3 Self training + revision 85.6 84.0 85.4 80.4 81.4 86.3 83.3 81.0 83.4 Das and Petrov (2011) 83.2 79.5 82.8 82.5 86.8 87.9 84.2 80.5 83.4 Table 2: Token-level POS tagging accuracy for our seed model, self training and revision, and the method of Das and Petrov (201 1). The best results on each language, and on average, are shown in bold. 1 1 Iteration 2 2 3 1 1 2 2 3 Iteration Figure 1: Overall accuracy, accuracy on known tokens, accuracy on unknown tokens, and proportion of known tokens for Italian (left) and Dutch (right). Table 2 shows results for our seed model, self training and revision, and the results reported by Das and Petrov. Self training and revision improve the accuracy for every language over the seed model, and gives an average improvement of roughly two percentage points. The average accuracy of self training and revision is on par with that reported by Das and Petrov. On individual languages, self training and revision and the method of Das and Petrov are split each performs better on half of the cases. Interestingly, our method achieves higher accuracies on Germanic languages the family of our source language, English while Das and Petrov perform better on Romance languages. This might be because our model relies on alignments, which might be more accurate for more-related languages, whereas Das and Petrov additionally rely on label propagation. Compared to Das and Petrov, our model performs poorest on Italian, in terms of percentage point difference in accuracy. Figure 1 (left panel) shows accuracy, accuracy on known words, accuracy on unknown words, and proportion of known tokens for each iteration of our model for Italian; iteration 0 is the seed model, and iteration 3 1 is the final model. Our model performs poorly on unknown words as indicated by the low accuracy on unknown words, and high accuracy on known — — — words compared to the overall accuracy. The poor performance on unknown words is expected because we do not use any language-specific rules to handle this case. Moreover, on average for the final model, approximately 10% of the test data tokens are unknown. One way to improve the performance of our tagger might be to reduce the proportion of unknown words by using a larger training corpus, as Das and Petrov did. We examine the impact of self-training and revision over training iterations. We find that for all languages, accuracy rises quickly in the first 5–6 iterations, and then subsequently improves only slightly. We exemplify this in Figure 1 (right panel) for Dutch. (Findings are similar for other languages.) Although accuracy does not increase much in later iterations, they may still have some benefit as the vocabulary size continues to grow. 6 Conclusion We have proposed a method for unsupervised POS tagging that performs on par with the current state- of-the-art (Das and Petrov, 2011), but is substantially less-sophisticated (specifically not requiring convex optimization or a feature-based HMM). The complexity of our algorithm is O(nlogn) compared to O(n2) for that of Das and Petrov 637 (201 1) where n is the size of training data.3 We made our code are available for download.4 In future work we intend to consider using a larger training corpus to reduce the proportion of unknown tokens and improve accuracy. Given the improvements of our model over that of Das and Petrov on languages from the same family as our source language, and the observation of Snyder et al. (2008) that a better tagger can be learned from a more-closely related language, we also plan to consider strategies for selecting an appropriate source language for a given target language. Using our final model with unsupervised HMM methods might improve the final performance too, i.e. use our final model as the initial state for HMM, then experiment with differ- ent inference algorithms such as Expectation Maximization (EM), Variational Bayers (VB) or Gibbs sampling (GS).5 Gao and Johnson (2008) compare EM, VB and GS for unsupervised English POS tagging. In many cases, GS outperformed other methods, thus we would like to try GS first for our model. 7 Acknowledgements This work is funded by Erasmus Mundus European Masters Program in Language and Communication Technologies (EM-LCT) and by the Czech Science Foundation (grant no. P103/12/G084). We would like to thank Prokopis Prokopidis for providing us the Greek Treebank and Antonia Marti for the Spanish CoNLL 06 dataset. Finally, we thank Siva Reddy and Spandana Gella for many discussions and suggestions. References Thorsten Brants. 2000. TnT: A statistical part-ofspeech tagger. In Proceedings of the sixth conference on Applied natural language processing (ANLP ’00), pages 224–231 . Seattle, Washington, USA. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of 3We re-implemented label propagation from Das and Petrov (2011). It took over a day to complete this step on an eight core Intel Xeon 3.16GHz CPU with 32 Gb Ram, but only 15 minutes for our model. 4https://code.google.com/p/universal-tagger/ 5We in fact have tried EM, but it did not help. The overall performance dropped slightly. This might be because selftraining with revision already found the local maximal point. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL 2011), pages 600–609. Portland, Oregon, USA. Pascal Denis and Beno ıˆt Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of the 23rd PacificAsia Conference on Language, Information and Computation, pages 721–736. Hong Kong, China. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’06), pages 549–554. Genoa, Italy. Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 344–352. Association for Computational Linguistics, Stroudsburg, PA, USA. Jiri Hana, Anna Feldman, and Chris Brew. 2004. A resource-light approach to Russian morphology: Tagging Russian using Czech resources. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), pages 222–229. Barcelona, Spain. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pages 79–86. AAMT, Phuket, Thailand. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on Human Language Technology Conference ofthe North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06), pages 152–159. New York, USA. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096. Istanbul, Turkey. Siva Reddy and Serge Sharoff. 2011. Cross language POS Taggers (and other tools) for Indian 638 languages: An experiment with Kannada using Telugu resources. In Proceedings of the IJCNLP 2011 workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA 2011). Chiang Mai, Thailand. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 1041–1050. Honolulu, Hawaii. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol- ume 1 (NAACL ’03), pages 173–180. Edmonton, Canada. David Yarowsky and Grace Ngai. 2001 . Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), pages 1–8. Pittsburgh, Pennsylvania, USA. 639
3 0.95493907 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies
Author: Reut Tsarfaty
Abstract: Stanford Dependencies (SD) provide a functional characterization of the grammatical relations in syntactic parse-trees. The SD representation is useful for parser evaluation, for downstream applications, and, ultimately, for natural language understanding, however, the design of SD focuses on structurally-marked relations and under-represents morphosyntactic realization patterns observed in Morphologically Rich Languages (MRLs). We present a novel extension of SD, called Unified-SD (U-SD), which unifies the annotation of structurally- and morphologically-marked relations via an inheritance hierarchy. We create a new resource composed of U-SDannotated constituency and dependency treebanks for the MRL Modern Hebrew, and present two systems that can automatically predict U-SD annotations, for gold segmented input as well as raw texts, with high baseline accuracy.
same-paper 4 0.94767588 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines
Author: Kristina Toutanova ; Byung-Gyu Ahn
Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.
5 0.91166252 257 acl-2013-Natural Language Models for Predicting Programming Comments
Author: Dana Movshovitz-Attias ; William W. Cohen
Abstract: Statistical language models have successfully been used to describe and analyze natural language documents. Recent work applying language models to programming languages is focused on the task of predicting code, while mainly ignoring the prediction of programmer comments. In this work, we predict comments from JAVA source files of open source projects, using topic models and n-grams, and we analyze the performance of the models given varying amounts of background data on the project being predicted. We evaluate models on their comment-completion capability in a setting similar to codecompletion tools built into standard code editors, and show that using a comment completion tool can save up to 47% of the comment typing. 1 Introduction and Related Work Statistical language models have traditionally been used to describe and analyze natural language documents. Recently, software engineering researchers have adopted the use of language models for modeling software code. Hindle et al. (2012) observe that, as code is created by humans it is likely to be repetitive and predictable, similar to natural language. NLP models have thus been used for a variety of software development tasks such as code token completion (Han et al., 2009; Jacob and Tairas, 2010), analysis of names in code (Lawrie et al., 2006; Binkley et al., 2011) and mining software repositories (Gabel and Su, 2008). An important part of software programming and maintenance lies in documentation, which may come in the form of tutorials describing the code, or inline comments provided by the programmer. The documentation provides a high level description of the task performed by the code, and may William W. Cohen Computer Science Department Carnegie Mellon University wcohen @ c s .cmu .edu include examples of use-cases for specific code segments or identifiers such as classes, methods and variables. Well documented code is easier to read and maintain in the long-run but writing comments is a laborious task that is often overlooked or at least postponed by many programmers. Code commenting not only provides a summarization of the conceptual idea behind the code (Sridhara et al., 2010), but can also be viewed as a form of document expansion where the comment contains significant terms relevant to the described code. Accurately predicted comment words can therefore be used for a variety of linguistic uses including improved search over code bases using natural language queries, code categorization, and locating parts of the code that are relevant to a specific topic or idea (Tseng and Juang, 2003; Wan et al., 2007; Kumar and Carterette, 2013; Shepherd et al., 2007; Rastkar et al., 2011). A related and well studied NLP task is that of predicting natural language caption and commentary for images and videos (Blei and Jordan, 2003; Feng and Lapata, 2010; Feng and Lapata, 2013; Wu and Li, 2011). In this work, our goal is to apply statistical language models for predicting class comments. We show that n-gram models are extremely successful in this task, and can lead to a saving of up to 47% in comment typing. This is expected as n-grams have been shown as a strong model for language and speech prediction that is hard to improve upon (Rosenfeld, 2000). In some cases however, for example in a document expansion task, we wish to extract important terms relevant to the code regardless of local syntactic dependencies. We hence also evaluate the use of LDA (Blei et al., 2003) and link-LDA (Erosheva et al., 2004) topic models, which are more relevant for the term ex- traction scenario. We find that the topic model performance can be improved by distinguishing code and text tokens in the code. 35 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 35–40, 2 Method 2.1 Models We train n-gram models (n = 1, 2, 3) over source code documents containing sequences of combined code and text tokens from multiple training datasets (described below). We use the Berkeley Language Model package (Pauls and Klein, 2011) with absolute discounting (Kneser-Ney smoothing; (1995)) which includes a backoff strategy to lower-order n-grams. Next, we use LDA topic models (Blei et al., 2003) trained on the same data, with 1, 5, 10 and 20 topics. The joint distribution of a topic mixture θ, and a set of N topics z, for a single source code document with N observed word tokens, d = {wi}iN=1, given the Dirichlet parameters α sa,n dd β, {isw th}erefore p(θ, z, w|α, β) = p(θ|α) Yp(z|θ)p(w|z, (1) β) Yw Under the models described so far, there is no distinction between text and code tokens. Finally, we consider documents as having a mixed membership of two entity types, code and text tokens, d = where tthexet text ws,o drd =s are tok}ens f,r{owm comment and string literals, and the code words include the programming language syntax tokens (e.g., publ ic, private, for, etc’ ) and all identifiers. In this case, we train link-LDA models (Erosheva et al., 2004) with 1, 5, 10 and 20 topics. Under the linkLDA model, the mixed-membership joint distribution of a topic mixture, words and topics is then ({wciode}iC=n1, {witext}iT=n1), p(θ, z, w|α, β) = p(θ|α) Y wYtext · p(ztext|θ)p(wtext|ztext,β)· (2) Y p(zcode|θ)p(wcode|zcode,β) wYcode where θ is the joint topic distribution, w is the set of observed document words, ztext is a topic associated with a text word, and zcode a topic associated with a code word. The LDA and link-LDA models use Gibbs sampling (Griffiths and Steyvers, 2004) for topic inference, based on the implementation of Balasubramanyan and Cohen (201 1) with single or multiple entities per document, respectively. 2.2 Testing Methodology Our goal is to predict the tokens of the JAVA class comment (the one preceding the class definition) in each of the test files. Each of the models described above assigns a probability to the next comment token. In the case of n-grams, the probability of a token word wi is given by considering previous words p(wi |wi−1 , . . . , w0). This probability is estimated given the previous n 1tokens as p(wi|wi−1, wi−(n−1)). For t|hwe topic models, we separate the docu- ..., − ment tokens into the class definition and the comment we wish to predict. The set of tokens of the class comment are all considered as text tokens. The rest of the tokens in the document are considered to be the class definition, and they may contain both code and text tokens (from string literals and other comments in the source file). We then compute the posterior probability of document topics by solving the following inference problem conditioned on the tokens wc, wr, wr p(θ,zr|wr,α,β) =p(θp,(zwr,rw|αr,|αβ),β) (3) This gives us an estimate of the document distribution, θ, with which we infer the probability of the comment tokens as p(wc|θ,β) = Xp(wc|z,β)p(z|θ) (4) Xz Following Blei et al. (2003), for the case of a single entity LDA, the inference problem from equation (3) can be solved by considering p(θ, z, w|α, β), as in equation (1), and by taking tph(eθ marginal )di,s atrsib iunti eoqnu aotfio othne ( 1d)o,c aunmde bnyt t toakkeinngs as a continuous mixture distribution for the set w = by integrating over θ and summing over the set of topics z wr, p(w|α,β) =Zp(θ|α)· (5) YwXzp(z|θ)p(w|z,β)!dθ For the case of link-LDA where the document is comprised of two entities, in our case code tokens and text tokens, we can consider the mixedmembership joint distribution θ, as in equation (2), and similarly the marginal distribution p(w|α, β) over bimoithla rclyod teh ean mda tregxint tlok deisntsri bfruotmion w pr(.w |Sαi,nβce) comment words in are all considered as text tokens they are sampled using text topics, namely ztext, in equation (4). wc 36 3 Experimental Settings 3.1 Data and Training Methodology We use source code from nine open source JAVA projects: Ant, Cassandra, Log4j, Maven, MinorThird, Batik, Lucene, Xalan and Xerces. For each project, we divide the source files into a training and testing dataset. Then, for each project in turn, we consider the following three main training scenarios, leading to using three training datasets. To emulate a scenario in which we are predicting comments in the middle of project development, we can use data (documented code) from the same project. In this case, we use the in-project training dataset (IN). Alternatively, if we train a comment prediction model at the beginning of the development, we need to use source files from other, possibly related projects. To analyze this scenario, for each of the projects above we train models using an out-of-project dataset (OUT) containing data from the other eight projects. Typically, source code files contain a greater amount ofcode versus comment text. Since we are interested in predicting comments, we consider a third training data source which contains more English text as well as some code segments. We use data from the popular Q&A; website StackOverflow (SO) where users ask and answer technical questions about software development, tools, algorithms, etc’ . We downloaded a dataset of all actions performed on the site since it was launched in August 2008 until August 2012. The data includes 3,453,742 questions and 6,858,133 answers posted by 1,295,620 users. We used only posts that are tagged as JAVA related questions and answers. All the models for each project are then tested on the testing set of that project. We report results averaged over all projects in Table 1. Source files were tokenized using the Eclipse JDT compiler tools, separating code tokens and identifiers. Identifier names (of classes, methods and variables), were further tokenized by camel case notation (e.g., ’minMargin’ was converted to ’min margin’). Non alpha-numeric tokens (e.g., dot, semicolon) were discarded from the code, as well as numeric and single character literals. Text from comments or any string literals within the code were further tokenized with the Mallet statistical natural language processing package (Mc- Callum, 2002). Posts from SO were parsed using the Apache Tika toolkit1 and then tokenized with the Mallet package. We considered as raw code tokens anything labeled using a markup (as indicated by the SO users who wrote the post). 3.2 Evaluation Since our models are trained using various data sources the vocabularies used by each of them are different, making the comment likelihood given by each model incomparable due to different sets of out-of-vocabulary tokens. We thus evaluate models using a character saving metric which aims at quantifying the percentage of characters that can be saved by using the model in a word-completion settings, similar to standard code completion tools built into code editors. For a comment word with n characters, w = w1, . . . , wn, we predict the two most likely words given each model filtered by the first 0, . . . , n characters ofw. Let k be the minimal ki for which w is in the top two predicted word tokens where tokens are filtered by the first ki characters. Then, the number of saved characters for w is n k. In Table 1we report the average percentage o−f ksa.v Iend T Tcahbalera 1cte wrse per ocrotm thmee avnet using eearcchen not-f the above models. The final results are also averaged over the nine input projects. As an example, in the predicted comment shown in Table 2, taken from the project Minor-Third, the token entity is the most likely token according to the model SO trigram, out of tokens starting with the prefix ’en’ . The saved characters in this case are ’tity’ . − 4 Results Table 1 displays the average percentage of characters saved per class comment using each of the models. Models trained on in-project data (IN) perform significantly better than those trained on another data source, regardless of the model type, with an average saving of 47. 1% characters using a trigram model. This is expected, as files from the same project are likely to contain similar comments, and identifier names that appear in the comment of one class may appear in the code of another class in the same project. Clearly, in-project data should be used when available as it improves comment prediction leading to an average increase of between 6% for the worst model (26.6 for OUT unigram versus 33.05 for IN) and 14% for the best (32.96 for OUT trigram versus 47. 1for IN). 1http://tika.apache.org/ 37 Model n / topics n-gram LDA Link-LDA 1 2 3 20 10 5 1 20 10 5 1 IN 33.05 (3.62) 43.27 (5.79) 47.1 (6.87) 34.20 (3.63) 33.93 (3.67) 33.63 (3.67) 33.05 (3.62) 35.76 (3.95) 35.81 (4.12) 35.37 (3.98) 34.59 (3.92) OUT 26.6 (3.37) 31.52 (4.17) 32.96 (4.33) 26.79 (3.26) 26.8 (3.36) 26.86 (3.44) 26.6 (3.37) 28.03 (3.60) 28 (3.56) 28 (3.67) 27.82 (3.62) SO 27.8 (3.51) 33.29 (4.40) 34.56 (4.78) 27.25 (3.67) 27.22 (3.44) 27.34 (3.55) 27.8 (3.51) 28.08 (3.48) 28.12 (3.58) 27.94 (3.56) 27.9 (3.45) Table 1: Average percentage of characters saved per comment using n-gram, LDA and link-LDA models trained on three training sets: IN, OUT, and SO. The results are averaged over nine JAVA projects (with standard deviations in parenthesis). Model Predicted Comment trigram IN link-LDA OUT trigram SO trigram “Train “Train “Train “Train IN named-entity a named-entity a named-entity a named-entity a extractor“ extractor“ extractor“ extractor“ Table 2: Sample comment from the Minor-Third project predicted using IN, OUT and SO based models. Saved characters are underlined. Of the out-of-project data sources, models using a greater amount of text (SO) mostly outperformed models based on more code (OUT). This increase in performance, however, comes at a cost of greater run-time due to the larger word dictionary associated with the SO data. Note that in the scope of this work we did not investigate the contribution of each of the background projects used in OUT, and how their relevance to the target prediction project effects their performance. The trigram model shows the best performance across all training data sources (47% for IN, 32% for OUT and 34% for SO). Amongst the tested topic models, link-LDA models which distinguish code and text tokens perform consistently better than simple LDA models in which all tokens are considered as text. We did not however find a correlation between the number of latent topics learned by a topic model and its performance. In fact, for each of the data sources, a different num- ber of topics gave the optimal character saving results. Note that in this work, all topic models are based on unigram tokens, therefore their results are most comparable with that of the unigram in Dataset n-gram link-LDA IN 2778.35 574.34 OUT 1865.67 670.34 SO 1898.43 638.55 Table 3: Average words per project for which each tested model completes the word better than the other. This indicates that each of the models is better at predicting a different set of comment words. Table 1, which does not benefit from the backoff strategy used by the bigram and trigram models. By this comparison, the link-LDA topic model proves more successful in the comment prediction task than the simpler models which do not distin- guish code and text tokens. Using n-grams without backoff leads to results significantly worse than any of the presented models (not shown). Table 2 shows a sample comment segment for which words were predicted using trigram models from all training sources and an in-project linkLDA. The comment is taken from the TrainExtractor class in the Minor-Third project, a machine learning library for annotating and categorizing text. Both IN models show a clear advantage in completing the project-specific word Train, compared to models based on out-of-project data (OUT and SO). Interestingly, in this example the trigram is better at completing the term namedentity given the prefix named. However, the topic model is better at completing the word extractor which refers to the target class. This example indicates that each model type may be more successful in predicting different comment words, and that combining multiple models may be advantageous. 38 This can also be seen by the analysis in Table 3 where we compare the average number of words completed better by either the best n-gram or topic model given each training dataset. Again, while n-grams generally complete more words better, a considerable portion of the words is better completed using a topic model, further motivating a hybrid solution. 5 Conclusions We analyze the use of language models for predicting class comments for source file documents containing a mixture of code and text tokens. Our experiments demonstrate the effectiveness of using language models for comment completion, showing a saving of up to 47% of the comment characters. When available, using in-project training data proves significantly more successful than using out-of-project data. However, we find that when using out-of-project data, a dataset based on more words than code performs consistently better. The results also show that different models are better at predicting different comment words, which motivates a hybrid solution combining the advantages of multiple models. Acknowledgments This research was supported by the NSF under grant CCF-1247088. References Ramnath Balasubramanyan and William W Cohen. 2011. Block-lda: Jointly modeling entity-annotated text and entity-entity links. In Proceedings ofthe 7th SIAM International Conference on Data Mining. Dave Binkley, Matthew Hearn, and Dawn Lawrie. 2011. Improving identifier informativeness using part of speech information. In Proc. of the Working Conference on Mining Software Repositories. ACM. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. David M Blei, Andrew Y Ng, and Michael IJordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. Elena Erosheva, Stephen Fienberg, and John Lafferty. 2004. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences of the United States of America. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE transactions on pattern analysis and machine intelligence. Mark Gabel and Zhendong Su. 2008. Javert: fully automatic mining of general temporal properties from dynamic traces. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 339–349. ACM. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. of the National Academy of Sciences of the United States of America. Sangmok Han, David R Wallace, and Robert C Miller. 2009. Code completion from abbreviated input. In Automated Software Engineering, 2009. ASE’09. 24th IEEE/ACM International Conference on, pages 332–343. IEEE. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE. Ferosh Jacob and Robert Tairas. 2010. Code template inference using language models. In Proceedings of the 48th Annual Southeast Regional Conference. ACM. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., volume 1, pages 181–184. IEEE. Naveen Kumar and Benjamin Carterette. 2013. Time based feedback and query expansion for twitter search. In Advances in Information Retrieval, pages 734–737. Springer. Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. Whats in a name? a study of identifiers. In Program Comprehension, 2006. ICPC 2006. 14th IEEE International Conference on, pages 3–12. IEEE. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Adam Pauls and Dan Klein. 2011. Faster and smaller language models. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 258–267. n-gram Sarah Rastkar, Gail C Murphy, and Alexander WJ Bradley. 2011. Generating natural language summaries for crosscutting source code concerns. In Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pages 103–1 12. IEEE. 39 Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8): 1270–1278. David Shepherd, Zachary P Fry, Emily Hill, Lori Pollock, and K Vijay-Shanker. 2007. Using natural language program analysis to locate and understand action-oriented concerns. In Proceedings of the 6th international conference on Aspect-oriented software development, pages 212–224. ACM. Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pages 43–52. ACM. Yuen-Hsien Tseng and Da-Wei Juang. 2003. Document-self expansion for text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 399–400. ACM. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. 2007. Single document summarization with document expansion. In Proc. of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Roung-Shiunn Wu and Po-Chun Li. 2011. Video annotation using hierarchical dirichlet process mixture model. Expert Systems with Applications, 38(4):3040–3048. 40
6 0.86149025 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
8 0.66289276 209 acl-2013-Joint Modeling of News Readerâ•Žs and Comment Writerâ•Žs Emotions
9 0.65445638 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing
10 0.65102744 163 acl-2013-From Natural Language Specifications to Program Input Parsers
11 0.6500724 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study
12 0.64316308 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
13 0.64141631 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
14 0.62841576 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
15 0.62675059 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
16 0.62471259 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification
17 0.62160152 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction
18 0.6169191 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
19 0.61515194 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction
20 0.60947931 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification