acl acl2013 acl2013-328 knowledge-graph by maker-knowledge-mining

328 acl-2013-Stacking for Statistical Machine Translation


Source: pdf

Author: Majid Razmara ; Anoop Sarkar

Abstract: We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca , Abstract We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. [sent-2, score-0.536]

2 A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. [sent-3, score-1.168]

3 Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model. [sent-4, score-0.176]

4 ensemble learning els, weak learners, The idea behind is to combine multiple mod- in an attempt to produce a strong model with less error. [sent-6, score-0.658]

5 , 2008; Sang, 2002) and recently has attracted attention in the statistical machine translation community in various work (Xiao et al. [sent-11, score-0.157]

6 In this paper, we propose a method to adopt stacking (Wolpert, 1992), an ensemble learning technique, to SMT. [sent-15, score-0.764]

7 We manipulate the full set of training data, creating k disjoint sets of held-out and held-in data sets as in k-fold cross-validation and build a model on each partition. [sent-16, score-0.106]

8 This creates a diverse ensemble of statistical machine translation models where each member of the ensemble has different feature function values for the SMT log-linear model (Koehn, 2010). [sent-17, score-1.03]

9 The weights of model are then tuned using minimum error rate training (Och, 2003) on the held-out fold to provide k weak models. [sent-18, score-0.349]

10 We then create a strong ∗This research was partially supported by an NSERC, Canada (RGPIN: 264905) grant and a Google Faculty Award to the second author. [sent-19, score-0.047]

11 model by stacking another meta-learner on top of weak models to combine them into a single model. [sent-20, score-0.671]

12 The particular second-tier model we use is a model combination approach called ensemble decoding which combines hypotheses from the weak models on-the-fly in the decoder. [sent-21, score-0.718]

13 Using this approach, we take advantage of the diversity created by manipulating the training data and obtain a significant and consistent improvement over a conventionally trained SMT model with a fixed training and tuning set. [sent-22, score-0.375]

14 2 Ensemble Learning Methods Two well-known instances of general framework of ensemble learning are bagging and boosting. [sent-23, score-0.529]

15 Bagging (Breiman, 1996a) (bootstrap aggregating) takes a number of samples with replacement from a training set. [sent-24, score-0.118]

16 The generated sample set may have 0, 1 or more instances of each original training instance. [sent-25, score-0.071]

17 This procedure is repeated a number of times and the base learner is applied to each sample to produce a weak learner. [sent-26, score-0.393]

18 These models are aggregated by doing a uniform voting for classification or averaging the predictions for regression. [sent-27, score-0.166]

19 Bagging reduces the variance of the base model while leaving the bias relatively unchanged and is most useful when a small change in the training data affects the prediction of the model (i. [sent-28, score-0.226]

20 , 2011) Boosting (Schapire, 1990) constructs a strong learner by repeatedly choosing a weak learner and applying it on a re-weighted training set. [sent-33, score-0.416]

21 In each iteration, a weak model is learned on the training data, whose instance weights are modified from the previous iteration to concentrate on examples on which the model predictions were poor. [sent-34, score-0.249]

22 By putting more weight on the wrongly predicted examples, a diverse ensemble of weak learners is created. [sent-35, score-0.72]

23 Input: D = {hfj , eji}jN=1 A parallel corpus IInnppuutt:: kD # of folds (i. [sent-44, score-0.052]

24 ,1D → ←k d So 23:: Ti =i ← 1 →D k− d Doi Use all but current partition as trainTing← ←set D. [sent-52, score-0.051]

25 , Mk) Combine all tsh ←e b aCse models to produ(cMe Ma s,tr . [sent-58, score-0.054]

26 Stacking (or stacked generalization) (Wolpert, 1992) is another ensemble learning algorithm that uses a second-level learning algorithm on top of the base learners to reduce the bias. [sent-64, score-0.706]

27 , gk where gi : → R, receiving input x and producing a prediction gi (x). [sent-68, score-0.064]

28 Tth xe n∈ex Rt level consists of a single function h : → R that takes hx, g1(x) , . [sent-69, score-0.039]

29 Two categories of ensemble learning are ho- Rd ∈ Rd Rd+k mogeneous learning and heterogeneous learning. [sent-76, score-0.379]

30 In homogeneous learning, a single base learner is used, and diversity is generated by data sampling, feature sampling, randomization and parameter settings, among other strategies. [sent-77, score-0.319]

31 In heterogeneous learning different learning algorithms are applied to the same training data to create a pool of diverse models. [sent-78, score-0.132]

32 In this paper, we focus on homogeneous ensemble learning by manipulating the training data. [sent-79, score-0.542]

33 In the primary form of stacking (Wolpert, 1992), the training data is split into multiple disjoint sets of held-out and held-in data sets using k-fold cross-validation and k models are trained on the held-in partitions and run on held-out partitions. [sent-80, score-0.592]

34 Then a meta-learner uses the predictions of all models on their held-out sets and the actual labels to learn a final model. [sent-81, score-0.054]

35 Breiman (1996b) linearly combines the weak learners in the stacking framework. [sent-83, score-0.665]

36 The weights of the base learners are learned using ridge regres- sion: s(x) = Pk αkmk (x), where mk is a base model trained oPn the k-th partition of the data and s is the resulting strong model created by linearly interpolating the weak learners. [sent-84, score-0.735]

37 Stacking (aka blending) has been used in the system that won the Netflix Prize1 , which used a multi-level stacking algorithm. [sent-85, score-0.385]

38 Stacking has been actively used in statistical parsing: Nivre and McDonald (2008) integrated two models for dependency parsing by letting one model learn from features generated by the other; F. [sent-86, score-0.102]

39 3 Our Approach In this paper, we propose a method to apply stacking to statistical machine translation (SMT) and our method is the first to successfully exploit stacking for statistical machine translation. [sent-90, score-1.012]

40 We use a standard statistical machine translation engine and produce multiple diverse models by partitioning the training set using the k-fold crossvalidation technique. [sent-91, score-0.418]

41 A diverse ensemble of weak systems is created by learning a model on each k −1 fold and tuning the statistical machine translkat−io1n log-linear weights on stthieca remaining rfaonlds. [sent-92, score-0.812]

42 However, instead oflearning a model on the output of base models as in (Wolpert, 1992), we combine hypotheses from the base models in the decoder with uniform weights. [sent-93, score-0.508]

43 For the base learner, we use Kriya (Sankaran et al. [sent-94, score-0.155]

44 , 2012), an in-house hierarchical phrase-based machine translation system, to produce multiple weak models. [sent-95, score-0.33]

45 These models are combined together using Ensemble Decoding (Razmara et al. [sent-96, score-0.054]

46 1 Ensemble Decoding SMT Log-linear models (Koehn, 2010) find the most likely target language output e given the source language input f using a vector of feature functions φ: p(e|f) ∝ exp? [sent-100, score-0.054]

47 com/ 335 Ensemble decoding combines several models dynamically at decoding time. [sent-104, score-0.196]

48 The scores are combined for each partial hypothesis using a user-defined mixture operation ⊗ over component musoedr-edlse. [sent-105, score-0.119]

49 We previously successfully applied ensemble decoding to domain adaptation in SMT and showed that it performed better than approaches that pre-compute linear mixtures of different models (Razmara et al. [sent-111, score-0.504]

50 Several mixture oper- ations were proposed, allowing the user to encode belief about the relative strengths of the component models. [sent-113, score-0.119]

51 These mixture operations receive two or more probabilities and return the mixture probability p( e¯ | f¯) for each rule e, f¯ used in the pdreocobdaberi. [sent-114, score-0.214]

52 l yD pif(ef ¯er|ent options for these operations are: • Weighted Sum (wsum) is defined as: XM p(¯ e| f¯) ∝ Xλm exp? [sent-115, score-0.046]

53 Xm • where m denotes the index of component models, M is the total number of them and λm is the weight for component m. [sent-117, score-0.07]

54 Weighted Max (wmax) is defined as: p(¯ e | f¯) • ∝ mmax ? [sent-118, score-0.047]

55 • Model Switching (Switch): Each cell in the CMKodYe lc Shawrti cish populated only by r cuelells nfr tohme one of the models and the other models’ rules are discarded. [sent-124, score-0.115]

56 A binary indicator function δ(f¯, m) picks a component model for each span: δ(f¯,m) =01, omth =erw arignse∈mMaxψ(f¯,n) The criteria for choosing a model for each cell, ψ(f¯, n), could be based on max Train size Src tokens Tgt tokens Fr - En 0+dev 10k+dev 100k+dev 67K 365K 3M 58K 327K 2. [sent-126, score-0.076]

57 8M Table 1: Statistics of the training set for different systems and different language pairs. [sent-129, score-0.071]

58 for each cell, the model that has the highest weighted score wins: ψ(f¯, n) = λn maex(wn · φn( e¯, ¯f)) Alternatively, we can pick the model with highest weighted sum of the probabilities of the rules (SW:SUM). [sent-132, score-0.05]

59 This sum has to take into account the translation table limit (ttl), on the number of rules suggested by each model for each cell: ψ(f¯,n) = λnXexp? [sent-133, score-0.122]

60 For the base models, we used an in-house implementation of hierarchical phrase-based systems, Kriya (Sankaran et al. [sent-136, score-0.198]

61 26 Table 2: Testset BLEU scores when applying stacking on the devset only (using no specific training set). [sent-185, score-0.727]

62 Direction Fr - En Es - En Corpus k-fold Baseline BMA WSUM WMAX PROD SW:MAX 1 0 0k + d evev 611 / 51 2 98. [sent-186, score-0.061]

63 2 21 Table 3: Testset BLEU scores when using 10k and 100k sentence training sets along with the devset. [sent-200, score-0.071]

64 1 Training on devset We first consider the scenario in which there is no parallel data between a language pair except a small bi-text used as a devset. [sent-202, score-0.271]

65 We use no specific training data and construct a SMT system completely on the devset by using our approach and compare to two different baselines. [sent-203, score-0.342]

66 A natural baseline when having a limited parallel text is to do re-substitution validation where the model is trained on the whole devset and is tuned on the same set. [sent-204, score-0.309]

67 The second baseline is the mean of BLEU scores of all base models. [sent-206, score-0.155]

68 Table 2 summarizes the BLEU scores on the testset when using stacking only on the devset on two different language pairs. [sent-207, score-0.723]

69 As the table shows, increasing the number of folds results in higher BLEU scores. [sent-208, score-0.052]

70 However, doing such will generally lead to higher variance among base learners. [sent-209, score-0.155]

71 Figure 1 shows the BLEU score of each of the base models resulted from a 20-fold partitioning of the devset along with the strong models’ BLEU scores. [sent-210, score-0.602]

72 As the figure shows, the strong models are generally superior to the base models whose mean is represented as a horizontal line. [sent-211, score-0.349]

73 2 Training on train+dev When we have some training data, we can use the cross-validation-style partitioning to create k splits. [sent-213, score-0.146]

74 We then train a system on k − 1 folds and tune on Wthee t dheenvs terat. [sent-214, score-0.099]

75 However, ea onch k system eldvsen atnudally wastes a fold of the training data. [sent-215, score-0.133]

76 In order to take advantage of that remaining fold, we concatenate the devset to the training set and partition the whole union. [sent-216, score-0.393]

77 We experimented with two sizes of training data: 10k sentence pairs and 100k, that with the addition of the devset, we have 12k and 102k sentence-pair corpora. [sent-218, score-0.112]

78 Table 3 reports the BLEU scores when using stacking on these two corpus sizes. [sent-220, score-0.385]

79 The baselines are the conventional systems which are built on the training-set only and tuned on the devset as well as Bayesian ModelAveraging (BMA, see §5). [sent-221, score-0.309]

80 For the 100k+dev corpus, we sampled 1,1 s partitions f trhoem 1 0a0llk +51d possible partitions by taking every fifth partition as training data. [sent-222, score-0.216]

81 The results in Table 3 show that stacking can improve over the baseline BLEU scores by up to 4 points. [sent-223, score-0.385]

82 Examining the performance of the different mixture operations, we can see that WSUM and WMAX typically outperform other mixture operations. [sent-224, score-0.168]

83 Different mixture operations can be dominant in different language pairs and different sizes of training sets. [sent-225, score-0.242]

84 (2013) have applied both boosting and bagging on three different statistical machine translation engines: phrase-based (Koehn et al. [sent-227, score-0.386]

85 , 2003), hierarchical phrase-based (Chiang, 2005) and syntax-based (Galley et al. [sent-228, score-0.043]

86 (2009) creates an ensemble of mod- els by using feature subspace method in the machine learning literature (Ho, 1998). [sent-231, score-0.523]

87 Each member of the ensemble is built by removing one nonLM feature in the log-linear framework or varying the order of language model. [sent-232, score-0.379]

88 Finally they use a sentence-level system combination on the outputs of the base models to pick the best system for each 337 23. [sent-233, score-0.209]

89 20 Models Figure 1: BLEU scores for all the base models and stacked models on the Fr-En devset with 20-fold cross validation. [sent-241, score-0.604]

90 The horizontal line shows the mean of base models’ scores. [sent-242, score-0.194]

91 Though, they do not combine the hypotheses search spaces of individual base models. [sent-244, score-0.245]

92 (2010) which uses Bayesian model averaging (BMA) (Hoeting et al. [sent-246, score-0.068]

93 They used sampling without replacement to create a number of base models whose phrase-tables are combined with that of the baseline (trained on the full training-set) using linear mixture models (Foster and Kuhn, 2007). [sent-248, score-0.429]

94 Empirical results (Table 3) also show that our approach outperforms the Bayesian model averaging approach (BMA). [sent-251, score-0.068]

95 6 Conclusion & Future Work In this paper, we proposed a novel method on applying stacking to the statistical machine translation task. [sent-252, score-0.542]

96 The results when using no, 10k and 100k sentence-pair training sets (along with a development set for tuning) show that stacking can yield an improvement of up to 4 BLEU points over conventionally trained SMT models which use a fixed training and tuning set. [sent-253, score-0.692]

97 Future work includes experimenting with larger training sets to investigate how useful this approach can be when having different sizes of training data. [sent-254, score-0.183]

98 Translation model generalization using probability averaging for machine translation. [sent-275, score-0.105]

99 Scalable inference and training of context-rich syntactic translation models. [sent-294, score-0.143]

100 Mixing multiple transla- tion models in statistical machine translation. [sent-348, score-0.139]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stacking', 0.385), ('ensemble', 0.379), ('devset', 0.271), ('weak', 0.178), ('base', 0.155), ('bagging', 0.15), ('smt', 0.141), ('wolpert', 0.136), ('dev', 0.127), ('xiao', 0.125), ('razmara', 0.125), ('bma', 0.123), ('prod', 0.123), ('wsum', 0.123), ('bleu', 0.121), ('sankaran', 0.108), ('wmax', 0.108), ('learners', 0.102), ('xm', 0.098), ('kriya', 0.092), ('lagarda', 0.092), ('sw', 0.09), ('mixture', 0.084), ('duan', 0.084), ('boosting', 0.079), ('partitioning', 0.075), ('translation', 0.072), ('training', 0.071), ('decoding', 0.071), ('majid', 0.071), ('subspace', 0.071), ('breiman', 0.071), ('stacked', 0.07), ('exp', 0.069), ('anoop', 0.068), ('averaging', 0.068), ('testset', 0.067), ('diversity', 0.067), ('conventionally', 0.064), ('gk', 0.064), ('fold', 0.062), ('evev', 0.061), ('hoeting', 0.061), ('tomeh', 0.061), ('cell', 0.061), ('diverse', 0.061), ('wm', 0.06), ('learner', 0.06), ('manipulating', 0.055), ('models', 0.054), ('combine', 0.054), ('baskaran', 0.054), ('casacuberta', 0.054), ('folds', 0.052), ('partition', 0.051), ('tong', 0.051), ('stroudsburg', 0.051), ('martins', 0.05), ('sum', 0.05), ('statistical', 0.048), ('partitions', 0.047), ('tuning', 0.047), ('mk', 0.047), ('mmax', 0.047), ('och', 0.047), ('tune', 0.047), ('strong', 0.047), ('replacement', 0.047), ('operations', 0.046), ('surdeanu', 0.046), ('song', 0.044), ('voting', 0.044), ('koehn', 0.043), ('hierarchical', 0.043), ('foster', 0.042), ('sizes', 0.041), ('max', 0.041), ('en', 0.041), ('jingbo', 0.041), ('leo', 0.041), ('pa', 0.04), ('xe', 0.039), ('horizontal', 0.039), ('summit', 0.039), ('tuned', 0.038), ('fr', 0.037), ('machine', 0.037), ('nivre', 0.037), ('homogeneous', 0.037), ('hypotheses', 0.036), ('rd', 0.036), ('els', 0.036), ('sampling', 0.035), ('disjoint', 0.035), ('component', 0.035), ('bayesian', 0.034), ('predictors', 0.034), ('es', 0.033), ('nan', 0.032), ('europarl', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 328 acl-2013-Stacking for Statistical Machine Translation

Author: Majid Razmara ; Anoop Sarkar

Abstract: We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model.

2 0.14807168 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa

Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.

3 0.13201544 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

Author: Boxing Chen ; Roland Kuhn ; George Foster

Abstract: This paper proposes a new approach to domain adaptation in statistical machine translation (SMT) based on a vector space model (VSM). The general idea is first to create a vector profile for the in-domain development (“dev”) set. This profile might, for instance, be a vector with a dimensionality equal to the number of training subcorpora; each entry in the vector reflects the contribution of a particular subcorpus to all the phrase pairs that can be extracted from the dev set. Then, for each phrase pair extracted from the training data, we create a vector with features defined in the same way, and calculate its similarity score with the vector representing the dev set. Thus, we obtain a de- coding feature whose value represents the phrase pair’s closeness to the dev. This is a simple, computationally cheap form of instance weighting for phrase pairs. Experiments on large scale NIST evaluation data show improvements over strong baselines: +1.8 BLEU on Arabic to English and +1.4 BLEU on Chinese to English over a non-adapted baseline, and significant improvements in most circumstances over baselines with linear mixture model adaptation. An informal analysis suggests that VSM adaptation may help in making a good choice among words with the same meaning, on the basis of style and genre.

4 0.11917974 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Author: Kristina Toutanova ; Byung-Gyu Ahn

Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.

5 0.11532369 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao

Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.

6 0.11088743 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

7 0.10799544 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction

8 0.10564977 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

9 0.10432609 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

10 0.096431807 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

11 0.095628761 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

12 0.089031093 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

13 0.084130384 314 acl-2013-Semantic Roles for String to Tree Machine Translation

14 0.083204612 24 acl-2013-A Tale about PRO and Monsters

15 0.082376458 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

16 0.082242593 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

17 0.078996465 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

18 0.078941263 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

19 0.078682713 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

20 0.078457981 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.193), (1, -0.106), (2, 0.106), (3, 0.067), (4, 0.002), (5, 0.043), (6, 0.055), (7, 0.003), (8, -0.001), (9, 0.03), (10, -0.002), (11, 0.04), (12, -0.057), (13, -0.003), (14, -0.001), (15, 0.051), (16, -0.056), (17, 0.006), (18, 0.038), (19, -0.012), (20, 0.082), (21, 0.04), (22, 0.023), (23, 0.041), (24, 0.06), (25, 0.032), (26, 0.039), (27, -0.004), (28, -0.033), (29, 0.019), (30, 0.139), (31, 0.003), (32, -0.082), (33, -0.028), (34, -0.014), (35, 0.017), (36, -0.046), (37, -0.044), (38, 0.023), (39, -0.006), (40, -0.007), (41, 0.016), (42, -0.01), (43, 0.089), (44, 0.065), (45, -0.013), (46, -0.025), (47, -0.037), (48, -0.039), (49, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9203921 328 acl-2013-Stacking for Statistical Machine Translation

Author: Majid Razmara ; Anoop Sarkar

Abstract: We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model.

2 0.73883188 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

Author: Kristina Toutanova ; Byung-Gyu Ahn

Abstract: In this paper we show how to automatically induce non-linear features for machine translation. The new features are selected to approximately maximize a BLEU-related objective and decompose on the level of local phrases, which guarantees that the asymptotic complexity of machine translation decoding does not increase. We achieve this by applying gradient boosting machines (Friedman, 2000) to learn new weak learners (features) in the form of regression trees, using a differentiable loss function related to BLEU. Our results indicate that small gains in perfor- mance can be achieved using this method but we do not see the dramatic gains observed using feature induction for other important machine learning tasks.

3 0.73707241 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

4 0.72585028 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa

Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.

5 0.72125256 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

Author: Christian Hardmeier ; Sara Stymne ; Jorg Tiedemann ; Joakim Nivre

Abstract: We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al., 2012) and discourse connectives (Meyer et al., 2012). One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al., 2003). For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al., 2012). Our decoding algorithm is based on local search instead of dynamic programming and permits the integration of 193 document-level models with unrestricted dependencies, so that a model score can be conditioned on arbitrary elements occurring anywhere in the input document or in the translation that is being generated. In this paper, we present an open-source implementation of this search algorithm. The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. The code is released under the GNU General Public License and published on Github1 to make it easy for other researchers to use it in their own experiments. 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. (2003) and implemented, for example, in the popular Moses decoder (Koehn et al., 2007). Translation is performed by splitting the input sentence into a number of contiguous word sequences, called phrases, which are translated into the target lan- guage through a phrase dictionary lookup and optionally reordered. The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. (2012) and is based on local search. This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github .com/chardmeier/docent/wiki Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. It is similar to previous work on phrase-based SMT decoding by Langlais et al. (2007), but enables the creation of document-level models, which was not addressed by earlier approaches. Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. (2012). The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. This saves time especially when running repeated experiments from the same starting point obtained 194 by DP search. In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. The three operations used by Hardmeier et al. (2012), change-phrase-translation, resegment and swap-phrases, jointly meet this requirement and are all implemented in Docent. Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. Unlike swap-phrases, it does not require that another phrase be moved in the opposite direction at the same time. A pair of operations called permute-phrases and linearisephrasescanreorderasequenceofphrasesintorandom order and back into the order corresponding to the source language. Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al., 2012; Stymne et al., 2013a). 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. The published source code also includes prototype implementations of a few document-level models. These models should be considered work in progress and serve as a demonstration of the cross-sentence modelling capabilities of the decoder. They have not yet reached a state of maturity that would make them suitable for production use. The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al., 2003) and a word penalty cost. All of these features are designed to be compatible with the corresponding features in Moses. From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. Docent therefore includes a separate indicator feature to indicate a violated distortion limit. In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. The semantic language model was originally proposed by Hardmeier et al. (2012) to improve lexical cohesion in a document. It is a cross-sentence model over sequences of content words that are scored based on their similarity in a word vector space. The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. They are described and demonstrated in more detail in section 5. Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). The MMAX format makes it possible to include a wide range of discourselevel corpus annotations such as coreference links. 195 These annotations can then be accessed by the feature models. To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al., 2013a). 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. For each move, a score must be computed and tested against the acceptance criterion. An overwhelming majority of the proposed moves will be rejected. In order to achieve reasonably fast decoding times, efficient scoring is paramount. Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. Fortunately, score computation can be sped up in two ways. Knowledge about how the state to be scored was generated from its predecessor helps to limit recomputations to a minimum, and by adopting a two-step scoring procedure that just computes the scores that can be calculated with little effort at first, we need to compute the complete score only if the new state has some chance of being accepted. The scores of SMT feature models can usually be decomposed in some way over parts of the document. The traditional models borrowed from sentence-based decoding are necessarily decomposable at the sentence level, and in practice, all common models are designed to meet the constraints of DP beam search, which ensures that they can in fact be decomposed over even smaller sequences of just a few words. For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. A feature model class in Docent implements three methods. The initDocument method is called once per document when decoding starts. It straightforwardly computes the model score for the entire document from scratch. When a state is modified, the decoder first invokes the estimateScoreUpdate method. Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. The search algorithm then checks this upper bound against the acceptance criterion. Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. This approximation can be calculated without doing any language model lookups at all. On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al., 2013b). This work is a first step towards achieving joint SMT and text simplification, with the final goal of adapting MT to user groups such as people with reading disabilities. Lexical consistency modelling for SMT has been attempted before. The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al., 2012), or cache-based models (Tiedemann, 2010). In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. We implemented four features on the document level. The first two features are type token ratio (TTR) and a reformulation of it, OVIX, which is less sensitive to text length. These ratios have been related to the “idea density” of a text (Mühlenbock and Kokkinakis, 2009). We also wanted to encourage consistent translations of words, for which we used the Q-value (Deléger et al., 2006), which has been proposed to measure term quality. We applied it on word level (QW) and phrase level (QP). These features need access to the full target document, which we have in Docent. In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). We tested our features on English–Swedish translation using the Europarl corpus. For training we used 1,488,322 sentences. As test data, we extracted 20 documents with a total of 690 sen- tences. We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. BaselineReadability featuresComment de ärade ledamöterna (the honourableledamöterna (the members) / ni+ Removal of non-essential words Members) (you) på ett sådant sätt att (in such a way så att (so that) + Simplified expression that) gemenskapslagstiftningen (the gemenskapens lagstiftning (the + Shorter community legislation) community’s compound to genitive construction Världshandelsorganisationen (World WTO (WTO) legislation) − Changing Trade Organisation) long compound to E−nCg hliasnhg-biansged lo handlingsplanen (the action plan) ägnat särskild uppmärksamhet particular attention to) words by changing long åt (paid planen (the plan) särskilt uppmärksam − Removal på (particular attentive on) anbgb creomvipatoiounn of important word −− RBaedm grammar bpeocratuasnet wofo rcdhanged p−ar Bt aodf gspraeemcmh aarn dbe mcaisussieng o fv cehrban Table 2: Example translation snippets with comments FeatureBLEUOVIXLIX Baseline0.24356.8851.17 TTR 0.243 55.25 51.04 OVIX 0.243 54.65 51.00 QW 0.242 57.16 51.16 QP 0.243 57.07 51.06 All 0.235 47.80 49.29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al., 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. The results are shown in Table 1. As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. We also investigated what types of changes the readability features could lead to. Table 2 shows a sample of translations where the baseline is compared to systems with readability features. There are both cases where the readability features help 197 and cases where they are problematic. Overall, these examples show that our simple features can help achieve some interesting simplifications. There is still much work to do on how to take best advantage of the possibilities in Docent in order to achieve readable texts. This attempt shows the feasibility of the approach. We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. References Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stockholm. Marine Carpuat. 2009. One translation per discourse. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), pages 19–27, Boulder, Colorado. Louise Deléger, Magnus Merkel, and Pierre Zweigenbaum. 2006. Enriching medical terminologies: an approach based on aligned corpora. In International Congress of the European Federation for Medical Informatics, pages 747–752, Maastricht, The Netherlands. Zhengxian Gong, Min Zhang, Chew Lim Tan, and Guodong Zhou. 2012. N-gram-based tense models for statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 276–285, Jeju Island, Korea. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179–1 190, Jeju Island, Korea. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Mark Hopkins and Jonathan ranking. In Proceedings on Empirical Methods in cessing, pages 1352–1362, May. 2011. Tuning as of the 2011 Conference Natural Language ProEdinburgh, Scotland. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, Edmonton. Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Associationfor Computational Linguistics: Demonstration session, pages 177–180, Prague, Czech Republic. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings 198 of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Skövde, Sweden. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Thomas Meyer, Andrei Popescu-Belis, Najeh Hajlaoui, and Andrea Gesmundo. 2012. Machine translation of labeled discourse connectives. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), San Diego, California, USA. Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited an extended readability. In Proceedings of the Corpus Linguistics Conference, Liverpool, UK. – Christoph Müller and Michael Strube. 2003. Multilevel annotation in MMAX. In Proceedings of the Fourth SIGdial Workshop on Discourse and Dialogue, pages 198–207, Sapporo, Japan. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Associationfor Computational Linguistics, pages 3 11–3 18, Philadelphia, Pennsylvania, USA. Sara Stymne, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2013a. Feature weight optimization for discourse-level SMT. In Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Sofia, Bulgaria. Sara Stymne, Jörg Tiedemann, Christian Hardmeier, and Joakim Nivre. 2013b. Statistical machine translation with readability constraints. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 375–386, Oslo, Norway. Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), pages 8–15, Uppsala, Sweden. Ferhan Ture, Douglas W. Oard, and Philip Resnik. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 417–426, Montréal, Canada.

6 0.71640378 24 acl-2013-A Tale about PRO and Monsters

7 0.70156354 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

8 0.70099998 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

9 0.67840093 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

10 0.67529166 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

11 0.66784042 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

12 0.66298217 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

13 0.63111818 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

14 0.6218574 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

15 0.61632711 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

16 0.61343765 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

17 0.60125041 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

18 0.59383023 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

19 0.59257609 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

20 0.57274503 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.066), (6, 0.049), (11, 0.069), (23, 0.239), (24, 0.033), (26, 0.06), (28, 0.017), (35, 0.056), (42, 0.088), (48, 0.039), (70, 0.034), (88, 0.03), (90, 0.06), (95, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82268929 209 acl-2013-Joint Modeling of News Readerâ•Žs and Comment Writerâ•Žs Emotions

Author: Huanhuan Liu ; Shoushan Li ; Guodong Zhou ; Chu-ren Huang ; Peifeng Li

Abstract: Emotion classification can be generally done from both the writer’s and reader’s perspectives. In this study, we find that two foundational tasks in emotion classification, i.e., reader’s emotion classification on the news and writer’s emotion classification on the comments, are strongly related to each other in terms of coarse-grained emotion categories, i.e., negative and positive. On the basis, we propose a respective way to jointly model these two tasks. In particular, a cotraining algorithm is proposed to improve semi-supervised learning of the two tasks. Experimental evaluation shows the effectiveness of our joint modeling approach. . 1

same-paper 2 0.80370551 328 acl-2013-Stacking for Statistical Machine Translation

Author: Majid Razmara ; Anoop Sarkar

Abstract: We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model.

3 0.77815294 365 acl-2013-Understanding Tables in Context Using Standard NLP Toolkits

Author: Vidhya Govindaraju ; Ce Zhang ; Christopher Re

Abstract: Tabular information in text documents contains a wealth of information, and so tables are a natural candidate for information extraction. There are many cues buried in both a table and its surrounding text that allow us to understand the meaning of the data in a table. We study how natural-language tools, such as part-of-speech tagging, dependency paths, and named-entity recognition, can be used to improve the quality of relation extraction from tables. In three domains we show that (1) a model that performs joint probabilistic inference across tabular and natural language features achieves an F1 score that is twice as high as either a puretable or pure-text system, and (2) using only shallower features or non-joint inference results in lower quality.

4 0.75812942 333 acl-2013-Summarization Through Submodularity and Dispersion

Author: Anirban Dasgupta ; Ravi Kumar ; Sujith Ravi

Abstract: We propose a new optimization framework for summarization by generalizing the submodular framework of (Lin and Bilmes, 2011). In our framework the summarization desideratum is expressed as a sum of a submodular function and a nonsubmodular function, which we call dispersion; the latter uses inter-sentence dissimilarities in different ways in order to ensure non-redundancy of the summary. We consider three natural dispersion functions and show that a greedy algorithm can obtain an approximately optimal summary in all three cases. We conduct experiments on two corpora—DUC 2004 and user comments on news articles—and show that the performance of our algorithm outperforms those that rely only on submodularity.

5 0.63981062 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

Author: Wenduan Xu ; Yue Zhang ; Philip Williams ; Philipp Koehn

Abstract: We present a context-sensitive chart pruning method for CKY-style MT decoding. Source phrases that are unlikely to have aligned target constituents are identified using sequence labellers learned from the parallel corpus, and speed-up is obtained by pruning corresponding chart cells. The proposed method is easy to implement, orthogonal to cube pruning and additive to its pruning power. On a full-scale Englishto-German experiment with a string-totree model, we obtain a speed-up of more than 60% over a strong baseline, with no loss in BLEU.

6 0.62848485 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

7 0.62695849 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

8 0.62400663 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

9 0.62384319 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

10 0.62268758 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

11 0.62158412 289 acl-2013-QuEst - A translation quality estimation framework

12 0.62034047 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

13 0.62021482 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

14 0.61979842 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

15 0.61938161 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

16 0.61819714 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

17 0.61795187 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

18 0.61757839 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

19 0.61727929 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

20 0.61696213 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search