acl acl2013 acl2013-166 knowledge-graph by maker-knowledge-mining

166 acl-2013-Generalized Reordering Rules for Improved SMT

Source: pdf

Author: Fei Huang ; Cezar Pendus

Abstract: We present a simple yet effective approach to syntactic reordering for Statistical Machine Translation (SMT). Instead of solely relying on the top-1 best-matching rule for source sentence preordering, we generalize fully lexicalized rules into partially lexicalized and unlexicalized rules to broaden the rule coverage. Furthermore, , we consider multiple permutations of all the matching rules, and select the final reordering path based on the weighed sum of reordering probabilities of these rules. Our experiments in English-Chinese and English-Japanese translations demonstrate the effectiveness of the proposed approach: we observe consistent and significant improvement in translation quality across multiple test sets in both language pairs judged by both humans and automatic metric. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We present a simple yet effective approach to syntactic reordering for Statistical Machine Translation (SMT). [sent-5, score-0.649]

2 Instead of solely relying on the top-1 best-matching rule for source sentence preordering, we generalize fully lexicalized rules into partially lexicalized and unlexicalized rules to broaden the rule coverage. [sent-6, score-1.77]

3 Furthermore, , we consider multiple permutations of all the matching rules, and select the final reordering path based on the weighed sum of reordering probabilities of these rules. [sent-7, score-1.274]

4 Our experiments in English-Chinese and English-Japanese translations demonstrate the effectiveness of the proposed approach: we observe consistent and significant improvement in translation quality across multiple test sets in both language pairs judged by both humans and automatic metric. [sent-8, score-0.388]

5 The proper handling of linguistic structures (such as word order) has been one of the most important yet most challenging tasks in statistical machine translation (SMT). [sent-10, score-0.381]

6 It is important because it has significant impact on human judgment of Machine Translation (MT) quality: an MT output without structure is just like a bag of words. [sent-11, score-0.122]

7 It is also very challenging due to the lack of effective methods to model the structural difference between source and target languages. [sent-12, score-0.312]

8 A lot of research has been conducted in this area. [sent-13, score-0.074]

9 2003) and lexicalized distortion models such as (Tillman 2004), (AlOnaizan and Papineni 2006). [sent-16, score-0.324]

10 Because these models are relatively easy to compute, they are widely used in phrase-based SMT systems. [sent-17, score-0.031]

11 com Chiang, 2005) utilizes long range reordering information without syntax. [sent-22, score-0.611]

12 ) to capture the structural difference between language pairs, including (Yamada and Knight, 2001), (Zollmann and Venugopal, 2006), (Liu et. [sent-24, score-0.136]

13 These models demonstrate better handling of sentence structures, while the computation is more expensive compared with the distortion-based models. [sent-29, score-0.26]

14 In the middle of the spectrum, (Xia and McCord 2004), (Collins et. [sent-30, score-0.044]

15 2010) combined the benefits of the above two strategies: their approaches reorder an input sentence based on a set of reordering rules defined over the source sentenceâ€™s syntax parse tree. [sent-35, score-1.024]

16 As a result, the re-ordered source sentence resembles the word order of its target translation. [sent-36, score-0.166]

17 The reordering rules are either hand-crafted or automatically learned from the training data (source parse trees and bitext word alignments). [sent-37, score-0.833]

18 These rules can be unlexicalized (only including the constituent labels) or fully lexicalized (including both the constituent labels and their head words). [sent-38, score-1.089]

19 The unlexicalized reordering rules are more general and can be applied broadly, but sometimes they are not discriminative enough. [sent-39, score-0.983]

20 56 NP PP â†’ 0 1 NP PP â†’ 1 0 the NP and PP nodes are reordered with close to random probabilities. [sent-42, score-0.151]

21 When the constituents are attached with their headwords, the reordering probability is much higher than that of the unlexicalized rules. [sent-43, score-0.84]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reordering', 0.498), ('lexicalized', 0.264), ('unlexicalized', 0.234), ('pp', 0.233), ('np', 0.232), ('testimony', 0.228), ('rules', 0.211), ('ibm', 0.178), ('smt', 0.158), ('watson', 0.154), ('weighed', 0.114), ('broaden', 0.114), ('visweswariah', 0.105), ('constituent', 0.1), ('mccord', 0.099), ('headwords', 0.094), ('tillman', 0.094), ('permutations', 0.094), ('preordering', 0.094), ('hiero', 0.09), ('reordered', 0.09), ('handling', 0.09), ('zollmann', 0.087), ('venugopal', 0.084), ('reorder', 0.079), ('spectrum', 0.079), ('challenging', 0.078), ('center', 0.075), ('structural', 0.072), ('bitext', 0.07), ('fully', 0.068), ('translation', 0.067), ('sparseness', 0.067), ('resembles', 0.063), ('bag', 0.062), ('source', 0.062), ('distortion', 0.06), ('judgment', 0.06), ('xia', 0.06), ('solely', 0.059), ('broadly', 0.059), ('yamada', 0.059), ('com', 0.058), ('rule', 0.057), ('yet', 0.056), ('utilizes', 0.055), ('mt', 0.055), ('attached', 0.055), ('demonstrate', 0.054), ('parse', 0.054), ('judged', 0.054), ('constituents', 0.053), ('fei', 0.053), ('structures', 0.053), ('penalty', 0.052), ('unlikely', 0.051), ('relying', 0.049), ('benefits', 0.049), ('constrained', 0.047), ('generalize', 0.047), ('humans', 0.046), ('generalized', 0.045), ('papineni', 0.045), ('labels', 0.045), ('middle', 0.044), ('knight', 0.044), ('shen', 0.043), ('lot', 0.043), ('chiang', 0.042), ('collins', 0.042), ('unfortunately', 0.041), ('sentence', 0.041), ('expensive', 0.041), ('alignments', 0.041), ('path', 0.04), ('sometimes', 0.04), ('al', 0.038), ('effective', 0.037), ('proper', 0.037), ('koehn', 0.036), ('translations', 0.036), ('observe', 0.035), ('strategies', 0.034), ('computation', 0.034), ('including', 0.034), ('lack', 0.033), ('head', 0.033), ('huang', 0.033), ('quality', 0.033), ('partially', 0.032), ('effectiveness', 0.032), ('close', 0.032), ('widely', 0.031), ('conducted', 0.031), ('consistent', 0.031), ('difference', 0.03), ('sum', 0.03), ('hierarchical', 0.03), ('syntax', 0.03), ('nodes', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 166 acl-2013-Generalized Reordering Rules for Improved SMT

Author: Fei Huang ; Cezar Pendus

2 0.39389288 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

Author: Karthik Visweswariah ; Mitesh M. Khapra ; Ananthakrishnan Ramanathan

Abstract: Preordering of a source language sentence to match target word order has proved to be useful for improving machine translation systems. Previous work has shown that a reordering model can be learned from high quality manual word alignments to improve machine translation performance. In this paper, we focus on further improving the performance of the reordering model (and thereby machine translation) by using a larger corpus of sentence aligned data for which manual word alignments are not available but automatic machine generated alignments are available. The main challenge we tackle is to generate quality data for training the reordering model in spite of the machine align- ments being noisy. To mitigate the effect of noisy machine alignments, we propose a novel approach that improves reorderings produced given noisy alignments and also improves word alignments using information from the reordering model. This approach generates alignments that are 2.6 f-Measure points better than a baseline supervised aligner. The data generated allows us to train a reordering model that gives an improvement of 1.8 BLEU points on the NIST MT-08 Urdu-English evaluation set over a reordering model that only uses manual word alignments, and a gain of 5.2 BLEU points over a standard phrase-based baseline.

3 0.33471695 200 acl-2013-Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation

Author: ThuyLinh Nguyen ; Stephan Vogel

Abstract: Hiero translation models have two limitations compared to phrase-based models: 1) Limited hypothesis space; 2) No lexicalized reordering model. We propose an extension of Hiero called PhrasalHiero to address Hiero’s second problem. Phrasal-Hiero still has the same hypothesis space as the original Hiero but incorporates a phrase-based distance cost feature and lexicalized reodering features into the chart decoder. The work consists of two parts: 1) for each Hiero translation derivation, find its corresponding dis- continuous phrase-based path. 2) Extend the chart decoder to incorporate features from the phrase-based path. We achieve significant improvement over both Hiero and phrase-based baselines for ArabicEnglish, Chinese-English and GermanEnglish translation.

4 0.27302003 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

Author: Minwei Feng ; Jan-Thorsten Peter ; Hermann Ney

Abstract: In this paper, we propose a novel reordering model based on sequence labeling techniques. Our model converts the reordering problem into a sequence labeling problem, i.e. a tagging task. Results on five Chinese-English NIST tasks show that our model improves the baseline system by 1.32 BLEU and 1.53 TER on average. Results of comparative study with other seven widely used reordering models will also be reported.

5 0.20476797 125 acl-2013-Distortion Model Considering Rich Context for Statistical Machine Translation

Author: Isao Goto ; Masao Utiyama ; Eiichiro Sumita ; Akihiro Tamura ; Sadao Kurohashi

Abstract: This paper proposes new distortion models for phrase-based SMT. In decoding, a distortion model estimates the source word position to be translated next (NP) given the last translated source word position (CP). We propose a distortion model that can consider the word at the CP, a word at an NP candidate, and the context of the CP and the NP candidate simultaneously. Moreover, we propose a further improved model that considers richer context by discriminating label sequences that specify spans from the CP to NP candidates. It enables our model to learn the effect of relative word order among NP candidates as well as to learn the effect of distances from the training data. In our experiments, our model improved 2.9 BLEU points for Japanese-English and 2.6 BLEU points for Chinese-English translation compared to the lexical reordering models.

6 0.15595639 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

7 0.14188181 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

8 0.13947988 314 acl-2013-Semantic Roles for String to Tree Machine Translation

9 0.13143107 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

10 0.12272105 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?

11 0.11838648 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

12 0.11018921 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

13 0.092003331 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

14 0.088512167 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

15 0.084434062 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

16 0.083589748 130 acl-2013-Domain-Specific Coreference Resolution with Lexicalized Features

17 0.080991514 80 acl-2013-Chinese Parsing Exploiting Characters

18 0.076639518 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

19 0.076113202 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

20 0.073424928 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.177), (1, -0.193), (2, 0.141), (3, 0.117), (4, -0.095), (5, 0.089), (6, 0.039), (7, -0.0), (8, 0.028), (9, 0.091), (10, -0.023), (11, 0.082), (12, 0.049), (13, 0.026), (14, 0.047), (15, 0.104), (16, 0.225), (17, 0.091), (18, 0.016), (19, 0.037), (20, -0.171), (21, 0.003), (22, -0.018), (23, -0.239), (24, 0.06), (25, 0.001), (26, -0.015), (27, -0.168), (28, -0.292), (29, -0.074), (30, -0.088), (31, -0.026), (32, 0.052), (33, 0.049), (34, -0.073), (35, 0.025), (36, 0.018), (37, -0.018), (38, -0.051), (39, 0.06), (40, -0.002), (41, -0.145), (42, 0.021), (43, -0.086), (44, -0.096), (45, -0.063), (46, 0.021), (47, 0.038), (48, -0.026), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97051203 166 acl-2013-Generalized Reordering Rules for Improved SMT

Author: Fei Huang ; Cezar Pendus

2 0.83184361 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

Author: Karthik Visweswariah ; Mitesh M. Khapra ; Ananthakrishnan Ramanathan

3 0.81932151 200 acl-2013-Integrating Phrase-based Reordering Features into a Chart-based Decoder for Machine Translation

Author: ThuyLinh Nguyen ; Stephan Vogel

4 0.80568528 125 acl-2013-Distortion Model Considering Rich Context for Statistical Machine Translation

Author: Isao Goto ; Masao Utiyama ; Eiichiro Sumita ; Akihiro Tamura ; Sadao Kurohashi

5 0.72978497 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

Author: Minwei Feng ; Jan-Thorsten Peter ; Hermann Ney

6 0.72817481 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

7 0.72756934 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?

8 0.48902151 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

9 0.42377716 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

10 0.41341171 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

11 0.40025774 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

12 0.3862966 314 acl-2013-Semantic Roles for String to Tree Machine Translation

13 0.36649421 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

14 0.36534059 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

15 0.35296696 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

16 0.34694651 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

17 0.32873386 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German

18 0.32676432 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

19 0.32074863 280 acl-2013-Plurality, Negation, and Quantification:Towards Comprehensive Quantifier Scope Disambiguation

20 0.31461817 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.054), (6, 0.053), (11, 0.055), (24, 0.036), (26, 0.039), (35, 0.082), (42, 0.169), (48, 0.025), (70, 0.029), (88, 0.023), (90, 0.099), (95, 0.104), (96, 0.141)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89850903 166 acl-2013-Generalized Reordering Rules for Improved SMT

Author: Fei Huang ; Cezar Pendus

2 0.83382356 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

Author: Wenduan Xu ; Yue Zhang ; Philip Williams ; Philipp Koehn

Abstract: We present a context-sensitive chart pruning method for CKY-style MT decoding. Source phrases that are unlikely to have aligned target constituents are identified using sequence labellers learned from the parallel corpus, and speed-up is obtained by pruning corresponding chart cells. The proposed method is easy to implement, orthogonal to cube pruning and additive to its pruning power. On a full-scale Englishto-German experiment with a string-totree model, we obtain a speed-up of more than 60% over a strong baseline, with no loss in BLEU.

3 0.82806909 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

Author: Christian Hardmeier ; Sara Stymne ; Jorg Tiedemann ; Joakim Nivre

Abstract: We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al., 2012) and discourse connectives (Meyer et al., 2012). One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al., 2003). For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al., 2012). Our decoding algorithm is based on local search instead of dynamic programming and permits the integration of 193 document-level models with unrestricted dependencies, so that a model score can be conditioned on arbitrary elements occurring anywhere in the input document or in the translation that is being generated. In this paper, we present an open-source implementation of this search algorithm. The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. The code is released under the GNU General Public License and published on Github1 to make it easy for other researchers to use it in their own experiments. 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. (2003) and implemented, for example, in the popular Moses decoder (Koehn et al., 2007). Translation is performed by splitting the input sentence into a number of contiguous word sequences, called phrases, which are translated into the target lan- guage through a phrase dictionary lookup and optionally reordered. The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. (2012) and is based on local search. This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github .com/chardmeier/docent/wiki Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. It is similar to previous work on phrase-based SMT decoding by Langlais et al. (2007), but enables the creation of document-level models, which was not addressed by earlier approaches. Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. (2012). The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. This saves time especially when running repeated experiments from the same starting point obtained 194 by DP search. In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. The three operations used by Hardmeier et al. (2012), change-phrase-translation, resegment and swap-phrases, jointly meet this requirement and are all implemented in Docent. Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. Unlike swap-phrases, it does not require that another phrase be moved in the opposite direction at the same time. A pair of operations called permute-phrases and linearisephrasescanreorderasequenceofphrasesintorandom order and back into the order corresponding to the source language. Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al., 2012; Stymne et al., 2013a). 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. The published source code also includes prototype implementations of a few document-level models. These models should be considered work in progress and serve as a demonstration of the cross-sentence modelling capabilities of the decoder. They have not yet reached a state of maturity that would make them suitable for production use. The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al., 2003) and a word penalty cost. All of these features are designed to be compatible with the corresponding features in Moses. From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. Docent therefore includes a separate indicator feature to indicate a violated distortion limit. In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. The semantic language model was originally proposed by Hardmeier et al. (2012) to improve lexical cohesion in a document. It is a cross-sentence model over sequences of content words that are scored based on their similarity in a word vector space. The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. They are described and demonstrated in more detail in section 5. Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). The MMAX format makes it possible to include a wide range of discourselevel corpus annotations such as coreference links. 195 These annotations can then be accessed by the feature models. To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al., 2013a). 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. For each move, a score must be computed and tested against the acceptance criterion. An overwhelming majority of the proposed moves will be rejected. In order to achieve reasonably fast decoding times, efficient scoring is paramount. Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. Fortunately, score computation can be sped up in two ways. Knowledge about how the state to be scored was generated from its predecessor helps to limit recomputations to a minimum, and by adopting a two-step scoring procedure that just computes the scores that can be calculated with little effort at first, we need to compute the complete score only if the new state has some chance of being accepted. The scores of SMT feature models can usually be decomposed in some way over parts of the document. The traditional models borrowed from sentence-based decoding are necessarily decomposable at the sentence level, and in practice, all common models are designed to meet the constraints of DP beam search, which ensures that they can in fact be decomposed over even smaller sequences of just a few words. For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. A feature model class in Docent implements three methods. The initDocument method is called once per document when decoding starts. It straightforwardly computes the model score for the entire document from scratch. When a state is modified, the decoder first invokes the estimateScoreUpdate method. Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. The search algorithm then checks this upper bound against the acceptance criterion. Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. This approximation can be calculated without doing any language model lookups at all. On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al., 2013b). This work is a first step towards achieving joint SMT and text simplification, with the final goal of adapting MT to user groups such as people with reading disabilities. Lexical consistency modelling for SMT has been attempted before. The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al., 2012), or cache-based models (Tiedemann, 2010). In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. We implemented four features on the document level. The first two features are type token ratio (TTR) and a reformulation of it, OVIX, which is less sensitive to text length. These ratios have been related to the “idea density” of a text (Mühlenbock and Kokkinakis, 2009). We also wanted to encourage consistent translations of words, for which we used the Q-value (Deléger et al., 2006), which has been proposed to measure term quality. We applied it on word level (QW) and phrase level (QP). These features need access to the full target document, which we have in Docent. In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). We tested our features on English–Swedish translation using the Europarl corpus. For training we used 1,488,322 sentences. As test data, we extracted 20 documents with a total of 690 sen- tences. We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. BaselineReadability featuresComment de ärade ledamöterna (the honourableledamöterna (the members) / ni+ Removal of non-essential words Members) (you) på ett sådant sätt att (in such a way så att (so that) + Simplified expression that) gemenskapslagstiftningen (the gemenskapens lagstiftning (the + Shorter community legislation) community’s compound to genitive construction Världshandelsorganisationen (World WTO (WTO) legislation) − Changing Trade Organisation) long compound to E−nCg hliasnhg-biansged lo handlingsplanen (the action plan) ägnat särskild uppmärksamhet particular attention to) words by changing long åt (paid planen (the plan) särskilt uppmärksam − Removal på (particular attentive on) anbgb creomvipatoiounn of important word −− RBaedm grammar bpeocratuasnet wofo rcdhanged p−ar Bt aodf gspraeemcmh aarn dbe mcaisussieng o fv cehrban Table 2: Example translation snippets with comments FeatureBLEUOVIXLIX Baseline0.24356.8851.17 TTR 0.243 55.25 51.04 OVIX 0.243 54.65 51.00 QW 0.242 57.16 51.16 QP 0.243 57.07 51.06 All 0.235 47.80 49.29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al., 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. The results are shown in Table 1. As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. We also investigated what types of changes the readability features could lead to. Table 2 shows a sample of translations where the baseline is compared to systems with readability features. There are both cases where the readability features help 197 and cases where they are problematic. Overall, these examples show that our simple features can help achieve some interesting simplifications. There is still much work to do on how to take best advantage of the possibilities in Docent in order to achieve readable texts. This attempt shows the feasibility of the approach. We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. References Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stockholm. Marine Carpuat. 2009. One translation per discourse. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), pages 19–27, Boulder, Colorado. Louise Deléger, Magnus Merkel, and Pierre Zweigenbaum. 2006. Enriching medical terminologies: an approach based on aligned corpora. In International Congress of the European Federation for Medical Informatics, pages 747–752, Maastricht, The Netherlands. Zhengxian Gong, Min Zhang, Chew Lim Tan, and Guodong Zhou. 2012. N-gram-based tense models for statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 276–285, Jeju Island, Korea. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179–1 190, Jeju Island, Korea. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Mark Hopkins and Jonathan ranking. In Proceedings on Empirical Methods in cessing, pages 1352–1362, May. 2011. Tuning as of the 2011 Conference Natural Language ProEdinburgh, Scotland. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, Edmonton. Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Associationfor Computational Linguistics: Demonstration session, pages 177–180, Prague, Czech Republic. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings 198 of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Skövde, Sweden. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Thomas Meyer, Andrei Popescu-Belis, Najeh Hajlaoui, and Andrea Gesmundo. 2012. Machine translation of labeled discourse connectives. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), San Diego, California, USA. Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited an extended readability. In Proceedings of the Corpus Linguistics Conference, Liverpool, UK. – Christoph Müller and Michael Strube. 2003. Multilevel annotation in MMAX. In Proceedings of the Fourth SIGdial Workshop on Discourse and Dialogue, pages 198–207, Sapporo, Japan. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Associationfor Computational Linguistics, pages 3 11–3 18, Philadelphia, Pennsylvania, USA. Sara Stymne, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2013a. Feature weight optimization for discourse-level SMT. In Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Sofia, Bulgaria. Sara Stymne, Jörg Tiedemann, Christian Hardmeier, and Joakim Nivre. 2013b. Statistical machine translation with readability constraints. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 375–386, Oslo, Norway. Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), pages 8–15, Uppsala, Sweden. Ferhan Ture, Douglas W. Oard, and Philip Resnik. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 417–426, Montréal, Canada.

4 0.82527286 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?

Author: Nadir Durrani ; Alexander Fraser ; Helmut Schmid ; Hieu Hoang ; Philipp Koehn

Abstract: The phrase-based and N-gram-based SMT frameworks complement each other. While the former is better able to memorize, the latter provides a more principled model that captures dependencies across phrasal boundaries. Some work has been done to combine insights from these two frameworks. A recent successful attempt showed the advantage of using phrasebased search on top of an N-gram-based model. We probe this question in the reverse direction by investigating whether integrating N-gram-based translation and reordering models into a phrase-based decoder helps overcome the problematic phrasal independence assumption. A large scale evaluation over 8 language pairs shows that performance does significantly improve.

5 0.81971741 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

Author: Minwei Feng ; Jan-Thorsten Peter ; Hermann Ney

6 0.81111223 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

7 0.81054902 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

8 0.80957347 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

9 0.80872321 206 acl-2013-Joint Event Extraction via Structured Prediction with Global Features

10 0.80647933 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

11 0.80312014 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

12 0.80273461 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

13 0.80101466 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

14 0.79486686 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

15 0.79411006 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

16 0.79353887 328 acl-2013-Stacking for Statistical Machine Translation

17 0.78992885 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

18 0.7854861 90 acl-2013-Conditional Random Fields for Responsive Surface Realisation using Global Features

19 0.78122354 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

20 0.78037155 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models