acl acl2013 acl2013-127 knowledge-graph by maker-knowledge-mining

127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation


Source: pdf

Author: Christian Hardmeier ; Sara Stymne ; Jorg Tiedemann ; Joakim Nivre

Abstract: We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al., 2012) and discourse connectives (Meyer et al., 2012). One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al., 2003). For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al., 2012). Our decoding algorithm is based on local search instead of dynamic programming and permits the integration of 193 document-level models with unrestricted dependencies, so that a model score can be conditioned on arbitrary elements occurring anywhere in the input document or in the translation that is being generated. In this paper, we present an open-source implementation of this search algorithm. The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. The code is released under the GNU General Public License and published on Github1 to make it easy for other researchers to use it in their own experiments. 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. (2003) and implemented, for example, in the popular Moses decoder (Koehn et al., 2007). Translation is performed by splitting the input sentence into a number of contiguous word sequences, called phrases, which are translated into the target lan- guage through a phrase dictionary lookup and optionally reordered. The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. (2012) and is based on local search. This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github .com/chardmeier/docent/wiki Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. It is similar to previous work on phrase-based SMT decoding by Langlais et al. (2007), but enables the creation of document-level models, which was not addressed by earlier approaches. Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. (2012). The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. This saves time especially when running repeated experiments from the same starting point obtained 194 by DP search. In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. The three operations used by Hardmeier et al. (2012), change-phrase-translation, resegment and swap-phrases, jointly meet this requirement and are all implemented in Docent. Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. Unlike swap-phrases, it does not require that another phrase be moved in the opposite direction at the same time. A pair of operations called permute-phrases and linearisephrasescanreorderasequenceofphrasesintorandom order and back into the order corresponding to the source language. Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al., 2012; Stymne et al., 2013a). 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. The published source code also includes prototype implementations of a few document-level models. These models should be considered work in progress and serve as a demonstration of the cross-sentence modelling capabilities of the decoder. They have not yet reached a state of maturity that would make them suitable for production use. The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al., 2003) and a word penalty cost. All of these features are designed to be compatible with the corresponding features in Moses. From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. Docent therefore includes a separate indicator feature to indicate a violated distortion limit. In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. The semantic language model was originally proposed by Hardmeier et al. (2012) to improve lexical cohesion in a document. It is a cross-sentence model over sequences of content words that are scored based on their similarity in a word vector space. The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. They are described and demonstrated in more detail in section 5. Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). The MMAX format makes it possible to include a wide range of discourselevel corpus annotations such as coreference links. 195 These annotations can then be accessed by the feature models. To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al., 2013a). 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. For each move, a score must be computed and tested against the acceptance criterion. An overwhelming majority of the proposed moves will be rejected. In order to achieve reasonably fast decoding times, efficient scoring is paramount. Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. Fortunately, score computation can be sped up in two ways. Knowledge about how the state to be scored was generated from its predecessor helps to limit recomputations to a minimum, and by adopting a two-step scoring procedure that just computes the scores that can be calculated with little effort at first, we need to compute the complete score only if the new state has some chance of being accepted. The scores of SMT feature models can usually be decomposed in some way over parts of the document. The traditional models borrowed from sentence-based decoding are necessarily decomposable at the sentence level, and in practice, all common models are designed to meet the constraints of DP beam search, which ensures that they can in fact be decomposed over even smaller sequences of just a few words. For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. A feature model class in Docent implements three methods. The initDocument method is called once per document when decoding starts. It straightforwardly computes the model score for the entire document from scratch. When a state is modified, the decoder first invokes the estimateScoreUpdate method. Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. The search algorithm then checks this upper bound against the acceptance criterion. Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. This approximation can be calculated without doing any language model lookups at all. On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al., 2013b). This work is a first step towards achieving joint SMT and text simplification, with the final goal of adapting MT to user groups such as people with reading disabilities. Lexical consistency modelling for SMT has been attempted before. The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al., 2012), or cache-based models (Tiedemann, 2010). In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. We implemented four features on the document level. The first two features are type token ratio (TTR) and a reformulation of it, OVIX, which is less sensitive to text length. These ratios have been related to the “idea density” of a text (Mühlenbock and Kokkinakis, 2009). We also wanted to encourage consistent translations of words, for which we used the Q-value (Deléger et al., 2006), which has been proposed to measure term quality. We applied it on word level (QW) and phrase level (QP). These features need access to the full target document, which we have in Docent. In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). We tested our features on English–Swedish translation using the Europarl corpus. For training we used 1,488,322 sentences. As test data, we extracted 20 documents with a total of 690 sen- tences. We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. BaselineReadability featuresComment de ärade ledamöterna (the honourableledamöterna (the members) / ni+ Removal of non-essential words Members) (you) på ett sådant sätt att (in such a way så att (so that) + Simplified expression that) gemenskapslagstiftningen (the gemenskapens lagstiftning (the + Shorter community legislation) community’s compound to genitive construction Världshandelsorganisationen (World WTO (WTO) legislation) − Changing Trade Organisation) long compound to E−nCg hliasnhg-biansged lo handlingsplanen (the action plan) ägnat särskild uppmärksamhet particular attention to) words by changing long åt (paid planen (the plan) särskilt uppmärksam − Removal på (particular attentive on) anbgb creomvipatoiounn of important word −− RBaedm grammar bpeocratuasnet wofo rcdhanged p−ar Bt aodf gspraeemcmh aarn dbe mcaisussieng o fv cehrban Table 2: Example translation snippets with comments FeatureBLEUOVIXLIX Baseline0.24356.8851.17 TTR 0.243 55.25 51.04 OVIX 0.243 54.65 51.00 QW 0.242 57.16 51.16 QP 0.243 57.07 51.06 All 0.235 47.80 49.29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al., 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. The results are shown in Table 1. As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. We also investigated what types of changes the readability features could lead to. Table 2 shows a sample of translations where the baseline is compared to systems with readability features. There are both cases where the readability features help 197 and cases where they are problematic. Overall, these examples show that our simple features can help achieve some interesting simplifications. There is still much work to do on how to take best advantage of the possibilities in Docent in order to achieve readable texts. This attempt shows the feasibility of the approach. We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. References Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stockholm. Marine Carpuat. 2009. One translation per discourse. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), pages 19–27, Boulder, Colorado. Louise Deléger, Magnus Merkel, and Pierre Zweigenbaum. 2006. Enriching medical terminologies: an approach based on aligned corpora. In International Congress of the European Federation for Medical Informatics, pages 747–752, Maastricht, The Netherlands. Zhengxian Gong, Min Zhang, Chew Lim Tan, and Guodong Zhou. 2012. N-gram-based tense models for statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 276–285, Jeju Island, Korea. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179–1 190, Jeju Island, Korea. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Mark Hopkins and Jonathan ranking. In Proceedings on Empirical Methods in cessing, pages 1352–1362, May. 2011. Tuning as of the 2011 Conference Natural Language ProEdinburgh, Scotland. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, Edmonton. Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Associationfor Computational Linguistics: Demonstration session, pages 177–180, Prague, Czech Republic. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings 198 of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Skövde, Sweden. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Thomas Meyer, Andrei Popescu-Belis, Najeh Hajlaoui, and Andrea Gesmundo. 2012. Machine translation of labeled discourse connectives. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), San Diego, California, USA. Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited an extended readability. In Proceedings of the Corpus Linguistics Conference, Liverpool, UK. – Christoph Müller and Michael Strube. 2003. Multilevel annotation in MMAX. In Proceedings of the Fourth SIGdial Workshop on Discourse and Dialogue, pages 198–207, Sapporo, Japan. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Associationfor Computational Linguistics, pages 3 11–3 18, Philadelphia, Pennsylvania, USA. Sara Stymne, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2013a. Feature weight optimization for discourse-level SMT. In Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Sofia, Bulgaria. Sara Stymne, Jörg Tiedemann, Christian Hardmeier, and Joakim Nivre. 2013b. Statistical machine translation with readability constraints. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 375–386, Oslo, Norway. Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), pages 8–15, Uppsala, Sweden. Ferhan Ture, Douglas W. Oard, and Philip Resnik. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 417–426, Montréal, Canada.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 se Joakim Nivre Abstract We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. [sent-4, score-0.298]

2 By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. [sent-5, score-0.449]

3 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. [sent-6, score-0.135]

4 Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al. [sent-7, score-0.136]

5 One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al. [sent-10, score-0.351]

6 For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). [sent-12, score-0.242]

7 We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al. [sent-13, score-0.104]

8 In this paper, we present an open-source implementation of this search algorithm. [sent-16, score-0.114]

9 The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. [sent-17, score-0.541]

10 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. [sent-19, score-0.163]

11 (2003) and implemented, for example, in the popular Moses decoder (Koehn et al. [sent-20, score-0.163]

12 The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. [sent-23, score-0.247]

13 The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. [sent-24, score-0.371]

14 The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. [sent-25, score-0.146]

15 This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github . [sent-27, score-0.27]

16 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. [sent-30, score-0.114]

17 Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. [sent-31, score-0.208]

18 This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. [sent-32, score-0.146]

19 It is similar to previous work on phrase-based SMT decoding by Langlais et al. [sent-33, score-0.104]

20 Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. [sent-35, score-0.502]

21 The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. [sent-37, score-0.43]

22 After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. [sent-38, score-0.162]

23 Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. [sent-39, score-0.309]

24 Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. [sent-40, score-0.516]

25 Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. [sent-41, score-0.731]

26 The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. [sent-42, score-0.777]

27 Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. [sent-43, score-0.483]

28 For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. [sent-44, score-0.532]

29 In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. [sent-45, score-0.406]

30 In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. [sent-47, score-0.651]

31 Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. [sent-50, score-0.135]

32 Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. [sent-53, score-0.303]

33 However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al. [sent-54, score-0.189]

34 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. [sent-57, score-0.116]

35 They have not yet reached a state of maturity that would make them suitable for production use. [sent-60, score-0.094]

36 The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al. [sent-61, score-0.27]

37 All of these features are designed to be compatible with the corresponding features in Moses. [sent-63, score-0.068]

38 From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. [sent-64, score-0.197]

39 Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. [sent-65, score-0.074]

40 DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. [sent-66, score-0.212]

41 In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. [sent-67, score-0.335]

42 The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. [sent-68, score-0.177]

43 Docent therefore includes a separate indicator feature to indicate a violated distortion limit. [sent-69, score-0.164]

44 In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. [sent-70, score-0.205]

45 In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. [sent-71, score-0.529]

46 The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. [sent-72, score-0.307]

47 The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. [sent-73, score-0.121]

48 It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. [sent-74, score-0.077]

49 The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. [sent-78, score-0.552]

50 Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). [sent-80, score-0.07]

51 To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. [sent-83, score-0.073]

52 In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. [sent-84, score-0.108]

53 In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. [sent-85, score-0.131]

54 Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. [sent-86, score-0.47]

55 The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al. [sent-87, score-0.133]

56 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. [sent-89, score-0.326]

57 For each move, a score must be computed and tested against the acceptance criterion. [sent-90, score-0.088]

58 In order to achieve reasonably fast decoding times, efficient scoring is paramount. [sent-92, score-0.134]

59 Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. [sent-93, score-0.263]

60 The scores of SMT feature models can usually be decomposed in some way over parts of the document. [sent-96, score-0.144]

61 For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. [sent-98, score-0.07]

62 To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. [sent-99, score-0.46]

63 The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. [sent-100, score-0.303]

64 This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. [sent-101, score-0.258]

65 A feature model class in Docent implements three methods. [sent-102, score-0.082]

66 The initDocument method is called once per document when decoding starts. [sent-103, score-0.173]

67 It straightforwardly computes the model score for the entire document from scratch. [sent-104, score-0.105]

68 When a state is modified, the decoder first invokes the estimateScoreUpdate method. [sent-105, score-0.257]

69 Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. [sent-106, score-0.169]

70 The search algorithm then checks this upper bound against the acceptance criterion. [sent-107, score-0.263]

71 Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. [sent-108, score-0.149]

72 The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. [sent-109, score-0.131]

73 For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. [sent-110, score-0.133]

74 In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. [sent-111, score-0.336]

75 On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. [sent-113, score-0.222]

76 If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. [sent-114, score-0.204]

77 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al. [sent-115, score-0.259]

78 Lexical consistency modelling for SMT has been attempted before. [sent-118, score-0.076]

79 The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al. [sent-119, score-0.156]

80 In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. [sent-121, score-0.132]

81 In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). [sent-129, score-0.22]

82 We tested our features on English–Swedish translation using the Europarl corpus. [sent-130, score-0.141]

83 We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. [sent-133, score-0.259]

84 29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al. [sent-152, score-0.114]

85 , 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. [sent-153, score-0.416]

86 Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. [sent-154, score-0.186]

87 Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. [sent-155, score-0.087]

88 We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. [sent-156, score-0.235]

89 As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. [sent-158, score-0.141]

90 For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. [sent-159, score-0.332]

91 When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. [sent-160, score-0.186]

92 We also investigated what types of changes the readability features could lead to. [sent-161, score-0.22]

93 Table 2 shows a sample of translations where the baseline is compared to systems with readability features. [sent-162, score-0.229]

94 There are both cases where the readability features help 197 and cases where they are problematic. [sent-163, score-0.22]

95 We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. [sent-167, score-0.104]

96 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. [sent-168, score-0.163]

97 Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. [sent-169, score-0.296]

98 A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. [sent-170, score-0.144]

99 We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. [sent-171, score-0.253]

100 Context adaptation in statistical machine translation using models with exponentially decaying cache. [sent-258, score-0.169]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('docent', 0.619), ('hardmeier', 0.335), ('readability', 0.186), ('decoder', 0.163), ('smt', 0.142), ('ovix', 0.129), ('stymne', 0.126), ('dp', 0.122), ('distortion', 0.121), ('search', 0.114), ('translation', 0.107), ('decoding', 0.104), ('optimisation', 0.099), ('climbing', 0.099), ('beam', 0.099), ('state', 0.094), ('hill', 0.087), ('hlenbock', 0.077), ('nagard', 0.077), ('moses', 0.075), ('tiedemann', 0.07), ('document', 0.069), ('christian', 0.067), ('moves', 0.062), ('rg', 0.061), ('lix', 0.059), ('koehn', 0.058), ('operations', 0.058), ('ttr', 0.057), ('discourse', 0.056), ('acceptance', 0.052), ('estimatescoreupdate', 0.052), ('parity', 0.052), ('successor', 0.052), ('temperature', 0.052), ('terna', 0.052), ('uppm', 0.052), ('decoders', 0.05), ('bound', 0.049), ('pronominal', 0.049), ('local', 0.049), ('upper', 0.048), ('anaphora', 0.048), ('joakim', 0.047), ('kokkinakis', 0.046), ('legislation', 0.046), ('sara', 0.045), ('consistency', 0.044), ('feature', 0.043), ('translations', 0.043), ('ller', 0.042), ('wto', 0.042), ('subtracting', 0.042), ('implemented', 0.042), ('limit', 0.041), ('states', 0.04), ('gong', 0.04), ('mmax', 0.04), ('qw', 0.04), ('bit', 0.039), ('implements', 0.039), ('tense', 0.039), ('phrase', 0.039), ('encouraging', 0.039), ('ger', 0.038), ('gnu', 0.038), ('uppsala', 0.037), ('decomposed', 0.036), ('annealing', 0.036), ('langlais', 0.036), ('score', 0.036), ('modifications', 0.036), ('documentlevel', 0.035), ('kenlm', 0.035), ('att', 0.035), ('format', 0.035), ('models', 0.034), ('features', 0.034), ('del', 0.034), ('infrastructure', 0.033), ('meyer', 0.033), ('modelling', 0.032), ('tuning', 0.032), ('qp', 0.032), ('scored', 0.032), ('penalty', 0.031), ('scores', 0.031), ('scoring', 0.03), ('stochastic', 0.029), ('access', 0.029), ('permits', 0.028), ('unrestricted', 0.028), ('bj', 0.028), ('sapporo', 0.028), ('statistical', 0.028), ('swedish', 0.028), ('simulated', 0.028), ('mt', 0.028), ('simplified', 0.027), ('runs', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

Author: Christian Hardmeier ; Sara Stymne ; Jorg Tiedemann ; Joakim Nivre

Abstract: We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al., 2012) and discourse connectives (Meyer et al., 2012). One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al., 2003). For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al., 2012). Our decoding algorithm is based on local search instead of dynamic programming and permits the integration of 193 document-level models with unrestricted dependencies, so that a model score can be conditioned on arbitrary elements occurring anywhere in the input document or in the translation that is being generated. In this paper, we present an open-source implementation of this search algorithm. The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. The code is released under the GNU General Public License and published on Github1 to make it easy for other researchers to use it in their own experiments. 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. (2003) and implemented, for example, in the popular Moses decoder (Koehn et al., 2007). Translation is performed by splitting the input sentence into a number of contiguous word sequences, called phrases, which are translated into the target lan- guage through a phrase dictionary lookup and optionally reordered. The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. (2012) and is based on local search. This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github .com/chardmeier/docent/wiki Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. It is similar to previous work on phrase-based SMT decoding by Langlais et al. (2007), but enables the creation of document-level models, which was not addressed by earlier approaches. Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. (2012). The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. This saves time especially when running repeated experiments from the same starting point obtained 194 by DP search. In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. The three operations used by Hardmeier et al. (2012), change-phrase-translation, resegment and swap-phrases, jointly meet this requirement and are all implemented in Docent. Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. Unlike swap-phrases, it does not require that another phrase be moved in the opposite direction at the same time. A pair of operations called permute-phrases and linearisephrasescanreorderasequenceofphrasesintorandom order and back into the order corresponding to the source language. Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al., 2012; Stymne et al., 2013a). 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. The published source code also includes prototype implementations of a few document-level models. These models should be considered work in progress and serve as a demonstration of the cross-sentence modelling capabilities of the decoder. They have not yet reached a state of maturity that would make them suitable for production use. The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al., 2003) and a word penalty cost. All of these features are designed to be compatible with the corresponding features in Moses. From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. Docent therefore includes a separate indicator feature to indicate a violated distortion limit. In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. The semantic language model was originally proposed by Hardmeier et al. (2012) to improve lexical cohesion in a document. It is a cross-sentence model over sequences of content words that are scored based on their similarity in a word vector space. The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. They are described and demonstrated in more detail in section 5. Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). The MMAX format makes it possible to include a wide range of discourselevel corpus annotations such as coreference links. 195 These annotations can then be accessed by the feature models. To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al., 2013a). 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. For each move, a score must be computed and tested against the acceptance criterion. An overwhelming majority of the proposed moves will be rejected. In order to achieve reasonably fast decoding times, efficient scoring is paramount. Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. Fortunately, score computation can be sped up in two ways. Knowledge about how the state to be scored was generated from its predecessor helps to limit recomputations to a minimum, and by adopting a two-step scoring procedure that just computes the scores that can be calculated with little effort at first, we need to compute the complete score only if the new state has some chance of being accepted. The scores of SMT feature models can usually be decomposed in some way over parts of the document. The traditional models borrowed from sentence-based decoding are necessarily decomposable at the sentence level, and in practice, all common models are designed to meet the constraints of DP beam search, which ensures that they can in fact be decomposed over even smaller sequences of just a few words. For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. A feature model class in Docent implements three methods. The initDocument method is called once per document when decoding starts. It straightforwardly computes the model score for the entire document from scratch. When a state is modified, the decoder first invokes the estimateScoreUpdate method. Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. The search algorithm then checks this upper bound against the acceptance criterion. Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. This approximation can be calculated without doing any language model lookups at all. On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al., 2013b). This work is a first step towards achieving joint SMT and text simplification, with the final goal of adapting MT to user groups such as people with reading disabilities. Lexical consistency modelling for SMT has been attempted before. The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al., 2012), or cache-based models (Tiedemann, 2010). In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. We implemented four features on the document level. The first two features are type token ratio (TTR) and a reformulation of it, OVIX, which is less sensitive to text length. These ratios have been related to the “idea density” of a text (Mühlenbock and Kokkinakis, 2009). We also wanted to encourage consistent translations of words, for which we used the Q-value (Deléger et al., 2006), which has been proposed to measure term quality. We applied it on word level (QW) and phrase level (QP). These features need access to the full target document, which we have in Docent. In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). We tested our features on English–Swedish translation using the Europarl corpus. For training we used 1,488,322 sentences. As test data, we extracted 20 documents with a total of 690 sen- tences. We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. BaselineReadability featuresComment de ärade ledamöterna (the honourableledamöterna (the members) / ni+ Removal of non-essential words Members) (you) på ett sådant sätt att (in such a way så att (so that) + Simplified expression that) gemenskapslagstiftningen (the gemenskapens lagstiftning (the + Shorter community legislation) community’s compound to genitive construction Världshandelsorganisationen (World WTO (WTO) legislation) − Changing Trade Organisation) long compound to E−nCg hliasnhg-biansged lo handlingsplanen (the action plan) ägnat särskild uppmärksamhet particular attention to) words by changing long åt (paid planen (the plan) särskilt uppmärksam − Removal på (particular attentive on) anbgb creomvipatoiounn of important word −− RBaedm grammar bpeocratuasnet wofo rcdhanged p−ar Bt aodf gspraeemcmh aarn dbe mcaisussieng o fv cehrban Table 2: Example translation snippets with comments FeatureBLEUOVIXLIX Baseline0.24356.8851.17 TTR 0.243 55.25 51.04 OVIX 0.243 54.65 51.00 QW 0.242 57.16 51.16 QP 0.243 57.07 51.06 All 0.235 47.80 49.29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al., 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. The results are shown in Table 1. As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. We also investigated what types of changes the readability features could lead to. Table 2 shows a sample of translations where the baseline is compared to systems with readability features. There are both cases where the readability features help 197 and cases where they are problematic. Overall, these examples show that our simple features can help achieve some interesting simplifications. There is still much work to do on how to take best advantage of the possibilities in Docent in order to achieve readable texts. This attempt shows the feasibility of the approach. We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. References Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stockholm. Marine Carpuat. 2009. One translation per discourse. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), pages 19–27, Boulder, Colorado. Louise Deléger, Magnus Merkel, and Pierre Zweigenbaum. 2006. Enriching medical terminologies: an approach based on aligned corpora. In International Congress of the European Federation for Medical Informatics, pages 747–752, Maastricht, The Netherlands. Zhengxian Gong, Min Zhang, Chew Lim Tan, and Guodong Zhou. 2012. N-gram-based tense models for statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 276–285, Jeju Island, Korea. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179–1 190, Jeju Island, Korea. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Mark Hopkins and Jonathan ranking. In Proceedings on Empirical Methods in cessing, pages 1352–1362, May. 2011. Tuning as of the 2011 Conference Natural Language ProEdinburgh, Scotland. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, Edmonton. Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Associationfor Computational Linguistics: Demonstration session, pages 177–180, Prague, Czech Republic. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings 198 of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Skövde, Sweden. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Thomas Meyer, Andrei Popescu-Belis, Najeh Hajlaoui, and Andrea Gesmundo. 2012. Machine translation of labeled discourse connectives. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), San Diego, California, USA. Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited an extended readability. In Proceedings of the Corpus Linguistics Conference, Liverpool, UK. – Christoph Müller and Michael Strube. 2003. Multilevel annotation in MMAX. In Proceedings of the Fourth SIGdial Workshop on Discourse and Dialogue, pages 198–207, Sapporo, Japan. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Associationfor Computational Linguistics, pages 3 11–3 18, Philadelphia, Pennsylvania, USA. Sara Stymne, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2013a. Feature weight optimization for discourse-level SMT. In Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Sofia, Bulgaria. Sara Stymne, Jörg Tiedemann, Christian Hardmeier, and Joakim Nivre. 2013b. Statistical machine translation with readability constraints. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 375–386, Oslo, Norway. Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), pages 8–15, Uppsala, Sweden. Ferhan Ture, Douglas W. Oard, and Philip Resnik. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 417–426, Montréal, Canada.

2 0.12305604 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa

Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.

3 0.12124451 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

Author: Yang Liu

Abstract: We introduce a shift-reduce parsing algorithm for phrase-based string-todependency translation. As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models. To resolve conflicts in shift-reduce parsing, we propose a maximum entropy model trained on the derivation graph of training data. As our approach combines the merits of phrase-based and string-todependency models, it achieves significant improvements over the two baselines on the NIST Chinese-English datasets.

4 0.11427461 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

Author: Graham Neubig

Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, and find that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http : / /phont ron . com/t ravat ar

5 0.11295735 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

Author: lemao liu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao

Abstract: Most statistical machine translation (SMT) systems are modeled using a loglinear framework. Although the log-linear model achieves success in SMT, it still suffers from some limitations: (1) the features are required to be linear with respect to the model itself; (2) features cannot be further interpreted to reach their potential. A neural network is a reasonable method to address these pitfalls. However, modeling SMT with a neural network is not trivial, especially when taking the decoding efficiency into consideration. In this paper, we propose a variant of a neural network, i.e. additive neural networks, for SMT to go beyond the log-linear translation model. In addition, word embedding is employed as the input to the neural network, which encodes each word as a feature vector. Our model outperforms the log-linear translation models with/without embedding features on Chinese-to-English and Japanese-to-English translation tasks.

6 0.1036802 125 acl-2013-Distortion Model Considering Rich Context for Statistical Machine Translation

7 0.10226186 322 acl-2013-Simple, readable sub-sentences

8 0.09622895 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

9 0.094457023 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

10 0.093396388 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

11 0.091866717 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

12 0.09060017 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

13 0.090162411 289 acl-2013-QuEst - A translation quality estimation framework

14 0.090121761 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

15 0.089289822 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

16 0.089083269 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

17 0.087238625 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

18 0.086102657 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

19 0.085524134 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

20 0.082627952 314 acl-2013-Semantic Roles for String to Tree Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, -0.121), (2, 0.099), (3, 0.061), (4, -0.022), (5, 0.041), (6, 0.06), (7, -0.011), (8, -0.004), (9, 0.068), (10, -0.011), (11, 0.074), (12, -0.083), (13, 0.023), (14, 0.051), (15, 0.011), (16, -0.071), (17, -0.022), (18, 0.003), (19, -0.018), (20, -0.007), (21, 0.027), (22, -0.018), (23, -0.038), (24, 0.033), (25, 0.006), (26, 0.007), (27, 0.039), (28, -0.009), (29, 0.026), (30, 0.01), (31, -0.014), (32, -0.019), (33, 0.002), (34, -0.012), (35, -0.041), (36, -0.003), (37, 0.044), (38, -0.051), (39, -0.078), (40, -0.051), (41, -0.049), (42, -0.051), (43, 0.094), (44, -0.033), (45, 0.002), (46, -0.012), (47, -0.026), (48, -0.004), (49, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92631125 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

Author: Christian Hardmeier ; Sara Stymne ; Jorg Tiedemann ; Joakim Nivre

Abstract: We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al., 2012) and discourse connectives (Meyer et al., 2012). One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al., 2003). For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al., 2012). Our decoding algorithm is based on local search instead of dynamic programming and permits the integration of 193 document-level models with unrestricted dependencies, so that a model score can be conditioned on arbitrary elements occurring anywhere in the input document or in the translation that is being generated. In this paper, we present an open-source implementation of this search algorithm. The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. The code is released under the GNU General Public License and published on Github1 to make it easy for other researchers to use it in their own experiments. 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. (2003) and implemented, for example, in the popular Moses decoder (Koehn et al., 2007). Translation is performed by splitting the input sentence into a number of contiguous word sequences, called phrases, which are translated into the target lan- guage through a phrase dictionary lookup and optionally reordered. The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. (2012) and is based on local search. This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github .com/chardmeier/docent/wiki Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. It is similar to previous work on phrase-based SMT decoding by Langlais et al. (2007), but enables the creation of document-level models, which was not addressed by earlier approaches. Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. (2012). The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. This saves time especially when running repeated experiments from the same starting point obtained 194 by DP search. In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. The three operations used by Hardmeier et al. (2012), change-phrase-translation, resegment and swap-phrases, jointly meet this requirement and are all implemented in Docent. Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. Unlike swap-phrases, it does not require that another phrase be moved in the opposite direction at the same time. A pair of operations called permute-phrases and linearisephrasescanreorderasequenceofphrasesintorandom order and back into the order corresponding to the source language. Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al., 2012; Stymne et al., 2013a). 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. The published source code also includes prototype implementations of a few document-level models. These models should be considered work in progress and serve as a demonstration of the cross-sentence modelling capabilities of the decoder. They have not yet reached a state of maturity that would make them suitable for production use. The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al., 2003) and a word penalty cost. All of these features are designed to be compatible with the corresponding features in Moses. From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. Docent therefore includes a separate indicator feature to indicate a violated distortion limit. In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. The semantic language model was originally proposed by Hardmeier et al. (2012) to improve lexical cohesion in a document. It is a cross-sentence model over sequences of content words that are scored based on their similarity in a word vector space. The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. They are described and demonstrated in more detail in section 5. Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). The MMAX format makes it possible to include a wide range of discourselevel corpus annotations such as coreference links. 195 These annotations can then be accessed by the feature models. To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al., 2013a). 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. For each move, a score must be computed and tested against the acceptance criterion. An overwhelming majority of the proposed moves will be rejected. In order to achieve reasonably fast decoding times, efficient scoring is paramount. Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. Fortunately, score computation can be sped up in two ways. Knowledge about how the state to be scored was generated from its predecessor helps to limit recomputations to a minimum, and by adopting a two-step scoring procedure that just computes the scores that can be calculated with little effort at first, we need to compute the complete score only if the new state has some chance of being accepted. The scores of SMT feature models can usually be decomposed in some way over parts of the document. The traditional models borrowed from sentence-based decoding are necessarily decomposable at the sentence level, and in practice, all common models are designed to meet the constraints of DP beam search, which ensures that they can in fact be decomposed over even smaller sequences of just a few words. For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. A feature model class in Docent implements three methods. The initDocument method is called once per document when decoding starts. It straightforwardly computes the model score for the entire document from scratch. When a state is modified, the decoder first invokes the estimateScoreUpdate method. Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. The search algorithm then checks this upper bound against the acceptance criterion. Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. This approximation can be calculated without doing any language model lookups at all. On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al., 2013b). This work is a first step towards achieving joint SMT and text simplification, with the final goal of adapting MT to user groups such as people with reading disabilities. Lexical consistency modelling for SMT has been attempted before. The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al., 2012), or cache-based models (Tiedemann, 2010). In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. We implemented four features on the document level. The first two features are type token ratio (TTR) and a reformulation of it, OVIX, which is less sensitive to text length. These ratios have been related to the “idea density” of a text (Mühlenbock and Kokkinakis, 2009). We also wanted to encourage consistent translations of words, for which we used the Q-value (Deléger et al., 2006), which has been proposed to measure term quality. We applied it on word level (QW) and phrase level (QP). These features need access to the full target document, which we have in Docent. In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). We tested our features on English–Swedish translation using the Europarl corpus. For training we used 1,488,322 sentences. As test data, we extracted 20 documents with a total of 690 sen- tences. We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. BaselineReadability featuresComment de ärade ledamöterna (the honourableledamöterna (the members) / ni+ Removal of non-essential words Members) (you) på ett sådant sätt att (in such a way så att (so that) + Simplified expression that) gemenskapslagstiftningen (the gemenskapens lagstiftning (the + Shorter community legislation) community’s compound to genitive construction Världshandelsorganisationen (World WTO (WTO) legislation) − Changing Trade Organisation) long compound to E−nCg hliasnhg-biansged lo handlingsplanen (the action plan) ägnat särskild uppmärksamhet particular attention to) words by changing long åt (paid planen (the plan) särskilt uppmärksam − Removal på (particular attentive on) anbgb creomvipatoiounn of important word −− RBaedm grammar bpeocratuasnet wofo rcdhanged p−ar Bt aodf gspraeemcmh aarn dbe mcaisussieng o fv cehrban Table 2: Example translation snippets with comments FeatureBLEUOVIXLIX Baseline0.24356.8851.17 TTR 0.243 55.25 51.04 OVIX 0.243 54.65 51.00 QW 0.242 57.16 51.16 QP 0.243 57.07 51.06 All 0.235 47.80 49.29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al., 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. The results are shown in Table 1. As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. We also investigated what types of changes the readability features could lead to. Table 2 shows a sample of translations where the baseline is compared to systems with readability features. There are both cases where the readability features help 197 and cases where they are problematic. Overall, these examples show that our simple features can help achieve some interesting simplifications. There is still much work to do on how to take best advantage of the possibilities in Docent in order to achieve readable texts. This attempt shows the feasibility of the approach. We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. References Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stockholm. Marine Carpuat. 2009. One translation per discourse. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), pages 19–27, Boulder, Colorado. Louise Deléger, Magnus Merkel, and Pierre Zweigenbaum. 2006. Enriching medical terminologies: an approach based on aligned corpora. In International Congress of the European Federation for Medical Informatics, pages 747–752, Maastricht, The Netherlands. Zhengxian Gong, Min Zhang, Chew Lim Tan, and Guodong Zhou. 2012. N-gram-based tense models for statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 276–285, Jeju Island, Korea. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179–1 190, Jeju Island, Korea. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Mark Hopkins and Jonathan ranking. In Proceedings on Empirical Methods in cessing, pages 1352–1362, May. 2011. Tuning as of the 2011 Conference Natural Language ProEdinburgh, Scotland. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, Edmonton. Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Associationfor Computational Linguistics: Demonstration session, pages 177–180, Prague, Czech Republic. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings 198 of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Skövde, Sweden. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Thomas Meyer, Andrei Popescu-Belis, Najeh Hajlaoui, and Andrea Gesmundo. 2012. Machine translation of labeled discourse connectives. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), San Diego, California, USA. Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited an extended readability. In Proceedings of the Corpus Linguistics Conference, Liverpool, UK. – Christoph Müller and Michael Strube. 2003. Multilevel annotation in MMAX. In Proceedings of the Fourth SIGdial Workshop on Discourse and Dialogue, pages 198–207, Sapporo, Japan. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Associationfor Computational Linguistics, pages 3 11–3 18, Philadelphia, Pennsylvania, USA. Sara Stymne, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2013a. Feature weight optimization for discourse-level SMT. In Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Sofia, Bulgaria. Sara Stymne, Jörg Tiedemann, Christian Hardmeier, and Joakim Nivre. 2013b. Statistical machine translation with readability constraints. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 375–386, Oslo, Norway. Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), pages 8–15, Uppsala, Sweden. Ferhan Ture, Douglas W. Oard, and Philip Resnik. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 417–426, Montréal, Canada.

2 0.74989218 328 acl-2013-Stacking for Statistical Machine Translation

Author: Majid Razmara ; Anoop Sarkar

Abstract: We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model.

3 0.71041101 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

Author: Rudolf Rosa ; David Marecek ; Ales Tamchyna

Abstract: Deepfix is a statistical post-editing system for improving the quality of statistical machine translation outputs. It attempts to correct errors in verb-noun valency using deep syntactic analysis and a simple probabilistic model of valency. On the English-to-Czech translation pair, we show that statistical post-editing of statistical machine translation leads to an improvement of the translation quality when helped by deep linguistic knowledge.

4 0.70959038 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

Author: Shachar Mirkin ; Sriram Venkatapathy ; Marc Dymetman ; Ioan Calapodescu

Abstract: The quality of automatic translation is affected by many factors. One is the divergence between the specific source and target languages. Another lies in the source text itself, as some texts are more complex than others. One way to handle such texts is to modify them prior to translation. Yet, an important factor that is often overlooked is the source translatability with respect to the specific translation system and the specific model that are being used. In this paper we present an interactive system where source modifications are induced by confidence estimates that are derived from the translation model in use. Modifications are automatically generated and proposed for the user’s ap- proval. Such a system can reduce postediting effort, replacing it by cost-effective pre-editing that can be done by monolinguals.

5 0.70934898 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

Author: Graham Neubig

Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, and find that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http : / /phont ron . com/t ravat ar

6 0.6959762 64 acl-2013-Automatically Predicting Sentence Translation Difficulty

7 0.67747164 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

8 0.67522073 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

9 0.67462677 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?

10 0.67285848 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

11 0.66644806 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

12 0.66053814 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

13 0.65712732 363 acl-2013-Two-Neighbor Orientation Model with Cross-Boundary Global Contexts

14 0.65512145 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

15 0.65020949 322 acl-2013-Simple, readable sub-sentences

16 0.64965141 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

17 0.6463058 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

18 0.64123225 135 acl-2013-English-to-Russian MT evaluation campaign

19 0.63615054 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

20 0.62637645 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.073), (6, 0.054), (11, 0.065), (15, 0.01), (24, 0.046), (26, 0.058), (28, 0.016), (35, 0.065), (42, 0.112), (48, 0.036), (70, 0.035), (71, 0.015), (75, 0.116), (87, 0.041), (88, 0.021), (90, 0.058), (95, 0.086)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91387731 49 acl-2013-An annotated corpus of quoted opinions in news articles

Author: Tim O'Keefe ; James R. Curran ; Peter Ashwell ; Irena Koprinska

Abstract: Quotes are used in news articles as evidence of a person’s opinion, and thus are a useful target for opinion mining. However, labelling each quote with a polarity score directed at a textually-anchored target can ignore the broader issue that the speaker is commenting on. We address this by instead labelling quotes as supporting or opposing a clear expression of a point of view on a topic, called a position statement. Using this we construct a corpus covering 7 topics with 2,228 quotes.

same-paper 2 0.90099502 127 acl-2013-Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation

Author: Christian Hardmeier ; Sara Stymne ; Jorg Tiedemann ; Joakim Nivre

Abstract: We describe Docent, an open-source decoder for statistical machine translation that breaks with the usual sentence-bysentence paradigm and translates complete documents as units. By taking translation to the document level, our decoder can handle feature models with arbitrary discourse-wide dependencies and constitutes an essential infrastructure component in the quest for discourse-aware SMT models. 1 Motivation Most of the research on statistical machine translation (SMT) that was conducted during the last 20 years treated every text as a “bag of sentences” and disregarded all relations between elements in different sentences. Systematic research into explicitly discourse-related problems has only begun very recently in the SMT community (Hardmeier, 2012) with work on topics such as pronominal anaphora (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012), verb tense (Gong et al., 2012) and discourse connectives (Meyer et al., 2012). One of the problems that hamper the development of cross-sentence models for SMT is the fact that the assumption of sentence independence is at the heart of the dynamic programming (DP) beam search algorithm most commonly used for decoding in phrase-based SMT systems (Koehn et al., 2003). For integrating cross-sentence features into the decoding process, researchers had to adopt strategies like two-pass decoding (Le Nagard and Koehn, 2010). We have previously proposed an algorithm for document-level phrase-based SMT decoding (Hardmeier et al., 2012). Our decoding algorithm is based on local search instead of dynamic programming and permits the integration of 193 document-level models with unrestricted dependencies, so that a model score can be conditioned on arbitrary elements occurring anywhere in the input document or in the translation that is being generated. In this paper, we present an open-source implementation of this search algorithm. The decoder is written in C++ and follows an objectoriented design that makes it easy to extend it with new feature models, new search operations or different types of local search algorithms. The code is released under the GNU General Public License and published on Github1 to make it easy for other researchers to use it in their own experiments. 2 Document-Level Decoding with Local Search Our decoder is based on the phrase-based SMT model described by Koehn et al. (2003) and implemented, for example, in the popular Moses decoder (Koehn et al., 2007). Translation is performed by splitting the input sentence into a number of contiguous word sequences, called phrases, which are translated into the target lan- guage through a phrase dictionary lookup and optionally reordered. The choice between different translations of an ambiguous source phrase and the ordering of the target phrases are guided by a scoring function that combines a set of scores taken from the phrase table with scores from other models such as an n-gram language model. The actual translation process is realised as a search for the highest-scoring translation in the space of all the possible translations that could be generated given the models. The decoding approach that is implemented in Docent was first proposed by Hardmeier et al. (2012) and is based on local search. This means that it has a state corresponding to a complete, if possibly bad, translation of a document at every 1https : //github .com/chardmeier/docent/wiki Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 193–198, stage of the search progress. Search proceeds by making small changes to the current search state in order to transform it gradually into a better translation. This differs from the DP algorithm used in other decoders, which starts with an empty translation and expands it bit by bit. It is similar to previous work on phrase-based SMT decoding by Langlais et al. (2007), but enables the creation of document-level models, which was not addressed by earlier approaches. Docent currently implements two search algorithms that are different generalisations of the hill climbing local search algorithm by Hardmeier et al. (2012). The original hill climbing algorithm starts with an initial state and generates possible successor states by randomly applying simple elementary operations to the state. After each operation, the new state is scored and accepted if its score is better than that of the previous state, else rejected. Search terminates when the decoder cannot find an acceptable successor state after a certain number of attempts, or when a maximum number of steps is reached. Simulated annealing is a stochastic variant of hill climbing that always accepts moves towards better states, but can also accept moves towards lower-scoring states with a certain probability that depends on a temperature parameter in order to escape local maxima. Local beam search generalises hill climbing in a different way by keeping a beam of a fixed number of multiple states at any time and randomly picking a state from the beam to modify at each move. The original hill climbing procedure can be recovered as a special case of either one of these search algorithms, by calling simulated annealing with a fixed temperature of 0 or local beam search with a beam size of 1. Initial states for the search process can be generated either by selecting a random segmentation with random translations from the phrase table in monotonic order, or by running DP beam search with sentence-local models as a first pass. For the second option, which generally yields better search results, Docent is linked with the Moses decoder and makes direct calls to the DP beam search algorithm implemented by Moses. In addition to these state initialisation procedures, Docent can save a search state to a disk file which can be loaded again in a subsequent decoding pass. This saves time especially when running repeated experiments from the same starting point obtained 194 by DP search. In order to explore the complete search space of phrase-based SMT, the search operations in a local search decoder must be able to change the phrase translations, the order of the output phrases and the segmentation of the source sentence into phrases. The three operations used by Hardmeier et al. (2012), change-phrase-translation, resegment and swap-phrases, jointly meet this requirement and are all implemented in Docent. Additionally, Docent features three extra operations, all of which affect the target word order: The movephrases operation moves a phrase to another location in the sentence. Unlike swap-phrases, it does not require that another phrase be moved in the opposite direction at the same time. A pair of operations called permute-phrases and linearisephrasescanreorderasequenceofphrasesintorandom order and back into the order corresponding to the source language. Since the search algorithm in Docent is stochastic, repeated runs of the decoder will gen- erally produce different output. However, the variance of the output is usually small, especially when initialising with a DP search pass, and it tends to be lower than the variance introduced by feature weight tuning (Hardmeier et al., 2012; Stymne et al., 2013a). 3 Available Feature Models In its current version, Docent implements a selection of sentence-local feature models that makes it possible to build a baseline system with a configuration comparable to that of a typical Moses baseline system. The published source code also includes prototype implementations of a few document-level models. These models should be considered work in progress and serve as a demonstration of the cross-sentence modelling capabilities of the decoder. They have not yet reached a state of maturity that would make them suitable for production use. The sentence-level models provided by Docent include the phrase table, n-gram language models implemented with the KenLM toolkit (Heafield, 2011), an unlexicalised distortion cost model with geometric decay (Koehn et al., 2003) and a word penalty cost. All of these features are designed to be compatible with the corresponding features in Moses. From among the typical set of baseline features in Moses, we have not implemented the lexicalised distortion model, but this model could easily be added if required. Docent uses the same binary file format for phrase tables as Moses, so the same training apparatus can be used. DP-based SMT decoders have a parameter called distortion limit that limits the difference in word order between the input and the MT output. In DP search, this is formally considered to be a parameter of the search algorithm because it affects the algorithmic complexity of the search by controlling how many translation options must be considered at each hypothesis expansion. The stochastic search algorithm in Docent does not require this limitation, but it can still be useful because the standard models of SMT do not model long-distance reordering well. Docent therefore includes a separate indicator feature to indicate a violated distortion limit. In conjunction with a very large weight, this feature can effectively ensure that the distortion limit is enforced. In contrast with the distortion limit parameter of a DP decoder, the weight ofour distortion limit feature can potentially be tuned to permit occasional distortion limit violations when they contribute to better translations. The document-level models included in Docent include a length parity model, a semantic language model as well as a collection of documentlevel readability models. The length parity model is a proof-of-concept model that ensures that all sentences in a document have either consistently odd or consistently even length. It serves mostly as a template to demonstrate how a simple documentlevel model can be implemented in the decoder. The semantic language model was originally proposed by Hardmeier et al. (2012) to improve lexical cohesion in a document. It is a cross-sentence model over sequences of content words that are scored based on their similarity in a word vector space. The readability models serve to improve the readability of the translation by encouraging the selection of easier and more consistent target words. They are described and demonstrated in more detail in section 5. Docent can read input files both in the NISTXML format commonly used to encode documents in MT shared tasks such as NIST or WMT and in the more elaborate MMAX format (Müller and Strube, 2003). The MMAX format makes it possible to include a wide range of discourselevel corpus annotations such as coreference links. 195 These annotations can then be accessed by the feature models. To allow for additional targetlanguage information such as morphological features of target words, Docent can handle simple word-level annotations that are encoded in the phrase table in the same way as target language factors in Moses. In order to optimise feature weights we have adapted the Moses tuning infrastructure to Docent. In this way we can take advantage of all its features, for instance using different optimisation algorithms such as MERT (Och, 2003) or PRO (Hopkins and May, 2011), and selective tuning of a subset of features. Since document features only give meaningful scores on the document level and not on the sentence level, we naturally perform optimisation on document level, which typically means that we need more data than for the optimisation of sentence-based decoding. The results we obtain are relatively stable and competitive with sentence-level optimisation of the same models (Stymne et al., 2013a). 4 Implementing Feature Models Efficiently While translating a document, the local search decoder attempts to make a great number of moves. For each move, a score must be computed and tested against the acceptance criterion. An overwhelming majority of the proposed moves will be rejected. In order to achieve reasonably fast decoding times, efficient scoring is paramount. Recomputing the scores of the whole document at every step would be far too slow for the decoder to be useful. Fortunately, score computation can be sped up in two ways. Knowledge about how the state to be scored was generated from its predecessor helps to limit recomputations to a minimum, and by adopting a two-step scoring procedure that just computes the scores that can be calculated with little effort at first, we need to compute the complete score only if the new state has some chance of being accepted. The scores of SMT feature models can usually be decomposed in some way over parts of the document. The traditional models borrowed from sentence-based decoding are necessarily decomposable at the sentence level, and in practice, all common models are designed to meet the constraints of DP beam search, which ensures that they can in fact be decomposed over even smaller sequences of just a few words. For genuine document-level features, this is not the case, but even these models can often be decomposed in some way, for instance over paragraphs, anaphoric links or lexical chains. To take advantage of this fact, feature models in Docent always have access to the previous state and its score and to a list of the state modifications that transform the previous state into the next. The scores of the new state are calculated by identifying the parts of a document that are affected by the modifications, subtracting the old scores of this part from the previous score and adding the new scores. This approach to scoring makes feature model implementation a bit more complicated than in DP search, but it gives the feature models full control over how they decompose a document while still permitting efficient decoding. A feature model class in Docent implements three methods. The initDocument method is called once per document when decoding starts. It straightforwardly computes the model score for the entire document from scratch. When a state is modified, the decoder first invokes the estimateScoreUpdate method. Rather than calculating the new score exactly, this method is only required to return an upper bound that reflects the maximum score that could possibly be achieved by this state. The search algorithm then checks this upper bound against the acceptance criterion. Only if the upper bound meets the criterion does it call the updateScore method to calculate the exact score, which is then checked against the acceptance criterion again. The motivation for this two-step procedure is that some models can compute an upper bound approximation much more efficiently than an exact score. For any model whose score is a log probability, a value of 0 is a loose upper bound that can be returned instantly, but in many cases, we can do much better. In the case of the n-gram language model, for instance, a more accurate upper bound can be computed cheaply by subtracting from the old score all log-probabilities of n-grams that are affected by the state modifications without adding the scores of the n-grams replacing them in the new state. This approximation can be calculated without doing any language model lookups at all. On the other hand, some models like the distortion cost or the word penalty are very cheap to compute, so that the estimateScoreUpdate method 196 can simply return the precise score as a tight up- per bound. If a state gets rejected because of a low score on one of the cheap models, this means we will never have to compute the more expensive feature scores at all. 5 Readability: A Case Study As a case study we report initial results on how document-wide features can be used in Docent in order to improve the readability oftexts by encouraging simple and consistent terminology (Stymne et al., 2013b). This work is a first step towards achieving joint SMT and text simplification, with the final goal of adapting MT to user groups such as people with reading disabilities. Lexical consistency modelling for SMT has been attempted before. The suggested approaches have been limited by the use of sentence-level decoders, however, and had to resort to procedures like post processing (Carpuat, 2009), multiple decoding runs with frozen counts from previous runs (Ture et al., 2012), or cache-based models (Tiedemann, 2010). In Docent, however, we al- ways have access to a full document translation, which makes it straightforward to include features directly into the decoder. We implemented four features on the document level. The first two features are type token ratio (TTR) and a reformulation of it, OVIX, which is less sensitive to text length. These ratios have been related to the “idea density” of a text (Mühlenbock and Kokkinakis, 2009). We also wanted to encourage consistent translations of words, for which we used the Q-value (Deléger et al., 2006), which has been proposed to measure term quality. We applied it on word level (QW) and phrase level (QP). These features need access to the full target document, which we have in Docent. In addition, we included two sentence-level count features for long words that have been used to measure the readability of Swedish texts (Mühlenbock and Kokkinakis, 2009). We tested our features on English–Swedish translation using the Europarl corpus. For training we used 1,488,322 sentences. As test data, we extracted 20 documents with a total of 690 sen- tences. We used the standard set of baseline features: 5-gram language model, translation model with 5 weights, a word penalty and a distortion penalty. BaselineReadability featuresComment de ärade ledamöterna (the honourableledamöterna (the members) / ni+ Removal of non-essential words Members) (you) på ett sådant sätt att (in such a way så att (so that) + Simplified expression that) gemenskapslagstiftningen (the gemenskapens lagstiftning (the + Shorter community legislation) community’s compound to genitive construction Världshandelsorganisationen (World WTO (WTO) legislation) − Changing Trade Organisation) long compound to E−nCg hliasnhg-biansged lo handlingsplanen (the action plan) ägnat särskild uppmärksamhet particular attention to) words by changing long åt (paid planen (the plan) särskilt uppmärksam − Removal på (particular attentive on) anbgb creomvipatoiounn of important word −− RBaedm grammar bpeocratuasnet wofo rcdhanged p−ar Bt aodf gspraeemcmh aarn dbe mcaisussieng o fv cehrban Table 2: Example translation snippets with comments FeatureBLEUOVIXLIX Baseline0.24356.8851.17 TTR 0.243 55.25 51.04 OVIX 0.243 54.65 51.00 QW 0.242 57.16 51.16 QP 0.243 57.07 51.06 All 0.235 47.80 49.29 Table 1: Results for adding single lexical consistency features to Docent To evaluate our system we used the BLEU score (Papineni et al., 2002) together with a set of readability metrics, since readability is what we hoped to improve by adding consistency features. Here we used OVIX to confirm a direct impact on con- sistency, and LIX (Björnsson, 1968), which is a common readability measure for Swedish. Unfortunately we do not have access to simplified translated text, so we calculate the MT metrics against a standard reference, which means that simple texts will likely have worse scores than complicated texts closer to the reference translation. We tuned the standard features using Moses and MERT, and then added each lexical consistency feature with a small weight, using a grid search approach to find values with a small impact. The results are shown in Table 1. As can be seen, for individual features the translation quality was maintained, with small improvements in LIX, and in OVIX for the TTR and OVIX features. For the combination we lost a little bit on translation quality, but there was a larger effect on the readability metrics. When we used larger weights, there was a bigger impact on the readability metrics, with a further decrease on MT quality. We also investigated what types of changes the readability features could lead to. Table 2 shows a sample of translations where the baseline is compared to systems with readability features. There are both cases where the readability features help 197 and cases where they are problematic. Overall, these examples show that our simple features can help achieve some interesting simplifications. There is still much work to do on how to take best advantage of the possibilities in Docent in order to achieve readable texts. This attempt shows the feasibility of the approach. We plan to extend this work for instance by better feature optimisation, by integrating part-of-speech tags into our features in order to focus on terms rather than common words, and by using simplified texts for evaluation and tuning. 6 Conclusions In this paper, we have presented Docent, an opensource document-level decoder for phrase-based SMT released under the GNU General Public License. Docent is the first decoder that permits the inclusion of feature models with unrestricted dependencies between arbitrary parts of the output, even crossing sentence boundaries. A number of research groups have recently started to investigate the interplay between SMT and discourse-level phenomena such as pronominal anaphora, verb tense selection and the generation of discourse connectives. We expect that the availability of a document-level decoder will make it substantially easier to leverage discourse information in SMT and make SMT models explore new ground beyond the next sentence boundary. References Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stockholm. Marine Carpuat. 2009. One translation per discourse. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009), pages 19–27, Boulder, Colorado. Louise Deléger, Magnus Merkel, and Pierre Zweigenbaum. 2006. Enriching medical terminologies: an approach based on aligned corpora. In International Congress of the European Federation for Medical Informatics, pages 747–752, Maastricht, The Netherlands. Zhengxian Gong, Min Zhang, Chew Lim Tan, and Guodong Zhou. 2012. N-gram-based tense models for statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 276–285, Jeju Island, Korea. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier, Joakim Nivre, and Jörg Tiedemann. 2012. Document-wide decoding for phrase-based statistical machine translation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1179–1 190, Jeju Island, Korea. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Mark Hopkins and Jonathan ranking. In Proceedings on Empirical Methods in cessing, pages 1352–1362, May. 2011. Tuning as of the 2011 Conference Natural Language ProEdinburgh, Scotland. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48–54, Edmonton. Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Associationfor Computational Linguistics: Demonstration session, pages 177–180, Prague, Czech Republic. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings 198 of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Skövde, Sweden. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Thomas Meyer, Andrei Popescu-Belis, Najeh Hajlaoui, and Andrea Gesmundo. 2012. Machine translation of labeled discourse connectives. In Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA), San Diego, California, USA. Katarina Mühlenbock and Sofie Johansson Kokkinakis. 2009. LIX 68 revisited an extended readability. In Proceedings of the Corpus Linguistics Conference, Liverpool, UK. – Christoph Müller and Michael Strube. 2003. Multilevel annotation in MMAX. In Proceedings of the Fourth SIGdial Workshop on Discourse and Dialogue, pages 198–207, Sapporo, Japan. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting ofthe Associationfor Computational Linguistics, pages 3 11–3 18, Philadelphia, Pennsylvania, USA. Sara Stymne, Christian Hardmeier, Jörg Tiedemann, and Joakim Nivre. 2013a. Feature weight optimization for discourse-level SMT. In Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT), Sofia, Bulgaria. Sara Stymne, Jörg Tiedemann, Christian Hardmeier, and Joakim Nivre. 2013b. Statistical machine translation with readability constraints. In Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), pages 375–386, Oslo, Norway. Jörg Tiedemann. 2010. Context adaptation in statistical machine translation using models with exponentially decaying cache. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), pages 8–15, Uppsala, Sweden. Ferhan Ture, Douglas W. Oard, and Philip Resnik. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 417–426, Montréal, Canada.

3 0.84338367 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

Author: Wenduan Xu ; Yue Zhang ; Philip Williams ; Philipp Koehn

Abstract: We present a context-sensitive chart pruning method for CKY-style MT decoding. Source phrases that are unlikely to have aligned target constituents are identified using sequence labellers learned from the parallel corpus, and speed-up is obtained by pruning corresponding chart cells. The proposed method is easy to implement, orthogonal to cube pruning and additive to its pruning power. On a full-scale Englishto-German experiment with a string-totree model, we obtain a speed-up of more than 60% over a strong baseline, with no loss in BLEU.

4 0.84266341 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

Author: Joanne Boisson ; Ting-Hui Kao ; Jian-Cheng Wu ; Tzu-Hsi Yen ; Jason S. Chang

Abstract: In this paper, we introduce a Web-scale linguistics search engine, Linggle, that retrieves lexical bundles in response to a given query. The query might contain keywords, wildcards, wild parts of speech (PoS), synonyms, and additional regular expression (RE) operators. In our approach, we incorporate inverted file indexing, PoS information from BNC, and semantic indexing based on Latent Dirichlet Allocation with Google Web 1T. The method involves parsing the query to transforming it into several keyword retrieval commands. Word chunks are retrieved with counts, further filtering the chunks with the query as a RE, and finally displaying the results according to the counts, similarities, and topics. Clusters of synonyms or conceptually related words are also provided. In addition, Linggle provides example sentences from The New York Times on demand. The current implementation of Linggle is the most functionally comprehensive, and is in principle language and dataset independent. We plan to extend Linggle to provide fast and convenient access to a wealth of linguistic information embodied in Web scale datasets including Google Web 1T and Google Books Ngram for many major languages in the world. 1

5 0.83722383 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

Author: Ulle Endriss ; Raquel Fernandez

Abstract: Crowdsourcing, which offers new ways of cheaply and quickly gathering large amounts of information contributed by volunteers online, has revolutionised the collection of labelled data. Yet, to create annotated linguistic resources from this data, we face the challenge of having to combine the judgements of a potentially large group of annotators. In this paper we investigate how to aggregate individual annotations into a single collective annotation, taking inspiration from the field of social choice theory. We formulate a general formal model for collective annotation and propose several aggregation methods that go beyond the commonly used majority rule. We test some of our methods on data from a crowdsourcing experiment on textual entailment annotation.

6 0.83141953 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

7 0.83048964 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

8 0.82886112 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

9 0.82784611 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

10 0.82579345 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

11 0.82325697 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

12 0.82138592 328 acl-2013-Stacking for Statistical Machine Translation

13 0.82012677 166 acl-2013-Generalized Reordering Rules for Improved SMT

14 0.82004452 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

15 0.81856924 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

16 0.81817245 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

17 0.81606323 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation

18 0.81249517 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

19 0.81165326 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

20 0.81163859 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing