emnlp emnlp2012 emnlp2012-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lemao Liu ; Hailong Cao ; Taro Watanabe ; Tiejun Zhao ; Mo Yu ; Conghui Zhu
Abstract: In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. However, due to the diversity and uneven distribution of source sentences, there are two problems suffered by this method. First, its performance is highly dependent on the choice of a development set, which may lead to an unstable performance for testing. Second, translations become inconsistent at the sentence level since tuning is performed globally on a document level. In this paper, we propose a novel local training method to address these two problems. Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. We pro- pose efficient incremental training methods to put the local training into practice. In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2.0 BLEU points, meanwhile its efficiency is comparable to that of the global method.
Reference: text
sentIndex sentText sentNum sentScore
1 tj p , , , , Abstract In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. [sent-7, score-0.388]
2 In this paper, we propose a novel local training method to address these two problems. [sent-11, score-0.262]
3 Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. [sent-12, score-0.545]
4 We pro- pose efficient incremental training methods to put the local training into practice. [sent-13, score-0.472]
5 In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2. [sent-14, score-0.517]
6 1 Introduction Och and Ney (2002) introduced the log-linear model for statistical machine translation (SMT), in which translation is considered as the following optimization problem: 402 eˆ(f;W) = argemaxP(e|f;W) = argemaxPee0xepxp? [sent-16, score-0.51]
7 All these training methods follow the same pipeline: they train only a single weight on a given development set, and then use it to translate all the sentences in a test set. [sent-31, score-0.279]
8 Firstly, on the document level, the performance of these methods is dependent on the choice of a development set, which may potentially lead to an unstable translation performance for testing. [sent-36, score-0.374]
9 cSoirnrecse thoend tsra tnos afo rpmosesdit icvleas esxiaficmaptiloen d problem sis ” n•”ot, linearly separable, there does not exist a single weight which can obtain e11 and e21 as translation results meanwhile. [sent-45, score-0.37]
10 However it is inefficient since tuning requires iteratively decoding an entire development set, which is impractical for an online service. [sent-52, score-0.328]
11 Secondly, translation becomes inconsistent on the sentence level (Ma et al. [sent-53, score-0.298]
12 Global training method such as MERT tries to optimize the weight towards the best performance for the whole set, and it can not necessarily always obtain good translation for every sentence in the development set. [sent-55, score-0.576]
13 When we tune examples in Figure 1(a) by MERT, it can be regarded as a nonlinearly separable classification problem illustrated in Figure 1(b). [sent-58, score-0.284]
14 Therefore, there exists no single weight W which simultaneously obtains e11 and e21 as translation for f1 and f2 via Equation (1). [sent-59, score-0.37]
15 , 2006), we propose a local training method, which trains sentence-wise weights instead of a single weight, to address the above two problems. [sent-62, score-0.37]
16 Compared with global training methods, such as 403 MERT, in which training and testing are separated, our method works in an online fashion, in which training is performed during testing. [sent-63, score-0.368]
17 This online fashion has an advantage in that it can adapt the weights for each of the test sentences, by dynamically tuning the weights on translation examples which are similar to these test sentences. [sent-64, score-0.636]
18 Similar to the method of development set automatical selection, the local training method may also suffer the problem of efficiency. [sent-65, score-0.332]
19 To put it into practice, we propose incremental training methods which avoid retraining and iterative decoding on a development set. [sent-66, score-0.466]
20 Our local training method has two advantages: firstly, it significantly outperforms MERT, especially when test set is different from the development set; secondly, it improves the translation consistency. [sent-67, score-0.623]
21 Experiments on NIST Chinese-to-English translation tasks show that our local training method significantly gains over MERT, with the maximum improvements up to 2. [sent-68, score-0.517]
22 2 Local Training and Testing The local training method (Bottou and Vapnik, 1992) is widely employed in computer vision (Zhang et al. [sent-70, score-0.262]
23 Compared with the global training method which tries to fit a single weight on the training data, the local one learns weights based on the local neighborhood information for each test example. [sent-73, score-0.848]
24 The local training in SMT is described in the Algorithm 1. [sent-77, score-0.262]
25 For each sentence ti in test set, training examples Di is retrieved from D using a similarity measure (line 2), a weight Wi is optimized on Dev and Di (line 3)1, and, finally, ti is decoded with Wi for testing (line 4). [sent-78, score-0.788]
26 At the end of this algorithm, it returns the translation re- sults for T. [sent-79, score-0.255]
27 Note that weights are adapted for each test sentence ti in line 3 by utilizing the translation examples Di which are similar to ti. [sent-80, score-0.686]
28 Thus, our local training method can be considered as an adaptation of translation weights. [sent-81, score-0.553]
29 It is impractical to train a weight Wi on Dev and Di from scratch for every sentence, since iteratively decoding Dev and Di is time consuming when we apply MERT. [sent-83, score-0.268]
30 On the first phase, we use a global training method, like MERT, to tune a baseline weight on the development set Dev in an offline manner. [sent-85, score-0.446]
31 On the second phase, we utilize the retrieved examples to incrementally tune sentence-wise local weights based on the baseline weight. [sent-86, score-0.595]
32 On the phase of incremental training, we perform decoding only once for retrieved examples Di, though several rounds of decoding are possible and potentially better if one does not seriously care about training speed. [sent-90, score-0.652]
33 Furthermore, instead of on-the-fly decoding, we decode the retrieval data D offline using the parameter from our baseline weight and its nbest translation candidates are saved with training examples to increase the training efficiency. [sent-91, score-0.751]
34 It globally trains a baseline weight Wb (line 1), and decodes each sentence in retrieval data D with the weight Wb (line 2). [sent-93, score-0.418]
35 For each sentence ti in test set T, it first retrieves training examples Di from D (line 4), and then it runs local training to tune a local weight Wi (line 5) and performs testing with Wi for ti (line 6). [sent-94, score-1.204]
36 Please note that the two-phase training contains global training in line 1 and local training in line 5. [sent-95, score-0.634]
37 From Algorithm 2, one can see that our method is effective even if the test set is unknow, for example, × in the scenario of online translation services, since the global training on development set and decoding on retrieval data can be performed offline. [sent-96, score-0.741]
38 In the next two sections, we will discuss the details about the similarity metric in line 4 and the incremental training in line 5 of Algorithm 2. [sent-97, score-0.481]
39 3 Acquiring Training Examples In line 4 of Algorithm 2, to retrieve training examples for the sentence ti , we first need a metric to retrieve similar translation examples. [sent-98, score-0.827]
40 We assume that the metric satisfy the property: more similar the test sentence and translation examples are, the better translation result one obtains when decoding the test sentence with the weight trained on the translation examples. [sent-99, score-1.276]
41 Like examplebased translation in which similar source sentences have similar translations, we assume that the optimal translation weights of the similar source sentences are closer. [sent-103, score-0.621]
42 4 Incremental Training Based on Ultraconservative Update Compared with retraining mode, incremental training can improve the training efficiency. [sent-104, score-0.351]
43 In the field of machine learning research, incremental training has been employed in the work (Cauwenberghs and Poggio, 2001 ; Shilton et al. [sent-105, score-0.21]
44 translation example, is unavailable until actually decoding the example, since the derivation is a latent variable. [sent-109, score-0.358]
45 Our goal ifs to findi an optimal weight, denoted by Wi, which is a local weight and used for decoding the sentence ti. [sent-113, score-0.465]
46 Unlike the global method which performs tuning on the whole development set Dev + Di as in Algorithm 1, Wi can be incrementally learned by optimizing on Di based on Wb. [sent-114, score-0.258]
47 We employ the idea of ultraconservative update (Crammer and Singer, 2003; Crammer et al. [sent-115, score-0.262]
48 , 2006) to propose two incremental methods for local training in Algorithm 2 as follows. [sent-116, score-0.414]
49 It desires that the optimal weight Wi is not only close to the baseline weight Wb, but also achieves the low loss over the retrieved examples Di. [sent-118, score-0.51]
50 The idea of ultraconservative update can be formalized as follows: mWinnd(W,Wb) + λ · Loss(Di,W)o, (3) where d(W, Wb) is a distance metric over a pair of weights W and Wb. [sent-119, score-0.379]
51 , 2006) is a form of ultraconservative update in (3) whose Loss is defined as hinge loss based on margin over the pairwise translation candiates in Di. [sent-127, score-0.686]
52 with ∆h(fj , ejn) = h(fj , ej·) − h(fj , ejn) , (4) where h(fj , e) is the feature vector of candidate e, ejn is a translation member of fj in cj, ej· is the oracle one in cj, ‘jn is a loss between ej· and ejn and it is the same as referred in (Chiang et al. [sent-130, score-0.679]
53 , 2008) employing the MIRA to globally train SMT, in this paper, we apply MIRA as one of local training method for SMT and we call it as margin based ultraconservative update (MBUU for shortly) to highlight its advantage of incremental training in line 5 of Algorithm 2. [sent-134, score-0.861]
54 MBUU is a batch update mode which updates the weight with all training examples, but MIRA is an online one which updates with each example (Watanabe et al. [sent-138, score-0.297]
55 2 Error Rate Based Ultraconservative Update Instead of taking into account the margin-based hinge loss between a pair of translations as the Loss in (3), we directly optimize the error rate of translation candidates with respect to their references in Di. [sent-143, score-0.513]
56 Formally, the objective function of error rate based ultraconservative update (EBUU) is as follows: 12kW − Wbk2+KλjXK=1Error(rj; ˆ e(fj;W)), (5) where e(fj ; W) is defined in Equation (1), and Error(rj , e) is the sentence-wise minus BLEU (Papineni et al. [sent-144, score-0.346]
57 1 Setting We conduct our experiments on the Chinese-toEnglish translation task. [sent-166, score-0.255]
58 In our experiments the translation performances are measured by case-insensitive BLEU4 metric (Papineni et al. [sent-172, score-0.31]
59 We use an in-house developed hierarchical phrase-based translation (Chiang, 2005) as our baseline system, and we denote it as In-Hiero. [sent-176, score-0.255]
60 63 Table 2: The efficiency of the local training and testing measured by sentence averaged runtime. [sent-179, score-0.379]
61 08+ Table 3: The performance comparison of local training methods (MBUU and EBUU) and a global method (MERT). [sent-189, score-0.338]
62 To compare the local training method in Algorithm 2, we use a standard global training method, MERT, as the baseline training method. [sent-198, score-0.454]
63 We do not compare with Algorithm 1, in which retraining is performed for each input sentence, since retraining for the whole test set is impractical given that each sentence-wise retraining may take some hours or even days. [sent-199, score-0.335]
64 We translate retrieval data with Wb to obtain their 100 best translation candidates. [sent-206, score-0.354]
65 Table 2 depicts that testing each sentence with local training method takes 2. [sent-227, score-0.379]
66 Further, compared to the retrieval, the local training is not the bottleneck. [sent-231, score-0.262]
67 Actually, if we use LSH technique (Andoni and Indyk, 2008) in retrieval process, the local method can be easily scaled to a larger training data. [sent-232, score-0.361]
68 3 Results and Analysis Table 3 shows the main results of our local training methods. [sent-234, score-0.262]
69 Both of these two local training methods achieve significant improvements over the MERT baseline, which proves the effectiveness of our local training method over global training method. [sent-238, score-0.658]
70 Although both local methods MBUU and EBUU achieved improvements on all the datasets, their gains on NIST06 and NIST08 are significantly higher than those achieved on NIST05 test dataset. [sent-239, score-0.24]
71 T02 3860 Table 5: The comparison of MERT with different development datasets and local training method based on EBUU. [sent-244, score-0.332]
72 nts local training has for the sentences in this test set. [sent-245, score-0.298]
73 To test our hypothesis, we measured the similarity between the development set and a test set by the average of accumulated TF-IDF scores of development dataset and each sentence in test datasets. [sent-246, score-0.327]
74 Table 4 shows that NIST06 and NIST08 are more different from NIS02 than NIST05, thus, this is potentially the reason why local training is more effective on NIST06 and NIST08. [sent-247, score-0.262]
75 We can see that, with the help of the local training, we still gain much even if we selected an unsatisfactory development data. [sent-255, score-0.274]
76 Besides obtaining improvements on document level as referred in Table 3, the local training methods can also achieve consistent improvements on sentence level and thus can improve the users’ experiences. [sent-258, score-0.305]
77 As mentioned in Section 4, if the retrieved examples are very similar to the test sentence, the better performance will be achieved with the larger λ. [sent-265, score-0.222]
78 Actually, in Table 7 there seems to be little dependency between the numbers of examples retrieved and the translation qualities, although they are positively re- Retrie4v0al Size2N7I. [sent-275, score-0.441]
79 Table 8 presents the performance of the oracle translations selected from the 1-best translation results ofMERT and EBUU. [sent-290, score-0.297]
80 Clearly, there exists more potential improvement for local training method. [sent-291, score-0.262]
81 All the methods mentioned above train a single weight for the whole development set, whereas our local training method learns a weight for each sentence. [sent-299, score-0.562]
82 Further, our translation framework integrates the training and testing into one unit, instead oftreating them separately. [sent-300, score-0.387]
83 Our method resorts to some translation examples, which is similar as example-based translation or translation memory (Watanabe and Sumita, 2003; He et al. [sent-302, score-0.765]
84 Instead of using translation examples to construct translation rules for enlarging the decoding space, we employed them 409 to discriminatively learn local weights. [sent-305, score-0.897]
85 Their methods utilize the retrieved examples to acquire translation model and can be seen as the adaptation of translation model. [sent-309, score-0.732]
86 However, ours uses the retrieved examples to tune the weights and thus can be considered as the adaptation of tuning. [sent-310, score-0.376]
87 Furthermore, since ours does not change the translation model which needs to run GIZA++ and it incrementally trains local weights, our method can be applied for online translation service. [sent-311, score-0.855]
88 7 Conclusion and Future Work This paper proposes a novel local training framework for SMT. [sent-312, score-0.262]
89 First, instead of training only one weight for document level, it trains a single weight for sentence level. [sent-314, score-0.377]
90 Second, instead of considering the training and testing as two separate units, we unify the training and testing into one unit, which can employ the information of test sentences and perform sentencewise local adaptation of weights. [sent-315, score-0.589]
91 Local training can not only alleviate the problem of the development data selection, but also reduce the risk of sentence-wise bad translation results, thus consistently improve the translation performance. [sent-316, score-0.638]
92 With the help of incremental training methods, the time incurred by local training was negligible and the local training and testing totally took 2. [sent-319, score-0.808]
93 In the future work, we will further investigate the local training method, since there are more room for improvements as observed in our experiments. [sent-321, score-0.262]
94 We will test our method on other translation models and larger training data6. [sent-322, score-0.349]
95 Acknowledgments We would like to thank Hongfei Jiang and Shujie Liu for many valuable discussions and thank three 6Intuitionally, when the corpus of translation examples is larger, the retrieval results in Algorithm 2 are much similar as the test sentence. [sent-323, score-0.47]
96 Online large-margin training of syntactic and structural translation features. [sent-364, score-0.313]
97 Adaptation of the translation model for statistical machine translation based on information retrieval. [sent-404, score-0.51]
98 Improving statistical machine translation performance by training data selection and optimization. [sent-437, score-0.313]
99 Consistent translation using discriminative learning - a translation memory-inspired approach. [sent-442, score-0.51]
100 A simplex armijo downhill algorithm for optimizing statistical machine translation decoding parameters. [sent-508, score-0.358]
wordName wordTfidf (topN-words)
[('mert', 0.293), ('mbuu', 0.27), ('translation', 0.255), ('ebuu', 0.246), ('wb', 0.229), ('local', 0.204), ('dev', 0.185), ('ultraconservative', 0.182), ('incremental', 0.152), ('watanabe', 0.141), ('di', 0.137), ('fj', 0.134), ('ti', 0.12), ('weight', 0.115), ('retrieved', 0.106), ('decoding', 0.103), ('retrieval', 0.099), ('ejn', 0.098), ('bleu', 0.094), ('loss', 0.094), ('tune', 0.092), ('line', 0.09), ('och', 0.09), ('chiang', 0.089), ('wi', 0.088), ('retraining', 0.083), ('update', 0.08), ('moses', 0.08), ('examples', 0.08), ('smt', 0.079), ('stroudsburg', 0.077), ('global', 0.076), ('testing', 0.074), ('mira', 0.071), ('development', 0.07), ('separable', 0.063), ('retrieve', 0.063), ('weights', 0.062), ('tuning', 0.061), ('josef', 0.059), ('association', 0.058), ('training', 0.058), ('taro', 0.057), ('crammer', 0.056), ('metric', 0.055), ('cj', 0.055), ('hopkins', 0.052), ('decode', 0.051), ('incrementally', 0.051), ('impractical', 0.05), ('bottou', 0.05), ('phase', 0.05), ('andoni', 0.049), ('cauwenberghs', 0.049), ('cji', 0.049), ('examplebased', 0.049), ('hfji', 0.049), ('inhiero', 0.049), ('jxk', 0.049), ('nonlinearly', 0.049), ('resluts', 0.049), ('sentencewise', 0.049), ('shilton', 0.049), ('uneven', 0.049), ('unstable', 0.049), ('yanjun', 0.049), ('yifan', 0.049), ('rj', 0.047), ('trains', 0.046), ('rate', 0.045), ('online', 0.044), ('sentence', 0.043), ('ej', 0.043), ('andy', 0.042), ('arrives', 0.042), ('hildebrand', 0.042), ('jk', 0.042), ('translations', 0.042), ('galley', 0.041), ('koehn', 0.041), ('zhao', 0.041), ('pa', 0.04), ('error', 0.039), ('franz', 0.039), ('hinge', 0.038), ('harbin', 0.038), ('vapnik', 0.038), ('blunsom', 0.038), ('mt', 0.037), ('ma', 0.037), ('margin', 0.037), ('similarity', 0.036), ('test', 0.036), ('adaptation', 0.036), ('quirk', 0.035), ('tries', 0.035), ('seconds', 0.035), ('offline', 0.035), ('fbis', 0.035), ('sumita', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 86 emnlp-2012-Locally Training the Log-Linear Model for SMT
Author: Lemao Liu ; Hailong Cao ; Taro Watanabe ; Tiejun Zhao ; Mo Yu ; Conghui Zhu
Abstract: In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. However, due to the diversity and uneven distribution of source sentences, there are two problems suffered by this method. First, its performance is highly dependent on the choice of a development set, which may lead to an unstable performance for testing. Second, translations become inconsistent at the sentence level since tuning is performed globally on a document level. In this paper, we propose a novel local training method to address these two problems. Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. We pro- pose efficient incremental training methods to put the local training into practice. In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2.0 BLEU points, meanwhile its efficiency is comparable to that of the global method.
2 0.1558222 67 emnlp-2012-Inducing a Discriminative Parser to Optimize Machine Translation Reordering
Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori
Abstract: This paper proposes a method for learning a discriminative parser for machine translation reordering using only aligned parallel text. This is done by treating the parser’s derivation tree as a latent variable in a model that is trained to maximize reordering accuracy. We demonstrate that efficient large-margin training is possible by showing that two measures of reordering accuracy can be factored over the parse tree. Using this model in the pre-ordering framework results in significant gains in translation accuracy over standard phrasebased SMT and previously proposed unsupervised syntax induction methods.
3 0.14529538 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
Author: Nan Duan ; Mu Li ; Ming Zhou
Abstract: A forced derivation tree (FDT) of a sentence pair {f, e} denotes a derivation tree that can tpraainrsl {afte, f} idnetono itetss a acc duerraivtea target etrea tnhsaltat cioann e. In this paper, we present an approach that leverages structured knowledge contained in FDTs to train component models for statistical machine translation (SMT) systems. We first describe how to generate different FDTs for each sentence pair in training corpus, and then present how to infer the optimal FDTs based on their derivation and alignment qualities. As the first step in this line of research, we verify the effectiveness of our approach in a BTGbased phrasal system, and propose four FDTbased component models. Experiments are carried out on large scale English-to-Japanese and Chinese-to-English translation tasks, and significant improvements are reported on both translation quality and alignment quality.
4 0.14180109 35 emnlp-2012-Document-Wide Decoding for Phrase-Based Statistical Machine Translation
Author: Christian Hardmeier ; Joakim Nivre ; Jorg Tiedemann
Abstract: Independence between sentences is an assumption deeply entrenched in the models and algorithms used for statistical machine translation (SMT), particularly in the popular dynamic programming beam search decoding algorithm. This restriction is an obstacle to research on more sophisticated discourse-level models for SMT. We propose a stochastic local search decoding method for phrase-based SMT, which permits free document-wide dependencies in the models. We explore the stability and the search parameters ofthis method and demonstrate that it can be successfully used to optimise a document-level semantic language model. 1 Motivation In the field oftranslation studies, it is undisputed that discourse-wide context must be considered care- fully for good translation results (Hatim and Mason, 1990). By contrast, the state of the art in statistical machine translation (SMT), despite significant advances in the last twenty years, still assumes that texts can be translated sentence by sentence under strict independence assumptions, even though it is well known that certain linguistic phenomena such as pronominal anaphora cannot be translated correctly without referring to extra-sentential context. This is true both for the phrase-based and the syntaxbased approach to SMT. In the rest of this paper, we shall concentrate on phrase-based SMT. One reason why it is difficult to experiment with document-wide models for phrase-based SMT is that the dynamic programming (DP) algorithm 1179 which has been used almost exclusively for decoding SMT models in the recent literature has very strong assumptions of locality built into it. DP beam search for phrase-based SMT was described by Koehn et al. (2003), extending earlier work on word-based SMT (Tillmann et al., 1997; Och et al., 2001 ; Tillmann and Ney, 2003). This algorithm con- structs output sentences by starting with an empty hypothesis and adding output words at the end until translations for all source words have been generated. The core models of phrase-based SMT, in particular the n-gram language model (LM), only depend on a constant number of output words to the left of the word being generated. This fact is exploited by the search algorithm with a DP technique called hypothesis recombination (Och et al., 2001), which permits the elimination of hypotheses from the search space if they coincide in a certain number of final words with a better hypothesis and no future expansion can possibly invert the relative ranking of the two hypotheses under the given models. Hypothesis recombination achieves a substantial reduction of the search space without affecting search optimality and makes it possible to use aggressive pruning techniques for fast search while still obtaining good results. The downside of this otherwise excellent approach is that it only works well with models that have a local dependency structure similar to that of an n-gram language model, so they only depend on a small context window for each target word. Sentence-local models with longer dependencies can be added, but doing so greatly increases the risk for search errors by inhibiting hypothesis recombination. Cross-sentence dependencies cannot be directly integrated into DP SMT decoding in LParnogcue agdein Lgesa ornf tihneg, 2 p0a1g2e Jso 1in17t C9–o1n1f9e0re,n Jce ju on Is Elanmdp,ir Kicoarlea M,e 1t2h–o1d4s J iunly N 2a0tu1r2a.l ? Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls any obvious way, especially if joint optimisation of a number of interdependent decisions over an entire document is required. Research into models with a more varied, non-local dependency structure is to some extent stifled by the difficulty of decoding such models effectively, as can be seen by the problems some researchers encountered when they attempted to solve discourse-level problems. Consider, for instance, the work on cache-based language models by Tiedemann (2010) and Gong et al. (201 1), where error propagation was a serious issue, or the works on pronominal anaphora by Le Nagard and Koehn (2010), who implemented cross-sentence dependencies with an ad-hoc two-pass decoding strategy, and Hardmeier and Federico (2010) with the use of an external decoder driver to manage backward-only dependencies between sentences. In this paper, we present a method for decoding complete documents in phrase-based SMT. Our decoder uses a local search approach whose state consists of a complete translation of an entire document at any time. The initial state is improved by the application of a series of operations using a hill climbing strategy to find a (local) maximum of the score function. This setup gives us complete freedom to define scoring functions over the entire document. Moreover, by optionally initialising the state with the output of a traditional DP decoder, we can ensure that the final hypothesis is no worse than what would have been found by DP search alone. We start by describing the decoding algorithm and the state operations used by our decoder, then we present empirical results demonstrating the effectiveness of our approach and its usability with a document-level semantic language model, and finally we discuss some related work. 2 SMT Decoding by Hill Climbing In this section, we formally describe the phrasebased SMT model implemented by our decoder as well as the decoding algorithm we use. 2.1 SMT Model Our decoder is based on local search, so its state at any time is a representation of a complete translation of the entire document. Even though the decoder operates at the document level, it is important to keep 1180 track of sentence boundaries, and the individual operations that are applied to the state are still confined to sentence scope, so it is useful to decompose the state of a document into the state of its sentences, and we define the overall state S as a sequence of sentence states: S = S1S2 . . .SN, (1) where N is the number of sentences. This implies that we constrain the decoder to emit exactly one output sentence per input sentence. Let ibe the number of a sentence and mi the number of input tokens of this sentence, p and q (with 1 ≤ p ≤ q ≤ mi) be positions in the input sentence a1n ≤d [p; q] qde ≤no mte the set ofpositions from p up to and including q. We say that [p; q] precedes [p0; q0], or [p; q] ≺ [p0; q0], if q < p0. Let Φi([p; q]) be the set of t[pra;nqs]l ≺atio [pns for the source phrase covering positions [p; q] in the input sentence ias given by the phrase table. We call A = h[p; q] ,φi an anchored phrase pair w.it Wh coverage C(A) = [p; q] nif a φ ∈ Φi([p; q]) sise a target phrase translating =th [ep source w∈o Φrds at positions [p; q] . Then a sequence of ni anchored phrase pairs Si = A1A2 . . .Ani (2) is a valid sentence state for sentence iif the following two conditions hold: 1. The coverage sets C(Aj) for j in 1, . . . , ni are mutually disjoint, and 2. the anchored phrase pairs jointly cover the complete input sentence, or [niC(Aj) = [1;mi]. (3) [j=1 Let f(S) be a scoring function mapping a state S to a real number. As usual in SMT, it is assumed that the scoring function can be decomposed into a linear combination of K feature functions hk(S), each with a constant weight λk, so f(S) =k∑K=1λkhk(S). (4) The problem addressed by the decoder is the search for the state with maximal score, such that Sˆ Sˆ = argSmaxf(S). (5) The feature functions implemented in our baseline system are identical to the ones found in the popular Moses SMT system (Koehn et al., 2007). In particular, our decoder has the following feature functions: 1. phrase translation scores provided by the phrase table including forward and backward conditional probabilities, lexical weights and a phrase penalty (Koehn et al., 2003), 2. n-gram language model scores implemented with the KenLM toolkit (Heafield, 2011), 3. a word penalty score, 4. a distortion model with geometric (Koehn et al., 2003), and decay 5. a feature indicating the number of times a given distortion limit is exceeded in the current state. In our experiments, the last feature is used with a fixed weight of negative infinity in order to limit the gaps between the coverage sets of adjacent anchored phrase pairs to a maximum value. In DP search, the distortion limit is usually enforced directly by the search algorithm and is not added as a feature. In our decoder, however, this restriction is not required to limit complexity, so we decided to add it among the scoring models. 2.2 Decoding Algorithm The decoding algorithm we use (algorithm 1) is very simple. It starts with a given initial document state. In the main loop, which extends from line 3 to line 12, it generates a successor state S0 for the current state S by calling the function Neighbour, which non-deterministically applies one of the operations described in section 3 of this paper to S. The score of the new state is compared to that of the previous one. If it meets a given acceptance criterion, S0 becomes the current state, else search continues from the previous state S. For the experiments in this paper, we use the hill climbing acceptance criterion, which simply accepts a new state if its score is higher than that of the current state. Other acceptance criteria are possible and could be used to endow the search algorithm with stochastic behaviour. 1181 The main loop is repeated until a maximum number of steps (step limit) is reached or until a maximum number of moves are rejected in a row (rejection limit). Algorithm 1 Decoding algorithm Input: an initial document state S; search parameters maxsteps and maxrejected Output: a modified document state 1: nsteps ← 0 2: nrejected ← 0 3: nwrhejileec nsteps < maxsteps and nrejected < maxrejected do 4: S0 ← Neighbour (S) 5: if Accept (f(S0) , f(S)) then 6: S ← S0 7: nrejected ← 0 8: elsner 9: nrejected ← nrejected + 1 10: enndr eifj 11: nsteps ← nsteps + 1 12: ennds wtephsile ← 13: return S A notable difference between this algorithm and other hill climbing algorithms that have been used for SMT decoding (Germann et al., 2004; Langlais et al., 2007) is its non-determinism. Previous work for sentence-level decoding employed a steepest ascent strategy which amounts to enumerating the complete neighbourhood of the current state as defined by the state operations and selecting the next state to be the best state found in the neighbourhood of the current one. Enumerating all neighbours of a given state, costly as it is, has the advantage that it makes it easy to prove local optimality of a state by recognising that all possible successor states have lower scores. It can be rather inefficient, since at every step only one modification will be adopted; many of the modifications that are discarded will very likely be generated anew in the next iteration. As we extend the decoder to the document level, the size of the neighbourhood that would have to be explored in this way increases considerably. Moreover, the inefficiency of the steepest ascent approach potentially increases as well. Very likely, a promising move in one sentence will remain promising after a modification has been applied to another sentence, even though this is not guaranteed to be true in the presence of cross-sentence models. We therefore adopt a first-choice hill climbing strategy that non-deterministically generates successor states and accepts the first one that meets the acceptance criterion. This frees us from the necessity of generating the full set of successors for each state. On the downside, if the full successor set is not known, it is no longer possible to prove local optimality of a state, so we are forced to use a different condition for halting the search. We use a combination of two limits: The step limit is a hard limit on the resources the user is willing to expend on the search problem. The value of the rejection limit determines how much of the neighbourhood is searched for better successors before a state is accepted as a solution; it is related to the probability that a state returned as a solution is in fact locally optimal. To simplify notations in the description of the individual state operations, we write Si −→ Si0 (6) to signify that a state operation, when presented with a document state as in equation 1 and acting on sentence i, returns a new document state of S0 = S1 . . .Si−1 Si0 Si+1 . . .SN. (7) Similarly, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (8) is equivalent to Si −→ A1 . . .Aj−1 A01 . . .A0h0 Aj+h . . .Ani (9) and indicates that the operation returns a state in which a sequence of h consecutive anchored phrase pairs has been replaced by another sequence of h0 anchored phrase pairs. 2.3 Efficiency Considerations When implementing the feature functions for the decoder, we have to exercise some care to avoid recomputing scores for the whole document at every iteration. To achieve this, the scores are computed completely only once, at the beginning of the decoding run. In subsequent iterations, scoring functions are presented with the scores of the previous 1182 iteration and a list of modifications produced by the state operation, a set of tuples hi, r, s,A01 . . .A0h0i, each indicating tthioant ,t ahe s edto ocfu tmupelnets s hhio,ru,sld, Abe modifii,e eda as described by Si :Ar . . .As −→ A01 . . .A0h0 . (10) If a feature function is decomposable in some way, as all the standard features developed under the constraints of DP search are, it can then update the state simply by subtracting and adding score components pertaining to the modified parts of the document. Feature functions have the possibility to store their own state information along with the document state to make sure the required information is available. Thus, the framework makes it possible to exploit decomposability for efficient scoring without impos- ing any particular decomposition on the features as beam search does. To make scoring even more efficient, scores are computed in two passes: First, every feature function is asked to provide an upper bound on the score that will be obtained for the new state. In some cases, it is possible to calculate reasonable upper bounds much more efficiently than computing the exact feature value. If the upper bound fails to meet the acceptance criterion, the new state is discarded right away; if not, the full score is computed and the acceptance criterion is tested again. Among the basic SMT models, this two-pass strategy is only used for the n-gram LM, which requires fairly expensive parameter lookups for scoring. The scores of all the other baseline models are fully computed during the first scoring pass. The n-gram model is more complex. In its state information, it keeps track of the LM score and LM library state for each word. The first scoring pass then identifies the words whose LM scores are affected by the current search step. This includes the words changed by the search operation as well as the words whose LM history is modified. The range of the history de- pendencies can be determined precisely by considering the “valid state length” information provided by the KenLM library. In the first pass, the LM scores of the affected words are subtracted from the total score. The model only looks up the new LM scores for the affected words and updates the total score if the new search state passes the first acceptance check. This two-pass scoring approach allows us to avoid LM lookups altogether for states that will be rejected anyhow because of low scores from the other models, e. g. because the distortion limit is violated. Model score updates become more complex and slower as the number of dependencies of a model increases. While our decoding algorithm does not impose any formal restrictions on the number or type of dependencies that can be handled, there will be practical limits beyond which decoding becomes unacceptably slow or the scoring code becomes very difficult to maintain. These limits are however fairly independent of the types of dependencies handled by a model, which permits the exploration of more varied model types than those handled by DP search. 2.4 State Initialisation Before the hill climbing decoding algorithm can be run, an initial state must be generated. The closer the initial state is to an optimum, the less work remains to be done for the algorithm. If the algorithm is to be self-contained, initialisation must be relatively uninformed and can only rely on some general prior assumptions about what might be a good initial guess. On the other hand, if optimal results are sought after, it pays off to invest some effort into a good starting point. One way to do this is to run DP search first. For uninformed initialisation, we chose to implement a very simple procedure based only on the observation that, at least for language pairs involving the major European languages, it is usually a good guess to keep the word order of the output very similar to that of the input. We therefore create the initial state by selecting, for each sentence in the document, a sequence of anchored phrase pairs covering the input sentence in monotonic order, that is, such that for all pairs of adjacent anchored phrase pairs Aj and Aj+1, we have that C(Aj) ≺ C(Aj+1 ). For initialisation with DP search, we first run the Moses decoder (Koehn et al., 2007) with default search parameters and the same models as those used by our decoder. Then we extract the best output hypothesis from the search graph of the decoder and map it into a sequence of anchored phrase pairs in the obvious way. When the document-level decoder is used with models that are incompatible with beam search, Moses can be run with a subset of the models in order to find an approximation of the solution 1183 which is then refined with the complete feature set. 3 State Operations Given a document state S, the decoder uses a neighbourhood function Neighbour to simulate a move in the state space. The neighbourhood function nondeterministically selects a type of state operation and a location in the document to apply it to and returns the resulting new state. We use a set of three operations that has the property that every possible document state can be reached from every other state in a sequence of moves. Designing operations for state transitions in local search for phrase-based SMT is a problem that has been addressed in the literature (Langlais et al., 2007; Arun et al., 2010). Our decoder’s first- choice hill climbing strategy never enumerates the full neighbourhood of a state. We therefore place less emphasis than previous work on defining a compact neighbourhood, but allow the decoder to make quite extensive changes to a state in a single step with a certain probability. Otherwise our operations are similar to those used by Arun et al. (2010). All of the operations described in this paper make changes to a single sentence only. Each time it is called, the Neighbour function selects a sentence in the document with a probability proportional to the number of input tokens in each sentence to ensure a fair distribution ofthe decoder’s attention over the words in the document regardless of varying sentence lengths. 3.1 Changing Phrase Translations The change-phrase-translation operation replaces the translation of a single phrase with a random translation with the same coverage taken from the phrase table. Formally, the operation selects an anchored phrase pair Aj by drawing uniformly from the elements of Si and then draws a new translation φ0 uniformly from the set Φi(C(Aj)). The new state is given by Si : Aj −→ hC(Aj), φ0i. (11) 3.2 Changing Word Order The swap-phrases operation affects the output word order without changing the phrase translations. It exchanges two anchored phrase pairs Aj and Aj+h, resulting in an output state of Si : Aj . . .Aj+h −→ Aj+h Aj+1 . . .Aj+h−1 Aj. (12) The start location j is drawn uniformly from the eligible sentence positions; the swap range h comes from a geometric distribution with configurable decay. Other word-order changes such as a one-way move operation that does not require another movement in exchange or more advanced permutations can easily be defined. 3.3 Resegmentation The most complex operation is resegment, which allows the decoder to modify the segmentation ofthe source phrase. It takes a number of anchored phrase pairs that form a contiguous block both in the input and in the output and replaces them with a new set of phrase pairs covering the same span of the input sentence. Formally, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (13) such that j+[h−1 [h0 [ C(Aj0) = [ C(A0j0) = [p;q] j[0=j (14) j[0=1 for some p and q, where, for j0 = 1, . . . ,h0, we have that A0j0 = h[pj0; qj0] , φj0i, all [pj0; qj0] are mutually disjoint =an hd[ peach φj0 isi randomly drawn from Φi([pj0;qj0]). Regardless of the ordering of Aj . . .Aj+h−1 , the resegment operation always generates a sequence of anchored phrase pairs in linear order, such that C(A0j0) ≺ C(A0j0+1 ) for j0 = 1, . . . ,h0 −1 . As )f o≺r Cth(eA other operations, j is− generated uniformly and h is drawn from a geometric distribution with a decay parameter. The new segmentation is generated by extending the sequence of anchored phrase pairs with random elements starting at the next free position, proceeding from left to right until the whole range [p; q] is covered. 4 Experimental Results In this section, we present the results of a series of experiments with our document decoder. The 1184 goal of our experiments is to demonstrate the behaviour of the decoder and characterise its response to changes in the fundamental search parameters. The SMT models for our experiments were created with a subset of the training data for the English-French shared task at the WMT 2011workshop (Callison-Burch et al., 2011). The phrase table was trained on Europarl, news-commentary and UN data. To reduce the training data to a manageable size, singleton phrase pairs were removed before the phrase scoring step. Significance-based filtering (Johnson et al., 2007) was applied to the resulting phrase table. The language model was a 5gram model with Kneser-Ney smoothing trained on the monolingual News corpus with IRSTLM (Federico et al., 2008). Feature weights were trained with Minimum Error-Rate Training (MERT) (Och, 2003) on the news-test2008 development set using the DP beam search decoder and the MERT implementation of the Moses toolkit (Koehn et al., 2007). Experimental results are reported for the newstest2009 test set, a corpus of 111 newswire documents totalling 2,525 sentences or 65,595 English input tokens. 4.1 Stability An important difference between our decoder and the classical DP decoder as well as previous work in SMT decoding with local search is that our decoder is inherently non-deterministic. This implies that repeated runs of the decoder with the same search parameters, input and models will not, in general, find the same local maximum of the score space. The first empirical question we ask is therefore how different the results are under repeated runs. The results in this and the next section were obtained with random state initialisation, i. e. without running the DP beam search decoder. Figure 1 shows the results of 7 decoder runs with the models described above, translating the newstest2009 test set, with a step limit of 227 and a rejection limit of 100,000. The x-axis of both plots shows the number of decoding steps on a logarithmic scale, so the number of steps is doubled between two adjacent points on the same curve. In the left plot, the y-axis indicates the model score optimised by the decoder summed over all 2525 sentences of the document. In the right plot, the case-sensitive BLEU score (Papineni et al., 2002) of the current decoder Figure 1: Score stability in repeated decoder runs state against a reference translation is displayed. We note, as expected, that the decoder achieves a considerable improvement of the initial state with diminishing returns as decoding continues. Between 28 and 214 steps, the score increases at a roughly logarithmic pace, then the curve flattens out, which is partly due to the fact that decoding for some documents effectively stopped when the maximum number of rejections was reached. The BLEU score curve shows a similar increase, from an initial score below 5 % to a maximum of around 21.5 %. This is below the score of 22.45 % achieved by the beam search decoder with the same models, which is not surprising considering that our decoder approximates a more difficult search problem, from which a number of strong independence assumptions have been lifted, without, at the moment, having any stronger models at its disposal to exploit this additional freedom for better translation. In terms of stability, there are no dramatic differences between the decoder runs. Indeed, the small differences that exist are hardly discernible in the plots. The model scores at the end of the decoding run range between −158767.9 and −158716.9, a g re rlautniv rea ndgieffe breetnwceee nof − only a6b7.o9ut a n0d.0 −3 %15.8 F1i6n.a9l, BLEU scores range from 21.41 % to 21.63 %, an interval that is not negligible, but comparable to the variance observed when, e. g., feature weights from repeated MERT runs are used with one and the same SMT system. Note that these results were obtained with random state initialisation. With DP initialisation, score differences between repeated runs rarely 1185 exceed 0.02 absolute BLEU percentage points. Overall, we conclude that the decoding results of our algorithm are reasonably stable despite the nondeterminism inherent in the procedure. In our subsequent experiments, the evaluation scores reported are calculated as the mean of three runs for each experiment. 4.2 Search Algorithm Parameters The hill climbing algorithm we use has two parameters which govern the trade-off between decoding time and the accuracy with which a local maximum is identified: The step limit stops the search process after a certain number of steps regardless of the search progress made or lack thereof. The rejection limit stops the search after a certain number of unsuccessful attempts to make a step, when continued search does not seem to be promising. In most of our experiments, we used a step limit of 227 ≈ 1.3 · 108 and a rejection limit of 105. In practice, decoding terminates by reaching the rejection limit for the vast majority of documents. We therefore examined the effect of different rejection limits on the learning curves. The results are shown in figure 2. The results show that continued search does pay off to a certain extent. Indeed, the curve for rejection limit 107 seems to indicate that the model score increases roughly logarithmically, albeit to a higher base, even after the curve has started to flatten out at 214 steps. At a certain point, however, the probability of finding a good successor state drops rather sharply by about two orders of magnitude, as Figure 2: Search performance at different rejection limits evidenced by the fact that a rejection limit of 106 does not give a large improvement over one of 105, while one of 107 does. The continued model score improvement also results in an increase in BLEU scores, and with a BLEU score of 22. 1% the system with rejection limit 107 is fairly close to the score of 22.45 % obtained by DP beam search. Obviously, more exact search comes at a cost, and in this case, it comes at a considerable cost, which is an explosion of the time required to decode the test set from 4 minutes at rejection limit 103 to 224 minutes at rejection limit 105 and 38 hours 45 minutes at limit 107. The DP decoder takes 3 1 minutes for the same task. We conclude that the rejection limit of 105 selected for our experiments, while technically suboptimal, realises a good trade-off between decoding time and accuracy. 4.3 A Semantic Document Language Model In this section, we present the results of the application of our decoder to an actual SMT model with cross-sentence features. Our model addresses the problem of lexical cohesion. In particular, it rewards the use of semantically related words in the translation output by the decoder, where semantic distance is measured with a word space model based on Latent Semantic Analysis (LSA). LSA has been applied to semantic language modelling in previous research with some success (Coccaro and Jurafsky, 1998; Bellegarda, 2000; Wandmacher and Antoine, 2007). In SMT, it has mostly been used for domain adaptation (Kim and Khudanpur, 2004; Tam et al., 1186 2007), or to measure sentence similarities (Banchs and Costa-juss a`, 2011). The model we use is inspired by Bellegarda (2000). It is a Markov model, similar to a standard n-gram model, and assigns to each content word a score given a history of n preceding content words, where n = 30 below. Scoring relies on a 30dimensional LSA word vector space trained with the S-Space software (Jurgens and Stevens, 2010). The score is defined based on the cosine similarity between the word vector of the predicted word and the mean word vector of the words in the history, which is converted to a probability by histogram lookup as suggested by Bellegarda (2000). The model is structurally different from a regular n-gram model in that word vector n-grams are defined over content words occurring in the word vector model only and can cross sentence boundaries. Stop words, identified by an extensive stop word list and amounting to around 60 % of the tokens, are scored by a different mechanism based on their relative frequency (undiscounted unigram probability) in the training corpus. In sum, the score produced by the semantic document LM has the following form: wh(er|h)α=is tεpαheuncipgors(wp)o|hrtinof w fci os nakutneskotn wpon w ,onerldse,in(ls1teh5) training corpus and ε is a small fixed probability. It is integrated into the decoder as an extra feature function. Since we lack an automatic method for training the feature weights of document-wide features, its weight was selected by grid search over a number of values, comparing translation performance for the newstest2009 test set. In these experiments, we used DP beam search to initialise the state of our local search decoder. Three results are presented (table 1): The first table row shows the baseline performance using DP beam search with standard sentence-local features only. The scores in the second row were obtained by running the hill climbing decoder with DP initialisation, but without adding any models. A marginal increase in scores for all three test sets demonstrates that the hill climbing decoder manages to fix some of the search errors made by the DP search. The last row contains the scores obtained by adding in the semantic language model. Scores are presented for three publicly available test sets from recent WMT Machine Translation shared tasks, of which one (newstest2009) was used to monitor progress during development and select the final model. Adding the semantic language model results in a small increase in NIST scores (Doddington, 2002) for all three test sets as well as a small BLEU score gain (Papineni et al., 2002) for two out of three corpora. We note that the NIST score turned out to react more sensitively to improvements due to the semantic LM in all our experiments, which is reasonable because the model specifically targets content words, which benefit from the information weighting done by the NIST score. While the results we present do not constitute compelling evidence in favour of our semantic LM in its current form, they do suggest that this model could be improved to realise higher gains from cross-sentence semantic information. They support our claim that cross- sentence models should be examined more closely and that existing methods should be adapted to deal with them, a problem addressed by our main contribution, the local search document decoder. 5 Related Work Even though DP beam search (Koehn et al., 2003) has been the dominant approach to SMT decoding in recent years, methods based on local search have been explored at various times. For word-based SMT, greedy hill-climbing techniques were advo1187 cated as a faster replacement for beam search (Germann et al., 2001 ; Germann, 2003; Germann et al., 2004), and a problem formulation specifically targeting word reordering with an efficient word reordering algorithm has been proposed (Eisner and Tromble, 2006). A local search decoder has been advanced as a faster alternative to beam search also for phrasebased SMT (Langlais et al., 2007; Langlais et al., 2008). That work anticipates many of the features found in our decoder, including the use of local search to refine an initial hypothesis produced by DP beam search. The possibility of using models that do not fit well into the beam search paradigm is mentioned and illustrated with the example of a reversed n-gram language model, which the authors claim would be difficult to implement in a beam search decoder. Similarly to the work by Germann et al. (2001), their decoder is deterministic and explores the entire neighbourhood of a state in order to identify the most promising step. Our main contribution with respect to the work by Langlais et al. (2007) is the introduction of the possibility of handling document-level models by lifting the assumption of sentence independence. As a consequence, enumerating the entire neighbourhood becomes too expensive, which is why we resort to a “first-choice” strategy that non-deterministically generates states and accepts the first one encountered that meets the acceptance criterion. More recently, Gibbs sampling was proposed as a way to generate samples from the posterior distribution of a phrase-based SMT decoder (Arun et al., 2009; Arun et al., 2010), a process that resembles local search in its use of a set of state-modifying operators to generate a sequence of decoder states. Where local search seeks for the best state attainable from a given initial state, Gibbs sampling produces a representative sample from the posterior. Like all work on SMT decoding that we know of, the Gibbs sampler presented by Arun et al. (2010) assumes independence of sentences and considers the complete neighbourhood of each state before taking a sample. 6 Conclusion In the last twenty years of SMT research, there has been a strong assumption that sentences in a text newstest2009 newstest2010 newstest201 1 BLEU NIST BLEU NIST BLEU NIST 22.56 6.513 27.27 7.034 24.94 7.170 + hill climbing 22.60 6.518 27.33 7.046 24.97 7.169 with semantic LM 22.71 6.549 27.53 7.087 24.90 7.199 DP search only DP Table 1: Experimental results with a cross-sentence semantic language model are independent of one another, and discourse context has been largely neglected. Several factors have contributed to this. Developing good discourse-level models is difficult, and considering the modest translation quality that has long been achieved by SMT, there have been more pressing problems to solve and lower hanging fruit to pick. However, we argue that the popular DP beam search algorithm, which delivers excellent decoding performance, but imposes a particular kind of local dependency structure on the feature models, has also had its share in driving researchers away from discourse-level problems. In this paper, we have presented a decoding procedure for phrase-based SMT that makes it possible to define feature models with cross-sentence dependencies. Our algorithm can be combined with DP beam search to leverage the quality of the traditional approach with increased flexibility for models at the discourse level. We have presented preliminary results on a cross-sentence semantic language model addressing the problem of lexical cohesion to demonstrate that this kind of models is worth exploring further. Besides lexical cohesion, cross-sentence models are relevant for other linguistic phenomena such as pronominal anaphora or verb tense selection. We believe that SMT research has reached a point of maturity where discourse phenomena should not be ignored any longer, and we consider our decoder to be a step towards this goal. References Abhishek Arun, Chris Dyer, Barry Haddow, Phil Blunsom, Adam Lopez, and Philipp Koehn. 2009. Monte carlo inference and maximization for phrase-based translation. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 102–1 10, Boulder, Colorado, June. Association for Computational Linguistics. Abhishek Arun, Barry Haddow, Philipp Koehn, Adam Lopez, Chris Dyer, and Phil Blunsom. 2010. Monte 1188 Ma- Carlo techniques for phrase-based translation. chine translation, 24(2): 103–121 . Rafael E. Banchs and Marta R. Costa-juss a`. 2011. A semantic feature for Statistical Machine Translation. In Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 126– 134, Portland, Oregon, USA, June. Association for Computational Linguistics. Jerome R. Bellegarda. 2000. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8): 1279–1296. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. Findings of the 2011 Workshop on Statistical Machine Translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22–64, Edinburgh, Scotland, July. Association for Computational Linguistics. Noah Coccaro and Daniel Jurafsky. 1998. Towards better integration of semantic predictors in statistical language modeling. In Proceedings of the 5th International Conference on Spoken Language Processing, Sydney. George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second International conference on Human Language Technology Research, pages 138–145, San Diego. Jason Eisner and Roy W. Tromble. 2006. Local search with very large-scale neighborhoods for optimal permutations in machine translation. In Proceedings of the HLT-NAACL Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, pages 57–75. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. In Interspeech 2008, pages 1618–1621 . ISCA. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 228–235, Toulouse, France, July. Association for Computational Linguis- tics. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1–2): 127–143. Ulrich Germann. 2003. Greedy decoding for Statistical Machine Translation in almost linear time. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Zhengxian Gong, Min Zhang, and Guodong Zhou. 2011. Cache-based document-level Statistical Machine Translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 909–919, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Christian Hardmeier and Marcello Federico. 2010. Modelling Pronominal Anaphora in Statistical Machine Translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289. Basil Hatim and Ian Mason. 1990. Discourse and the Translator. Language in Social Life Series. Longman, London. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, July. Association for Computational Linguistics. Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 967– 975, Prague, Czech Republic, June. Association for Computational Linguistics. David Jurgens and Keith Stevens. 2010. The S-Space package: An open source package for word space models. In Proceedings of the ACL 2010 System Demonstrations, pages 30–35, Uppsala, Sweden, July. Association for Computational Linguistics. Woosung Kim and Sanjeev Khudanpur. 2004. Crosslingual latent semantic analysis for language modeling. In IEEE international conference on acoustics, speech, and signal processing (ICASSP), volume 1, pages 257–260, Montr ´eal. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48– 54, Edmonton. 1189 Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Association for Computational Linguistics: Demonstration session, pages 177–180, Prague. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Sk¨ ovde. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2008. Recherche locale pour la traduction statistique par segments. In TALN 2008, pages 119–128, Avignon, France, June. ATALA. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden, July. Association for Computational Linguistics. Franz Josef Och, Nicola Ueffing, and Hermann Ney. 2001. An efficient A* search algorithm for Statistical Machine Translation. In Proceedings of the DataDriven Machine Translation Workshop, 39th Annual Meeting of the Association for Computational Linguistics (ACL), pages 55–62, Toulouse. Franz Josef Och. 2003. Minimum error rate training in Statistical Machine Translation. In Proceedings of the 41st annual meeting of the Association for Computational Linguistics, pages 160–167, Sapporo (Japan). Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 3 11–3 18, Philadelphia. ACL. Yik-Cheung Tam, Ian Lane, and Tanja Schultz. 2007. Bilingual LSA-based adaptation for Statistical Machine Translation. Machine Translation, 21(4): 187– 207. J o¨rg Tiedemann. 2010. To cache or not to cache? Experiments with adaptive models in Statistical Machine Translation. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, pages 189–194, Uppsala, Sweden. Association for Computational Linguistics. Christoph Tillmann and Hermann Ney. 2003. Word reordering and a Dynamic Programming beam search algorithm for Statistical Machine Translation. Computational linguistics, 29(1):97–133. Christoph Tillmann, Stephan Vogel, Hermann Ney, and Alex Zubiaga. 1997. A DP-based search using monotone alignments in Statistical Translation. In Proceedings of the 35th Annual Meeting of the Association for Computational Spain, tics. Linguistics, July. Association for 289–296, Madrid, Computational Linguis- pages Tonio Wandmacher and Jean-Yves Antoine. 2007. Methods to integrate a language model with semantic information for a word prediction component. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 506–5 13, Prague, Czech Republic, June. Association for Computational Linguistics. 1190
5 0.13383321 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
Author: Wang Ling ; Joao Graca ; Isabel Trancoso ; Alan Black
Abstract: Phrase-based machine translation models have shown to yield better translations than Word-based models, since phrase pairs encode the contextual information that is needed for a more accurate translation. However, many phrase pairs do not encode any relevant context, which means that the translation event encoded in that phrase pair is led by smaller translation events that are independent from each other, and can be found on smaller phrase pairs, with little or no loss in translation accuracy. In this work, we propose a relative entropy model for translation models, that measures how likely a phrase pair encodes a translation event that is derivable using smaller translation events with similar probabilities. This model is then applied to phrase table pruning. Tests show that considerable amounts of phrase pairs can be excluded, without much impact on the transla- . tion quality. In fact, we show that better translations can be obtained using our pruned models, due to the compression of the search space during decoding.
6 0.11890598 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models
7 0.11411804 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques
8 0.1098384 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage
9 0.10196083 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
10 0.096570015 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence
11 0.092920348 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
12 0.084401384 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction
13 0.082530022 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
14 0.080627404 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
15 0.068632059 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
16 0.065400451 31 emnlp-2012-Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition
17 0.064323448 2 emnlp-2012-A Beam-Search Decoder for Grammatical Error Correction
18 0.061695304 50 emnlp-2012-Extending Machine Translation Evaluation Metrics with Lexical Cohesion to Document Level
19 0.058647148 41 emnlp-2012-Entity based QA Retrieval
20 0.058049057 12 emnlp-2012-A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing
topicId topicWeight
[(0, 0.245), (1, -0.167), (2, -0.199), (3, 0.007), (4, -0.109), (5, -0.129), (6, -0.1), (7, 0.028), (8, 0.077), (9, -0.098), (10, -0.017), (11, -0.049), (12, -0.027), (13, -0.009), (14, 0.028), (15, 0.039), (16, -0.014), (17, 0.012), (18, 0.079), (19, 0.04), (20, 0.029), (21, -0.014), (22, 0.004), (23, 0.069), (24, -0.026), (25, -0.02), (26, -0.095), (27, 0.031), (28, -0.066), (29, 0.141), (30, 0.005), (31, -0.056), (32, 0.109), (33, 0.018), (34, -0.094), (35, 0.072), (36, -0.025), (37, 0.115), (38, 0.071), (39, -0.045), (40, 0.024), (41, -0.027), (42, -0.04), (43, 0.042), (44, -0.085), (45, 0.106), (46, 0.095), (47, 0.024), (48, -0.014), (49, 0.097)]
simIndex simValue paperId paperTitle
same-paper 1 0.97378486 86 emnlp-2012-Locally Training the Log-Linear Model for SMT
Author: Lemao Liu ; Hailong Cao ; Taro Watanabe ; Tiejun Zhao ; Mo Yu ; Conghui Zhu
Abstract: In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. However, due to the diversity and uneven distribution of source sentences, there are two problems suffered by this method. First, its performance is highly dependent on the choice of a development set, which may lead to an unstable performance for testing. Second, translations become inconsistent at the sentence level since tuning is performed globally on a document level. In this paper, we propose a novel local training method to address these two problems. Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. We pro- pose efficient incremental training methods to put the local training into practice. In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2.0 BLEU points, meanwhile its efficiency is comparable to that of the global method.
2 0.72854334 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
Author: Nan Duan ; Mu Li ; Ming Zhou
Abstract: A forced derivation tree (FDT) of a sentence pair {f, e} denotes a derivation tree that can tpraainrsl {afte, f} idnetono itetss a acc duerraivtea target etrea tnhsaltat cioann e. In this paper, we present an approach that leverages structured knowledge contained in FDTs to train component models for statistical machine translation (SMT) systems. We first describe how to generate different FDTs for each sentence pair in training corpus, and then present how to infer the optimal FDTs based on their derivation and alignment qualities. As the first step in this line of research, we verify the effectiveness of our approach in a BTGbased phrasal system, and propose four FDTbased component models. Experiments are carried out on large scale English-to-Japanese and Chinese-to-English translation tasks, and significant improvements are reported on both translation quality and alignment quality.
3 0.64833611 35 emnlp-2012-Document-Wide Decoding for Phrase-Based Statistical Machine Translation
Author: Christian Hardmeier ; Joakim Nivre ; Jorg Tiedemann
Abstract: Independence between sentences is an assumption deeply entrenched in the models and algorithms used for statistical machine translation (SMT), particularly in the popular dynamic programming beam search decoding algorithm. This restriction is an obstacle to research on more sophisticated discourse-level models for SMT. We propose a stochastic local search decoding method for phrase-based SMT, which permits free document-wide dependencies in the models. We explore the stability and the search parameters ofthis method and demonstrate that it can be successfully used to optimise a document-level semantic language model. 1 Motivation In the field oftranslation studies, it is undisputed that discourse-wide context must be considered care- fully for good translation results (Hatim and Mason, 1990). By contrast, the state of the art in statistical machine translation (SMT), despite significant advances in the last twenty years, still assumes that texts can be translated sentence by sentence under strict independence assumptions, even though it is well known that certain linguistic phenomena such as pronominal anaphora cannot be translated correctly without referring to extra-sentential context. This is true both for the phrase-based and the syntaxbased approach to SMT. In the rest of this paper, we shall concentrate on phrase-based SMT. One reason why it is difficult to experiment with document-wide models for phrase-based SMT is that the dynamic programming (DP) algorithm 1179 which has been used almost exclusively for decoding SMT models in the recent literature has very strong assumptions of locality built into it. DP beam search for phrase-based SMT was described by Koehn et al. (2003), extending earlier work on word-based SMT (Tillmann et al., 1997; Och et al., 2001 ; Tillmann and Ney, 2003). This algorithm con- structs output sentences by starting with an empty hypothesis and adding output words at the end until translations for all source words have been generated. The core models of phrase-based SMT, in particular the n-gram language model (LM), only depend on a constant number of output words to the left of the word being generated. This fact is exploited by the search algorithm with a DP technique called hypothesis recombination (Och et al., 2001), which permits the elimination of hypotheses from the search space if they coincide in a certain number of final words with a better hypothesis and no future expansion can possibly invert the relative ranking of the two hypotheses under the given models. Hypothesis recombination achieves a substantial reduction of the search space without affecting search optimality and makes it possible to use aggressive pruning techniques for fast search while still obtaining good results. The downside of this otherwise excellent approach is that it only works well with models that have a local dependency structure similar to that of an n-gram language model, so they only depend on a small context window for each target word. Sentence-local models with longer dependencies can be added, but doing so greatly increases the risk for search errors by inhibiting hypothesis recombination. Cross-sentence dependencies cannot be directly integrated into DP SMT decoding in LParnogcue agdein Lgesa ornf tihneg, 2 p0a1g2e Jso 1in17t C9–o1n1f9e0re,n Jce ju on Is Elanmdp,ir Kicoarlea M,e 1t2h–o1d4s J iunly N 2a0tu1r2a.l ? Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls any obvious way, especially if joint optimisation of a number of interdependent decisions over an entire document is required. Research into models with a more varied, non-local dependency structure is to some extent stifled by the difficulty of decoding such models effectively, as can be seen by the problems some researchers encountered when they attempted to solve discourse-level problems. Consider, for instance, the work on cache-based language models by Tiedemann (2010) and Gong et al. (201 1), where error propagation was a serious issue, or the works on pronominal anaphora by Le Nagard and Koehn (2010), who implemented cross-sentence dependencies with an ad-hoc two-pass decoding strategy, and Hardmeier and Federico (2010) with the use of an external decoder driver to manage backward-only dependencies between sentences. In this paper, we present a method for decoding complete documents in phrase-based SMT. Our decoder uses a local search approach whose state consists of a complete translation of an entire document at any time. The initial state is improved by the application of a series of operations using a hill climbing strategy to find a (local) maximum of the score function. This setup gives us complete freedom to define scoring functions over the entire document. Moreover, by optionally initialising the state with the output of a traditional DP decoder, we can ensure that the final hypothesis is no worse than what would have been found by DP search alone. We start by describing the decoding algorithm and the state operations used by our decoder, then we present empirical results demonstrating the effectiveness of our approach and its usability with a document-level semantic language model, and finally we discuss some related work. 2 SMT Decoding by Hill Climbing In this section, we formally describe the phrasebased SMT model implemented by our decoder as well as the decoding algorithm we use. 2.1 SMT Model Our decoder is based on local search, so its state at any time is a representation of a complete translation of the entire document. Even though the decoder operates at the document level, it is important to keep 1180 track of sentence boundaries, and the individual operations that are applied to the state are still confined to sentence scope, so it is useful to decompose the state of a document into the state of its sentences, and we define the overall state S as a sequence of sentence states: S = S1S2 . . .SN, (1) where N is the number of sentences. This implies that we constrain the decoder to emit exactly one output sentence per input sentence. Let ibe the number of a sentence and mi the number of input tokens of this sentence, p and q (with 1 ≤ p ≤ q ≤ mi) be positions in the input sentence a1n ≤d [p; q] qde ≤no mte the set ofpositions from p up to and including q. We say that [p; q] precedes [p0; q0], or [p; q] ≺ [p0; q0], if q < p0. Let Φi([p; q]) be the set of t[pra;nqs]l ≺atio [pns for the source phrase covering positions [p; q] in the input sentence ias given by the phrase table. We call A = h[p; q] ,φi an anchored phrase pair w.it Wh coverage C(A) = [p; q] nif a φ ∈ Φi([p; q]) sise a target phrase translating =th [ep source w∈o Φrds at positions [p; q] . Then a sequence of ni anchored phrase pairs Si = A1A2 . . .Ani (2) is a valid sentence state for sentence iif the following two conditions hold: 1. The coverage sets C(Aj) for j in 1, . . . , ni are mutually disjoint, and 2. the anchored phrase pairs jointly cover the complete input sentence, or [niC(Aj) = [1;mi]. (3) [j=1 Let f(S) be a scoring function mapping a state S to a real number. As usual in SMT, it is assumed that the scoring function can be decomposed into a linear combination of K feature functions hk(S), each with a constant weight λk, so f(S) =k∑K=1λkhk(S). (4) The problem addressed by the decoder is the search for the state with maximal score, such that Sˆ Sˆ = argSmaxf(S). (5) The feature functions implemented in our baseline system are identical to the ones found in the popular Moses SMT system (Koehn et al., 2007). In particular, our decoder has the following feature functions: 1. phrase translation scores provided by the phrase table including forward and backward conditional probabilities, lexical weights and a phrase penalty (Koehn et al., 2003), 2. n-gram language model scores implemented with the KenLM toolkit (Heafield, 2011), 3. a word penalty score, 4. a distortion model with geometric (Koehn et al., 2003), and decay 5. a feature indicating the number of times a given distortion limit is exceeded in the current state. In our experiments, the last feature is used with a fixed weight of negative infinity in order to limit the gaps between the coverage sets of adjacent anchored phrase pairs to a maximum value. In DP search, the distortion limit is usually enforced directly by the search algorithm and is not added as a feature. In our decoder, however, this restriction is not required to limit complexity, so we decided to add it among the scoring models. 2.2 Decoding Algorithm The decoding algorithm we use (algorithm 1) is very simple. It starts with a given initial document state. In the main loop, which extends from line 3 to line 12, it generates a successor state S0 for the current state S by calling the function Neighbour, which non-deterministically applies one of the operations described in section 3 of this paper to S. The score of the new state is compared to that of the previous one. If it meets a given acceptance criterion, S0 becomes the current state, else search continues from the previous state S. For the experiments in this paper, we use the hill climbing acceptance criterion, which simply accepts a new state if its score is higher than that of the current state. Other acceptance criteria are possible and could be used to endow the search algorithm with stochastic behaviour. 1181 The main loop is repeated until a maximum number of steps (step limit) is reached or until a maximum number of moves are rejected in a row (rejection limit). Algorithm 1 Decoding algorithm Input: an initial document state S; search parameters maxsteps and maxrejected Output: a modified document state 1: nsteps ← 0 2: nrejected ← 0 3: nwrhejileec nsteps < maxsteps and nrejected < maxrejected do 4: S0 ← Neighbour (S) 5: if Accept (f(S0) , f(S)) then 6: S ← S0 7: nrejected ← 0 8: elsner 9: nrejected ← nrejected + 1 10: enndr eifj 11: nsteps ← nsteps + 1 12: ennds wtephsile ← 13: return S A notable difference between this algorithm and other hill climbing algorithms that have been used for SMT decoding (Germann et al., 2004; Langlais et al., 2007) is its non-determinism. Previous work for sentence-level decoding employed a steepest ascent strategy which amounts to enumerating the complete neighbourhood of the current state as defined by the state operations and selecting the next state to be the best state found in the neighbourhood of the current one. Enumerating all neighbours of a given state, costly as it is, has the advantage that it makes it easy to prove local optimality of a state by recognising that all possible successor states have lower scores. It can be rather inefficient, since at every step only one modification will be adopted; many of the modifications that are discarded will very likely be generated anew in the next iteration. As we extend the decoder to the document level, the size of the neighbourhood that would have to be explored in this way increases considerably. Moreover, the inefficiency of the steepest ascent approach potentially increases as well. Very likely, a promising move in one sentence will remain promising after a modification has been applied to another sentence, even though this is not guaranteed to be true in the presence of cross-sentence models. We therefore adopt a first-choice hill climbing strategy that non-deterministically generates successor states and accepts the first one that meets the acceptance criterion. This frees us from the necessity of generating the full set of successors for each state. On the downside, if the full successor set is not known, it is no longer possible to prove local optimality of a state, so we are forced to use a different condition for halting the search. We use a combination of two limits: The step limit is a hard limit on the resources the user is willing to expend on the search problem. The value of the rejection limit determines how much of the neighbourhood is searched for better successors before a state is accepted as a solution; it is related to the probability that a state returned as a solution is in fact locally optimal. To simplify notations in the description of the individual state operations, we write Si −→ Si0 (6) to signify that a state operation, when presented with a document state as in equation 1 and acting on sentence i, returns a new document state of S0 = S1 . . .Si−1 Si0 Si+1 . . .SN. (7) Similarly, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (8) is equivalent to Si −→ A1 . . .Aj−1 A01 . . .A0h0 Aj+h . . .Ani (9) and indicates that the operation returns a state in which a sequence of h consecutive anchored phrase pairs has been replaced by another sequence of h0 anchored phrase pairs. 2.3 Efficiency Considerations When implementing the feature functions for the decoder, we have to exercise some care to avoid recomputing scores for the whole document at every iteration. To achieve this, the scores are computed completely only once, at the beginning of the decoding run. In subsequent iterations, scoring functions are presented with the scores of the previous 1182 iteration and a list of modifications produced by the state operation, a set of tuples hi, r, s,A01 . . .A0h0i, each indicating tthioant ,t ahe s edto ocfu tmupelnets s hhio,ru,sld, Abe modifii,e eda as described by Si :Ar . . .As −→ A01 . . .A0h0 . (10) If a feature function is decomposable in some way, as all the standard features developed under the constraints of DP search are, it can then update the state simply by subtracting and adding score components pertaining to the modified parts of the document. Feature functions have the possibility to store their own state information along with the document state to make sure the required information is available. Thus, the framework makes it possible to exploit decomposability for efficient scoring without impos- ing any particular decomposition on the features as beam search does. To make scoring even more efficient, scores are computed in two passes: First, every feature function is asked to provide an upper bound on the score that will be obtained for the new state. In some cases, it is possible to calculate reasonable upper bounds much more efficiently than computing the exact feature value. If the upper bound fails to meet the acceptance criterion, the new state is discarded right away; if not, the full score is computed and the acceptance criterion is tested again. Among the basic SMT models, this two-pass strategy is only used for the n-gram LM, which requires fairly expensive parameter lookups for scoring. The scores of all the other baseline models are fully computed during the first scoring pass. The n-gram model is more complex. In its state information, it keeps track of the LM score and LM library state for each word. The first scoring pass then identifies the words whose LM scores are affected by the current search step. This includes the words changed by the search operation as well as the words whose LM history is modified. The range of the history de- pendencies can be determined precisely by considering the “valid state length” information provided by the KenLM library. In the first pass, the LM scores of the affected words are subtracted from the total score. The model only looks up the new LM scores for the affected words and updates the total score if the new search state passes the first acceptance check. This two-pass scoring approach allows us to avoid LM lookups altogether for states that will be rejected anyhow because of low scores from the other models, e. g. because the distortion limit is violated. Model score updates become more complex and slower as the number of dependencies of a model increases. While our decoding algorithm does not impose any formal restrictions on the number or type of dependencies that can be handled, there will be practical limits beyond which decoding becomes unacceptably slow or the scoring code becomes very difficult to maintain. These limits are however fairly independent of the types of dependencies handled by a model, which permits the exploration of more varied model types than those handled by DP search. 2.4 State Initialisation Before the hill climbing decoding algorithm can be run, an initial state must be generated. The closer the initial state is to an optimum, the less work remains to be done for the algorithm. If the algorithm is to be self-contained, initialisation must be relatively uninformed and can only rely on some general prior assumptions about what might be a good initial guess. On the other hand, if optimal results are sought after, it pays off to invest some effort into a good starting point. One way to do this is to run DP search first. For uninformed initialisation, we chose to implement a very simple procedure based only on the observation that, at least for language pairs involving the major European languages, it is usually a good guess to keep the word order of the output very similar to that of the input. We therefore create the initial state by selecting, for each sentence in the document, a sequence of anchored phrase pairs covering the input sentence in monotonic order, that is, such that for all pairs of adjacent anchored phrase pairs Aj and Aj+1, we have that C(Aj) ≺ C(Aj+1 ). For initialisation with DP search, we first run the Moses decoder (Koehn et al., 2007) with default search parameters and the same models as those used by our decoder. Then we extract the best output hypothesis from the search graph of the decoder and map it into a sequence of anchored phrase pairs in the obvious way. When the document-level decoder is used with models that are incompatible with beam search, Moses can be run with a subset of the models in order to find an approximation of the solution 1183 which is then refined with the complete feature set. 3 State Operations Given a document state S, the decoder uses a neighbourhood function Neighbour to simulate a move in the state space. The neighbourhood function nondeterministically selects a type of state operation and a location in the document to apply it to and returns the resulting new state. We use a set of three operations that has the property that every possible document state can be reached from every other state in a sequence of moves. Designing operations for state transitions in local search for phrase-based SMT is a problem that has been addressed in the literature (Langlais et al., 2007; Arun et al., 2010). Our decoder’s first- choice hill climbing strategy never enumerates the full neighbourhood of a state. We therefore place less emphasis than previous work on defining a compact neighbourhood, but allow the decoder to make quite extensive changes to a state in a single step with a certain probability. Otherwise our operations are similar to those used by Arun et al. (2010). All of the operations described in this paper make changes to a single sentence only. Each time it is called, the Neighbour function selects a sentence in the document with a probability proportional to the number of input tokens in each sentence to ensure a fair distribution ofthe decoder’s attention over the words in the document regardless of varying sentence lengths. 3.1 Changing Phrase Translations The change-phrase-translation operation replaces the translation of a single phrase with a random translation with the same coverage taken from the phrase table. Formally, the operation selects an anchored phrase pair Aj by drawing uniformly from the elements of Si and then draws a new translation φ0 uniformly from the set Φi(C(Aj)). The new state is given by Si : Aj −→ hC(Aj), φ0i. (11) 3.2 Changing Word Order The swap-phrases operation affects the output word order without changing the phrase translations. It exchanges two anchored phrase pairs Aj and Aj+h, resulting in an output state of Si : Aj . . .Aj+h −→ Aj+h Aj+1 . . .Aj+h−1 Aj. (12) The start location j is drawn uniformly from the eligible sentence positions; the swap range h comes from a geometric distribution with configurable decay. Other word-order changes such as a one-way move operation that does not require another movement in exchange or more advanced permutations can easily be defined. 3.3 Resegmentation The most complex operation is resegment, which allows the decoder to modify the segmentation ofthe source phrase. It takes a number of anchored phrase pairs that form a contiguous block both in the input and in the output and replaces them with a new set of phrase pairs covering the same span of the input sentence. Formally, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (13) such that j+[h−1 [h0 [ C(Aj0) = [ C(A0j0) = [p;q] j[0=j (14) j[0=1 for some p and q, where, for j0 = 1, . . . ,h0, we have that A0j0 = h[pj0; qj0] , φj0i, all [pj0; qj0] are mutually disjoint =an hd[ peach φj0 isi randomly drawn from Φi([pj0;qj0]). Regardless of the ordering of Aj . . .Aj+h−1 , the resegment operation always generates a sequence of anchored phrase pairs in linear order, such that C(A0j0) ≺ C(A0j0+1 ) for j0 = 1, . . . ,h0 −1 . As )f o≺r Cth(eA other operations, j is− generated uniformly and h is drawn from a geometric distribution with a decay parameter. The new segmentation is generated by extending the sequence of anchored phrase pairs with random elements starting at the next free position, proceeding from left to right until the whole range [p; q] is covered. 4 Experimental Results In this section, we present the results of a series of experiments with our document decoder. The 1184 goal of our experiments is to demonstrate the behaviour of the decoder and characterise its response to changes in the fundamental search parameters. The SMT models for our experiments were created with a subset of the training data for the English-French shared task at the WMT 2011workshop (Callison-Burch et al., 2011). The phrase table was trained on Europarl, news-commentary and UN data. To reduce the training data to a manageable size, singleton phrase pairs were removed before the phrase scoring step. Significance-based filtering (Johnson et al., 2007) was applied to the resulting phrase table. The language model was a 5gram model with Kneser-Ney smoothing trained on the monolingual News corpus with IRSTLM (Federico et al., 2008). Feature weights were trained with Minimum Error-Rate Training (MERT) (Och, 2003) on the news-test2008 development set using the DP beam search decoder and the MERT implementation of the Moses toolkit (Koehn et al., 2007). Experimental results are reported for the newstest2009 test set, a corpus of 111 newswire documents totalling 2,525 sentences or 65,595 English input tokens. 4.1 Stability An important difference between our decoder and the classical DP decoder as well as previous work in SMT decoding with local search is that our decoder is inherently non-deterministic. This implies that repeated runs of the decoder with the same search parameters, input and models will not, in general, find the same local maximum of the score space. The first empirical question we ask is therefore how different the results are under repeated runs. The results in this and the next section were obtained with random state initialisation, i. e. without running the DP beam search decoder. Figure 1 shows the results of 7 decoder runs with the models described above, translating the newstest2009 test set, with a step limit of 227 and a rejection limit of 100,000. The x-axis of both plots shows the number of decoding steps on a logarithmic scale, so the number of steps is doubled between two adjacent points on the same curve. In the left plot, the y-axis indicates the model score optimised by the decoder summed over all 2525 sentences of the document. In the right plot, the case-sensitive BLEU score (Papineni et al., 2002) of the current decoder Figure 1: Score stability in repeated decoder runs state against a reference translation is displayed. We note, as expected, that the decoder achieves a considerable improvement of the initial state with diminishing returns as decoding continues. Between 28 and 214 steps, the score increases at a roughly logarithmic pace, then the curve flattens out, which is partly due to the fact that decoding for some documents effectively stopped when the maximum number of rejections was reached. The BLEU score curve shows a similar increase, from an initial score below 5 % to a maximum of around 21.5 %. This is below the score of 22.45 % achieved by the beam search decoder with the same models, which is not surprising considering that our decoder approximates a more difficult search problem, from which a number of strong independence assumptions have been lifted, without, at the moment, having any stronger models at its disposal to exploit this additional freedom for better translation. In terms of stability, there are no dramatic differences between the decoder runs. Indeed, the small differences that exist are hardly discernible in the plots. The model scores at the end of the decoding run range between −158767.9 and −158716.9, a g re rlautniv rea ndgieffe breetnwceee nof − only a6b7.o9ut a n0d.0 −3 %15.8 F1i6n.a9l, BLEU scores range from 21.41 % to 21.63 %, an interval that is not negligible, but comparable to the variance observed when, e. g., feature weights from repeated MERT runs are used with one and the same SMT system. Note that these results were obtained with random state initialisation. With DP initialisation, score differences between repeated runs rarely 1185 exceed 0.02 absolute BLEU percentage points. Overall, we conclude that the decoding results of our algorithm are reasonably stable despite the nondeterminism inherent in the procedure. In our subsequent experiments, the evaluation scores reported are calculated as the mean of three runs for each experiment. 4.2 Search Algorithm Parameters The hill climbing algorithm we use has two parameters which govern the trade-off between decoding time and the accuracy with which a local maximum is identified: The step limit stops the search process after a certain number of steps regardless of the search progress made or lack thereof. The rejection limit stops the search after a certain number of unsuccessful attempts to make a step, when continued search does not seem to be promising. In most of our experiments, we used a step limit of 227 ≈ 1.3 · 108 and a rejection limit of 105. In practice, decoding terminates by reaching the rejection limit for the vast majority of documents. We therefore examined the effect of different rejection limits on the learning curves. The results are shown in figure 2. The results show that continued search does pay off to a certain extent. Indeed, the curve for rejection limit 107 seems to indicate that the model score increases roughly logarithmically, albeit to a higher base, even after the curve has started to flatten out at 214 steps. At a certain point, however, the probability of finding a good successor state drops rather sharply by about two orders of magnitude, as Figure 2: Search performance at different rejection limits evidenced by the fact that a rejection limit of 106 does not give a large improvement over one of 105, while one of 107 does. The continued model score improvement also results in an increase in BLEU scores, and with a BLEU score of 22. 1% the system with rejection limit 107 is fairly close to the score of 22.45 % obtained by DP beam search. Obviously, more exact search comes at a cost, and in this case, it comes at a considerable cost, which is an explosion of the time required to decode the test set from 4 minutes at rejection limit 103 to 224 minutes at rejection limit 105 and 38 hours 45 minutes at limit 107. The DP decoder takes 3 1 minutes for the same task. We conclude that the rejection limit of 105 selected for our experiments, while technically suboptimal, realises a good trade-off between decoding time and accuracy. 4.3 A Semantic Document Language Model In this section, we present the results of the application of our decoder to an actual SMT model with cross-sentence features. Our model addresses the problem of lexical cohesion. In particular, it rewards the use of semantically related words in the translation output by the decoder, where semantic distance is measured with a word space model based on Latent Semantic Analysis (LSA). LSA has been applied to semantic language modelling in previous research with some success (Coccaro and Jurafsky, 1998; Bellegarda, 2000; Wandmacher and Antoine, 2007). In SMT, it has mostly been used for domain adaptation (Kim and Khudanpur, 2004; Tam et al., 1186 2007), or to measure sentence similarities (Banchs and Costa-juss a`, 2011). The model we use is inspired by Bellegarda (2000). It is a Markov model, similar to a standard n-gram model, and assigns to each content word a score given a history of n preceding content words, where n = 30 below. Scoring relies on a 30dimensional LSA word vector space trained with the S-Space software (Jurgens and Stevens, 2010). The score is defined based on the cosine similarity between the word vector of the predicted word and the mean word vector of the words in the history, which is converted to a probability by histogram lookup as suggested by Bellegarda (2000). The model is structurally different from a regular n-gram model in that word vector n-grams are defined over content words occurring in the word vector model only and can cross sentence boundaries. Stop words, identified by an extensive stop word list and amounting to around 60 % of the tokens, are scored by a different mechanism based on their relative frequency (undiscounted unigram probability) in the training corpus. In sum, the score produced by the semantic document LM has the following form: wh(er|h)α=is tεpαheuncipgors(wp)o|hrtinof w fci os nakutneskotn wpon w ,onerldse,in(ls1teh5) training corpus and ε is a small fixed probability. It is integrated into the decoder as an extra feature function. Since we lack an automatic method for training the feature weights of document-wide features, its weight was selected by grid search over a number of values, comparing translation performance for the newstest2009 test set. In these experiments, we used DP beam search to initialise the state of our local search decoder. Three results are presented (table 1): The first table row shows the baseline performance using DP beam search with standard sentence-local features only. The scores in the second row were obtained by running the hill climbing decoder with DP initialisation, but without adding any models. A marginal increase in scores for all three test sets demonstrates that the hill climbing decoder manages to fix some of the search errors made by the DP search. The last row contains the scores obtained by adding in the semantic language model. Scores are presented for three publicly available test sets from recent WMT Machine Translation shared tasks, of which one (newstest2009) was used to monitor progress during development and select the final model. Adding the semantic language model results in a small increase in NIST scores (Doddington, 2002) for all three test sets as well as a small BLEU score gain (Papineni et al., 2002) for two out of three corpora. We note that the NIST score turned out to react more sensitively to improvements due to the semantic LM in all our experiments, which is reasonable because the model specifically targets content words, which benefit from the information weighting done by the NIST score. While the results we present do not constitute compelling evidence in favour of our semantic LM in its current form, they do suggest that this model could be improved to realise higher gains from cross-sentence semantic information. They support our claim that cross- sentence models should be examined more closely and that existing methods should be adapted to deal with them, a problem addressed by our main contribution, the local search document decoder. 5 Related Work Even though DP beam search (Koehn et al., 2003) has been the dominant approach to SMT decoding in recent years, methods based on local search have been explored at various times. For word-based SMT, greedy hill-climbing techniques were advo1187 cated as a faster replacement for beam search (Germann et al., 2001 ; Germann, 2003; Germann et al., 2004), and a problem formulation specifically targeting word reordering with an efficient word reordering algorithm has been proposed (Eisner and Tromble, 2006). A local search decoder has been advanced as a faster alternative to beam search also for phrasebased SMT (Langlais et al., 2007; Langlais et al., 2008). That work anticipates many of the features found in our decoder, including the use of local search to refine an initial hypothesis produced by DP beam search. The possibility of using models that do not fit well into the beam search paradigm is mentioned and illustrated with the example of a reversed n-gram language model, which the authors claim would be difficult to implement in a beam search decoder. Similarly to the work by Germann et al. (2001), their decoder is deterministic and explores the entire neighbourhood of a state in order to identify the most promising step. Our main contribution with respect to the work by Langlais et al. (2007) is the introduction of the possibility of handling document-level models by lifting the assumption of sentence independence. As a consequence, enumerating the entire neighbourhood becomes too expensive, which is why we resort to a “first-choice” strategy that non-deterministically generates states and accepts the first one encountered that meets the acceptance criterion. More recently, Gibbs sampling was proposed as a way to generate samples from the posterior distribution of a phrase-based SMT decoder (Arun et al., 2009; Arun et al., 2010), a process that resembles local search in its use of a set of state-modifying operators to generate a sequence of decoder states. Where local search seeks for the best state attainable from a given initial state, Gibbs sampling produces a representative sample from the posterior. Like all work on SMT decoding that we know of, the Gibbs sampler presented by Arun et al. (2010) assumes independence of sentences and considers the complete neighbourhood of each state before taking a sample. 6 Conclusion In the last twenty years of SMT research, there has been a strong assumption that sentences in a text newstest2009 newstest2010 newstest201 1 BLEU NIST BLEU NIST BLEU NIST 22.56 6.513 27.27 7.034 24.94 7.170 + hill climbing 22.60 6.518 27.33 7.046 24.97 7.169 with semantic LM 22.71 6.549 27.53 7.087 24.90 7.199 DP search only DP Table 1: Experimental results with a cross-sentence semantic language model are independent of one another, and discourse context has been largely neglected. Several factors have contributed to this. Developing good discourse-level models is difficult, and considering the modest translation quality that has long been achieved by SMT, there have been more pressing problems to solve and lower hanging fruit to pick. However, we argue that the popular DP beam search algorithm, which delivers excellent decoding performance, but imposes a particular kind of local dependency structure on the feature models, has also had its share in driving researchers away from discourse-level problems. In this paper, we have presented a decoding procedure for phrase-based SMT that makes it possible to define feature models with cross-sentence dependencies. Our algorithm can be combined with DP beam search to leverage the quality of the traditional approach with increased flexibility for models at the discourse level. We have presented preliminary results on a cross-sentence semantic language model addressing the problem of lexical cohesion to demonstrate that this kind of models is worth exploring further. Besides lexical cohesion, cross-sentence models are relevant for other linguistic phenomena such as pronominal anaphora or verb tense selection. We believe that SMT research has reached a point of maturity where discourse phenomena should not be ignored any longer, and we consider our decoder to be a step towards this goal. References Abhishek Arun, Chris Dyer, Barry Haddow, Phil Blunsom, Adam Lopez, and Philipp Koehn. 2009. Monte carlo inference and maximization for phrase-based translation. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 102–1 10, Boulder, Colorado, June. Association for Computational Linguistics. Abhishek Arun, Barry Haddow, Philipp Koehn, Adam Lopez, Chris Dyer, and Phil Blunsom. 2010. Monte 1188 Ma- Carlo techniques for phrase-based translation. chine translation, 24(2): 103–121 . Rafael E. Banchs and Marta R. Costa-juss a`. 2011. A semantic feature for Statistical Machine Translation. In Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 126– 134, Portland, Oregon, USA, June. Association for Computational Linguistics. Jerome R. Bellegarda. 2000. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8): 1279–1296. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. Findings of the 2011 Workshop on Statistical Machine Translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22–64, Edinburgh, Scotland, July. Association for Computational Linguistics. Noah Coccaro and Daniel Jurafsky. 1998. Towards better integration of semantic predictors in statistical language modeling. In Proceedings of the 5th International Conference on Spoken Language Processing, Sydney. George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second International conference on Human Language Technology Research, pages 138–145, San Diego. Jason Eisner and Roy W. Tromble. 2006. Local search with very large-scale neighborhoods for optimal permutations in machine translation. In Proceedings of the HLT-NAACL Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, pages 57–75. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. In Interspeech 2008, pages 1618–1621 . ISCA. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 228–235, Toulouse, France, July. Association for Computational Linguis- tics. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1–2): 127–143. Ulrich Germann. 2003. Greedy decoding for Statistical Machine Translation in almost linear time. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Zhengxian Gong, Min Zhang, and Guodong Zhou. 2011. Cache-based document-level Statistical Machine Translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 909–919, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Christian Hardmeier and Marcello Federico. 2010. Modelling Pronominal Anaphora in Statistical Machine Translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289. Basil Hatim and Ian Mason. 1990. Discourse and the Translator. Language in Social Life Series. Longman, London. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, July. Association for Computational Linguistics. Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 967– 975, Prague, Czech Republic, June. Association for Computational Linguistics. David Jurgens and Keith Stevens. 2010. The S-Space package: An open source package for word space models. In Proceedings of the ACL 2010 System Demonstrations, pages 30–35, Uppsala, Sweden, July. Association for Computational Linguistics. Woosung Kim and Sanjeev Khudanpur. 2004. Crosslingual latent semantic analysis for language modeling. In IEEE international conference on acoustics, speech, and signal processing (ICASSP), volume 1, pages 257–260, Montr ´eal. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48– 54, Edmonton. 1189 Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Association for Computational Linguistics: Demonstration session, pages 177–180, Prague. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Sk¨ ovde. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2008. Recherche locale pour la traduction statistique par segments. In TALN 2008, pages 119–128, Avignon, France, June. ATALA. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden, July. Association for Computational Linguistics. Franz Josef Och, Nicola Ueffing, and Hermann Ney. 2001. An efficient A* search algorithm for Statistical Machine Translation. In Proceedings of the DataDriven Machine Translation Workshop, 39th Annual Meeting of the Association for Computational Linguistics (ACL), pages 55–62, Toulouse. Franz Josef Och. 2003. Minimum error rate training in Statistical Machine Translation. In Proceedings of the 41st annual meeting of the Association for Computational Linguistics, pages 160–167, Sapporo (Japan). Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 3 11–3 18, Philadelphia. ACL. Yik-Cheung Tam, Ian Lane, and Tanja Schultz. 2007. Bilingual LSA-based adaptation for Statistical Machine Translation. Machine Translation, 21(4): 187– 207. J o¨rg Tiedemann. 2010. To cache or not to cache? Experiments with adaptive models in Statistical Machine Translation. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, pages 189–194, Uppsala, Sweden. Association for Computational Linguistics. Christoph Tillmann and Hermann Ney. 2003. Word reordering and a Dynamic Programming beam search algorithm for Statistical Machine Translation. Computational linguistics, 29(1):97–133. Christoph Tillmann, Stephan Vogel, Hermann Ney, and Alex Zubiaga. 1997. A DP-based search using monotone alignments in Statistical Translation. In Proceedings of the 35th Annual Meeting of the Association for Computational Spain, tics. Linguistics, July. Association for 289–296, Madrid, Computational Linguis- pages Tonio Wandmacher and Jean-Yves Antoine. 2007. Methods to integrate a language model with semantic information for a word prediction component. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 506–5 13, Prague, Czech Republic, June. Association for Computational Linguistics. 1190
4 0.59053141 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage
Author: Kenneth Heafield ; Philipp Koehn ; Alon Lavie
Abstract: Approximate search algorithms, such as cube pruning in syntactic machine translation, rely on the language model to estimate probabilities of sentence fragments. We contribute two changes that trade between accuracy of these estimates and memory, holding sentence-level scores constant. Common practice uses lowerorder entries in an N-gram model to score the first few words of a fragment; this violates assumptions made by common smoothing strategies, including Kneser-Ney. Instead, we use a unigram model to score the first word, a bigram for the second, etc. This improves search at the expense of memory. Conversely, we show how to save memory by collapsing probability and backoff into a single value without changing sentence-level scores, at the expense of less accurate estimates for sentence fragments. These changes can be stacked, achieving better estimates with unchanged memory usage. In order to interpret changes in search accuracy, we adjust the pop limit so that accuracy is unchanged and report the change in CPU time. In a GermanEnglish Moses system with target-side syntax, improved estimates yielded a 63% reduction in CPU time; for a Hiero-style version, the reduction is 21%. The compressed language model uses 26% less RAM while equivalent search quality takes 27% more CPU. Source code is released as part of KenLM.
5 0.59046662 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models
Author: Shixiang Lu ; Wei Wei ; Xiaoyin Fu ; Bo Xu
Abstract: In this paper, we propose a novel translation model (TM) based cross-lingual data selection model for language model (LM) adaptation in statistical machine translation (SMT), from word models to phrase models. Given a source sentence in the translation task, this model directly estimates the probability that a sentence in the target LM training corpus is similar. Compared with the traditional approaches which utilize the first pass translation hypotheses, cross-lingual data selection model avoids the problem of noisy proliferation. Furthermore, phrase TM based cross-lingual data selection model is more effective than the traditional approaches based on bag-ofwords models and word-based TM, because it captures contextual information in modeling the selection of phrase as a whole. Experiments conducted on large-scale data sets demonstrate that our approach significantly outperforms the state-of-the-art approaches on both LM perplexity and SMT performance.
6 0.5702334 67 emnlp-2012-Inducing a Discriminative Parser to Optimize Machine Translation Reordering
7 0.54543179 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
8 0.52971661 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
9 0.52342385 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
10 0.51755428 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
11 0.49596965 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid
12 0.48052305 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
13 0.44324869 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
14 0.43604538 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
15 0.41716671 50 emnlp-2012-Extending Machine Translation Evaluation Metrics with Lexical Cohesion to Document Level
16 0.41625333 78 emnlp-2012-Learning Lexicon Models from Search Logs for Query Expansion
17 0.40515402 31 emnlp-2012-Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition
18 0.40305293 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction
19 0.39300883 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence
20 0.37885362 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques
topicId topicWeight
[(14, 0.027), (16, 0.02), (34, 0.637), (60, 0.071), (63, 0.028), (70, 0.018), (74, 0.027), (76, 0.017), (79, 0.01), (80, 0.011), (86, 0.018), (95, 0.011)]
simIndex simValue paperId paperTitle
1 0.99301618 43 emnlp-2012-Exact Sampling and Decoding in High-Order Hidden Markov Models
Author: Simon Carter ; Marc Dymetman ; Guillaume Bouchard
Abstract: We present a method for exact optimization and sampling from high order Hidden Markov Models (HMMs), which are generally handled by approximation techniques. Motivated by adaptive rejection sampling and heuristic search, we propose a strategy based on sequentially refining a lower-order language model that is an upper bound on the true model we wish to decode and sample from. This allows us to build tractable variable-order HMMs. The ARPA format for language models is extended to enable an efficient use of the max-backoff quantities required to compute the upper bound. We evaluate our approach on two problems: a SMS-retrieval task and a POS tagging experiment using 5-gram models. Results show that the same approach can be used for exact optimization and sampling, while explicitly constructing only a fraction of the total implicit state-space.
same-paper 2 0.98911113 86 emnlp-2012-Locally Training the Log-Linear Model for SMT
Author: Lemao Liu ; Hailong Cao ; Taro Watanabe ; Tiejun Zhao ; Mo Yu ; Conghui Zhu
Abstract: In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. However, due to the diversity and uneven distribution of source sentences, there are two problems suffered by this method. First, its performance is highly dependent on the choice of a development set, which may lead to an unstable performance for testing. Second, translations become inconsistent at the sentence level since tuning is performed globally on a document level. In this paper, we propose a novel local training method to address these two problems. Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. We pro- pose efficient incremental training methods to put the local training into practice. In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2.0 BLEU points, meanwhile its efficiency is comparable to that of the global method.
3 0.96337742 119 emnlp-2012-Spectral Dependency Parsing with Latent Variables
Author: Paramveer Dhillon ; Jordan Rodu ; Michael Collins ; Dean Foster ; Lyle Ungar
Abstract: Recently there has been substantial interest in using spectral methods to learn generative sequence models like HMMs. Spectral methods are attractive as they provide globally consistent estimates of the model parameters and are very fast and scalable, unlike EM methods, which can get stuck in local minima. In this paper, we present a novel extension of this class of spectral methods to learn dependency tree structures. We propose a simple yet powerful latent variable generative model for dependency parsing, and a spectral learning method to efficiently estimate it. As a pilot experimental evaluation, we use the spectral tree probabilities estimated by our model to re-rank the outputs of a near state-of-theart parser. Our approach gives us a moderate reduction in error of up to 4.6% over the baseline re-ranker. .
4 0.93402064 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
Author: Roberto Navigli ; Simone Paolo Ponzetto
Abstract: We present a multilingual joint approach to Word Sense Disambiguation (WSD). Our method exploits BabelNet, a very large multilingual knowledge base, to perform graphbased WSD across different languages, and brings together empirical evidence from these languages using ensemble methods. The results show that, thanks to complementing wide-coverage multilingual lexical knowledge with robust graph-based algorithms and combination methods, we are able to achieve the state of the art in both monolingual and multilingual WSD settings.
5 0.69003993 35 emnlp-2012-Document-Wide Decoding for Phrase-Based Statistical Machine Translation
Author: Christian Hardmeier ; Joakim Nivre ; Jorg Tiedemann
Abstract: Independence between sentences is an assumption deeply entrenched in the models and algorithms used for statistical machine translation (SMT), particularly in the popular dynamic programming beam search decoding algorithm. This restriction is an obstacle to research on more sophisticated discourse-level models for SMT. We propose a stochastic local search decoding method for phrase-based SMT, which permits free document-wide dependencies in the models. We explore the stability and the search parameters ofthis method and demonstrate that it can be successfully used to optimise a document-level semantic language model. 1 Motivation In the field oftranslation studies, it is undisputed that discourse-wide context must be considered care- fully for good translation results (Hatim and Mason, 1990). By contrast, the state of the art in statistical machine translation (SMT), despite significant advances in the last twenty years, still assumes that texts can be translated sentence by sentence under strict independence assumptions, even though it is well known that certain linguistic phenomena such as pronominal anaphora cannot be translated correctly without referring to extra-sentential context. This is true both for the phrase-based and the syntaxbased approach to SMT. In the rest of this paper, we shall concentrate on phrase-based SMT. One reason why it is difficult to experiment with document-wide models for phrase-based SMT is that the dynamic programming (DP) algorithm 1179 which has been used almost exclusively for decoding SMT models in the recent literature has very strong assumptions of locality built into it. DP beam search for phrase-based SMT was described by Koehn et al. (2003), extending earlier work on word-based SMT (Tillmann et al., 1997; Och et al., 2001 ; Tillmann and Ney, 2003). This algorithm con- structs output sentences by starting with an empty hypothesis and adding output words at the end until translations for all source words have been generated. The core models of phrase-based SMT, in particular the n-gram language model (LM), only depend on a constant number of output words to the left of the word being generated. This fact is exploited by the search algorithm with a DP technique called hypothesis recombination (Och et al., 2001), which permits the elimination of hypotheses from the search space if they coincide in a certain number of final words with a better hypothesis and no future expansion can possibly invert the relative ranking of the two hypotheses under the given models. Hypothesis recombination achieves a substantial reduction of the search space without affecting search optimality and makes it possible to use aggressive pruning techniques for fast search while still obtaining good results. The downside of this otherwise excellent approach is that it only works well with models that have a local dependency structure similar to that of an n-gram language model, so they only depend on a small context window for each target word. Sentence-local models with longer dependencies can be added, but doing so greatly increases the risk for search errors by inhibiting hypothesis recombination. Cross-sentence dependencies cannot be directly integrated into DP SMT decoding in LParnogcue agdein Lgesa ornf tihneg, 2 p0a1g2e Jso 1in17t C9–o1n1f9e0re,n Jce ju on Is Elanmdp,ir Kicoarlea M,e 1t2h–o1d4s J iunly N 2a0tu1r2a.l ? Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls any obvious way, especially if joint optimisation of a number of interdependent decisions over an entire document is required. Research into models with a more varied, non-local dependency structure is to some extent stifled by the difficulty of decoding such models effectively, as can be seen by the problems some researchers encountered when they attempted to solve discourse-level problems. Consider, for instance, the work on cache-based language models by Tiedemann (2010) and Gong et al. (201 1), where error propagation was a serious issue, or the works on pronominal anaphora by Le Nagard and Koehn (2010), who implemented cross-sentence dependencies with an ad-hoc two-pass decoding strategy, and Hardmeier and Federico (2010) with the use of an external decoder driver to manage backward-only dependencies between sentences. In this paper, we present a method for decoding complete documents in phrase-based SMT. Our decoder uses a local search approach whose state consists of a complete translation of an entire document at any time. The initial state is improved by the application of a series of operations using a hill climbing strategy to find a (local) maximum of the score function. This setup gives us complete freedom to define scoring functions over the entire document. Moreover, by optionally initialising the state with the output of a traditional DP decoder, we can ensure that the final hypothesis is no worse than what would have been found by DP search alone. We start by describing the decoding algorithm and the state operations used by our decoder, then we present empirical results demonstrating the effectiveness of our approach and its usability with a document-level semantic language model, and finally we discuss some related work. 2 SMT Decoding by Hill Climbing In this section, we formally describe the phrasebased SMT model implemented by our decoder as well as the decoding algorithm we use. 2.1 SMT Model Our decoder is based on local search, so its state at any time is a representation of a complete translation of the entire document. Even though the decoder operates at the document level, it is important to keep 1180 track of sentence boundaries, and the individual operations that are applied to the state are still confined to sentence scope, so it is useful to decompose the state of a document into the state of its sentences, and we define the overall state S as a sequence of sentence states: S = S1S2 . . .SN, (1) where N is the number of sentences. This implies that we constrain the decoder to emit exactly one output sentence per input sentence. Let ibe the number of a sentence and mi the number of input tokens of this sentence, p and q (with 1 ≤ p ≤ q ≤ mi) be positions in the input sentence a1n ≤d [p; q] qde ≤no mte the set ofpositions from p up to and including q. We say that [p; q] precedes [p0; q0], or [p; q] ≺ [p0; q0], if q < p0. Let Φi([p; q]) be the set of t[pra;nqs]l ≺atio [pns for the source phrase covering positions [p; q] in the input sentence ias given by the phrase table. We call A = h[p; q] ,φi an anchored phrase pair w.it Wh coverage C(A) = [p; q] nif a φ ∈ Φi([p; q]) sise a target phrase translating =th [ep source w∈o Φrds at positions [p; q] . Then a sequence of ni anchored phrase pairs Si = A1A2 . . .Ani (2) is a valid sentence state for sentence iif the following two conditions hold: 1. The coverage sets C(Aj) for j in 1, . . . , ni are mutually disjoint, and 2. the anchored phrase pairs jointly cover the complete input sentence, or [niC(Aj) = [1;mi]. (3) [j=1 Let f(S) be a scoring function mapping a state S to a real number. As usual in SMT, it is assumed that the scoring function can be decomposed into a linear combination of K feature functions hk(S), each with a constant weight λk, so f(S) =k∑K=1λkhk(S). (4) The problem addressed by the decoder is the search for the state with maximal score, such that Sˆ Sˆ = argSmaxf(S). (5) The feature functions implemented in our baseline system are identical to the ones found in the popular Moses SMT system (Koehn et al., 2007). In particular, our decoder has the following feature functions: 1. phrase translation scores provided by the phrase table including forward and backward conditional probabilities, lexical weights and a phrase penalty (Koehn et al., 2003), 2. n-gram language model scores implemented with the KenLM toolkit (Heafield, 2011), 3. a word penalty score, 4. a distortion model with geometric (Koehn et al., 2003), and decay 5. a feature indicating the number of times a given distortion limit is exceeded in the current state. In our experiments, the last feature is used with a fixed weight of negative infinity in order to limit the gaps between the coverage sets of adjacent anchored phrase pairs to a maximum value. In DP search, the distortion limit is usually enforced directly by the search algorithm and is not added as a feature. In our decoder, however, this restriction is not required to limit complexity, so we decided to add it among the scoring models. 2.2 Decoding Algorithm The decoding algorithm we use (algorithm 1) is very simple. It starts with a given initial document state. In the main loop, which extends from line 3 to line 12, it generates a successor state S0 for the current state S by calling the function Neighbour, which non-deterministically applies one of the operations described in section 3 of this paper to S. The score of the new state is compared to that of the previous one. If it meets a given acceptance criterion, S0 becomes the current state, else search continues from the previous state S. For the experiments in this paper, we use the hill climbing acceptance criterion, which simply accepts a new state if its score is higher than that of the current state. Other acceptance criteria are possible and could be used to endow the search algorithm with stochastic behaviour. 1181 The main loop is repeated until a maximum number of steps (step limit) is reached or until a maximum number of moves are rejected in a row (rejection limit). Algorithm 1 Decoding algorithm Input: an initial document state S; search parameters maxsteps and maxrejected Output: a modified document state 1: nsteps ← 0 2: nrejected ← 0 3: nwrhejileec nsteps < maxsteps and nrejected < maxrejected do 4: S0 ← Neighbour (S) 5: if Accept (f(S0) , f(S)) then 6: S ← S0 7: nrejected ← 0 8: elsner 9: nrejected ← nrejected + 1 10: enndr eifj 11: nsteps ← nsteps + 1 12: ennds wtephsile ← 13: return S A notable difference between this algorithm and other hill climbing algorithms that have been used for SMT decoding (Germann et al., 2004; Langlais et al., 2007) is its non-determinism. Previous work for sentence-level decoding employed a steepest ascent strategy which amounts to enumerating the complete neighbourhood of the current state as defined by the state operations and selecting the next state to be the best state found in the neighbourhood of the current one. Enumerating all neighbours of a given state, costly as it is, has the advantage that it makes it easy to prove local optimality of a state by recognising that all possible successor states have lower scores. It can be rather inefficient, since at every step only one modification will be adopted; many of the modifications that are discarded will very likely be generated anew in the next iteration. As we extend the decoder to the document level, the size of the neighbourhood that would have to be explored in this way increases considerably. Moreover, the inefficiency of the steepest ascent approach potentially increases as well. Very likely, a promising move in one sentence will remain promising after a modification has been applied to another sentence, even though this is not guaranteed to be true in the presence of cross-sentence models. We therefore adopt a first-choice hill climbing strategy that non-deterministically generates successor states and accepts the first one that meets the acceptance criterion. This frees us from the necessity of generating the full set of successors for each state. On the downside, if the full successor set is not known, it is no longer possible to prove local optimality of a state, so we are forced to use a different condition for halting the search. We use a combination of two limits: The step limit is a hard limit on the resources the user is willing to expend on the search problem. The value of the rejection limit determines how much of the neighbourhood is searched for better successors before a state is accepted as a solution; it is related to the probability that a state returned as a solution is in fact locally optimal. To simplify notations in the description of the individual state operations, we write Si −→ Si0 (6) to signify that a state operation, when presented with a document state as in equation 1 and acting on sentence i, returns a new document state of S0 = S1 . . .Si−1 Si0 Si+1 . . .SN. (7) Similarly, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (8) is equivalent to Si −→ A1 . . .Aj−1 A01 . . .A0h0 Aj+h . . .Ani (9) and indicates that the operation returns a state in which a sequence of h consecutive anchored phrase pairs has been replaced by another sequence of h0 anchored phrase pairs. 2.3 Efficiency Considerations When implementing the feature functions for the decoder, we have to exercise some care to avoid recomputing scores for the whole document at every iteration. To achieve this, the scores are computed completely only once, at the beginning of the decoding run. In subsequent iterations, scoring functions are presented with the scores of the previous 1182 iteration and a list of modifications produced by the state operation, a set of tuples hi, r, s,A01 . . .A0h0i, each indicating tthioant ,t ahe s edto ocfu tmupelnets s hhio,ru,sld, Abe modifii,e eda as described by Si :Ar . . .As −→ A01 . . .A0h0 . (10) If a feature function is decomposable in some way, as all the standard features developed under the constraints of DP search are, it can then update the state simply by subtracting and adding score components pertaining to the modified parts of the document. Feature functions have the possibility to store their own state information along with the document state to make sure the required information is available. Thus, the framework makes it possible to exploit decomposability for efficient scoring without impos- ing any particular decomposition on the features as beam search does. To make scoring even more efficient, scores are computed in two passes: First, every feature function is asked to provide an upper bound on the score that will be obtained for the new state. In some cases, it is possible to calculate reasonable upper bounds much more efficiently than computing the exact feature value. If the upper bound fails to meet the acceptance criterion, the new state is discarded right away; if not, the full score is computed and the acceptance criterion is tested again. Among the basic SMT models, this two-pass strategy is only used for the n-gram LM, which requires fairly expensive parameter lookups for scoring. The scores of all the other baseline models are fully computed during the first scoring pass. The n-gram model is more complex. In its state information, it keeps track of the LM score and LM library state for each word. The first scoring pass then identifies the words whose LM scores are affected by the current search step. This includes the words changed by the search operation as well as the words whose LM history is modified. The range of the history de- pendencies can be determined precisely by considering the “valid state length” information provided by the KenLM library. In the first pass, the LM scores of the affected words are subtracted from the total score. The model only looks up the new LM scores for the affected words and updates the total score if the new search state passes the first acceptance check. This two-pass scoring approach allows us to avoid LM lookups altogether for states that will be rejected anyhow because of low scores from the other models, e. g. because the distortion limit is violated. Model score updates become more complex and slower as the number of dependencies of a model increases. While our decoding algorithm does not impose any formal restrictions on the number or type of dependencies that can be handled, there will be practical limits beyond which decoding becomes unacceptably slow or the scoring code becomes very difficult to maintain. These limits are however fairly independent of the types of dependencies handled by a model, which permits the exploration of more varied model types than those handled by DP search. 2.4 State Initialisation Before the hill climbing decoding algorithm can be run, an initial state must be generated. The closer the initial state is to an optimum, the less work remains to be done for the algorithm. If the algorithm is to be self-contained, initialisation must be relatively uninformed and can only rely on some general prior assumptions about what might be a good initial guess. On the other hand, if optimal results are sought after, it pays off to invest some effort into a good starting point. One way to do this is to run DP search first. For uninformed initialisation, we chose to implement a very simple procedure based only on the observation that, at least for language pairs involving the major European languages, it is usually a good guess to keep the word order of the output very similar to that of the input. We therefore create the initial state by selecting, for each sentence in the document, a sequence of anchored phrase pairs covering the input sentence in monotonic order, that is, such that for all pairs of adjacent anchored phrase pairs Aj and Aj+1, we have that C(Aj) ≺ C(Aj+1 ). For initialisation with DP search, we first run the Moses decoder (Koehn et al., 2007) with default search parameters and the same models as those used by our decoder. Then we extract the best output hypothesis from the search graph of the decoder and map it into a sequence of anchored phrase pairs in the obvious way. When the document-level decoder is used with models that are incompatible with beam search, Moses can be run with a subset of the models in order to find an approximation of the solution 1183 which is then refined with the complete feature set. 3 State Operations Given a document state S, the decoder uses a neighbourhood function Neighbour to simulate a move in the state space. The neighbourhood function nondeterministically selects a type of state operation and a location in the document to apply it to and returns the resulting new state. We use a set of three operations that has the property that every possible document state can be reached from every other state in a sequence of moves. Designing operations for state transitions in local search for phrase-based SMT is a problem that has been addressed in the literature (Langlais et al., 2007; Arun et al., 2010). Our decoder’s first- choice hill climbing strategy never enumerates the full neighbourhood of a state. We therefore place less emphasis than previous work on defining a compact neighbourhood, but allow the decoder to make quite extensive changes to a state in a single step with a certain probability. Otherwise our operations are similar to those used by Arun et al. (2010). All of the operations described in this paper make changes to a single sentence only. Each time it is called, the Neighbour function selects a sentence in the document with a probability proportional to the number of input tokens in each sentence to ensure a fair distribution ofthe decoder’s attention over the words in the document regardless of varying sentence lengths. 3.1 Changing Phrase Translations The change-phrase-translation operation replaces the translation of a single phrase with a random translation with the same coverage taken from the phrase table. Formally, the operation selects an anchored phrase pair Aj by drawing uniformly from the elements of Si and then draws a new translation φ0 uniformly from the set Φi(C(Aj)). The new state is given by Si : Aj −→ hC(Aj), φ0i. (11) 3.2 Changing Word Order The swap-phrases operation affects the output word order without changing the phrase translations. It exchanges two anchored phrase pairs Aj and Aj+h, resulting in an output state of Si : Aj . . .Aj+h −→ Aj+h Aj+1 . . .Aj+h−1 Aj. (12) The start location j is drawn uniformly from the eligible sentence positions; the swap range h comes from a geometric distribution with configurable decay. Other word-order changes such as a one-way move operation that does not require another movement in exchange or more advanced permutations can easily be defined. 3.3 Resegmentation The most complex operation is resegment, which allows the decoder to modify the segmentation ofthe source phrase. It takes a number of anchored phrase pairs that form a contiguous block both in the input and in the output and replaces them with a new set of phrase pairs covering the same span of the input sentence. Formally, Si : Aj . . .Aj+h−1 −→ A01 . . .A0h0 (13) such that j+[h−1 [h0 [ C(Aj0) = [ C(A0j0) = [p;q] j[0=j (14) j[0=1 for some p and q, where, for j0 = 1, . . . ,h0, we have that A0j0 = h[pj0; qj0] , φj0i, all [pj0; qj0] are mutually disjoint =an hd[ peach φj0 isi randomly drawn from Φi([pj0;qj0]). Regardless of the ordering of Aj . . .Aj+h−1 , the resegment operation always generates a sequence of anchored phrase pairs in linear order, such that C(A0j0) ≺ C(A0j0+1 ) for j0 = 1, . . . ,h0 −1 . As )f o≺r Cth(eA other operations, j is− generated uniformly and h is drawn from a geometric distribution with a decay parameter. The new segmentation is generated by extending the sequence of anchored phrase pairs with random elements starting at the next free position, proceeding from left to right until the whole range [p; q] is covered. 4 Experimental Results In this section, we present the results of a series of experiments with our document decoder. The 1184 goal of our experiments is to demonstrate the behaviour of the decoder and characterise its response to changes in the fundamental search parameters. The SMT models for our experiments were created with a subset of the training data for the English-French shared task at the WMT 2011workshop (Callison-Burch et al., 2011). The phrase table was trained on Europarl, news-commentary and UN data. To reduce the training data to a manageable size, singleton phrase pairs were removed before the phrase scoring step. Significance-based filtering (Johnson et al., 2007) was applied to the resulting phrase table. The language model was a 5gram model with Kneser-Ney smoothing trained on the monolingual News corpus with IRSTLM (Federico et al., 2008). Feature weights were trained with Minimum Error-Rate Training (MERT) (Och, 2003) on the news-test2008 development set using the DP beam search decoder and the MERT implementation of the Moses toolkit (Koehn et al., 2007). Experimental results are reported for the newstest2009 test set, a corpus of 111 newswire documents totalling 2,525 sentences or 65,595 English input tokens. 4.1 Stability An important difference between our decoder and the classical DP decoder as well as previous work in SMT decoding with local search is that our decoder is inherently non-deterministic. This implies that repeated runs of the decoder with the same search parameters, input and models will not, in general, find the same local maximum of the score space. The first empirical question we ask is therefore how different the results are under repeated runs. The results in this and the next section were obtained with random state initialisation, i. e. without running the DP beam search decoder. Figure 1 shows the results of 7 decoder runs with the models described above, translating the newstest2009 test set, with a step limit of 227 and a rejection limit of 100,000. The x-axis of both plots shows the number of decoding steps on a logarithmic scale, so the number of steps is doubled between two adjacent points on the same curve. In the left plot, the y-axis indicates the model score optimised by the decoder summed over all 2525 sentences of the document. In the right plot, the case-sensitive BLEU score (Papineni et al., 2002) of the current decoder Figure 1: Score stability in repeated decoder runs state against a reference translation is displayed. We note, as expected, that the decoder achieves a considerable improvement of the initial state with diminishing returns as decoding continues. Between 28 and 214 steps, the score increases at a roughly logarithmic pace, then the curve flattens out, which is partly due to the fact that decoding for some documents effectively stopped when the maximum number of rejections was reached. The BLEU score curve shows a similar increase, from an initial score below 5 % to a maximum of around 21.5 %. This is below the score of 22.45 % achieved by the beam search decoder with the same models, which is not surprising considering that our decoder approximates a more difficult search problem, from which a number of strong independence assumptions have been lifted, without, at the moment, having any stronger models at its disposal to exploit this additional freedom for better translation. In terms of stability, there are no dramatic differences between the decoder runs. Indeed, the small differences that exist are hardly discernible in the plots. The model scores at the end of the decoding run range between −158767.9 and −158716.9, a g re rlautniv rea ndgieffe breetnwceee nof − only a6b7.o9ut a n0d.0 −3 %15.8 F1i6n.a9l, BLEU scores range from 21.41 % to 21.63 %, an interval that is not negligible, but comparable to the variance observed when, e. g., feature weights from repeated MERT runs are used with one and the same SMT system. Note that these results were obtained with random state initialisation. With DP initialisation, score differences between repeated runs rarely 1185 exceed 0.02 absolute BLEU percentage points. Overall, we conclude that the decoding results of our algorithm are reasonably stable despite the nondeterminism inherent in the procedure. In our subsequent experiments, the evaluation scores reported are calculated as the mean of three runs for each experiment. 4.2 Search Algorithm Parameters The hill climbing algorithm we use has two parameters which govern the trade-off between decoding time and the accuracy with which a local maximum is identified: The step limit stops the search process after a certain number of steps regardless of the search progress made or lack thereof. The rejection limit stops the search after a certain number of unsuccessful attempts to make a step, when continued search does not seem to be promising. In most of our experiments, we used a step limit of 227 ≈ 1.3 · 108 and a rejection limit of 105. In practice, decoding terminates by reaching the rejection limit for the vast majority of documents. We therefore examined the effect of different rejection limits on the learning curves. The results are shown in figure 2. The results show that continued search does pay off to a certain extent. Indeed, the curve for rejection limit 107 seems to indicate that the model score increases roughly logarithmically, albeit to a higher base, even after the curve has started to flatten out at 214 steps. At a certain point, however, the probability of finding a good successor state drops rather sharply by about two orders of magnitude, as Figure 2: Search performance at different rejection limits evidenced by the fact that a rejection limit of 106 does not give a large improvement over one of 105, while one of 107 does. The continued model score improvement also results in an increase in BLEU scores, and with a BLEU score of 22. 1% the system with rejection limit 107 is fairly close to the score of 22.45 % obtained by DP beam search. Obviously, more exact search comes at a cost, and in this case, it comes at a considerable cost, which is an explosion of the time required to decode the test set from 4 minutes at rejection limit 103 to 224 minutes at rejection limit 105 and 38 hours 45 minutes at limit 107. The DP decoder takes 3 1 minutes for the same task. We conclude that the rejection limit of 105 selected for our experiments, while technically suboptimal, realises a good trade-off between decoding time and accuracy. 4.3 A Semantic Document Language Model In this section, we present the results of the application of our decoder to an actual SMT model with cross-sentence features. Our model addresses the problem of lexical cohesion. In particular, it rewards the use of semantically related words in the translation output by the decoder, where semantic distance is measured with a word space model based on Latent Semantic Analysis (LSA). LSA has been applied to semantic language modelling in previous research with some success (Coccaro and Jurafsky, 1998; Bellegarda, 2000; Wandmacher and Antoine, 2007). In SMT, it has mostly been used for domain adaptation (Kim and Khudanpur, 2004; Tam et al., 1186 2007), or to measure sentence similarities (Banchs and Costa-juss a`, 2011). The model we use is inspired by Bellegarda (2000). It is a Markov model, similar to a standard n-gram model, and assigns to each content word a score given a history of n preceding content words, where n = 30 below. Scoring relies on a 30dimensional LSA word vector space trained with the S-Space software (Jurgens and Stevens, 2010). The score is defined based on the cosine similarity between the word vector of the predicted word and the mean word vector of the words in the history, which is converted to a probability by histogram lookup as suggested by Bellegarda (2000). The model is structurally different from a regular n-gram model in that word vector n-grams are defined over content words occurring in the word vector model only and can cross sentence boundaries. Stop words, identified by an extensive stop word list and amounting to around 60 % of the tokens, are scored by a different mechanism based on their relative frequency (undiscounted unigram probability) in the training corpus. In sum, the score produced by the semantic document LM has the following form: wh(er|h)α=is tεpαheuncipgors(wp)o|hrtinof w fci os nakutneskotn wpon w ,onerldse,in(ls1teh5) training corpus and ε is a small fixed probability. It is integrated into the decoder as an extra feature function. Since we lack an automatic method for training the feature weights of document-wide features, its weight was selected by grid search over a number of values, comparing translation performance for the newstest2009 test set. In these experiments, we used DP beam search to initialise the state of our local search decoder. Three results are presented (table 1): The first table row shows the baseline performance using DP beam search with standard sentence-local features only. The scores in the second row were obtained by running the hill climbing decoder with DP initialisation, but without adding any models. A marginal increase in scores for all three test sets demonstrates that the hill climbing decoder manages to fix some of the search errors made by the DP search. The last row contains the scores obtained by adding in the semantic language model. Scores are presented for three publicly available test sets from recent WMT Machine Translation shared tasks, of which one (newstest2009) was used to monitor progress during development and select the final model. Adding the semantic language model results in a small increase in NIST scores (Doddington, 2002) for all three test sets as well as a small BLEU score gain (Papineni et al., 2002) for two out of three corpora. We note that the NIST score turned out to react more sensitively to improvements due to the semantic LM in all our experiments, which is reasonable because the model specifically targets content words, which benefit from the information weighting done by the NIST score. While the results we present do not constitute compelling evidence in favour of our semantic LM in its current form, they do suggest that this model could be improved to realise higher gains from cross-sentence semantic information. They support our claim that cross- sentence models should be examined more closely and that existing methods should be adapted to deal with them, a problem addressed by our main contribution, the local search document decoder. 5 Related Work Even though DP beam search (Koehn et al., 2003) has been the dominant approach to SMT decoding in recent years, methods based on local search have been explored at various times. For word-based SMT, greedy hill-climbing techniques were advo1187 cated as a faster replacement for beam search (Germann et al., 2001 ; Germann, 2003; Germann et al., 2004), and a problem formulation specifically targeting word reordering with an efficient word reordering algorithm has been proposed (Eisner and Tromble, 2006). A local search decoder has been advanced as a faster alternative to beam search also for phrasebased SMT (Langlais et al., 2007; Langlais et al., 2008). That work anticipates many of the features found in our decoder, including the use of local search to refine an initial hypothesis produced by DP beam search. The possibility of using models that do not fit well into the beam search paradigm is mentioned and illustrated with the example of a reversed n-gram language model, which the authors claim would be difficult to implement in a beam search decoder. Similarly to the work by Germann et al. (2001), their decoder is deterministic and explores the entire neighbourhood of a state in order to identify the most promising step. Our main contribution with respect to the work by Langlais et al. (2007) is the introduction of the possibility of handling document-level models by lifting the assumption of sentence independence. As a consequence, enumerating the entire neighbourhood becomes too expensive, which is why we resort to a “first-choice” strategy that non-deterministically generates states and accepts the first one encountered that meets the acceptance criterion. More recently, Gibbs sampling was proposed as a way to generate samples from the posterior distribution of a phrase-based SMT decoder (Arun et al., 2009; Arun et al., 2010), a process that resembles local search in its use of a set of state-modifying operators to generate a sequence of decoder states. Where local search seeks for the best state attainable from a given initial state, Gibbs sampling produces a representative sample from the posterior. Like all work on SMT decoding that we know of, the Gibbs sampler presented by Arun et al. (2010) assumes independence of sentences and considers the complete neighbourhood of each state before taking a sample. 6 Conclusion In the last twenty years of SMT research, there has been a strong assumption that sentences in a text newstest2009 newstest2010 newstest201 1 BLEU NIST BLEU NIST BLEU NIST 22.56 6.513 27.27 7.034 24.94 7.170 + hill climbing 22.60 6.518 27.33 7.046 24.97 7.169 with semantic LM 22.71 6.549 27.53 7.087 24.90 7.199 DP search only DP Table 1: Experimental results with a cross-sentence semantic language model are independent of one another, and discourse context has been largely neglected. Several factors have contributed to this. Developing good discourse-level models is difficult, and considering the modest translation quality that has long been achieved by SMT, there have been more pressing problems to solve and lower hanging fruit to pick. However, we argue that the popular DP beam search algorithm, which delivers excellent decoding performance, but imposes a particular kind of local dependency structure on the feature models, has also had its share in driving researchers away from discourse-level problems. In this paper, we have presented a decoding procedure for phrase-based SMT that makes it possible to define feature models with cross-sentence dependencies. Our algorithm can be combined with DP beam search to leverage the quality of the traditional approach with increased flexibility for models at the discourse level. We have presented preliminary results on a cross-sentence semantic language model addressing the problem of lexical cohesion to demonstrate that this kind of models is worth exploring further. Besides lexical cohesion, cross-sentence models are relevant for other linguistic phenomena such as pronominal anaphora or verb tense selection. We believe that SMT research has reached a point of maturity where discourse phenomena should not be ignored any longer, and we consider our decoder to be a step towards this goal. References Abhishek Arun, Chris Dyer, Barry Haddow, Phil Blunsom, Adam Lopez, and Philipp Koehn. 2009. Monte carlo inference and maximization for phrase-based translation. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 102–1 10, Boulder, Colorado, June. Association for Computational Linguistics. Abhishek Arun, Barry Haddow, Philipp Koehn, Adam Lopez, Chris Dyer, and Phil Blunsom. 2010. Monte 1188 Ma- Carlo techniques for phrase-based translation. chine translation, 24(2): 103–121 . Rafael E. Banchs and Marta R. Costa-juss a`. 2011. A semantic feature for Statistical Machine Translation. In Proceedings of Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 126– 134, Portland, Oregon, USA, June. Association for Computational Linguistics. Jerome R. Bellegarda. 2000. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, 88(8): 1279–1296. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar Zaidan. 2011. Findings of the 2011 Workshop on Statistical Machine Translation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 22–64, Edinburgh, Scotland, July. Association for Computational Linguistics. Noah Coccaro and Daniel Jurafsky. 1998. Towards better integration of semantic predictors in statistical language modeling. In Proceedings of the 5th International Conference on Spoken Language Processing, Sydney. George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second International conference on Human Language Technology Research, pages 138–145, San Diego. Jason Eisner and Roy W. Tromble. 2006. Local search with very large-scale neighborhoods for optimal permutations in machine translation. In Proceedings of the HLT-NAACL Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, pages 57–75. Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. 2008. IRSTLM: an open source toolkit for handling large scale language models. In Interspeech 2008, pages 1618–1621 . ISCA. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 228–235, Toulouse, France, July. Association for Computational Linguis- tics. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1–2): 127–143. Ulrich Germann. 2003. Greedy decoding for Statistical Machine Translation in almost linear time. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. Zhengxian Gong, Min Zhang, and Guodong Zhou. 2011. Cache-based document-level Statistical Machine Translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 909–919, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Christian Hardmeier and Marcello Federico. 2010. Modelling Pronominal Anaphora in Statistical Machine Translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289. Basil Hatim and Ian Mason. 1990. Discourse and the Translator. Language in Social Life Series. Longman, London. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland, July. Association for Computational Linguistics. Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 967– 975, Prague, Czech Republic, June. Association for Computational Linguistics. David Jurgens and Keith Stevens. 2010. The S-Space package: An open source package for word space models. In Proceedings of the ACL 2010 System Demonstrations, pages 30–35, Uppsala, Sweden, July. Association for Computational Linguistics. Woosung Kim and Sanjeev Khudanpur. 2004. Crosslingual latent semantic analysis for language modeling. In IEEE international conference on acoustics, speech, and signal processing (ICASSP), volume 1, pages 257–260, Montr ´eal. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48– 54, Edmonton. 1189 Philipp Koehn, Hieu Hoang, Alexandra Birch, et al. 2007. Moses: open source toolkit for Statistical Machine Translation. In Annual meeting of the Association for Computational Linguistics: Demonstration session, pages 177–180, Prague. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2007. A greedy decoder for phrase-based statistical machine translation. In TMI-2007: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 104–1 13, Sk¨ ovde. Philippe Langlais, Alexandre Patry, and Fabrizio Gotti. 2008. Recherche locale pour la traduction statistique par segments. In TALN 2008, pages 119–128, Avignon, France, June. ATALA. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden, July. Association for Computational Linguistics. Franz Josef Och, Nicola Ueffing, and Hermann Ney. 2001. An efficient A* search algorithm for Statistical Machine Translation. In Proceedings of the DataDriven Machine Translation Workshop, 39th Annual Meeting of the Association for Computational Linguistics (ACL), pages 55–62, Toulouse. Franz Josef Och. 2003. Minimum error rate training in Statistical Machine Translation. In Proceedings of the 41st annual meeting of the Association for Computational Linguistics, pages 160–167, Sapporo (Japan). Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 3 11–3 18, Philadelphia. ACL. Yik-Cheung Tam, Ian Lane, and Tanja Schultz. 2007. Bilingual LSA-based adaptation for Statistical Machine Translation. Machine Translation, 21(4): 187– 207. J o¨rg Tiedemann. 2010. To cache or not to cache? Experiments with adaptive models in Statistical Machine Translation. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, pages 189–194, Uppsala, Sweden. Association for Computational Linguistics. Christoph Tillmann and Hermann Ney. 2003. Word reordering and a Dynamic Programming beam search algorithm for Statistical Machine Translation. Computational linguistics, 29(1):97–133. Christoph Tillmann, Stephan Vogel, Hermann Ney, and Alex Zubiaga. 1997. A DP-based search using monotone alignments in Statistical Translation. In Proceedings of the 35th Annual Meeting of the Association for Computational Spain, tics. Linguistics, July. Association for 289–296, Madrid, Computational Linguis- pages Tonio Wandmacher and Jean-Yves Antoine. 2007. Methods to integrate a language model with semantic information for a word prediction component. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL), pages 506–5 13, Prague, Czech Republic, June. Association for Computational Linguistics. 1190
6 0.68998551 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques
7 0.6846782 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
8 0.68229264 67 emnlp-2012-Inducing a Discriminative Parser to Optimize Machine Translation Reordering
9 0.68168974 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
10 0.64244771 50 emnlp-2012-Extending Machine Translation Evaluation Metrics with Lexical Cohesion to Document Level
11 0.63992685 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM
12 0.6242314 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
13 0.62090898 104 emnlp-2012-Parse, Price and Cut-Delayed Column and Row Generation for Graph Based Parsers
14 0.61971766 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
15 0.61491871 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
16 0.61114323 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
17 0.6106289 31 emnlp-2012-Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition
18 0.60906464 95 emnlp-2012-N-gram-based Tense Models for Statistical Machine Translation
19 0.60630202 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
20 0.60416943 55 emnlp-2012-Forest Reranking through Subtree Ranking