acl acl2012 acl2012-118 knowledge-graph by maker-knowledge-mining

118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes


Source: pdf

Author: Darcey Riley ; Daniel Gildea

Abstract: Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 University of Rochester Rochester, NY 14627 Abstract Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. [sent-2, score-0.308]

2 We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. [sent-3, score-0.483]

3 We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score. [sent-4, score-0.513]

4 1 Introduction The IBM Models of word alignment (Brown et al. [sent-5, score-0.123]

5 , 1996), serve as the starting point for most current state-of-the-art machine translation systems, both phrase-based and syntax-based (Koehn et al. [sent-7, score-0.141]

6 In this setting, which involves finding a segmentation of the input sentences into phrasal units, it is particularly important to control the tendency of EM to choose longer phrases, 306 which explain the training data well but are unlikely to generalize. [sent-17, score-0.079]

7 However, most state-of-the-art machine translation systems today are built on the basis of wordlevel alignments of the type generated by GIZA++ from the IBM Models and the HMM. [sent-18, score-0.242]

8 Overfitting is also a problem in this context, and improving these word alignment systems could be of broad utility in machine translation research. [sent-19, score-0.288]

9 Moore (2004) discusses details of how EM overfits the data when training IBM Model 1. [sent-20, score-0.024]

10 He discovers that the EM algorithm is particularly susceptible to overfitting in the case of rare words, due to the “garbage collection” phenomenon. [sent-21, score-0.25]

11 Suppose a sentence contains an English word e1 that occurs nowhere else in the data, and its French translation f1. [sent-22, score-0.218]

12 Suppose that same sentence also contains a word e2 which occurs frequently in the overall data but whose translation in this sentence, f2, co-occurs with it infrequently. [sent-23, score-0.245]

13 If the translation t(f2 |e2) occurs with probability 0. [sent-24, score-0.185]

14 1, then the sentence w|iell have a higher probability if EM assigns the rare word and its actual translation a probability of t(f1|e1) = 0. [sent-25, score-0.278]

15 5, and assigns the rare word’s translation to| f2 a probability of t(f2 |e1) = 0. [sent-26, score-0.245]

16 5, than if it assigns a probability of 1to t|hee correct translation t(f1|e1). [sent-27, score-0.175]

17 Moore suggests a number of solutions to this issue, including add-n smoothing and initializing the probabilities based on a heuristic rather than choosing uniform probabilities. [sent-28, score-0.151]

18 When combined, his solutions cause a significant decrease in alignment error rate (AER). [sent-29, score-0.234]

19 More recently, Mermer and Saraclar (201 1) have added a Bayesian prior to IBM Model 1 using Gibbs sampling for inference, showing improvements in BLEU scores. [sent-30, score-0.094]

20 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 0s6–310, rating variational Bayes (VB) into the widely used GIZA++ software for word alignment. [sent-33, score-0.321]

21 We use VB both because it converges more quickly than Gibbs sampling, and because it can be applied in a fairly straightforward manner to all of the models implemented by GIZA++. [sent-34, score-0.065]

22 In Section 3, we present results for VB for the various models, in terms of perplexity of held-out test data, alignment error rate (AER), and the BLEU scores which result from using our version of GIZA++ in the end-to-end phrase-based machine translation system Moses. [sent-36, score-0.431]

23 2 Variational Bayes and GIZA++ Beal (2003) gives a detailed derivation of a variational Bayesian algorithm for HMMs. [sent-37, score-0.321]

24 In practice, the digamma function has the effect of subtracting 0. [sent-41, score-0.064]

25 5 is subtracted from the expected counts, small counts corresponding to rare co-occurrences of words will be penalized heavily, while larger counts will not be affected very much. [sent-45, score-0.232]

26 Thus, low values of α cause the algorithm to favor words which co-occur frequently and to distrust words that co-occur rarely. [sent-46, score-0.078]

27 307 In this way, VB controls the overfitting that would otherwise occur with rare words. [sent-47, score-0.211]

28 On the other hand, higher values of α can be chosen if smoothing is desired, for instance in the case of the alignment probabilities, which state how likely a word in position i of the English sentence is to align to a word in po- sition j of the French sentence. [sent-48, score-0.292]

29 For these probabilities, smoothing is important because we do not want to rule out any alignment altogether, no matter how infrequently it occurs in the data. [sent-49, score-0.245]

30 We implemented VB for the translation probabilities as well as for the position alignment probabilities of IBM Model 2. [sent-50, score-0.508]

31 We discovered that adding VB for the translation probabilities improved the performance of the system. [sent-51, score-0.238]

32 However, including VB for the alignment probabilities had relatively little effect, because the alignment table in its original form does some smoothing during normalization by interpolating the counts with a uniform distribution. [sent-52, score-0.488]

33 We did not experiment with VB for the distortion probabilities of the HMM or Models 3 and 4, as these distributions have fewer parameters and are likely to have reliable counts during EM. [sent-54, score-0.164]

34 Thus, in Section 3, we present the results of using VB for the translation probabilities only. [sent-55, score-0.238]

35 3 Results First, we ran our modified version of GIZA++ on a simple test case designed to be similar to the example from Moore (2004) discussed in Section 1. [sent-56, score-0.117]

36 Our test case, shown in Table 1, had three different sentence pairs; we included nine instances of the first, two instances of the second, and one of the third. [sent-57, score-0.06]

37 Human intuition tells us that f2 should translate to e2 and f1 should translate to e1. [sent-58, score-0.066]

38 However, the EM algorithm without VB prefers e1 as the translation AER after Entire Training Alpha Figure 1: Determining the best value of α for the translation probabilities. [sent-59, score-0.327]

39 Training data is 10,000 sentence pairs from each language pair. [sent-60, score-0.066]

40 This table shows the AER for different values of α after training is complete (five iterations each of Models 1, HMM, 3, and 4). [sent-62, score-0.135]

41 The EM algorithm with VB does not overfit this data and prefers e2 as f2’s translation. [sent-64, score-0.045]

42 For minimum error rate train- ing, we used 1000 sentences for French-English, 2000 sentences for German-English, and 1274 sentences for Chinese-English. [sent-69, score-0.145]

43 Our test sets contained 1000 sentences each for French-English and German-English, and 686 sentences for ChineseEnglish. [sent-70, score-0.081]

44 For scoring the Viterbi alignments of each system against gold-standard annotated alignments, 308 Model 1 Susceptibility to Overfitting Iterations of Model 1 Figure 2: Effect of variational Bayes on overfitting for Model 1. [sent-71, score-0.503]

45 This table contrasts the test perplexities of Model 1 with variational Bayes and Model 1 without variational Bayes after different numbers of training iterations. [sent-73, score-0.741]

46 we use the alignment error rate (AER) of Och and Ney (2000), which measures agreement at the level of pairs of words. [sent-75, score-0.22]

47 We ran our code on ten thousand sentence pairs to determine the best value of α for the translation probabilities t(f|e). [sent-76, score-0.487]

48 For our training, we ran tGioIZnA p+ro+b afobirl ftiieves ti(tefr|aet)i. [sent-77, score-0.09]

49 Figure 1shows how VB, and different values of α in particular, affect the performance of GIZA++ in terms of AER. [sent-80, score-0.031]

50 We discover that, after all training is complete, VB improves the performance of the overall system, lowering AER (Figure 1) for all three language pairs. [sent-81, score-0.077]

51 We find that low values of α cause the most consistent improvements, and so we use α = 0 for the translation probabilities in the remaining experiments. [sent-82, score-0.316]

52 Note that, while a value of α = 0 does not define a proba- bilistically valid Dirichlet prior, it does not cause any practical problems in the update equation for VB. [sent-83, score-0.047]

53 Figure 2 shows the test perplexity after GIZA++ has been run for twenty-five iterations of Model 1: without VB, the test perplexity increases as training continues, but it remains stable when VB is used. [sent-84, score-0.31]

54 After choosing 0 as the best value of α for the AER for Different Corpus Sizes Corpus Size Figure 3: Performance of GIZA++ on different amounts of test data. [sent-86, score-0.027]

55 Table shows AER after all the training has completed (five iterations each of Models 1, HMM, 3, and 4). [sent-88, score-0.104]

56 Models translation probabilities, we reran the test above (five iterations each of Models 1, HMM, 3, and 4, with VB turned on for Model 1) on different amounts of data. [sent-89, score-0.372]

57 We found that the results for larger data sizes were comparable to the results for ten thousand sentence pairs, both with and without VB (Figure 3). [sent-90, score-0.153]

58 In all of these experiments, we ran Models 1, HMM, 3, and 4 for five iterations each, training on the same ten thousand sentence pairs that we used in the previous experiments. [sent-92, score-0.39]

59 In Table 2, we show the performance of the system when no VB is used, when it is used for each of the four models individually, and when it is used for all four models simultaneously. [sent-93, score-0.078]

60 We saw the most overall improvement when VB was used only for Model 1;using VB for all four models simultaneously caused the most improvement to the test perplexity, but at the cost of 309 FrenchBLCEhiUne SsceoreGerman BAMal1seMOlion dley ls2 6 . [sent-94, score-0.093]

61 For the MT experiments, we ran GIZA++ through Moses, training Model 1, the HMM, and Model 4 on 100,000 sentence pairs from each language pair. [sent-98, score-0.18]

62 We ran three experiments, one with VB turned on for all models, one with VB turned on for Model 1 only, and one (the baseline) with VB turned off for all models. [sent-99, score-0.384]

63 When VB was turned on, we ran GIZA++ for five iterations per model as in our earlier tests, but when VB was turned off, we ran GIZA++ for only four iterations per model, having determined that this was the optimal number of iterations for baseline system. [sent-100, score-0.682]

64 VB was used for the translation probabilities only, with α set to 0. [sent-101, score-0.238]

65 For French, the best results were achieved when VB was used for Model 1 only; for Chinese and German, on the other hand, using VB for all models caused the most improvements. [sent-103, score-0.039]

66 Overall, VB seems to have the greatest impact on the language pairs that are most difficult to align and translate to begin with. [sent-108, score-0.093]

67 4 Conclusion We find that applying variational Bayes with a Dirichlet prior to the translation models implemented in GIZA++ improves alignments, both in terms of AER and the BLEU score of an end-to-end translation system. [sent-109, score-0.699]

68 Variational Bayes is especially beneficial for IBM Model 1, because its lack of fertility and position information makes it particularly susceptible to the garbage collection phenomenon. [sent-110, score-0.188]

69 Applying VB to Model 1 alone tends to improve the performance of later models in the training sequence. [sent-111, score-0.063]

70 Model 1 is an essential stepping stone in avoiding local minima when training the following models, and improvements to Model 1 lead to improvements in the end-to-end system. [sent-112, score-0.112]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vb', 0.683), ('variational', 0.321), ('giza', 0.192), ('aer', 0.177), ('bayes', 0.162), ('ibm', 0.153), ('translation', 0.141), ('hmm', 0.134), ('em', 0.125), ('alignment', 0.123), ('overfitting', 0.107), ('turned', 0.098), ('probabilities', 0.097), ('bayesian', 0.094), ('ran', 0.09), ('bleu', 0.082), ('iterations', 0.08), ('perplexity', 0.076), ('alignments', 0.075), ('rare', 0.07), ('garbage', 0.067), ('counts', 0.067), ('digamma', 0.064), ('french', 0.063), ('sure', 0.055), ('smoothing', 0.054), ('thousand', 0.053), ('moore', 0.048), ('mermer', 0.048), ('cause', 0.047), ('prefers', 0.045), ('susceptible', 0.045), ('occurs', 0.044), ('dempster', 0.043), ('ix', 0.041), ('ten', 0.04), ('hmms', 0.039), ('moses', 0.039), ('models', 0.039), ('rochester', 0.038), ('five', 0.037), ('blunsom', 0.037), ('rate', 0.036), ('della', 0.036), ('denero', 0.036), ('yj', 0.036), ('german', 0.035), ('pietra', 0.035), ('increased', 0.034), ('assigns', 0.034), ('hermann', 0.034), ('controls', 0.034), ('sampling', 0.033), ('pairs', 0.033), ('sentence', 0.033), ('translate', 0.033), ('ney', 0.032), ('dirichlet', 0.031), ('vogel', 0.031), ('prior', 0.031), ('values', 0.031), ('alexandra', 0.03), ('chris', 0.03), ('improvements', 0.03), ('model', 0.029), ('och', 0.029), ('error', 0.028), ('stepping', 0.028), ('subtracted', 0.028), ('inm', 0.028), ('sity', 0.028), ('frenchenglish', 0.028), ('hansard', 0.028), ('stantin', 0.028), ('galley', 0.028), ('particularly', 0.028), ('sentences', 0.027), ('align', 0.027), ('sizes', 0.027), ('overall', 0.027), ('test', 0.027), ('implemented', 0.026), ('wordlevel', 0.026), ('alpha', 0.026), ('lowering', 0.026), ('canadian', 0.026), ('reran', 0.026), ('brown', 0.024), ('position', 0.024), ('improving', 0.024), ('training', 0.024), ('je', 0.024), ('interpolating', 0.024), ('contrasts', 0.024), ('infrequently', 0.024), ('saraclar', 0.024), ('fertility', 0.024), ('inexact', 0.024), ('cx', 0.024), ('perplexities', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

Author: Darcey Riley ; Daniel Gildea

Abstract: Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score.

2 0.29007378 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

Author: Ashish Vaswani ; Liang Huang ; David Chiang

Abstract: Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. Although many models have surpassed them in accuracy, none have supplanted them in practice. In this paper, we propose a simple extension to the IBM models: an ‘0 prior to encourage sparsity in the word-to-word translation model. We explain how to implement this extension efficiently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).

3 0.18208723 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

4 0.15509959 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

Author: Xiaodong He ; Li Deng

Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.

5 0.11129747 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

Author: Jingbo Zhu ; Tong Xiao ; Chunliang Zhang

Abstract: This paper presents an unsupervised approach to learning translation span alignments from parallel data that improves syntactic rule extraction by deleting spurious word alignment links and adding new valuable links based on bilingual translation span correspondences. Experiments on Chinese-English translation demonstrate improvements over standard methods for tree-to-string and tree-to-tree translation. 1

6 0.10329982 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

7 0.10254104 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

8 0.10085613 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

9 0.099354692 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

10 0.098133937 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

11 0.093166776 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

12 0.091539524 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

13 0.091161855 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

14 0.091017991 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

15 0.090108924 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

16 0.089652836 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

17 0.082256816 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors

18 0.078111425 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

19 0.077357672 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

20 0.077116966 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.207), (1, -0.184), (2, 0.085), (3, 0.035), (4, 0.028), (5, 0.005), (6, 0.001), (7, 0.015), (8, -0.008), (9, -0.016), (10, -0.031), (11, -0.125), (12, -0.05), (13, 0.018), (14, 0.0), (15, -0.036), (16, 0.009), (17, 0.065), (18, 0.086), (19, 0.106), (20, -0.056), (21, -0.041), (22, 0.054), (23, -0.007), (24, 0.091), (25, -0.174), (26, 0.007), (27, -0.096), (28, 0.123), (29, 0.008), (30, -0.058), (31, -0.02), (32, 0.038), (33, -0.044), (34, 0.046), (35, 0.169), (36, -0.106), (37, -0.153), (38, -0.049), (39, 0.246), (40, -0.153), (41, 0.021), (42, 0.121), (43, 0.109), (44, -0.033), (45, 0.006), (46, 0.141), (47, 0.036), (48, -0.017), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93512523 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

Author: Darcey Riley ; Daniel Gildea

Abstract: Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score.

2 0.849383 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

Author: Ashish Vaswani ; Liang Huang ; David Chiang

Abstract: Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. Although many models have surpassed them in accuracy, none have supplanted them in practice. In this paper, we propose a simple extension to the IBM models: an ‘0 prior to encourage sparsity in the word-to-word translation model. We explain how to implement this extension efficiently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).

3 0.65879387 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

4 0.58814454 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

Author: Ning Xi ; Guangchao Tang ; Xinyu Dai ; Shujian Huang ; Jiajun Chen

Abstract: The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses two different segmentation specifications for alignment and translation respectively: we use Chinese character as the basic unit for alignment, and then convert this alignment to conventional word alignment for translation rule induction. Experimentally, our approach outperformed two baselines: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance. 1Introduction Chinese Word segmentation is a necessary step in Chinese-English statistical machine translation (SMT) because Chinese sentences do not delimit words by spaces. The key characteristic of a Chinese word segmenter is the segmentation specification1. As depicted in Figure 1(a), the dominant practice of SMT uses the same word segmentation for both word alignment and translation rule induction. For brevity, we will refer to the word segmentation of the bilingual corpus as word segmentation for alignment (WSA for short), because it determines the basic tokens for alignment; and refer to the word segmentation of the aligned corpus as word segmentation for rules (WSR for short), because it determines the basic tokens of translation 1 We hereafter use “word segmentation” for short. 285 (a) WSA=WSR (b) WSA≠WSR Figure 1. WSA and WSR in SMT pipeline rules2, which also determines how the translation rules would be matched by the source sentences. It is widely accepted that word segmentation with a higher F-score will not necessarily yield better translation performance (Chang et al., 2008; Zhang et al., 2008; Xiao et al., 2010). Therefore, many approaches have been proposed to learn word segmentation suitable for SMT. These approaches were either complicated (Ma et al., 2007; Chang et al., 2008; Ma and Way, 2009; Paul et al., 2010), or of high computational complexity (Chung and Gildea 2009; Duan et al., 2010). Moreover, they implicitly assumed that WSA and WSR should be equal. This requirement may lead to a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses different word segmentation specifications as WSA and WSR respectively, as shown Figure 1(b). We investigate a solution in this framework: first, we use Chinese character as the basic unit for alignment, viz. character alignment; second, we use a simple method (Elming and Habash, 2007) to convert the character alignment to conventional word alignment for translation rule induction. In the 2 Interestingly, word is also a basic token in syntax-based rules. Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 8s5–290, experiment, our approach consistently outperformed two baselines with three different word segmenters: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance. The remainder of this paper is structured as follows: Section 2 analyzes the influences of WSA and WSR on SMT respectively; Section 3 discusses how to convert character alignment to word alignment; Section 4 presents experimental results, followed by conclusions and future work in section 5. 2 Understanding WSA and WSR We propose a solution to tackle the suboptimal problem: using Chinese character for alignment while using Chinese word for translation. Character alignment differs from conventional word alignment in the basic tokens of the Chinese side of the training corpus3. Table 1 compares the token distributions of character-based corpus (CCorpus) and word-based corpus (WCorpus). We see that the WCorpus has a longer-tailed distribution than the CCorpus. More than 70% of the unique tokens appear less than 5 times in WCorpus. However, over half of the tokens appear more than or equal to 5 times in the CCorpus. This indicates that modeling word alignment could suffer more from data sparsity than modeling character alignment. Table 2 shows the numbers of the unique tokens (#UT) and unique bilingual token pairs (#UTP) of the two corpora. Consider two extensively features, fertility and translation features, which are extensively used by many state-of-the-art word aligners. The number of parameters w.r.t. fertility features grows linearly with #UT while the number of parameters w.r.t. translation features grows linearly with #UTP. We compare #UT and #UTP of both corpora in Table 2. As can be seen, CCorpus has less UT and UTP than WCorpus, i.e. character alignment model has a compact parameterization than word alignment model, where the compactness of parameterization is shown very important in statistical modeling (Collins, 1999). Another advantage of character alignment is the reduction in alignment errors caused by word seg3 Several works have proposed to use character (letter) on both sides of the parallel corpus for SMT between similar (European) languages (Vilar et al., 2007; Tiedemann, 2009), however, Chinese is not similar to English. 286 Frequency Characters (%) Words (%) 1 27.22 45.39 2 11.13 14.61 3 6.18 6.47 4 4.26 4.32 5(+) 50.21 29.21 Table 1 Token distribution of CCorpus and WCorpus Stats. Characters Words #UT 9.7K 88.1K #UTP 15.8M 24.2M Table 2 #UT and #UTP in CCorpus and WCorpus mentation errors. For example, “切尼 (Cheney)” and “愿 (will)” are wrongly merged into one word 切 尼 by the word segmenter, and 切 尼 wrongly aligns to a comma in English sentence in the word alignment; However, both 切 and 尼 align to “Cheney” correctly in the character alignment. However, this kind of errors cannot be fixed by methods which learn new words by packing already segmented words, such as word packing (Ma et al., 2007) and Pseudo-word (Duan et al., 2010). As character could preserve more meanings than word in Chinese, it seems that a character can be wrongly aligned to many English words by the aligner. However, we found this can be avoided to a great extent by the basic features (co-occurrence and distortion) used by many alignment models. For example, we observed that the four characters of the non-compositional word “阿拉法特 (Arafat)” align to Arafat correctly, although these characters preserve different meanings from that of Arafat. This can be attributed to the frequent co-occurrence (192 愿 愿 times) of these characters and Arafat in CCorpus. Moreover, 法 usually means France in Chinese, thus it may co-occur very often with France in CCorpus. If both France and Arafat appear in the English sentence, 法 may wrongly align to France. However, if 阿 aligns to Arafat, 法 will probably align to Arafat, because aligning 法 to Arafat could result in a lower distortion cost than aligning it to France. Different from alignment, translation is a pattern matching procedure (Lopez, 2008). WSR determines how the translation rules would be matched by the source sentences. For example, if we use translation rules with character as WSR to translate name entities such as the non-compositional word 阿拉法特, i.e. translating literally, we may get a wrong translation. That’s because the linguistic knowledge that the four characters convey a specific meaning different from the characters has been lost, which cannot always be totally recovered even by using phrase in phrase-based SMT systems (see Chang et al. (2008) for detail). Duan et al. (2010) and Paul et al., (2010) further pointed out that coarser-grained segmentation of the source sentence do help capture more contexts in translation. Therefore, rather than using character, using coarser-grained, at least as coarser as the conventional word, as WSR is quite necessary. 3 Converting Character Alignment to Word Alignment In order to use word as WSR, we employ the same method as Elming and Habash (2007)4 to convert the character alignment (CA) to its word-based version (CA ’) for translation rule induction. The conversion is very intuitive: for every English-Chinese word pair ??, ?? in the sentence pair, we align ? to ? as a link in CA ’, if and only if there is at least one Chinese character of ? aligns to ? in CA. Given two different segmentations A and B of the same sentence, it is easy to prove that if every word in A is finer-grained than the word of B at the corresponding position, the conversion is unambiguity (we omit the proof due to space limitation). As character is a finer-grained than its original word, character alignment can always be converted to alignment based on any word segmentation. Therefore, our approach can be naturally scaled to syntax-based system by converting character alignment to word alignment where the word seg- mentation is consistent with the parsers. We compare CA with the conventional word alignment (WA) as follows: We hand-align some sentence pairs as the evaluation set based on characters (ESChar), and converted it to the evaluation set based on word (ESWord) using the above conversion method. It is worth noting that comparing CA and WA by evaluating CA on ESChar and evaluating WA on ESWord is meaningless, because the basic tokens in CA and WA are different. However, based on the conversion method, comparing CA with WA can be accomplished by evaluating both CA ’ and WA on ESWord. 4 They used this conversion for word alignment combination only, no translation results were reported. 287 4 Experiments 4.1 Setup FBIS corpus (LDC2003E14) (210K sentence pairs) was used for small-scale task. A large bilingual corpus of our lab (1.9M sentence pairs) was used for large-scale task. The NIST’06 and NIST’08 test sets were used as the development set and test set respectively. The Chinese portions of all these data were preprocessed by character segmenter (CHAR), ICTCLAS word segmenter5 (ICT) and Stanford word segmenters with CTB and PKU specifications6 respectively. The first 100 sentence pairs of the hand-aligned set in Haghighi et al. (2009) were hand-aligned as ESChar, which is converted to three ESWords based on three segmentations respectively. These ESWords were appended to training corpus with the corresponding word segmentation for evaluation purpose. Both character and word alignment were performed by GIZA++ (Och and Ney, 2003) enhanced with gdf heuristics to combine bidirectional alignments (Koehn et al., 2003). A 5-gram language model was trained from the Xinhua portion of Gigaword corpus. A phrase-based MT decoder similar to (Koehn et al., 2007) was used with the decoding weights optimized by MERT (Och, 2003). 4.2 Evaluation We first evaluate the alignment quality. The method discussed in section 3 was used to compare character and word alignment. As can be seen from Table 3, the systems using character as WSA outperformed the ones using word as WSA in both small-scale (row 3-5) and large-scale task (row 6-8) with all segmentations. This gain can be attributed to the small vocabulary size (sparsity) for character alignment. The observation is consistent with Koehn (2005) which claimed that there is a negative correlation between the vocabulary size and translation performance without explicitly distinguishing WSA and WSR. We then evaluated the translation performance. The baselines are fully word-based MT systems (WordSys), i.e. using word as both WSA and WSR, and fully character-based systems (CharSys). Table 5 http://www.ictclas.org/ 6 http://nlp.stanford.edu/software/segmenter.shtml TLSablCIPeKT3BUAlig87 n609P5mW.0162eonrdt8avl52R01ai.g l6489numatieo78n29F t. 46590PrecC87 i1hP28s.oa3027rn(ctPe89)r6R05,.ar7162e3licganm8 (15F62eR.n983)t, TableSL4TWwrcahonSraAdslatioWw no SerdRvalu2Ct31iT.o405Bn1724ofW2P 301Ko.895rU61d Sy2sI03Ca.29nT035d4 proand F-score (F) with ? ? 0.5 (Fraser and Marcu, 2007) posed system using BLEU-SBP (Chiang et al., 2008) 4 compares WordSys to our proposed system. Significant testing was carried out using bootstrap re-sampling method proposed by Koehn (2004) with a 95% confidence level. We see that our proposed systems outperformed WordSys in all segmentation specifications settings. Table 5 lists the results of CharSys in small-scale task. In this setting, we gradually set the phrase length and the distortion limits of the phrase-based decoder (context size) to 7, 9, 11 and 13, in order to remove the disadvantage of shorter context size of using character as WSR for fair comparison with WordSys as suggested by Duan et al. (2010). Comparing Table 4 and 5, we see that all CharSys underperformed WordSys. This observation is consistent with Chang et al. (2008) which claimed that using characters, even with large phrase length (up to 13 in our experiment) cannot always capture everything a Chinese word segmenter can do, and using word for translation is quite necessary. We also see that CharSys underperformed our proposed systems, that’s because the harm of using character as WSR outweighed the benefit of using character as WSA, which indicated that word segmentation better for alignment is not necessarily better for translation, and vice versa. We finally compared our approaches to Ma et al. (2007) and Ma and Way (2009), which proposed “packed word (PW)” and “bilingual motivated word (BS)” respectively. Both methods iteratively learn word segmentation and alignment alternatively, with the former starting from word-based corpus and the latter starting from characters-based corpus. Therefore, PW can be experimented on all segmentations. Table 6 lists their results in small- 288 Context Size 7 9 11 13 BLEU 20.90 21.19 20.89 21.09 Table 5 Translation evaluation of CharSys. CWPhrSoayps+TdoPtSaBebWmySdsle6wWPcCBhoWSa rAmdpawWrPBisoWS rRdnwiC2t1hT.2504oB6the2r1P0w9K.2o178U496rk s2I10C.9T547 scale task, we see that both PW and BS underperformed our approach. This may be attributed to the low recall of the learned BS or PW in their approaches. BS underperformed both two baselines, one reason is that Ma and Way (2009) also employed word lattice decoding techniques (Dyer et al., 2008) to tackle the low recall of BS, which was removed from our experiments for fair comparison. Interestingly, we found that using character as WSA and BS as WSR (Char+BS), a moderate gain (+0.43 point) was achieved compared with fully BS-based system; and using character as WSA and PW as WSR (Char+PW), significant gains were achieved compared with fully PW-based system, the result of CTB segmentation in this setting even outperformed our proposed approach (+0.42 point). This observation indicated that in our framework, better combinations of WSA and WSR can be found to achieve better translation performance. 5 Conclusions and Future Work We proposed a SMT framework that uses character for alignment and word for translation, which improved both alignment quality and translation performance. We believe that in this framework, using other finer-grained segmentation, with fewer ambiguities than character, would better parameterize the alignment models, while using other coarser-grained segmentation as WSR can help capture more linguistic knowledge than word to get better translation. We also believe that our approach, if integrated with combination techniques (Dyer et al., 2008; Xi et al., 2011), can yield better results. Acknowledgments We thank ACL reviewers. This work is supported by the National Natural Science Foundation of China (No. 61003 112), the National Fundamental Research Program of China (2010CB327903). References Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Peitra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), pages 263-3 11. Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of third workshop on SMT, pages 224-232. David Chiang, Steve DeNeefe, Yee Seng Chan and Hwee Tou Ng. 2008. Decomposability of Translation Metrics for Improved Evaluation and Efficient Algorithms. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 610-619. Tagyoung Chung and Daniel Gildea. 2009. Unsupervised tokenization for machine translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 718-726. Michael Collins. 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania. Xiangyu Duan, Min Zhang, and Haizhou Li. 2010. Pseudo-word for phrase-based machine translation. In Proceedings of the Association for Computational Linguistics, pages 148-156. Christopher Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. In Proceedings of the Association for Computational Linguistics, pages 1012-1020. Jakob Elming and Nizar Habash. 2007. Combination of statistical word alignments based on multiple preprocessing schemes. In Proceedings of the Association for Computational Linguistics, pages 25-28. Alexander Fraser and Daniel Marcu. 2007. Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation. In Computational Linguistics, 33(3), pages 293-303. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein. 2009. Better word alignments with supervised ITG models. In Proceedings of the Association for Computational Linguistics, pages 923-93 1. Phillip Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 177-1 80. 289 Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods on Natural Language Processing, pages 388-395. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the MT Summit. Adam David Lopez. 2008. Machine translation by pattern matching. Ph.D. thesis, University of Maryland. Yanjun Ma, Nicolas Stroppa, and Andy Way. 2007. Bootstrapping word alignment via word packing. In Proceedings of the Association for Computational Linguistics, pages 304-3 11. Yanjun Ma and Andy Way. 2009. Bilingually motivated domain-adapted word segmentation for statistical machine translation. In Proceedings of the Conference of the European Chapter of the ACL, pages 549-557. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 440-447. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), pages 19-5 1. Michael Paul, Andrew Finch and Eiichiro Sumita. 2010. Integration of multiple bilingually-learned segmentation schemes into statistical machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 400-408. Jörg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proceedings of the Annual Conference of the European Association for machine Translation, pages 12-19. David Vilar, Jan-T. Peter and Hermann Ney. 2007. Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, pages 33-39. Xinyan Xiao, Yang Liu, Young-Sook Hwang, Qun Liu and Shouxun Lin. 2010. Joint tokenization and translation. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1200-1208. Ning Xi, Guangchao Tang, Boyuan Li, and Yinggong Zhao. 2011. Word alignment combination over multiple word segmentation. In Proceedings of the ACL 2011 Student Session, pages 1-5. Ruiqiang Zhang, Keiji Yasuda, and Eiichiro Sumita. 2008. Improved statistical machine translation by multiple Chinese word segmentation. of the Third Workshop on Statistical Machine Translation, pages 216-223. 290 In Proceedings

5 0.57608014 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

Author: Jingbo Zhu ; Tong Xiao ; Chunliang Zhang

Abstract: This paper presents an unsupervised approach to learning translation span alignments from parallel data that improves syntactic rule extraction by deleting spurious word alignment links and adding new valuable links based on bilingual translation span correspondences. Experiments on Chinese-English translation demonstrate improvements over standard methods for tree-to-string and tree-to-tree translation. 1

6 0.51348245 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

7 0.48335904 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

8 0.43007118 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

9 0.40561423 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

10 0.40192625 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

11 0.39665303 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

12 0.39509037 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

13 0.38781083 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

14 0.36574808 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

15 0.36006466 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

16 0.35966855 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

17 0.35956866 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation

18 0.34813654 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

19 0.34380105 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

20 0.34063894 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.011), (26, 0.026), (28, 0.062), (30, 0.037), (37, 0.029), (39, 0.058), (57, 0.038), (59, 0.024), (74, 0.049), (82, 0.014), (84, 0.018), (85, 0.032), (90, 0.136), (92, 0.043), (93, 0.151), (94, 0.155), (99, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81597942 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

Author: Apoorv Agarwal ; Adinoyi Omuya ; Aaron Harnly ; Owen Rambow

Abstract: Many researchers have attempted to predict the Enron corporate hierarchy from the data. This work, however, has been hampered by a lack of data. We present a new, large, and freely available gold-standard hierarchy. Using our new gold standard, we show that a simple lower bound for social network-based systems outperforms an upper bound on the approach taken by current NLP systems.

2 0.81124073 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

Author: Andrejs Vasiljevs ; Raivis Skadins ; Jorg Tiedemann

Abstract: Machine Translation Raivis Skadiņš TILDE Vienbas gatve 75a, Riga LV-1004, LATVIA raivi s . s kadins @ ti lde . lv Jörg Tiedemann Uppsala University Box 635, Uppsala SE-75 126, SWEDEN j org .t iedemann@ l ingfi l .uu . se the Universities of Copenhagen, and Uppsala. Edinburgh, Zagreb, To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. The platform is developed in the EU collaboration project LetsMT! . This system demonstration paper presents the motivation in developing the LetsMT! platform, its main features, architecture, and an evaluation in a practical use case. 1

3 0.810072 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

Author: Ashish Vaswani ; Liang Huang ; David Chiang

Abstract: Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. Although many models have surpassed them in accuracy, none have supplanted them in practice. In this paper, we propose a simple extension to the IBM models: an ‘0 prior to encourage sparsity in the word-to-word translation model. We explain how to implement this extension efficiently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).

same-paper 4 0.80999589 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

Author: Darcey Riley ; Daniel Gildea

Abstract: Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score.

5 0.80427402 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations

Author: JinYeong Bak ; Suin Kim ; Alice Oh

Abstract: In social psychology, it is generally accepted that one discloses more of his/her personal information to someone in a strong relationship. We present a computational framework for automatically analyzing such self-disclosure behavior in Twitter conversations. Our framework uses text mining techniques to discover topics, emotions, sentiments, lexical patterns, as well as personally identifiable information (PII) and personally embarrassing information (PEI). Our preliminary results illustrate that in relationships with high relationship strength, Twitter users show significantly more frequent behaviors of self-disclosure.

6 0.80355942 176 acl-2012-Sentence Compression with Semantic Role Constraints

7 0.75555509 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

8 0.69529992 140 acl-2012-Machine Translation without Words through Substring Alignment

9 0.69244301 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

10 0.68821394 136 acl-2012-Learning to Translate with Multiple Objectives

11 0.68641084 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation

12 0.67988235 83 acl-2012-Error Mining on Dependency Trees

13 0.67458469 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

14 0.67280453 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

15 0.67254603 108 acl-2012-Hierarchical Chunk-to-String Translation

16 0.67200679 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

17 0.67064947 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

18 0.66828662 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

19 0.66800576 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

20 0.6669057 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations