emnlp emnlp2010 emnlp2010-67 knowledge-graph by maker-knowledge-mining

67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Source: pdf

Author: Samuel Brody

Abstract: We reveal a previously unnoticed connection between dependency parsing and statistical machine translation (SMT), by formulating the dependency parsing task as a problem of word alignment. Furthermore, we show that two well known models for these respective tasks (DMV and the IBM models) share common modeling assumptions. This motivates us to develop an alignment-based framework for unsupervised dependency parsing. The framework (which will be made publicly available) is flexible, modular and easy to extend. Using this framework, we implement several algorithms based on the IBM alignment models, which prove surprisingly effective on the dependency parsing task, and demonstrate the potential of the alignment-based approach.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We reveal a previously unnoticed connection between dependency parsing and statistical machine translation (SMT), by formulating the dependency parsing task as a problem of word alignment. [sent-5, score-0.912]

2 This motivates us to develop an alignment-based framework for unsupervised dependency parsing. [sent-7, score-0.387]

3 The framework (which will be made publicly available) is flexible, modular and easy to extend. [sent-8, score-0.136]

4 Using this framework, we implement several algorithms based on the IBM alignment models, which prove surprisingly effective on the dependency parsing task, and demonstrate the potential of the alignment-based approach. [sent-9, score-0.624]

5 1 Introduction Both statistical machine translation (SMT) and un- supervised dependency parsing have seen a surge of interest in recent years, as the need for large scale data processing has increased. [sent-10, score-0.416]

6 However, in this paper, we reveal a strong connection between them and show that the problem of dependency parsing can be formulated as one of word alignment within the sentence. [sent-12, score-0.807]

7 Based on this connection, we develop a framework which uses an alignment-based approach for 1214 unsupervised dependency parsing. [sent-15, score-0.338]

8 We demonstrate these properties and the merit ofthe alignment-based parsing approach by implementing several dependency parsing algorithms based on the IBM alignment models and evaluating their performance on the task. [sent-17, score-0.826]

9 These results are encouraging and indicate that the alignment-based approach could serve as the basis for competitive dependency parsing systems, much as DMV did. [sent-20, score-0.444]

10 First, by revealing the connection between the two tasks, we introduce a new approach to dependency parsing, and open the way for use of SMT alignment resources and tools for parsing. [sent-22, score-0.612]

11 The second contribution is a publiclyavailable framework for exploring new alignment models. [sent-24, score-0.365]

12 The framework uses Gibbs sampling techniques and includes our sampling-based implementations of the IBM models (see Section 3. [sent-25, score-0.217]

13 The sampling approach makes it easy to modify the existing models and add new ones. [sent-27, score-0.143]

14 The framework can be used both for dependency parsing and for bi- lingual word alignment. [sent-28, score-0.44]

15 In Section 2 we present a brief overview of those works in the fields of dependency parsing and alignment for statistical machine translation which are directly Proce MdiInTg,s M oaf sthseac 2h0u1s0et Ctso, UnfeSrAe,nc 9e-1 o1n O Ecmtopbireirca 2l0 M10e. [sent-30, score-0.674]

16 Section 3 describes the connection between the two problems, examines the shared assumptions of the DMV and IBM models, and describes our framework and algorithms. [sent-33, score-0.244]

17 1 Unsupervised Dependency Parsing In recent years, the field of supervised parsing has advanced tremendously, to the point where highly accurate parsers are available for many languages. [sent-37, score-0.142]

18 Therefore, for domains and languages with minimal resources, unsupervised parsing is of great impor- tance. [sent-39, score-0.182]

19 Early work in the field focused on models that made use primarily of the co-occurrence information of the head and its argument (Yuret, 1998; Paskin, 2001). [sent-40, score-0.223]

20 DMV is based on a linguistically motivated generative model, which follows common practice in supervised parsing and takes into consideration the distance between head and argument, as well as the valence (the capacity of a head word to attach arguments). [sent-42, score-0.647]

21 DMV strongly outperformed previous models and was the first unsupervised dependency induction system to achieve accuracy above the right-branching baseline. [sent-44, score-0.362]

22 , 1993) represent the first generation of word-based SMT models, and serve as a starting point for most cur1215 Figure 1: An example of an alignment between an English sentence (top) and its French translation (bottom). [sent-51, score-0.377]

23 The models employ the notion of alignment between individual words in the source and translation. [sent-56, score-0.474]

24 An example of such an alignment is given in Figure 1. [sent-57, score-0.258]

25 1 The Connection The task of dependency parsing requires finding a parse tree for a sentence, where two words are connected by an edge if they participate in a syntactic dependency relation. [sent-61, score-0.674]

26 An example of a dependency parse of a sentence is given in Figure 2 (left). [sent-63, score-0.272]

27 Find a set of pairwise relations (si, sj) connecting a dependent word sj with its head word si in the sentence. [sent-65, score-0.299]

28 This alternate formulation allows us to view the problem as one of alignment of a sentence to itself, as shown in Figure 2 (right). [sent-66, score-0.33]

29 Given this perspective on the problem, it makes sense to examine existing alignment models, compare them to dependency parsing models, and see if they can be successfully employed for the dependency parsing task. [sent-67, score-0.99]

30 Figure 2: Left: An example of an unlabeled dependency parse of a sentence. [sent-68, score-0.272]

31 Right: The same parse, in the form of an alignment between a head words (top) and their dependents (bottom). [sent-69, score-0.457]

32 The same assumption is made in all the dependency models mentioned in Section 2 regarding a head and its dependent (although DMV uses word classes instead of the actual words). [sent-72, score-0.503]

33 One of the improvements contributing to the success of DMV was the notion of distance, which was absent from previous models (see Section 3 in Klein and Manning 2004). [sent-74, score-0.155]

34 Fertility IBM Model 3 adds the notion of fertility, or the idea that different words in the source language tend to generate different numbers of words in the target language. [sent-75, score-0.246]

35 In these characteristics, it is very similar to the “root” node, which is artificially added to parse trees and used to represent the head of words which are not dependents of any other word in the sentence. [sent-81, score-0.247]

36 1216 In examining the core assumptions of the IBM models, we note that there is a strong resemblance to those of DMV. [sent-82, score-0.126]

37 e, exploring the use of the IBM alignment models for dependency parsing. [sent-87, score-0.575]

38 It is important to note that the IBM models do not address many important factors relevant to the parsing task. [sent-88, score-0.202]

39 For instance, they have no notion of a parse tree, a deficit which may lead to degenerate solutions and malformed parses. [sent-89, score-0.188]

40 However, they serve as a good starting point for exploring the alignment approach to parsing, as well as discovering additional factors that need to be addressed under this approach. [sent-90, score-0.36]

41 3 Experimental Framework We developed a Gibbs sampling framework for parsing2. [sent-92, score-0.157]

42 alignment-based dependency The traditional approach to alignment uses Expectation Maximization (EM) to find the optimal values for the latent variables. [sent-93, score-0.482]

43 The sampling method, on the other hand, only considers a small change in each step - that of re-aligning a previously aligned target word to a new source. [sent-101, score-0.214]

44 Under the sampling framework, the model provides the probability of changing the alignment A[i] of a target word ifrom a previously aligned source word j to a new one jˆ. [sent-104, score-0.578]

45 In all the models we consider, this probability is proportional to the ratio between the scores of the old sentence alignment A and the new one Aˆ, which differs from the old only in the realignment of ito jˆ. [sent-105, score-0.487]

46 P(A[i] = j ⇒ A[i] =jˆ ) ∼PPmmooddeell((AAˆ)) (2) As a starting point for our dependency parsing model, we re-implemented the first three IBM models 4 in the sampling framework. [sent-106, score-0.544]

47 4 Reformulating the IBM models IBM Model 1 According to this model, the probability of an alignment between target word i and source word jˆ depends only on the lexical identities of the two words wi and wjˆ respectively. [sent-108, score-0.604]

48 IBM Model 2 The original IBM model 2 is a distortion model that assumes that the probability of an alignment between target word i and source word jˆ depends only on the locations of the words, i. [sent-114, score-0.499]

49 , the values iand jˆ, taking into account the different lengths land m of the source and target sentences, respectively. [sent-116, score-0.189]

50 For dependency parsing, where we align sentences to themselves, l= m. [sent-117, score-0.224]

51 Even without the need for handling different lengths for source and target sentences, this model is complex and requires estimating a separate probability for each triplet (i, j,l). [sent-120, score-0.156]

52 In addition, the assumption that the distance distribution depends only on the sentence length and is similar for all tokens seems unreasonable, especially when dealing with part-of-speech tokens and dependency relations. [sent-121, score-0.326]

53 For this reason, we also implemented an alternate distance model, based loosely on Liang et al. [sent-124, score-0.131]

54 Under the alternate model, the probability of an alignment between target word iand source word jˆ depends on the distance between them, their order, the sentence length, and the word type of the head, according to equation 6. [sent-126, score-0.685]

55 P(i,jˆ,l) =#[w#i(,(wi-i,jˆ∗),,ll])++αα33/D (6) IBM Model 3 This model handles the notion of fertility (or valence). [sent-127, score-0.251]

56 Under this model, the probability of an alignment depends on how many target words are aligned to each of the source words. [sent-128, score-0.538]

57 Each source word type wjˆ, has a distribution specifying the probability of having n aligned target words. [sent-129, score-0.237]

58 The probability of an alignment is proportional to the product of the probabilities of the fertilities in the alignment and takes into account the special status of the null word (represented by the index j = 0). [sent-130, score-0.727]

59 #(#w(wj,φj,j∗) ++αα44/F (7) Here, φj denotes the number of target words aligned to the j-th source word in alignment A. [sent-136, score-0.45]

60 #(wj, φj) represents the number of times source word wj was observed to have φj dependent target words, #(wj, ∗) is the number of times wj appeared in the data, ∗F) )i iss sth tehe expected number of fertility values (5 in our experiments), and α4 is the CRP hyperparameter. [sent-138, score-0.637]

61 It uses the alignments 5The transitional version of this equation depends on whether either the old source word (j) or the new one (jˆ) are null, and is omitted for brevity. [sent-141, score-0.255]

62 This allows for the easy introduction of new models which consider different aspects of the alignment and complement each other. [sent-148, score-0.318]

63 Preventing Self-Alignment When adapting the alignment approach to dependency parsing, we view the task as that of aligning a sentence to itself. [sent-149, score-0.525]

64 For this purpose we introduce a simple model into the product which gives zero probability to alignments which contain a word aligned to itself, as in equation 8. [sent-151, score-0.235]

65 Since we do not make use of annotation, we can induce a dependency structure on the entire dataset provided (disregarding the division into training and testing). [sent-159, score-0.224]

66 2 Results Table 1 shows the results of the IBM Models on the task of directed (unlabeled) dependency parsing. [sent-162, score-0.224]

67 However, the Danish dataset is unusual (see Buchholz and Marsi 2006) in that the alternate adjacency baseline of leftbranching (also mentioned by Klein and Manning 2004) is extremely strong and achieves 48. [sent-181, score-0.125]

68 3 Analysis In order to better understand what our alignment model was learning, we looked at each component element individually. [sent-184, score-0.258]

69 Table 2 shows the most likely dependency attachment for the top ten most common parts-of-speech. [sent-186, score-0.278]

70 ), but there is little notion of directionality, and cycles can exist. [sent-188, score-0.136]

71 For instance, the model learns the connection between determiner and noun, but is unsure which is the head and which the dependent. [sent-189, score-0.342]

72 A similar connection is learned between to and verbs in the base form (VB). [sent-190, score-0.179]

73 However, there is a strong linguistic basis to consider the directionality of these relations difficult. [sent-192, score-0.177]

74 There is some debate among linguists as to whether the head of a noun phrase is the noun or the deter- miner9 (see Abney 1987). [sent-193, score-0.163]

75 Each can be seen as a different kind of head element, performing a different function, similarly to the multiple types of dependency relations identified in Hudson’s (1990) Word Grammar. [sent-194, score-0.422]

76 A similar case can be made regarding the head of an infinitive phrase. [sent-195, score-0.201]

77 The alternative distance model we proposed, which takes into account the identity of the head word, achieves better accuracy and is closer to the gold standard balance (43. [sent-203, score-0.264]

78 Figure 3 shows the distribution of the location of the dependent relative to the head word (at position 0) for several common parts-of-speech. [sent-206, score-0.27]

79 Fertility Figure 4 shows the distribution of fertility values for several common parts of speech. [sent-214, score-0.156]

80 This is likely an effect of the strong connection between base form verbs and the preceding word to. [sent-217, score-0.232]

81 Hyper-Parameters Each of our models requires a value for its CRP hyperparameter (see Section 3. [sent-218, score-0.129]

82 In addition to the CRP parameters, Model 3 requires a value for p1, the null fertility hyperparameter. [sent-226, score-0.24]

83 We tested our model with random initialization (uniform alignment probabilities) and with an approximation of the ad-hoc “harmonic” initialization described in Klein and Manning (2004) and found no noticeable difference in accuracy. [sent-233, score-0.37]

84 4 Discussion The accuracy achieved by the IBM models (Table 1) is surprisingly high, given the fact that the IBM models were not designed with dependency parsing in mind. [sent-235, score-0.486]

85 Although it lacks an inherent notion of tree structure, the alignment-based approach has several advantages over the head-outward approach of DMV and related models. [sent-240, score-0.131]

86 It can consider the alignment as a whole and take into account global sentence constraints, not just head-dependent relations. [sent-241, score-0.3]

87 Another advantage of our alignment-based models is the fact that they are not strongly sensitive to initialization and can be started from a set of random alignments. [sent-244, score-0.154]

88 5 Conclusions and Future Work We have described an alternative formulation of dependency parsing as a problem of word alignment. [sent-245, score-0.366]

89 This connection motivated us to explore the possibility of using alignment tools for the task of unsupervised dependency parsing. [sent-246, score-0.652]

90 We chose to experiment with the well-known IBM alignment models which share a set of similar modeling assumptions with Klein and Manning’s (2004) Dependency Model with Valence. [sent-247, score-0.358]

91 Our experiments showed that the IBM models are surprisingly effective at the dependency parsing task, outperforming the rightbranching baseline and approaching the accuracy of DMV. [sent-248, score-0.484]

92 Our results demonstrate that the alignment approach can be used as a foundation for dependency parsing algorithms and motivates further research in this area. [sent-249, score-0.673]

93 These include improving and extending the existing IBM models, as well as introducing new models that are specifically designed for the parsing 1221 task and represent relevant linguistic considerations (e. [sent-251, score-0.202]

94 Finally, although we use our framework for dependency parsing, the sampling approach and the framework we developed can be used to explore new models for bilingual word alignment. [sent-258, score-0.515]

95 Furthermore, an alignment-based parsing method is expected to integrate well with SMT bi-lingual alignment models and may, therefore, be suitable for combined models which use parse trees to improve word align- ment (e. [sent-259, score-0.568]

96 Acknowledgments Iwould like to thank Chris Dyer for providing the basis for the sampling implementation. [sent-263, score-0.127]

97 Improving unsupervised dependency parsing with richer contexts and smoothing. [sent-315, score-0.406]

98 Corpusbased induction of syntactic structure: models of dependency and constituency. [sent-325, score-0.284]

99 A system- atic comparison of various statistical alignment models. [sent-340, score-0.258]

100 From Baby Steps to Leapfrog: How “Less is More” in unsupervised dependency parsing. [sent-351, score-0.264]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ibm', 0.439), ('dmv', 0.37), ('alignment', 0.258), ('dependency', 0.224), ('crp', 0.173), ('head', 0.163), ('wj', 0.157), ('fertility', 0.156), ('parsing', 0.142), ('connection', 0.13), ('smt', 0.11), ('klein', 0.103), ('notion', 0.095), ('manning', 0.085), ('null', 0.084), ('sampling', 0.083), ('valence', 0.082), ('aligned', 0.081), ('framework', 0.074), ('alternate', 0.072), ('hyperparameter', 0.069), ('modular', 0.062), ('danish', 0.062), ('source', 0.061), ('equation', 0.061), ('models', 0.06), ('distance', 0.059), ('association', 0.058), ('hudson', 0.058), ('paskin', 0.058), ('posattachment', 0.058), ('ppmmooddeell', 0.058), ('rightbranching', 0.058), ('tendencies', 0.058), ('yuret', 0.058), ('dependent', 0.056), ('brown', 0.056), ('initialization', 0.056), ('attachment', 0.054), ('strong', 0.053), ('location', 0.051), ('target', 0.05), ('translation', 0.05), ('identities', 0.049), ('gradual', 0.049), ('iwould', 0.049), ('motivates', 0.049), ('determiner', 0.049), ('verbs', 0.049), ('parse', 0.048), ('alignments', 0.048), ('vb', 0.045), ('probability', 0.045), ('lectures', 0.045), ('sj', 0.045), ('directionality', 0.045), ('binding', 0.045), ('degenerate', 0.045), ('chomsky', 0.045), ('basis', 0.044), ('morristown', 0.044), ('aa', 0.043), ('aligning', 0.043), ('chris', 0.043), ('depends', 0.043), ('old', 0.042), ('distortion', 0.042), ('account', 0.042), ('spitkovsky', 0.041), ('abney', 0.041), ('burkett', 0.041), ('cycles', 0.041), ('adds', 0.04), ('proportional', 0.04), ('assumptions', 0.04), ('unsupervised', 0.04), ('nj', 0.039), ('attach', 0.038), ('infinitive', 0.038), ('dutch', 0.038), ('wi', 0.038), ('strongly', 0.038), ('tree', 0.036), ('buchholz', 0.036), ('headden', 0.036), ('iand', 0.036), ('dependents', 0.036), ('relations', 0.035), ('french', 0.035), ('starting', 0.035), ('presumably', 0.034), ('preventing', 0.034), ('serve', 0.034), ('pr', 0.034), ('adam', 0.033), ('functional', 0.033), ('samuel', 0.033), ('core', 0.033), ('exploring', 0.033), ('north', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

2 0.34665337 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

Author: Shaojun Zhao ; Daniel Gildea

Abstract: A word in one language can be translated to zero, one, or several words in other languages. Using word fertility features has been shown to be useful in building word alignment models for statistical machine translation. We built a fertility hidden Markov model by adding fertility to the hidden Markov model. This model not only achieves lower alignment error rate than the hidden Markov model, but also runs faster. It is similar in some ways to IBM Model 4, but is much easier to understand. We use Gibbs sampling for parameter estimation, which is more principled than the neighborhood method used in IBM Model 4.

3 0.22269489 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

Author: Hendra Setiawan ; Chris Dyer ; Philip Resnik

Abstract: We address the modeling, parameter estimation and search challenges that arise from the introduction of reordering models that capture non-local reordering in alignment modeling. In particular, we introduce several reordering models that utilize (pairs of) function words as contexts for alignment reordering. To address the parameter estimation challenge, we propose to estimate these reordering models from a relatively small amount of manuallyaligned corpora. To address the search challenge, we devise an iterative local search algorithm that stochastically explores reordering possibilities. By capturing non-local reordering phenomena, our proposed alignment model bears a closer resemblance to stateof-the-art translation model. Empirical results show significant improvements in alignment quality as well as in translation performance over baselines in a large-scale ChineseEnglish translation task.

4 0.19087192 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Author: Phil Blunsom ; Trevor Cohn

Abstract: Inducing a grammar directly from text is one of the oldest and most challenging tasks in Computational Linguistics. Significant progress has been made for inducing dependency grammars, however the models employed are overly simplistic, particularly in comparison to supervised parsing models. In this paper we present an approach to dependency grammar induction using tree substitution grammar which is capable of learning large dependency fragments and thereby better modelling the text. We define a hierarchical non-parametric Pitman-Yor Process prior which biases towards a small grammar with simple productions. This approach significantly improves the state-of-the-art, when measured by head attachment accuracy.

5 0.17972851 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

Author: Tahira Naseem ; Harr Chen ; Regina Barzilay ; Mark Johnson

Abstract: We present an approach to grammar induction that utilizes syntactic universals to improve dependency parsing across a range of languages. Our method uses a single set of manually-specified language-independent rules that identify syntactic dependencies between pairs of syntactic categories that commonly occur across languages. During inference of the probabilistic model, we use posterior expectation constraints to require that a minimum proportion of the dependencies we infer be instances of these rules. We also automatically refine the syntactic categories given in our coarsely tagged input. Across six languages our approach outperforms state-of-theart unsupervised methods by a significant margin.1

6 0.16801094 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

7 0.12718189 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

8 0.11649054 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

9 0.096862085 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

10 0.088770591 39 emnlp-2010-EMNLP 044

11 0.087094054 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

12 0.08614748 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

13 0.085210346 38 emnlp-2010-Dual Decomposition for Parsing with Non-Projective Head Automata

14 0.084117524 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

15 0.083872795 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

16 0.083301909 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

17 0.082827367 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

18 0.080090001 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

19 0.078926258 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

20 0.078639738 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.32), (1, -0.054), (2, 0.217), (3, -0.112), (4, 0.146), (5, -0.31), (6, -0.065), (7, 0.096), (8, 0.164), (9, 0.141), (10, -0.166), (11, 0.118), (12, 0.086), (13, 0.098), (14, 0.071), (15, 0.04), (16, 0.123), (17, 0.052), (18, -0.011), (19, -0.241), (20, 0.1), (21, 0.051), (22, -0.236), (23, 0.08), (24, -0.176), (25, 0.153), (26, -0.037), (27, -0.01), (28, -0.074), (29, 0.053), (30, 0.035), (31, 0.033), (32, -0.067), (33, 0.027), (34, 0.05), (35, 0.037), (36, 0.007), (37, 0.005), (38, -0.003), (39, -0.042), (40, 0.002), (41, 0.051), (42, -0.011), (43, 0.02), (44, -0.031), (45, 0.004), (46, -0.039), (47, 0.086), (48, 0.003), (49, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96589845 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

2 0.86511511 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

Author: Shaojun Zhao ; Daniel Gildea

3 0.60092008 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

Author: Phil Blunsom ; Trevor Cohn

4 0.52323902 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

Author: Hendra Setiawan ; Chris Dyer ; Philip Resnik

5 0.47634503 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

Author: Jinxi Xu ; Antti-Veikko Rosti

Abstract: Word alignment plays a central role in statistical MT (SMT) since almost all SMT systems extract translation rules from word aligned parallel training data. While most SMT systems use unsupervised algorithms (e.g. GIZA++) for training word alignment, supervised methods, which exploit a small amount of human-aligned data, have become increasingly popular recently. This work empirically studies the performance of these two classes of alignment algorithms and explores strategies to combine them to improve overall system performance. We used two unsupervised aligners, GIZA++ and HMM, and one supervised aligner, ITG, in this study. To avoid language and genre specific conclusions, we ran experiments on test sets consisting of two language pairs (Chinese-to-English and Arabicto-English) and two genres (newswire and weblog). Results show that the two classes of algorithms achieve the same level of MT perfor- mance. Modest improvements were achieved by taking the union of the translation grammars extracted from different alignments. Significant improvements (around 1.0 in BLEU) were achieved by combining outputs of different systems trained with different alignments. The improvements are consistent across languages and genres.

6 0.46784368 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

7 0.39442223 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

8 0.38327923 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

9 0.31405175 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

10 0.291125 38 emnlp-2010-Dual Decomposition for Parsing with Non-Projective Head Automata

11 0.28459179 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

12 0.28248188 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

13 0.27528346 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

14 0.27472666 68 emnlp-2010-Joint Inference for Bilingual Semantic Role Labeling

15 0.27221921 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing

16 0.27131781 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

17 0.25806004 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

18 0.25799984 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

19 0.25131685 39 emnlp-2010-EMNLP 044

20 0.24753028 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.025), (12, 0.033), (29, 0.12), (30, 0.039), (32, 0.018), (52, 0.079), (54, 0.158), (56, 0.062), (62, 0.042), (66, 0.141), (72, 0.054), (76, 0.038), (77, 0.023), (79, 0.015), (83, 0.026), (87, 0.02), (89, 0.016), (92, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85654497 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

2 0.79542822 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

Author: Guillaume Wisniewski ; Alexandre Allauzen ; Francois Yvon

Abstract: Extant Statistical Machine Translation (SMT) systems are very complex softwares, which embed multiple layers of heuristics and embark very large numbers of numerical parameters. As a result, it is difficult to analyze output translations and there is a real need for tools that could help developers to better understand the various causes of errors. In this study, we make a step in that direction and present an attempt to evaluate the quality of the phrase-based translation model. In order to identify those translation errors that stem from deficiencies in the phrase table (PT), we propose to compute the oracle BLEU-4 score, that is the best score that a system based on this PT can achieve on a reference corpus. By casting the computation of the oracle BLEU-1 as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of this score, and report measures performed on several standard benchmarks. Various other applications of these oracle decoding techniques are also reported and discussed. 1 Phrase-Based Machine Translation 1.1 Principle A Phrase-Based Translation System (PBTS) consists of a ruleset and a scoring function (Lopez, 2009). The ruleset, represented in the phrase table, is a set of phrase1pairs {(f, e) }, each pair expressing that the source phrase f can ,bee) r}e,w earicthten p (atirra enxslparteedss)i inngto t a target phrase e. Trarsaens flation hypotheses are generated by iteratively rewriting portions of the source sentence as prescribed by the ruleset, until each source word has been consumed by exactly one rule. The order of target words in an hypothesis is uniquely determined by the order in which the rewrite operation are performed. The search space ofthe translation model corresponds to the set of all possible sequences of 1Following the usage in statistical machine translation literature, use “phrase” to denote a subsequence of consecutive words. we 933 rules applications. The scoring function aims to rank all possible translation hypotheses in such a way that the best one has the highest score. A PBTS is learned from a parallel corpus in two independent steps. In a first step, the corpus is aligned at the word level, by using alignment tools such as Gi z a++ (Och and Ney, 2003) and some symmetrisation heuristics; phrases are then extracted by other heuristics (Koehn et al., 2003) and assigned numerical weights. In the second step, the parameters of the scoring function are estimated, typically through Minimum Error Rate training (Och, 2003). Translating a sentence amounts to finding the best scoring translation hypothesis in the search space. Because of the combinatorial nature of this problem, translation has to rely on heuristic search techniques such as greedy hill-climbing (Germann, 2003) or variants of best-first search like multi-stack decoding (Koehn, 2004). Moreover, to reduce the overall complexity of decoding, the search space is typically pruned using simple heuristics. For instance, the state-of-the-art phrase-based decoder Moses (Koehn et al., 2007) considers only a restricted number of translations for each source sequence2 and enforces a distortion limit3 over which phrases can be reordered. As a consequence, the best translation hypothesis returned by the decoder is not always the one with the highest score. 1.2 Typology of PBTS Errors Analyzing the errors of a SMT system is not an easy task, because of the number of models that are combined, the size of these models, and the high complexity of the various decision making processes. For a SMT system, three different kinds of errors can be distinguished (Germann et al., 2004; Auli et al., 2009): search errors, induction errors and model errors. The former corresponds to cases where the hypothesis with the best score is missed by the search procedure, either because of the use of an ap2the 3the option of Moses, defaulting to 20. dl option of Moses, whose default value is 7. tt l ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 9t3ic3s–943, proximate search method or because of the restrictions of the search space. Induction errors correspond to cases where, given the model, the search space does not contain the reference. Finally, model errors correspond to cases where the hypothesis with the highest score is not the best translation according to the evaluation metric. Model errors encompass several types oferrors that occur during learning (Bottou and Bousquet, 2008)4. Approximation errors are errors caused by the use of a restricted and oversimplistic class of functions (here, finitestate transducers to model the generation of hypotheses and a linear scoring function to discriminate them) to model the translation process. Estimation errors correspond to the use of sub-optimal values for both the phrase pairs weights and the parameters of the scoring function. The reasons behind these errors are twofold: first, training only considers a finite sample of data; second, it relies on error prone alignments. As a result, some “good” phrases are extracted with a small weight, or, in the limit, are not extracted at all; and conversely that some “poor” phrases are inserted into the phrase table, sometimes with a really optimistic score. Sorting out and assessing the impact of these various causes of errors is of primary interest for SMT system developers: for lack of such diagnoses, it is difficult to figure out which components of the system require the most urgent attention. Diagnoses are however, given the tight intertwining among the various component of a system, very difficult to obtain: most evaluations are limited to the computation of global scores and usually do not imply any kind of failure analysis. 1.3 Contribution and organization To systematically assess the impact of the multiple heuristic decisions made during training and decoding, we propose, following (Dreyer et al., 2007; Auli et al., 2009), to work out oracle scores, that is to evaluate the best achievable performances of a PBTS. We aim at both studying the expressive power of PBTS and at providing tools for identifying and quantifying causes of failure. Under standard metrics such as BLEU (Papineni et al., 2002), oracle scores are difficult (if not impossible) to compute, but, by casting the computation of the oracle unigram recall and precision as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of the oracle BLEU-4 scores and report measurements performed on several standard benchmarks. The main contributions of this paper are twofold. We first introduce an ILP program able to efficiently find the best hypothesis a PBTS can achieve. This program can be easily extended to test various improvements to 4We omit here optimization errors. 934 phrase-base systems or to evaluate the impact of different parameter settings. Second, we present a number of complementary results illustrating the usage of our oracle decoder for identifying and analyzing PBTS errors. Our experimental results confirm the main conclusions of (Turchi et al., 2008), showing that extant PBTs have the potential to generate hypotheses having very high BLEU4 score and that their main bottleneck is their scoring function. The rest of this paper is organized as follows: in Section 2, we introduce and formalize the oracle decoding problem, and present a series of ILP problems of increasing complexity designed so as to deliver accurate lowerbounds of oracle score. This section closes with various extensions allowing to model supplementary constraints, most notably reordering constraints (Section 2.5). Our experiments are reported in Section 3, where we first introduce the training and test corpora, along with a description of our system building pipeline (Section 3. 1). We then discuss the baseline oracle BLEU scores (Section 3.2), analyze the non-reachable parts of the reference translations, and comment several complementary results which allow to identify causes of failures. Section 4 discuss our approach and findings with respect to the existing literature on error analysis and oracle decoding. We conclude and discuss further prospects in Section 5. 2 Oracle Decoder 2.1 The Oracle Decoding Problem Definition To get some insights on the errors of phrasebased systems and better understand their limits, we propose to consider the oracle decoding problem defined as follows: given a source sentence, its reference translation5 and a phrase table, what is the “best” translation hypothesis a system can generate? As usual, the quality of an hypothesis is evaluated by the similarity between the reference and the hypothesis. Note that in the oracle decoding problem, we are only assessing the ability of PBT systems to generate good candidate translations, irrespective of their ability to score them properly. We believe that studying this problem is interesting for various reasons. First, as described in Section 3.4, comparing the best hypothesis a system could have generated and the hypothesis it actually generates allows us to carry on both quantitative and qualitative failure analysis. The oracle decoding problem can also be used to assess the expressive power of phrase-based systems (Auli et al., 2009). Other applications include computing acceptable pseudo-references for discriminative training (Tillmann and Zhang, 2006; Liang et al., 2006; Arun and 5The oracle decoding problem can be extended to the case of multiple references. For the sake of simplicity, we only describe the case of a single reference. Koehn, 2007) or combining machine translation systems in a multi-source setting (Li and Khudanpur, 2009). We have also used oracle decoding to identify erroneous or difficult to translate references (Section 3.3). Evaluation Measure To fully define the oracle decoding problem, a measure of the similarity between a translation hypothesis and its reference translation has to be chosen. The most obvious choice is the BLEU-4 score (Papineni et al., 2002) used in most machine translation evaluations. However, using this metric in the oracle decoding problem raises several issues. First, BLEU-4 is a metric defined at the corpus level and is hard to interpret at the sentence level. More importantly, BLEU-4 is not decomposable6: as it relies on 4-grams statistics, the contribution of each phrase pair to the global score depends on the translation of the previous and following phrases and can not be evaluated in isolation. Because of its nondecomposability, maximizing BLEU-4 is hard; in particular, the phrase-level decomposability of the evaluation × metric is necessary in our approach. To circumvent this difficulty, we propose to evaluate the similarity between a translation hypothesis and a reference by the number of their common words. This amounts to evaluating translation quality in terms of unigram precision and recall, which are highly correlated with human judgements (Lavie et al., ). This measure is closely related to the BLEU-1 evaluation metric and the Meteor (Banerjee and Lavie, 2005) metric (when it is evaluated without considering near-matches and the distortion penalty). We also believe that hypotheses that maximize the unigram precision and recall at the sentence level yield corpus level BLEU-4 scores close the maximal achievable. Indeed, in the setting we will introduce in the next section, BLEU-1 and BLEU-4 are highly correlated: as all correct words of the hypothesis will be compelled to be at their correct position, any hypothesis with a high 1-gram precision is also bound to have a high 2-gram precision, etc. 2.2 Formalizing the Oracle Decoding Problem The oracle decoding problem has already been considered in the case of word-based models, in which all translation units are bound to contain only one word. The problem can then be solved by a bipartite graph matching algorithm (Leusch et al., 2008): given a n m binary matarligxo describing possible t 2r0an08sl)a:ti goinv elinn aks n b×emtw beeinna source words and target words7, this algorithm finds the subset of links maximizing the number of words of the reference that have been translated, while ensuring that each word 6Neither at the sentence (Chiang et al., 2008), nor at the phrase level. 7The (i, j) entry of the matrix is 1if the ith word of the source can be translated by the jth word of the reference, 0 otherwise. 935 is translated only once. Generalizing this approach to phrase-based systems amounts to solving the following problem: given a set of possible translation links between potential phrases of the source and of the target, find the subset of links so that the unigram precision and recall are the highest possible. The corresponding oracle hypothesis can then be easily generated by selecting the target phrases that are aligned with one source phrase, disregarding the others. In addition, to mimic the way OOVs are usually handled, we match identical OOV tokens appearing both in the source and target sentences. In this approach, the unigram precision is always one (every word generated in the oracle hypothesis matches exactly one word in the reference). As a consequence, to find the oracle hypothesis, we just have to maximize the recall, that is the number of words appearing both in the hypothesis and in the reference. Considering phrases instead of isolated words has a major impact on the computational complexity: in this new setting, the optimal segmentations in phrases of both the source and of the target have to be worked out in addition to links selection. Moreover, constraints have to be taken into account so as to enforce a proper segmentation of the source and target sentences. These constraints make it impossible to use the approach of (Leusch et al., 2008) and concur in making the oracle decoding problem for phrase-based models more complex than it is for word-based models: it can be proven, using arguments borrowed from (De Nero and Klein, 2008), that this problem is NP-hard even for the simple unigram precision measure. 2.3 An Integer Program for Oracle Decoding To solve the combinatorial problem introduced in the previous section, we propose to cast it into an Integer Linear Programming (ILP) problem, for which many generic solvers exist. ILP has already been used in SMT to find the optimal translation for word-based (Germann et al., 2001) and to study the complexity of learning phrase alignments (De Nero and Klein, 2008) models. Following the latter reference, we introduce the following variables: fi,j (resp. ek,l) is a binary indicator variable that is true when the phrase contains all spans from betweenword position i to j (resp. k to l) of the source (resp. target) sentence. We also introduce a binary variable, denoted ai,j,k,l, to describe a possible link between source phrase fi,j and target phrase ek,l. These variables are built from the entries of the phrase table according to selection strategies introduced in Section 2.4. In the following, index variables are so that: 0 ≤ i< j ≤ n, in the source sentence and 0 ≤ k < l ≤ m, in the target sentence, where n (resp. m) is the length of the source (resp. target) sentence. Solving the oracle decoding problem then amounts to optimizing the following objective function: mi,j,akx,li,Xj,k,lai,j,k,l· (l − k), (1) under the constraints: X ∀x ∈ J1,mK : ek,l ≤ 1 (2) = (3) 1∀,kn,lK : Xai,j,k,l = fk,l (4) ∀i,j : Xai,j,k,l (5) k,l s.tX. Xk≤x≤l ∀∀xy ∈∈ J11,,mnKK : X i,j s.tX. Xi≤y≤j fi,j 1 Xi,j = ei,j Xk,l The objective function (1) corresponds to the number of target words that are generated. The first set of constraints (2) ensures that each word in the reference e ap- pears in no more than one phrase. Maximizing the objective under these constraints amounts to maximizing the unigram recall. The second set of constraints (3) ensures that each word in the source f is translated exactly once, which guarantees that the search space of the ILP problem is the same as the search space of a phrase-based system. Constraints (4) bind the fk,l and ai,j,k,l variables, ensuring that whenever a link ai,j,k,l is active, the corresponding phrase fk,l is also active. Constraints (5) play a similar role for the reference. The Relaxed Problem Even though it accurately models the search space of a phrase-based decoder, this programs is not really useful as is: due to out-ofvocabulary words or missing entries in the phrase table, the constraint that all source words should be translated yields infeasible problems8. We propose to relax this problem and allow some source words to remain untranslated. This is done by replacing constraints (3) by: ∀y ∈ J1,nK : X i,j s.tX. Xi≤y≤j fi,j ≤ 1 To better ref∀lyec ∈t th J1e, bneKh :avior of phrase-based decoders, which attempt to translate all source words, we also need to modify the objective function as follows: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) Xi,j (6) The second term in this new objective ensures that optimal solutions translate as many source words as possible. 8An ILP problem is said to be infeasible when tion violates at least one constraint. every possible solu- 936 The Relaxed-Distortion Problem A last caveat with the Relaxed optimization program is caused by frequently occurring source tokens, such as function words or punctuation signs, which can often align with more than one target word. For lack of taking distortion information into account in our objective function, all these alignments are deemed equivalent, even if some of them are clearly more satisfactory than others. This situation is illustrated on Figure 1. le chat et the cat and le the chien dog Figure 1: Equivalent alignments between “le” and “the”. The dashed lines corresponds to a less interpretable solution. To overcome this difficulty, we propose a last change to the objective function: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) X ai,j,k,l|k − i| Xi,j −α (7) i Xk ,l X,j, Compared to the objective function of the relaxed problem (6), we introduce here a supplementary penalty factor which favors monotonous alignments. For each phrase pair, the higher the difference between source and target positions, the higher this penalty. If α is small enough, this extra term allows us to select, among all the optimal alignments of the re l axed problem, the one with the lowest distortion. In our experiments, we set α to min {n, m} to ensure that the penalty factor is always smminall{enr, ,tmha}n tthoe e rneswuarred t fhoart aligning atwltyo single iwso ardlwsa. 2.4 Selecting Indicator Variables In the approach introduced in the previous sections, the oracle decoding problem is solved by selecting, among a set of possible translation links, the ones that yield the solution with the highest unigram recall. We propose two strategies to build this set of possible translation links. In the first one, denoted exact match, an indicator ai,j,k,l is created if there is an entry (f, e) so that f spans from word position ito j in the source and e from word position k to l in the target. In this strategy, the ILP program considers exactly the same ruleset as conventional phrase-based decoders. We also consider an alternative strategy, which could help us to identify errors made during the phrase extraction process. In this strategy, denoted inside match, an indicator ai,j,k,l is created when the following three criteria are met: i) f spans from position ito j of the source; ii) a substring of e, denoted e, spans from position k to l of the reference; iii) (f, e¯) is not an entry of the phrase table. The resulting set of indicator variables thus contains, at least, all the variables used in the exact match strategy. In addition, we license here the use of phrases containing words that do not occur in the reference. In fact, using such solutions can yield higher BLEU scores when the reward for additional correct matches exceeds the cost incurred by wrong predictions. These cases are symptoms of situations where the extraction heuristic failed to extract potentially useful subphrases. 2.5 Oracle Decoding with Reordering Constraints The ILP problem introduced in the previous section can be extended in several ways to describe and test various improvements to phrase-based systems or to evaluate the impact of different parameter settings. This flexibility mainly stems from the possibility offered by our framework to express arbitrary constraints over variables. In this section, we illustrate these possibilities by describing how reordering constraints can easily be considered. As a first example, the Moses decoder uses a distortion limit to constrain the set of possible reorderings. This constraint “enforces (...) that the last word of a phrase chosen for translation cannot be more than d9 words from the leftmost untranslated word in the source” (Lopez, 2009) and is expressed as: ∀aijkl , ai0j0k0l0 s.t. k > k0, aijkl · ai0j0k0l0 · |j − i0 + 1| ≤ d, The maximum distortion limit strategy (Lopez, 2009) is also easily expressed and take the following form (assuming this constraint is parameterized by d): ∀l < m − 1, ai,j,k,l·ai0,j0,l+1,l0 · |i0 − j − 1| 71is%t e6hs.a distortion greater that Moses default distortion limit. alignment decisions enabled by the use of larger training corpora and phrase table. To evaluate the impact ofthe second heuristic, we computed the number of phrases discarded by Moses (be- cause of the default ttl limit) but used in the oracle hypotheses. In the English to French NEWSCO setting, they account for 34.11% of the total number of phrases used in the oracle hypotheses. When the oracle decoder is constrained to use the same phrase table as Moses, its BLEU-4 score drops to 42.78. This shows that filtering the phrase table prior to decoding discards many useful phrase pairs and is seriously limiting the best achievable performance, a conclusion shared with (Auli et al., 2009). Search Errors Search errors can be identified by comparing the score of the best hypothesis found by Moses and the score of the oracle hypothesis. If the score of the oracle hypothesis is higher, then there has been a search error; on the contrary, there has been an estimation error when the score of the oracle hypothesis is lower than the score of the best hypothesis found by Moses. 940 Based on the comparison of the score of Moses hypotheses and of oracle hypotheses for the English to French NEWSCO setting, our preliminary conclusion is that the number of search errors is quite limited: only about 5% of the hypotheses of our oracle decoder are actually getting a better score than Moses solutions. Again, this shows that the scoring function (model error) is one of the main bottleneck of current PBTS. Comparing these hypotheses is nonetheless quite revealing: while Moses mostly selects phrase pairs with high translation scores and generates monotonous alignments, our ILP decoder uses larger reorderings and less probable phrases to achieve better solutions: on average, the reordering score of oracle solutions is −5.74, compared to −76.78 fscoro rMeo osfe osr outputs. iGonivsen is −the5 weight assigned through MERT training to the distortion score, no wonder that these hypotheses are severely penalized. The Impact of Phrase Length The observed outputs do not only depend on decisions made during the search, but also on decisions made during training. One such decision is the specification of maximal length for the source and target phrases. In our framework, evaluating the impact of this decision is simple: it suffices to change the definition of indicator variables so as to consider only alignments between phrases of a given length. In the English-French NEWSCO setting, the most restrictive choice, when only alignments between single words are authorized, yields an oracle BLEU-4 of 48.68; however, authorizing phrases up to length 2 allows to achieve an oracle value of 66.57, very close to the score achieved when considering all extracted phrases (67.77). This is corroborated with a further analysis of our oracle alignments, which use phrases whose average source length is 1.21 words (respectively 1.31 for target words). If many studies have already acknowledged the predomi- nance of “small” phrases in actual translations, our oracle scores suggest that, for this language pair, increasing the phrase length limit beyond 2 or 3 might be a waste of computational resources. 4 Related Work To the best of our knowledge, there are only a few works that try to study the expressive power ofphrase-based machine translation systems or to provide tools for analyzing potential causes of failure. The approach described in (Auli et al., 2009) is very similar to ours: in this study, the authors propose to find and analyze the limits of machine translation systems by studying the reference reachability. A reference is reachable for a given system if it can be exactly generated by this system. Reference reachability is assessed using Moses in forced decoding mode: during search, all hypotheses that deviate from the reference are simply discarded. Even though the main goal of this study was to compare the search space of phrase-based and hierarchical systems, it also provides some insights on the impact of various search parameters in Moses, delivering conclusions that are consistent with our main results. As described in Section 1.2, these authors also propose a typology of the errors of a statistical translation systems, but do not attempt to provide methods for identifying them. The authors of (Turchi et al., 2008) study the learn- ing capabilities of Moses by extensively analyzing learning curves representing the translation performances as a function of the number of examples, and by corrupting the model parameters. Even though their focus is more on assessing the scoring function, they reach conclusions similar to ours: the current bottleneck of translation performances is not the representation power of the PBTS but rather in their scoring functions. Oracle decoding is useful to compute reachable pseudo-references in the context of discriminative training. This is the main motivation of (Tillmann and Zhang, 2006), where the authors compute high BLEU hypotheses by running a conventional decoder so as to maximize a per-sentence approximation of BLEU-4, under a simple (local) reordering model. Oracle decoding has also been used to assess the limitations induced by various reordering constraints in (Dreyer et al., 2007). To this end, the authors propose to use a beam-search based oracle decoder, which computes lower bounds of the best achievable BLEU-4 using dynamic programming techniques over finite-state (for so-called local and IBM constraints) or hierarchically structured (for ITG constraints) sets of hypotheses. Even 941 though the numbers reported in this study are not directly comparable with ours17, it seems that our decoder is not only conceptually much simpler, but also achieves much more optimistic lower-bounds of the oracle BLEU score. The approach described in (Li and Khudanpur, 2009) employs a similar technique, which is to guide a heuristic search in an hypergraph representing possible translation hypotheses with n-gram counts matches, which amounts to decoding with a n-gram model trained on the sole reference translation. Additional tricks are presented in this article to speed-up decoding. Computing oracle BLEU scores is also the subject of (Zens and Ney, 2005; Leusch et al., 2008), yet with a different emphasis. These studies are concerned with finding the best hypotheses in a word graph or in a consensus network, a problem that has various implications for multi-pass decoding and/or system combination techniques. The former reference describes an exponential approximate algorithm, while the latter proves the NPcompleteness of this problem and discuss various heuristic approaches. Our problem is somewhat more complex and using their techniques would require us to built word graphs containing all the translations induced by arbitrary segmentations and permutations of the source sentence. 5 Conclusions In this paper, we have presented a methodology for analyzing the errors of PBTS, based on the computation of an approximation of the BLEU-4 oracle score. We have shown that this approximation could be computed fairly accurately and efficiently using Integer Linear Programming techniques. Our main result is a confirmation of the fact that extant PBTS systems are expressive enough to achieve very high translation performance with respect to conventional quality measurements. The main efforts should therefore strive to improve on the way phrases and hypotheses are scored during training. This gives further support to attempts aimed at designing context-dependent scoring functions as in (Stroppa et al., 2007; Gimpel and Smith, 2008), or at attempts to perform discriminative training of feature-rich models. (Bangalore et al., 2007). We have shown that the examination of difficult-totranslate sentences was an effective way to detect errors or inconsistencies in the reference translations, making our approach a potential aid for controlling the quality or assessing the difficulty of test data. Our experiments have also highlighted the impact of various parameters. Various extensions of the baseline ILP program have been suggested and/or evaluated. In particular, the ILP formalism lends itself well to expressing various constraints that are typically used in conventional PBTS. In 17The best BLEU-4 oracle they achieve on Europarl German to English is approximately 48; but they considered a smaller version of the training corpus and the WMT’06 test set. our future work, we aim at using this ILP framework to systematically assess various search configurations. We plan to explore how replacing non-reachable references with high-score pseudo-references can improve discrim- inative training of PBTS. We are also concerned by determining how tight is our approximation of the BLEU4 score is: to this end, we intend to compute the best BLEU-4 score within the n-best solutions of the oracle decoding problem. Acknowledgments Warm thanks to Houda Bouamor for helping us with the annotation tool. This work has been partly financed by OSEO, the French State Agency for Innovation, under the Quaero program. References Tobias Achterberg. 2007. Constraint Integer Programming. Ph.D. thesis, Technische Universit a¨t Berlin. http : / / opus .kobv .de /tuberl in/vol ltexte / 2 0 0 7 / 16 11/ . Abhishek Arun and Philipp Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In Proc. of MT Summit XI, Copenhagen, Denmark. Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Koehn. 2009. A systematic analysis of translation model search spaces. In Proc. of WMT, pages 224–232, Athens, Greece. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Srinivas Bangalore, Patrick Haffner, and Stephan Kanthak. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proc. of ACL, pages 152–159, Prague, Czech Republic. L e´on Bottou and Olivier Bousquet. 2008. The tradeoffs oflarge scale learning. In Proc. of NIPS, pages 161–168, Vancouver, B.C., Canada. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proc. of WMT, pages 1–28, Athens, Greece. David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proc. of ECML, pages 610–619, Honolulu, Hawaii. John De Nero and Dan Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL: HLT, Short Papers, pages 25–28, Columbus, Ohio. Markus Dreyer, Keith B. Hall, and Sanjeev P. Khudanpur. 2007. Comparing reordering constraints for smt using efficient bleu oracle computation. In NAACL-HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 103– 110, Rochester, New York. 942 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proc. of ACL, pages 228–235, Toulouse, France. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1-2): 127– 143. Ulrich Germann. 2003. Greedy decoding for statistical machine translation in almost linear time. In Proc. of NAACL, pages 1–8, Edmonton, Canada. Kevin Gimpel and Noah A. Smith. 2008. Rich source-side context for statistical machine translation. In Proc. of WMT, pages 9–17, Columbus, Ohio. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL, pages 48–54, Edmonton, Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL, demonstration session. Philipp Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA, pages 115–124, Washington DC. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proc. of HLT, pages 161–168, Vancouver, Canada. Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. The significance of recall in automatic metrics for MT evaluation. In In Proc. of AMTA, pages 134–143, Washington DC. Gregor Leusch, Evgeny Matusov, and Hermann Ney. 2008. Complexity of finding the BLEU-optimal hypothesis in a confusion network. In Proc. of EMNLP, pages 839–847, Honolulu, Hawaii. Zhifei Li and Sanjeev Khudanpur. 2009. Efficient extraction of oracle-best translations from hypergraphs. In Proc. of NAACL, pages 9–12, Boulder, Colorado. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proc. of ACL, pages 761–768, Sydney, Australia. Adam Lopez. 2009. Translation as weighted deduction. In Proc. of EACL, pages 532–540, Athens, Greece. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. , 29(1): 19–5 1. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. Technical report, Philadelphia, Pennsylvania. D. Roth and W. Yih. 2005. Integer linear programming inference for conditional random fields. In Proc. of ICML, pages 737–744, Bonn, Germany. Nicolas Stroppa, Antal van den Bosch, and Andy Way. 2007. Exploiting source similarity for smt using context-informed features. In Andy Way and Barbara Proc. of TMI, pages Christoph Tillmann 231–240, Sk¨ ovde, and Tong Zhang. Gawronska, editors, Sweden. 2006. A discriminative global training algorithm for statistical mt. In Proc. of ACL, 721–728, Sydney, Australia. Turchi, Tijl De Bie, and Nello pages Marco Cristianini. 2008. Learn- ing performance of a machine translation system: a statistical and computational analysis. In Proc. of WMT, pages Columbus, Ohio. 35–43, Richard Zens and Hermann Ney. 2005. Word graphs for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 191–198, Ann Arbor, Michigan. 943

3 0.77139896 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

4 0.76790273 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

Author: Adria de Gispert ; Juan Pino ; William Byrne

Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

5 0.75896454 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

Author: Hui Zhang ; Min Zhang ; Haizhou Li ; Eng Siong Chng

Abstract: This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence translation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than traditional tree-based and tree-sequence-based translation methods. For the second issue, we propose a parallel space searching method to generate hypothesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.

6 0.75863415 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

7 0.75793052 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

8 0.75617033 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

9 0.75546342 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

10 0.75494546 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

11 0.7493099 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

12 0.74777579 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

13 0.74682027 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

14 0.74677354 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

15 0.74542761 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

16 0.74533606 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

17 0.74386036 104 emnlp-2010-The Necessity of Combining Adaptation Methods

18 0.74311405 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

19 0.74164563 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

20 0.73803401 115 emnlp-2010-Uptraining for Accurate Deterministic Question Parsing