emnlp emnlp2010 emnlp2010-76 knowledge-graph by maker-knowledge-mining

76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation


Source: pdf

Author: Zhongjun He ; Yao Meng ; Hao Yu

Abstract: Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. However, the phrase reorderings lack of contextual information in conventional HPB systems. This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. We integrate the MaxEnt models into the HPB model. Experimental results show that our approach achieves significant improvements over a standard HPB system on large-scale translation tasks. On Chinese-to-English translation, , the absolute improvements in BLEU (caseinsensitive) range from 1.2 to 2.1.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Chaoyang District, Beijing, 100025, China {he zhong j un , mengyao Abstract Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. [sent-4, score-0.209]

2 However, the phrase reorderings lack of contextual information in conventional HPB systems. [sent-5, score-0.198]

3 This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. [sent-6, score-0.881]

4 We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. [sent-7, score-0.715]

5 1 Introduction The hierarchical phrase-based (HPB) model (Chiang, 2005; Chiang, 2007) has been widely adopted in statistical machine translation (SMT). [sent-13, score-0.255]

6 Typically, there are three types of rules (see Table 1): phrasal rule, a phrase pair consisting of consecutive words; hierarchical rule, a hierarchical phrase pair consisting of both words and variables; and glue rule, which is used to merge phrases serially. [sent-15, score-0.944]

7 Phrasal rule captures short distance reorderings within phrases, while hierarchical rule captures long distance reorderings be- . [sent-16, score-0.5]

8 PR = phrasal rule, HR = hierarchical rule, GR = glue rule. [sent-21, score-0.453]

9 However, HPB translation suffers from a limitation, in that the phrase reorderings lack of contextual information, such as the surrounding words of a phrase and the content of sub-phrases that represented by variables. [sent-24, score-0.384]

10 Consider the following two hierarchical rules in translating a Chinese sentence into English: X → hX1 ? [sent-25, score-0.227]

11 with Russia ’s talks talks with Russia Both pattern-match the source sentence, but produce quite different phrase reorderings. [sent-33, score-0.205]

12 The first rule generates a monotone translation, while the second rule swaps the source phrases covered by X1 and X2 on the target side. [sent-34, score-0.656]

13 It is helpful to reduce ambiguity, thus guide the decoder to choose correct translation for a source text. [sent-41, score-0.264]

14 Several re- searchers observed that word sense disambiguation improves translation quality on lexical translation (Carpuat and Wu, 2007; Chan et al. [sent-42, score-0.214]

15 These methods utilized contextual features to determine the correct meaning of a source word, thus help an SMT system choose an appropriate target translation. [sent-44, score-0.214]

16 They addressed phrase reordering as a two-class classification problem that translating neighboring phrases serially or inversely. [sent-47, score-0.65]

17 They built a maximum entropy (MaxEnt) classifier based on boundary words to predict the order of neighboring phrases. [sent-48, score-0.228]

18 (2008) presented a lexicalized rule selection model to improve both lexical translation and phrase reordering for HPB translation. [sent-50, score-0.674]

19 They built a MaxEnt model for each ambiguous source side based on contextual features. [sent-51, score-0.228]

20 In this paper, we focus on improving phrase reordering for HPB translation. [sent-56, score-0.436]

21 We classify SCFG rules into several reordering patterns consisting of two variables X and F (or E) 1, such as X1FX2 and X2EX1. [sent-57, score-0.56]

22 We treat phrase reordering as a classification problem and build a MaxEnt model for each source reordering pattern based on various contex1We use F and E to represent source and target words, respectively. [sent-58, score-1.109]

23 Specifically: • For hierarchical rules, we classify the sourcesFiodre haniedr athrceh target-side winteo c7l aasnsdif 1y7 t reordering patterns, respectively. [sent-61, score-0.513]

24 We then build a classifier for each source pattern to predict phrase reorderings. [sent-63, score-0.316]

25 Here, we classify source hierarchical phrases into 7 reordering patterns according to the arrangement of words and variables. [sent-67, score-0.774]

26 • • For glue rules, we extend the HPB model by using bracketing tera enxsdteuncdtio thne grammar (BTG) (Wu, 1996) instead of the monotone glue rule. [sent-69, score-0.757]

27 We then build a classifier for glue rules to predict reorderings of neighboring phrases, analogous to Xiong et al. [sent-71, score-0.623]

28 We integrate the MaxEnt based phrase reordering mnteodgeralste as ef eMatauxrEesn i bnatose dth peh rHasPeB r moroddere-l (Chiang, 2005). [sent-73, score-0.455]

29 The rest of the paper is structured as follows: Section 2 describes the MaxEnt based phrase reordering method. [sent-80, score-0.436]

30 phrasebTeatwrXwge eit anhpnXhXdra sned Figure 1: A source hierarchical phrase and its corresponding target translation. [sent-85, score-0.338]

31 2 MaxEnt based Phrase Reordering We regard phrase reordering as a pattern classification problem. [sent-86, score-0.533]

32 A reordering pattern indicates an arrangement of words and variables. [sent-87, score-0.461]

33 Although there are a large amount of hierarchical rules may be extracted from bilingual corpus, these rules can be classified into several reordering patterns (Section 2. [sent-88, score-0.807]

34 In addition, we extend the HPB model with BTG, that adding an inverted glue rule to merge phrases inversely (Section 2. [sent-90, score-0.645]

35 Therefore, the glue rules are classified into two patterns: serial or inverse. [sent-92, score-0.475]

36 We then build a MaxEnt phrase reordering (MEPR) classifier for each source reordering pattern (Section 2. [sent-93, score-1.005]

37 We may learn millions of hierarchical rules from a bilingual corpus. [sent-100, score-0.245]

38 Although these rules are different from each other, they can be classified into several re- ordering patterns according to the arrangement of variables and words. [sent-101, score-0.273]

39 In this paper, we follow the constraint as described in (Chiang, 2005) that a hierarchical rule can have at most two variables and they cannot be adjacent on the source side. [sent-102, score-0.397]

40 Therefore, in a hierarchical rule, E is the lexical translation of F, while the order of X and E contains phrase reordering information. [sent-104, score-0.673]

41 SoXur1cFeXpa2tFernTaXE Xg1eX21EtXE21pX 1a2XE2t1 E2ern E X 21E XE 1XE2 1 FX1FX2 FX1FX2F X2EX1E EX1X2E Table 3: A classification of the source side and the target side for the hierarchical rule that contains two variables. [sent-106, score-0.536]

42 For example, we consider “e1Xe2” and “e2Xe1” as the same pattern “EXE”, because the target words are determined by lexical translation of source words. [sent-109, score-0.299]

43 During decoding the phrases covered by X are dynamically changed and the contextual information of these phrases is ignored for pattern-matching of hierarchical rules. [sent-111, score-0.4]

44 Analogously, for the hierarchical rule that contains two variables, the source phrases are classified into 4 patterns, while the target phrases are classified into 14 patterns, as shown in Table 3. [sent-112, score-0.654]

45 The pattern number on the source side is less than that on the target side, because on the source side, “X1” always appears before “X2”, and they cannot be adjacent. [sent-113, score-0.338]

46 2 Reordering Pattern Classification for Glue Rule The HPB model used glue rule to combine phrases serially. [sent-115, score-0.508]

47 The reason is that in some cases, there are no valid translation rules that cover a source span. [sent-116, score-0.294]

48 Therefore, the glue rule provides a default monotone combination of phrases in order to complete a translation. [sent-117, score-0.665]

49 During decoding, the decoder first uses Rule 3 to produce phrase translation, and then iteratively uses Rule 4 and 5 to merge two neighboring phrases into a larger phrase until the whole sentence is covered. [sent-120, score-0.399]

50 We replace the original glue rules in the HPB model with BTG rules (see Table 4). [sent-121, score-0.494]

51 The inverted glue rule in BTG, however, can solve this problem. [sent-128, score-0.512]

52 • In the HPB model, only a monotone glue rule Iisn provided t om merge phrases serially. [sent-129, score-0.694]

53 I gnl uthee r extended HPB model, the combination of phrases is classified into two types: monotone and inverse. [sent-130, score-0.289]

54 (2006), to perform context-dependent phrase reordering, we build a 558 S →G→luhSeh-X R,uSlXe iX Ex→ teSnh d→X e1 dhX G2 l,u Xe XRi21X ule12i Table 4: Extending the glue rules in the HPB model with BTG. [sent-132, score-0.476]

55 MaxEnt based classifier for glue rules to predict the order of two neighboring phrases. [sent-133, score-0.549]

56 3 The MaxEnt based Phrase Reordering Classifier As described above, we classified phrase reorderings into several patterns. [sent-136, score-0.188]

57 Therefore, phrase reordering can be regarded as a classification problem: for each source reordering pattern, we treat the corresponding target reordering patterns as labels. [sent-137, score-1.366]

58 f(X) and e(X) are the phrases that covered by X one the source and target side, respectively. [sent-140, score-0.237]

59 Given a source phrase, the model predicts a target reordering pattern, considering various contextual features (Section 2. [sent-141, score-0.551]

60 According to the classification of reordering patterns, there are 3 kinds of classifiers: • • Pmhre1 includes 3 classifiers for the hierarchical rPules that contain 1 variable. [sent-143, score-0.56]

61 Each of the classifier has 3 labels; Pmhre2 includes 4 classifiers for the hierarchical rPules that contain 2 variables. [sent-144, score-0.228]

62 Each of the classifier has 14 labels; Pmgre includes 1classifier for the glue rules. [sent-145, score-0.359]

63 The cPlassifier has 2 labels that predict a monotone or inverse order for two neighboring phrases. [sent-146, score-0.268]

64 (2008), in which a classifier was built for each ambiguous hierarchical source side. [sent-151, score-0.296]

65 While our approach is more generic, rather than training a MaxEnt model for a specific hierarchical source side, we train a model for a source reordering pattern. [sent-153, score-0.667]

66 4 Feature definition For a reordering pattern pair hTα, Tγi, we design tFhorree a f reaetourrdee rfiunngct piaonttes rfnor p phrase reordering dcleassisginfiers: • • • Source lexical feature, including boundary Swoourdrcse an ledx neighboring iwncorluddsi. [sent-156, score-0.962]

67 Boundary words are the left and right word of the source phrases covered by f(X), while neighboring words are the words that immediately to the left and right of a source phrase f(α); Part-of-Speech (POS) feature, POS tags of the boundary aenedc neighboring uwreo,rd PsO on ttahges source side. [sent-157, score-0.631]

68 These features can be extracted together with translation rules from bilingual corpus. [sent-159, score-0.222]

69 However, since the hierarchical rule does not allow for adjacent variables on the source side, we extract features for Pmgre by using the method described in Xiong et al. [sent-160, score-0.397]

70 The HPB model has the following features: translation probabilities p(γ|α) and p(α|γ), 559 lexical weights pw (γ|α) and pw (α|γ), word penalty, phrase penalty, glue αru)l ean penalty, a)n, dw a target ngram language model. [sent-164, score-0.561]

71 Therefore, the contextual information guides the decoder to perform phrase reordering. [sent-166, score-0.211]

72 • We split the “glue rule penalty” into two featWueres s:p mlito tnhoeto “ngel glue lreul pee nnualmtyb”er i natnod tiwnover fteeadglue rule number. [sent-167, score-0.562]

73 These features reflect preference of the decoder for using monotone or inverted glue rules. [sent-168, score-0.605]

74 For a source span [j1,j2], the decoder uses three kinds of rules: translation rules produce lexical translation and phrase reordering (for hierarchical rules), monotone rule merges any neighboring sub-spans [j1, k] and [k + 1, j2] serially, and inverted rule swap them. [sent-172, score-1.644]

75 Note that when the decoder uses the monotone and inverted glue rule to combine sub-spans, it merges phrases that do not contain variables. [sent-173, score-0.855]

76 1; HPB+MEGR: HPB with MaxEnt based classifHiePrB Bf+orM glue rules, as wdietshcr MibaexdE nint bSaecsteidon c 2. [sent-176, score-0.3]

77 2; HPB+MER: HPB with MaxEnt based classifier fHoPr B bo+tMh hEiRer:a HrcPhBic wali tahn dM glue rtu bleass. [sent-177, score-0.359]

78 1 Statistical Information of Rules Hierarchical Rules We extracted 162M translation rules from the training corpus. [sent-188, score-0.204]

79 Among them, there were 127M hierarchical rules, which contained 85M hierarchical source phrases. [sent-189, score-0.35]

80 We classified these source phrases into 7 patterns as described in Section 2. [sent-190, score-0.275]

81 We observed that the most frequent source pattern is “FXF”, 560 Table 5: Statistical information ofreordering pattern classification for hierarchical source phrases. [sent-193, score-0.47]

82 965F Table 6: Percentage of target reordering pattern for each source pattern containing one variable. [sent-197, score-0.612]

83 Table 6 and Table 7 show the distributions of reordering patterns for hierarchical source phrases that contain one and two variables, respectively. [sent-202, score-0.707]

84 From both the tables, we observed that for Chinese-to-English translation, the most frequent “reordering” pattern for a source phrase is monotone translation (bold font in the tables). [sent-203, score-0.496]

85 Glue Rules To train a MaxEnt classifier for glue rules, we extracted 65. [sent-204, score-0.359]

86 8M reordering (monotone and inverse) instances from the training data, using the algo- rithm described in Xiong et al. [sent-205, score-0.357]

87 From the table, we made the following observations: Table Table 8: 7: Percentage of target reordering pattern for each source pattern containing two variables. [sent-213, score-0.612]

88 This indicates that the ME based reordering for hierarchical rules improves translation performance. [sent-227, score-0.691]

89 The HPB+MEGR system overcomes the shortcoming of the HPB system by using both monotone glue rule and inverted glue rule, which merging phrases serially and inversely, respectively. [sent-232, score-1.081]

90 The system combin- ing with ME based reordering for both hierarchical and glue rules, outperformed both the HPB+MEHR and HPB+MEGR systems. [sent-237, score-0.787]

91 Another reason is that adding inverted glue rules increases search space. [sent-243, score-0.478]

92 However, the baseline produced a monotone translation by us- ing the rule “‚I ? [sent-264, score-0.395]

93 The reason is that the MaxEnt phrase reordering classifier uses the contextual features (e. [sent-270, score-0.56]

94 the boundary words) of the phrase covered by X1 to predict the phrase reordering as X1E for the source phrase FX1. [sent-272, score-0.778]

95 562 6 Conclusions and Future Work In this paper, we have proposed a MaxEnt based phrase reordering approach to help the HPB decoder select reordering patterns. [sent-274, score-0.86]

96 We classified hierarchical rules into 7 reordering patterns on the source side and 17 reordering patterns on the target side. [sent-275, score-1.287]

97 In addition, we introduced BTG to enhance the reordering of neighboring phrases and classified the glue rules into two patterns. [sent-276, score-0.954]

98 We trained a MaxEnt classifier for each reordering pattern and integrated it into a standard HPB system. [sent-277, score-0.479]

99 MaxEnt based phrase reordering provides a mechanism to incorporate various features into the translation model. [sent-282, score-0.543]

100 Maximum entropy based phrase reordering model for statistical machine translation. [sent-366, score-0.475]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hpb', 0.707), ('reordering', 0.357), ('glue', 0.3), ('maxent', 0.191), ('monotone', 0.157), ('rule', 0.131), ('hierarchical', 0.13), ('translation', 0.107), ('rules', 0.097), ('btg', 0.095), ('mepr', 0.095), ('source', 0.09), ('inverted', 0.081), ('xiong', 0.081), ('phrase', 0.079), ('phrases', 0.077), ('mer', 0.07), ('megr', 0.068), ('neighboring', 0.068), ('decoder', 0.067), ('contextual', 0.065), ('pattern', 0.063), ('classifier', 0.059), ('side', 0.056), ('classified', 0.055), ('mehr', 0.054), ('reorderings', 0.054), ('patterns', 0.053), ('bleu', 0.05), ('chiang', 0.049), ('merges', 0.042), ('arrangement', 0.041), ('zhongjun', 0.041), ('target', 0.039), ('classifiers', 0.039), ('boundary', 0.038), ('serially', 0.035), ('classification', 0.034), ('south', 0.031), ('covered', 0.031), ('och', 0.03), ('nist', 0.03), ('government', 0.03), ('korean', 0.029), ('shouxun', 0.029), ('merge', 0.029), ('classifhieprb', 0.027), ('dprk', 0.027), ('inversely', 0.027), ('orm', 0.027), ('pme', 0.027), ('pmgre', 0.027), ('rpules', 0.027), ('syntactical', 0.027), ('variables', 0.027), ('classify', 0.026), ('predict', 0.025), ('penalty', 0.024), ('russia', 0.023), ('zhong', 0.023), ('carpuat', 0.023), ('bf', 0.023), ('month', 0.023), ('serial', 0.023), ('phrasal', 0.023), ('qun', 0.022), ('gale', 0.022), ('entropy', 0.021), ('analogous', 0.02), ('decoding', 0.02), ('improvements', 0.02), ('utilized', 0.02), ('meeting', 0.02), ('scfg', 0.019), ('accounted', 0.019), ('hk', 0.019), ('adjacent', 0.019), ('speed', 0.019), ('integrate', 0.019), ('statistical', 0.018), ('liu', 0.018), ('talks', 0.018), ('chan', 0.018), ('pw', 0.018), ('cky', 0.018), ('bilingual', 0.018), ('inverse', 0.018), ('built', 0.017), ('smt', 0.017), ('franz', 0.017), ('accounting', 0.017), ('ranging', 0.017), ('annual', 0.016), ('xe', 0.016), ('association', 0.016), ('percentage', 0.016), ('josef', 0.015), ('hermann', 0.015), ('wu', 0.015), ('proceedings', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

Author: Zhongjun He ; Yao Meng ; Hao Yu

Abstract: Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. However, the phrase reorderings lack of contextual information in conventional HPB systems. This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. We integrate the MaxEnt models into the HPB model. Experimental results show that our approach achieves significant improvements over a standard HPB system on large-scale translation tasks. On Chinese-to-English translation, , the absolute improvements in BLEU (caseinsensitive) range from 1.2 to 2.1.

2 0.26267874 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

Author: Hendra Setiawan ; Chris Dyer ; Philip Resnik

Abstract: We address the modeling, parameter estimation and search challenges that arise from the introduction of reordering models that capture non-local reordering in alignment modeling. In particular, we introduce several reordering models that utilize (pairs of) function words as contexts for alignment reordering. To address the parameter estimation challenge, we propose to estimate these reordering models from a relatively small amount of manuallyaligned corpora. To address the search challenge, we devise an iterative local search algorithm that stochastically explores reordering possibilities. By capturing non-local reordering phenomena, our proposed alignment model bears a closer resemblance to stateof-the-art translation model. Empirical results show significant improvements in alignment quality as well as in translation performance over baselines in a large-scale ChineseEnglish translation task.

3 0.20079499 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

Author: Adria de Gispert ; Juan Pino ; William Byrne

Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

4 0.12354478 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

Author: Zhongqiang Huang ; Martin Cmejrek ; Bowen Zhou

Abstract: In this paper, we present a novel approach to enhance hierarchical phrase-based machine translation systems with linguistically motivated syntactic features. Rather than directly using treebank categories as in previous studies, we learn a set of linguistically-guided latent syntactic categories automatically from a source-side parsed, word-aligned parallel corpus, based on the hierarchical structure among phrase pairs as well as the syntactic structure of the source side. In our model, each X nonterminal in a SCFG rule is decorated with a real-valued feature vector computed based on its distribution of latent syntactic categories. These feature vectors are utilized at decod- ing time to measure the similarity between the syntactic analysis of the source side and the syntax of the SCFG rules that are applied to derive translations. Our approach maintains the advantages of hierarchical phrase-based translation systems while at the same time naturally incorporates soft syntactic constraints.

5 0.11106238 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

Author: Guillaume Wisniewski ; Alexandre Allauzen ; Francois Yvon

Abstract: Extant Statistical Machine Translation (SMT) systems are very complex softwares, which embed multiple layers of heuristics and embark very large numbers of numerical parameters. As a result, it is difficult to analyze output translations and there is a real need for tools that could help developers to better understand the various causes of errors. In this study, we make a step in that direction and present an attempt to evaluate the quality of the phrase-based translation model. In order to identify those translation errors that stem from deficiencies in the phrase table (PT), we propose to compute the oracle BLEU-4 score, that is the best score that a system based on this PT can achieve on a reference corpus. By casting the computation of the oracle BLEU-1 as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of this score, and report measures performed on several standard benchmarks. Various other applications of these oracle decoding techniques are also reported and discussed. 1 Phrase-Based Machine Translation 1.1 Principle A Phrase-Based Translation System (PBTS) consists of a ruleset and a scoring function (Lopez, 2009). The ruleset, represented in the phrase table, is a set of phrase1pairs {(f, e) }, each pair expressing that the source phrase f can ,bee) r}e,w earicthten p (atirra enxslparteedss)i inngto t a target phrase e. Trarsaens flation hypotheses are generated by iteratively rewriting portions of the source sentence as prescribed by the ruleset, until each source word has been consumed by exactly one rule. The order of target words in an hypothesis is uniquely determined by the order in which the rewrite operation are performed. The search space ofthe translation model corresponds to the set of all possible sequences of 1Following the usage in statistical machine translation literature, use “phrase” to denote a subsequence of consecutive words. we 933 rules applications. The scoring function aims to rank all possible translation hypotheses in such a way that the best one has the highest score. A PBTS is learned from a parallel corpus in two independent steps. In a first step, the corpus is aligned at the word level, by using alignment tools such as Gi z a++ (Och and Ney, 2003) and some symmetrisation heuristics; phrases are then extracted by other heuristics (Koehn et al., 2003) and assigned numerical weights. In the second step, the parameters of the scoring function are estimated, typically through Minimum Error Rate training (Och, 2003). Translating a sentence amounts to finding the best scoring translation hypothesis in the search space. Because of the combinatorial nature of this problem, translation has to rely on heuristic search techniques such as greedy hill-climbing (Germann, 2003) or variants of best-first search like multi-stack decoding (Koehn, 2004). Moreover, to reduce the overall complexity of decoding, the search space is typically pruned using simple heuristics. For instance, the state-of-the-art phrase-based decoder Moses (Koehn et al., 2007) considers only a restricted number of translations for each source sequence2 and enforces a distortion limit3 over which phrases can be reordered. As a consequence, the best translation hypothesis returned by the decoder is not always the one with the highest score. 1.2 Typology of PBTS Errors Analyzing the errors of a SMT system is not an easy task, because of the number of models that are combined, the size of these models, and the high complexity of the various decision making processes. For a SMT system, three different kinds of errors can be distinguished (Germann et al., 2004; Auli et al., 2009): search errors, induction errors and model errors. The former corresponds to cases where the hypothesis with the best score is missed by the search procedure, either because of the use of an ap2the 3the option of Moses, defaulting to 20. dl option of Moses, whose default value is 7. tt l ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 9t3ic3s–943, proximate search method or because of the restrictions of the search space. Induction errors correspond to cases where, given the model, the search space does not contain the reference. Finally, model errors correspond to cases where the hypothesis with the highest score is not the best translation according to the evaluation metric. Model errors encompass several types oferrors that occur during learning (Bottou and Bousquet, 2008)4. Approximation errors are errors caused by the use of a restricted and oversimplistic class of functions (here, finitestate transducers to model the generation of hypotheses and a linear scoring function to discriminate them) to model the translation process. Estimation errors correspond to the use of sub-optimal values for both the phrase pairs weights and the parameters of the scoring function. The reasons behind these errors are twofold: first, training only considers a finite sample of data; second, it relies on error prone alignments. As a result, some “good” phrases are extracted with a small weight, or, in the limit, are not extracted at all; and conversely that some “poor” phrases are inserted into the phrase table, sometimes with a really optimistic score. Sorting out and assessing the impact of these various causes of errors is of primary interest for SMT system developers: for lack of such diagnoses, it is difficult to figure out which components of the system require the most urgent attention. Diagnoses are however, given the tight intertwining among the various component of a system, very difficult to obtain: most evaluations are limited to the computation of global scores and usually do not imply any kind of failure analysis. 1.3 Contribution and organization To systematically assess the impact of the multiple heuristic decisions made during training and decoding, we propose, following (Dreyer et al., 2007; Auli et al., 2009), to work out oracle scores, that is to evaluate the best achievable performances of a PBTS. We aim at both studying the expressive power of PBTS and at providing tools for identifying and quantifying causes of failure. Under standard metrics such as BLEU (Papineni et al., 2002), oracle scores are difficult (if not impossible) to compute, but, by casting the computation of the oracle unigram recall and precision as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of the oracle BLEU-4 scores and report measurements performed on several standard benchmarks. The main contributions of this paper are twofold. We first introduce an ILP program able to efficiently find the best hypothesis a PBTS can achieve. This program can be easily extended to test various improvements to 4We omit here optimization errors. 934 phrase-base systems or to evaluate the impact of different parameter settings. Second, we present a number of complementary results illustrating the usage of our oracle decoder for identifying and analyzing PBTS errors. Our experimental results confirm the main conclusions of (Turchi et al., 2008), showing that extant PBTs have the potential to generate hypotheses having very high BLEU4 score and that their main bottleneck is their scoring function. The rest of this paper is organized as follows: in Section 2, we introduce and formalize the oracle decoding problem, and present a series of ILP problems of increasing complexity designed so as to deliver accurate lowerbounds of oracle score. This section closes with various extensions allowing to model supplementary constraints, most notably reordering constraints (Section 2.5). Our experiments are reported in Section 3, where we first introduce the training and test corpora, along with a description of our system building pipeline (Section 3. 1). We then discuss the baseline oracle BLEU scores (Section 3.2), analyze the non-reachable parts of the reference translations, and comment several complementary results which allow to identify causes of failures. Section 4 discuss our approach and findings with respect to the existing literature on error analysis and oracle decoding. We conclude and discuss further prospects in Section 5. 2 Oracle Decoder 2.1 The Oracle Decoding Problem Definition To get some insights on the errors of phrasebased systems and better understand their limits, we propose to consider the oracle decoding problem defined as follows: given a source sentence, its reference translation5 and a phrase table, what is the “best” translation hypothesis a system can generate? As usual, the quality of an hypothesis is evaluated by the similarity between the reference and the hypothesis. Note that in the oracle decoding problem, we are only assessing the ability of PBT systems to generate good candidate translations, irrespective of their ability to score them properly. We believe that studying this problem is interesting for various reasons. First, as described in Section 3.4, comparing the best hypothesis a system could have generated and the hypothesis it actually generates allows us to carry on both quantitative and qualitative failure analysis. The oracle decoding problem can also be used to assess the expressive power of phrase-based systems (Auli et al., 2009). Other applications include computing acceptable pseudo-references for discriminative training (Tillmann and Zhang, 2006; Liang et al., 2006; Arun and 5The oracle decoding problem can be extended to the case of multiple references. For the sake of simplicity, we only describe the case of a single reference. Koehn, 2007) or combining machine translation systems in a multi-source setting (Li and Khudanpur, 2009). We have also used oracle decoding to identify erroneous or difficult to translate references (Section 3.3). Evaluation Measure To fully define the oracle decoding problem, a measure of the similarity between a translation hypothesis and its reference translation has to be chosen. The most obvious choice is the BLEU-4 score (Papineni et al., 2002) used in most machine translation evaluations. However, using this metric in the oracle decoding problem raises several issues. First, BLEU-4 is a metric defined at the corpus level and is hard to interpret at the sentence level. More importantly, BLEU-4 is not decomposable6: as it relies on 4-grams statistics, the contribution of each phrase pair to the global score depends on the translation of the previous and following phrases and can not be evaluated in isolation. Because of its nondecomposability, maximizing BLEU-4 is hard; in particular, the phrase-level decomposability of the evaluation × metric is necessary in our approach. To circumvent this difficulty, we propose to evaluate the similarity between a translation hypothesis and a reference by the number of their common words. This amounts to evaluating translation quality in terms of unigram precision and recall, which are highly correlated with human judgements (Lavie et al., ). This measure is closely related to the BLEU-1 evaluation metric and the Meteor (Banerjee and Lavie, 2005) metric (when it is evaluated without considering near-matches and the distortion penalty). We also believe that hypotheses that maximize the unigram precision and recall at the sentence level yield corpus level BLEU-4 scores close the maximal achievable. Indeed, in the setting we will introduce in the next section, BLEU-1 and BLEU-4 are highly correlated: as all correct words of the hypothesis will be compelled to be at their correct position, any hypothesis with a high 1-gram precision is also bound to have a high 2-gram precision, etc. 2.2 Formalizing the Oracle Decoding Problem The oracle decoding problem has already been considered in the case of word-based models, in which all translation units are bound to contain only one word. The problem can then be solved by a bipartite graph matching algorithm (Leusch et al., 2008): given a n m binary matarligxo describing possible t 2r0an08sl)a:ti goinv elinn aks n b×emtw beeinna source words and target words7, this algorithm finds the subset of links maximizing the number of words of the reference that have been translated, while ensuring that each word 6Neither at the sentence (Chiang et al., 2008), nor at the phrase level. 7The (i, j) entry of the matrix is 1if the ith word of the source can be translated by the jth word of the reference, 0 otherwise. 935 is translated only once. Generalizing this approach to phrase-based systems amounts to solving the following problem: given a set of possible translation links between potential phrases of the source and of the target, find the subset of links so that the unigram precision and recall are the highest possible. The corresponding oracle hypothesis can then be easily generated by selecting the target phrases that are aligned with one source phrase, disregarding the others. In addition, to mimic the way OOVs are usually handled, we match identical OOV tokens appearing both in the source and target sentences. In this approach, the unigram precision is always one (every word generated in the oracle hypothesis matches exactly one word in the reference). As a consequence, to find the oracle hypothesis, we just have to maximize the recall, that is the number of words appearing both in the hypothesis and in the reference. Considering phrases instead of isolated words has a major impact on the computational complexity: in this new setting, the optimal segmentations in phrases of both the source and of the target have to be worked out in addition to links selection. Moreover, constraints have to be taken into account so as to enforce a proper segmentation of the source and target sentences. These constraints make it impossible to use the approach of (Leusch et al., 2008) and concur in making the oracle decoding problem for phrase-based models more complex than it is for word-based models: it can be proven, using arguments borrowed from (De Nero and Klein, 2008), that this problem is NP-hard even for the simple unigram precision measure. 2.3 An Integer Program for Oracle Decoding To solve the combinatorial problem introduced in the previous section, we propose to cast it into an Integer Linear Programming (ILP) problem, for which many generic solvers exist. ILP has already been used in SMT to find the optimal translation for word-based (Germann et al., 2001) and to study the complexity of learning phrase alignments (De Nero and Klein, 2008) models. Following the latter reference, we introduce the following variables: fi,j (resp. ek,l) is a binary indicator variable that is true when the phrase contains all spans from betweenword position i to j (resp. k to l) of the source (resp. target) sentence. We also introduce a binary variable, denoted ai,j,k,l, to describe a possible link between source phrase fi,j and target phrase ek,l. These variables are built from the entries of the phrase table according to selection strategies introduced in Section 2.4. In the following, index variables are so that: 0 ≤ i< j ≤ n, in the source sentence and 0 ≤ k < l ≤ m, in the target sentence, where n (resp. m) is the length of the source (resp. target) sentence. Solving the oracle decoding problem then amounts to optimizing the following objective function: mi,j,akx,li,Xj,k,lai,j,k,l· (l − k), (1) under the constraints: X ∀x ∈ J1,mK : ek,l ≤ 1 (2) = (3) 1∀,kn,lK : Xai,j,k,l = fk,l (4) ∀i,j : Xai,j,k,l (5) k,l s.tX. Xk≤x≤l ∀∀xy ∈∈ J11,,mnKK : X i,j s.tX. Xi≤y≤j fi,j 1 Xi,j = ei,j Xk,l The objective function (1) corresponds to the number of target words that are generated. The first set of constraints (2) ensures that each word in the reference e ap- pears in no more than one phrase. Maximizing the objective under these constraints amounts to maximizing the unigram recall. The second set of constraints (3) ensures that each word in the source f is translated exactly once, which guarantees that the search space of the ILP problem is the same as the search space of a phrase-based system. Constraints (4) bind the fk,l and ai,j,k,l variables, ensuring that whenever a link ai,j,k,l is active, the corresponding phrase fk,l is also active. Constraints (5) play a similar role for the reference. The Relaxed Problem Even though it accurately models the search space of a phrase-based decoder, this programs is not really useful as is: due to out-ofvocabulary words or missing entries in the phrase table, the constraint that all source words should be translated yields infeasible problems8. We propose to relax this problem and allow some source words to remain untranslated. This is done by replacing constraints (3) by: ∀y ∈ J1,nK : X i,j s.tX. Xi≤y≤j fi,j ≤ 1 To better ref∀lyec ∈t th J1e, bneKh :avior of phrase-based decoders, which attempt to translate all source words, we also need to modify the objective function as follows: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) Xi,j (6) The second term in this new objective ensures that optimal solutions translate as many source words as possible. 8An ILP problem is said to be infeasible when tion violates at least one constraint. every possible solu- 936 The Relaxed-Distortion Problem A last caveat with the Relaxed optimization program is caused by frequently occurring source tokens, such as function words or punctuation signs, which can often align with more than one target word. For lack of taking distortion information into account in our objective function, all these alignments are deemed equivalent, even if some of them are clearly more satisfactory than others. This situation is illustrated on Figure 1. le chat et the cat and le the chien dog Figure 1: Equivalent alignments between “le” and “the”. The dashed lines corresponds to a less interpretable solution. To overcome this difficulty, we propose a last change to the objective function: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) X ai,j,k,l|k − i| Xi,j −α (7) i Xk ,l X,j, Compared to the objective function of the relaxed problem (6), we introduce here a supplementary penalty factor which favors monotonous alignments. For each phrase pair, the higher the difference between source and target positions, the higher this penalty. If α is small enough, this extra term allows us to select, among all the optimal alignments of the re l axed problem, the one with the lowest distortion. In our experiments, we set α to min {n, m} to ensure that the penalty factor is always smminall{enr, ,tmha}n tthoe e rneswuarred t fhoart aligning atwltyo single iwso ardlwsa. 2.4 Selecting Indicator Variables In the approach introduced in the previous sections, the oracle decoding problem is solved by selecting, among a set of possible translation links, the ones that yield the solution with the highest unigram recall. We propose two strategies to build this set of possible translation links. In the first one, denoted exact match, an indicator ai,j,k,l is created if there is an entry (f, e) so that f spans from word position ito j in the source and e from word position k to l in the target. In this strategy, the ILP program considers exactly the same ruleset as conventional phrase-based decoders. We also consider an alternative strategy, which could help us to identify errors made during the phrase extraction process. In this strategy, denoted inside match, an indicator ai,j,k,l is created when the following three criteria are met: i) f spans from position ito j of the source; ii) a substring of e, denoted e, spans from position k to l of the reference; iii) (f, e¯) is not an entry of the phrase table. The resulting set of indicator variables thus contains, at least, all the variables used in the exact match strategy. In addition, we license here the use of phrases containing words that do not occur in the reference. In fact, using such solutions can yield higher BLEU scores when the reward for additional correct matches exceeds the cost incurred by wrong predictions. These cases are symptoms of situations where the extraction heuristic failed to extract potentially useful subphrases. 2.5 Oracle Decoding with Reordering Constraints The ILP problem introduced in the previous section can be extended in several ways to describe and test various improvements to phrase-based systems or to evaluate the impact of different parameter settings. This flexibility mainly stems from the possibility offered by our framework to express arbitrary constraints over variables. In this section, we illustrate these possibilities by describing how reordering constraints can easily be considered. As a first example, the Moses decoder uses a distortion limit to constrain the set of possible reorderings. This constraint “enforces (...) that the last word of a phrase chosen for translation cannot be more than d9 words from the leftmost untranslated word in the source” (Lopez, 2009) and is expressed as: ∀aijkl , ai0j0k0l0 s.t. k > k0, aijkl · ai0j0k0l0 · |j − i0 + 1| ≤ d, The maximum distortion limit strategy (Lopez, 2009) is also easily expressed and take the following form (assuming this constraint is parameterized by d): ∀l < m − 1, ai,j,k,l·ai0,j0,l+1,l0 · |i0 − j − 1| 71is%t e6hs.a distortion greater that Moses default distortion limit. alignment decisions enabled by the use of larger training corpora and phrase table. To evaluate the impact ofthe second heuristic, we computed the number of phrases discarded by Moses (be- cause of the default ttl limit) but used in the oracle hypotheses. In the English to French NEWSCO setting, they account for 34.11% of the total number of phrases used in the oracle hypotheses. When the oracle decoder is constrained to use the same phrase table as Moses, its BLEU-4 score drops to 42.78. This shows that filtering the phrase table prior to decoding discards many useful phrase pairs and is seriously limiting the best achievable performance, a conclusion shared with (Auli et al., 2009). Search Errors Search errors can be identified by comparing the score of the best hypothesis found by Moses and the score of the oracle hypothesis. If the score of the oracle hypothesis is higher, then there has been a search error; on the contrary, there has been an estimation error when the score of the oracle hypothesis is lower than the score of the best hypothesis found by Moses. 940 Based on the comparison of the score of Moses hypotheses and of oracle hypotheses for the English to French NEWSCO setting, our preliminary conclusion is that the number of search errors is quite limited: only about 5% of the hypotheses of our oracle decoder are actually getting a better score than Moses solutions. Again, this shows that the scoring function (model error) is one of the main bottleneck of current PBTS. Comparing these hypotheses is nonetheless quite revealing: while Moses mostly selects phrase pairs with high translation scores and generates monotonous alignments, our ILP decoder uses larger reorderings and less probable phrases to achieve better solutions: on average, the reordering score of oracle solutions is −5.74, compared to −76.78 fscoro rMeo osfe osr outputs. iGonivsen is −the5 weight assigned through MERT training to the distortion score, no wonder that these hypotheses are severely penalized. The Impact of Phrase Length The observed outputs do not only depend on decisions made during the search, but also on decisions made during training. One such decision is the specification of maximal length for the source and target phrases. In our framework, evaluating the impact of this decision is simple: it suffices to change the definition of indicator variables so as to consider only alignments between phrases of a given length. In the English-French NEWSCO setting, the most restrictive choice, when only alignments between single words are authorized, yields an oracle BLEU-4 of 48.68; however, authorizing phrases up to length 2 allows to achieve an oracle value of 66.57, very close to the score achieved when considering all extracted phrases (67.77). This is corroborated with a further analysis of our oracle alignments, which use phrases whose average source length is 1.21 words (respectively 1.31 for target words). If many studies have already acknowledged the predomi- nance of “small” phrases in actual translations, our oracle scores suggest that, for this language pair, increasing the phrase length limit beyond 2 or 3 might be a waste of computational resources. 4 Related Work To the best of our knowledge, there are only a few works that try to study the expressive power ofphrase-based machine translation systems or to provide tools for analyzing potential causes of failure. The approach described in (Auli et al., 2009) is very similar to ours: in this study, the authors propose to find and analyze the limits of machine translation systems by studying the reference reachability. A reference is reachable for a given system if it can be exactly generated by this system. Reference reachability is assessed using Moses in forced decoding mode: during search, all hypotheses that deviate from the reference are simply discarded. Even though the main goal of this study was to compare the search space of phrase-based and hierarchical systems, it also provides some insights on the impact of various search parameters in Moses, delivering conclusions that are consistent with our main results. As described in Section 1.2, these authors also propose a typology of the errors of a statistical translation systems, but do not attempt to provide methods for identifying them. The authors of (Turchi et al., 2008) study the learn- ing capabilities of Moses by extensively analyzing learning curves representing the translation performances as a function of the number of examples, and by corrupting the model parameters. Even though their focus is more on assessing the scoring function, they reach conclusions similar to ours: the current bottleneck of translation performances is not the representation power of the PBTS but rather in their scoring functions. Oracle decoding is useful to compute reachable pseudo-references in the context of discriminative training. This is the main motivation of (Tillmann and Zhang, 2006), where the authors compute high BLEU hypotheses by running a conventional decoder so as to maximize a per-sentence approximation of BLEU-4, under a simple (local) reordering model. Oracle decoding has also been used to assess the limitations induced by various reordering constraints in (Dreyer et al., 2007). To this end, the authors propose to use a beam-search based oracle decoder, which computes lower bounds of the best achievable BLEU-4 using dynamic programming techniques over finite-state (for so-called local and IBM constraints) or hierarchically structured (for ITG constraints) sets of hypotheses. Even 941 though the numbers reported in this study are not directly comparable with ours17, it seems that our decoder is not only conceptually much simpler, but also achieves much more optimistic lower-bounds of the oracle BLEU score. The approach described in (Li and Khudanpur, 2009) employs a similar technique, which is to guide a heuristic search in an hypergraph representing possible translation hypotheses with n-gram counts matches, which amounts to decoding with a n-gram model trained on the sole reference translation. Additional tricks are presented in this article to speed-up decoding. Computing oracle BLEU scores is also the subject of (Zens and Ney, 2005; Leusch et al., 2008), yet with a different emphasis. These studies are concerned with finding the best hypotheses in a word graph or in a consensus network, a problem that has various implications for multi-pass decoding and/or system combination techniques. The former reference describes an exponential approximate algorithm, while the latter proves the NPcompleteness of this problem and discuss various heuristic approaches. Our problem is somewhat more complex and using their techniques would require us to built word graphs containing all the translations induced by arbitrary segmentations and permutations of the source sentence. 5 Conclusions In this paper, we have presented a methodology for analyzing the errors of PBTS, based on the computation of an approximation of the BLEU-4 oracle score. We have shown that this approximation could be computed fairly accurately and efficiently using Integer Linear Programming techniques. Our main result is a confirmation of the fact that extant PBTS systems are expressive enough to achieve very high translation performance with respect to conventional quality measurements. The main efforts should therefore strive to improve on the way phrases and hypotheses are scored during training. This gives further support to attempts aimed at designing context-dependent scoring functions as in (Stroppa et al., 2007; Gimpel and Smith, 2008), or at attempts to perform discriminative training of feature-rich models. (Bangalore et al., 2007). We have shown that the examination of difficult-totranslate sentences was an effective way to detect errors or inconsistencies in the reference translations, making our approach a potential aid for controlling the quality or assessing the difficulty of test data. Our experiments have also highlighted the impact of various parameters. Various extensions of the baseline ILP program have been suggested and/or evaluated. In particular, the ILP formalism lends itself well to expressing various constraints that are typically used in conventional PBTS. In 17The best BLEU-4 oracle they achieve on Europarl German to English is approximately 48; but they considered a smaller version of the training corpus and the WMT’06 test set. our future work, we aim at using this ILP framework to systematically assess various search configurations. We plan to explore how replacing non-reachable references with high-score pseudo-references can improve discrim- inative training of PBTS. We are also concerned by determining how tight is our approximation of the BLEU4 score is: to this end, we intend to compute the best BLEU-4 score within the n-best solutions of the oracle decoding problem. Acknowledgments Warm thanks to Houda Bouamor for helping us with the annotation tool. This work has been partly financed by OSEO, the French State Agency for Innovation, under the Quaero program. References Tobias Achterberg. 2007. Constraint Integer Programming. Ph.D. thesis, Technische Universit a¨t Berlin. http : / / opus .kobv .de /tuberl in/vol ltexte / 2 0 0 7 / 16 11/ . Abhishek Arun and Philipp Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In Proc. of MT Summit XI, Copenhagen, Denmark. Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Koehn. 2009. A systematic analysis of translation model search spaces. In Proc. of WMT, pages 224–232, Athens, Greece. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Srinivas Bangalore, Patrick Haffner, and Stephan Kanthak. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proc. of ACL, pages 152–159, Prague, Czech Republic. L e´on Bottou and Olivier Bousquet. 2008. The tradeoffs oflarge scale learning. In Proc. of NIPS, pages 161–168, Vancouver, B.C., Canada. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proc. of WMT, pages 1–28, Athens, Greece. David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proc. of ECML, pages 610–619, Honolulu, Hawaii. John De Nero and Dan Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL: HLT, Short Papers, pages 25–28, Columbus, Ohio. Markus Dreyer, Keith B. Hall, and Sanjeev P. Khudanpur. 2007. Comparing reordering constraints for smt using efficient bleu oracle computation. In NAACL-HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 103– 110, Rochester, New York. 942 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proc. of ACL, pages 228–235, Toulouse, France. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1-2): 127– 143. Ulrich Germann. 2003. Greedy decoding for statistical machine translation in almost linear time. In Proc. of NAACL, pages 1–8, Edmonton, Canada. Kevin Gimpel and Noah A. Smith. 2008. Rich source-side context for statistical machine translation. In Proc. of WMT, pages 9–17, Columbus, Ohio. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL, pages 48–54, Edmonton, Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL, demonstration session. Philipp Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA, pages 115–124, Washington DC. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proc. of HLT, pages 161–168, Vancouver, Canada. Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. The significance of recall in automatic metrics for MT evaluation. In In Proc. of AMTA, pages 134–143, Washington DC. Gregor Leusch, Evgeny Matusov, and Hermann Ney. 2008. Complexity of finding the BLEU-optimal hypothesis in a confusion network. In Proc. of EMNLP, pages 839–847, Honolulu, Hawaii. Zhifei Li and Sanjeev Khudanpur. 2009. Efficient extraction of oracle-best translations from hypergraphs. In Proc. of NAACL, pages 9–12, Boulder, Colorado. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proc. of ACL, pages 761–768, Sydney, Australia. Adam Lopez. 2009. Translation as weighted deduction. In Proc. of EACL, pages 532–540, Athens, Greece. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. , 29(1): 19–5 1. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. Technical report, Philadelphia, Pennsylvania. D. Roth and W. Yih. 2005. Integer linear programming inference for conditional random fields. In Proc. of ICML, pages 737–744, Bonn, Germany. Nicolas Stroppa, Antal van den Bosch, and Andy Way. 2007. Exploiting source similarity for smt using context-informed features. In Andy Way and Barbara Proc. of TMI, pages Christoph Tillmann 231–240, Sk¨ ovde, and Tong Zhang. Gawronska, editors, Sweden. 2006. A discriminative global training algorithm for statistical mt. In Proc. of ACL, 721–728, Sydney, Australia. Turchi, Tijl De Bie, and Nello pages Marco Cristianini. 2008. Learn- ing performance of a machine translation system: a statistical and computational analysis. In Proc. of WMT, pages Columbus, Ohio. 35–43, Richard Zens and Hermann Ney. 2005. Word graphs for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 191–198, Ann Arbor, Michigan. 943

6 0.10180655 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

7 0.10038949 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

8 0.09672603 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

9 0.095388979 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

10 0.090014286 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

11 0.087361455 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

12 0.075560234 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

13 0.075075269 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

14 0.073620446 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

15 0.062688172 39 emnlp-2010-EMNLP 044

16 0.056225482 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

17 0.049640797 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

18 0.047966726 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

19 0.047128752 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

20 0.04525036 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.198), (1, -0.221), (2, 0.058), (3, -0.055), (4, 0.122), (5, -0.153), (6, 0.032), (7, -0.001), (8, -0.154), (9, 0.1), (10, 0.003), (11, -0.049), (12, -0.023), (13, -0.097), (14, -0.068), (15, -0.005), (16, -0.111), (17, -0.046), (18, -0.033), (19, 0.072), (20, -0.001), (21, -0.033), (22, 0.05), (23, -0.166), (24, 0.145), (25, -0.083), (26, 0.105), (27, 0.031), (28, 0.243), (29, -0.011), (30, -0.101), (31, -0.185), (32, 0.242), (33, 0.137), (34, -0.028), (35, -0.007), (36, -0.097), (37, 0.028), (38, -0.03), (39, -0.127), (40, 0.016), (41, -0.002), (42, 0.019), (43, -0.082), (44, 0.11), (45, 0.094), (46, -0.063), (47, 0.106), (48, 0.054), (49, -0.258)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92663854 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

Author: Zhongjun He ; Yao Meng ; Hao Yu

Abstract: Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. However, the phrase reorderings lack of contextual information in conventional HPB systems. This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. We integrate the MaxEnt models into the HPB model. Experimental results show that our approach achieves significant improvements over a standard HPB system on large-scale translation tasks. On Chinese-to-English translation, , the absolute improvements in BLEU (caseinsensitive) range from 1.2 to 2.1.

2 0.6608097 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

Author: Hendra Setiawan ; Chris Dyer ; Philip Resnik

Abstract: We address the modeling, parameter estimation and search challenges that arise from the introduction of reordering models that capture non-local reordering in alignment modeling. In particular, we introduce several reordering models that utilize (pairs of) function words as contexts for alignment reordering. To address the parameter estimation challenge, we propose to estimate these reordering models from a relatively small amount of manuallyaligned corpora. To address the search challenge, we devise an iterative local search algorithm that stochastically explores reordering possibilities. By capturing non-local reordering phenomena, our proposed alignment model bears a closer resemblance to stateof-the-art translation model. Empirical results show significant improvements in alignment quality as well as in translation performance over baselines in a large-scale ChineseEnglish translation task.

3 0.53931212 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

Author: Adria de Gispert ; Juan Pino ; William Byrne

Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

4 0.46052173 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

Author: Zhongqiang Huang ; Martin Cmejrek ; Bowen Zhou

Abstract: In this paper, we present a novel approach to enhance hierarchical phrase-based machine translation systems with linguistically motivated syntactic features. Rather than directly using treebank categories as in previous studies, we learn a set of linguistically-guided latent syntactic categories automatically from a source-side parsed, word-aligned parallel corpus, based on the hierarchical structure among phrase pairs as well as the syntactic structure of the source side. In our model, each X nonterminal in a SCFG rule is decorated with a real-valued feature vector computed based on its distribution of latent syntactic categories. These feature vectors are utilized at decod- ing time to measure the similarity between the syntactic analysis of the source side and the syntax of the SCFG rules that are applied to derive translations. Our approach maintains the advantages of hierarchical phrase-based translation systems while at the same time naturally incorporates soft syntactic constraints.

5 0.38901687 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

Author: Guillaume Wisniewski ; Alexandre Allauzen ; Francois Yvon

Abstract: Extant Statistical Machine Translation (SMT) systems are very complex softwares, which embed multiple layers of heuristics and embark very large numbers of numerical parameters. As a result, it is difficult to analyze output translations and there is a real need for tools that could help developers to better understand the various causes of errors. In this study, we make a step in that direction and present an attempt to evaluate the quality of the phrase-based translation model. In order to identify those translation errors that stem from deficiencies in the phrase table (PT), we propose to compute the oracle BLEU-4 score, that is the best score that a system based on this PT can achieve on a reference corpus. By casting the computation of the oracle BLEU-1 as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of this score, and report measures performed on several standard benchmarks. Various other applications of these oracle decoding techniques are also reported and discussed. 1 Phrase-Based Machine Translation 1.1 Principle A Phrase-Based Translation System (PBTS) consists of a ruleset and a scoring function (Lopez, 2009). The ruleset, represented in the phrase table, is a set of phrase1pairs {(f, e) }, each pair expressing that the source phrase f can ,bee) r}e,w earicthten p (atirra enxslparteedss)i inngto t a target phrase e. Trarsaens flation hypotheses are generated by iteratively rewriting portions of the source sentence as prescribed by the ruleset, until each source word has been consumed by exactly one rule. The order of target words in an hypothesis is uniquely determined by the order in which the rewrite operation are performed. The search space ofthe translation model corresponds to the set of all possible sequences of 1Following the usage in statistical machine translation literature, use “phrase” to denote a subsequence of consecutive words. we 933 rules applications. The scoring function aims to rank all possible translation hypotheses in such a way that the best one has the highest score. A PBTS is learned from a parallel corpus in two independent steps. In a first step, the corpus is aligned at the word level, by using alignment tools such as Gi z a++ (Och and Ney, 2003) and some symmetrisation heuristics; phrases are then extracted by other heuristics (Koehn et al., 2003) and assigned numerical weights. In the second step, the parameters of the scoring function are estimated, typically through Minimum Error Rate training (Och, 2003). Translating a sentence amounts to finding the best scoring translation hypothesis in the search space. Because of the combinatorial nature of this problem, translation has to rely on heuristic search techniques such as greedy hill-climbing (Germann, 2003) or variants of best-first search like multi-stack decoding (Koehn, 2004). Moreover, to reduce the overall complexity of decoding, the search space is typically pruned using simple heuristics. For instance, the state-of-the-art phrase-based decoder Moses (Koehn et al., 2007) considers only a restricted number of translations for each source sequence2 and enforces a distortion limit3 over which phrases can be reordered. As a consequence, the best translation hypothesis returned by the decoder is not always the one with the highest score. 1.2 Typology of PBTS Errors Analyzing the errors of a SMT system is not an easy task, because of the number of models that are combined, the size of these models, and the high complexity of the various decision making processes. For a SMT system, three different kinds of errors can be distinguished (Germann et al., 2004; Auli et al., 2009): search errors, induction errors and model errors. The former corresponds to cases where the hypothesis with the best score is missed by the search procedure, either because of the use of an ap2the 3the option of Moses, defaulting to 20. dl option of Moses, whose default value is 7. tt l ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 9t3ic3s–943, proximate search method or because of the restrictions of the search space. Induction errors correspond to cases where, given the model, the search space does not contain the reference. Finally, model errors correspond to cases where the hypothesis with the highest score is not the best translation according to the evaluation metric. Model errors encompass several types oferrors that occur during learning (Bottou and Bousquet, 2008)4. Approximation errors are errors caused by the use of a restricted and oversimplistic class of functions (here, finitestate transducers to model the generation of hypotheses and a linear scoring function to discriminate them) to model the translation process. Estimation errors correspond to the use of sub-optimal values for both the phrase pairs weights and the parameters of the scoring function. The reasons behind these errors are twofold: first, training only considers a finite sample of data; second, it relies on error prone alignments. As a result, some “good” phrases are extracted with a small weight, or, in the limit, are not extracted at all; and conversely that some “poor” phrases are inserted into the phrase table, sometimes with a really optimistic score. Sorting out and assessing the impact of these various causes of errors is of primary interest for SMT system developers: for lack of such diagnoses, it is difficult to figure out which components of the system require the most urgent attention. Diagnoses are however, given the tight intertwining among the various component of a system, very difficult to obtain: most evaluations are limited to the computation of global scores and usually do not imply any kind of failure analysis. 1.3 Contribution and organization To systematically assess the impact of the multiple heuristic decisions made during training and decoding, we propose, following (Dreyer et al., 2007; Auli et al., 2009), to work out oracle scores, that is to evaluate the best achievable performances of a PBTS. We aim at both studying the expressive power of PBTS and at providing tools for identifying and quantifying causes of failure. Under standard metrics such as BLEU (Papineni et al., 2002), oracle scores are difficult (if not impossible) to compute, but, by casting the computation of the oracle unigram recall and precision as an Integer Linear Programming (ILP) problem, we show that it is possible to efficiently compute accurate lower-bounds of the oracle BLEU-4 scores and report measurements performed on several standard benchmarks. The main contributions of this paper are twofold. We first introduce an ILP program able to efficiently find the best hypothesis a PBTS can achieve. This program can be easily extended to test various improvements to 4We omit here optimization errors. 934 phrase-base systems or to evaluate the impact of different parameter settings. Second, we present a number of complementary results illustrating the usage of our oracle decoder for identifying and analyzing PBTS errors. Our experimental results confirm the main conclusions of (Turchi et al., 2008), showing that extant PBTs have the potential to generate hypotheses having very high BLEU4 score and that their main bottleneck is their scoring function. The rest of this paper is organized as follows: in Section 2, we introduce and formalize the oracle decoding problem, and present a series of ILP problems of increasing complexity designed so as to deliver accurate lowerbounds of oracle score. This section closes with various extensions allowing to model supplementary constraints, most notably reordering constraints (Section 2.5). Our experiments are reported in Section 3, where we first introduce the training and test corpora, along with a description of our system building pipeline (Section 3. 1). We then discuss the baseline oracle BLEU scores (Section 3.2), analyze the non-reachable parts of the reference translations, and comment several complementary results which allow to identify causes of failures. Section 4 discuss our approach and findings with respect to the existing literature on error analysis and oracle decoding. We conclude and discuss further prospects in Section 5. 2 Oracle Decoder 2.1 The Oracle Decoding Problem Definition To get some insights on the errors of phrasebased systems and better understand their limits, we propose to consider the oracle decoding problem defined as follows: given a source sentence, its reference translation5 and a phrase table, what is the “best” translation hypothesis a system can generate? As usual, the quality of an hypothesis is evaluated by the similarity between the reference and the hypothesis. Note that in the oracle decoding problem, we are only assessing the ability of PBT systems to generate good candidate translations, irrespective of their ability to score them properly. We believe that studying this problem is interesting for various reasons. First, as described in Section 3.4, comparing the best hypothesis a system could have generated and the hypothesis it actually generates allows us to carry on both quantitative and qualitative failure analysis. The oracle decoding problem can also be used to assess the expressive power of phrase-based systems (Auli et al., 2009). Other applications include computing acceptable pseudo-references for discriminative training (Tillmann and Zhang, 2006; Liang et al., 2006; Arun and 5The oracle decoding problem can be extended to the case of multiple references. For the sake of simplicity, we only describe the case of a single reference. Koehn, 2007) or combining machine translation systems in a multi-source setting (Li and Khudanpur, 2009). We have also used oracle decoding to identify erroneous or difficult to translate references (Section 3.3). Evaluation Measure To fully define the oracle decoding problem, a measure of the similarity between a translation hypothesis and its reference translation has to be chosen. The most obvious choice is the BLEU-4 score (Papineni et al., 2002) used in most machine translation evaluations. However, using this metric in the oracle decoding problem raises several issues. First, BLEU-4 is a metric defined at the corpus level and is hard to interpret at the sentence level. More importantly, BLEU-4 is not decomposable6: as it relies on 4-grams statistics, the contribution of each phrase pair to the global score depends on the translation of the previous and following phrases and can not be evaluated in isolation. Because of its nondecomposability, maximizing BLEU-4 is hard; in particular, the phrase-level decomposability of the evaluation × metric is necessary in our approach. To circumvent this difficulty, we propose to evaluate the similarity between a translation hypothesis and a reference by the number of their common words. This amounts to evaluating translation quality in terms of unigram precision and recall, which are highly correlated with human judgements (Lavie et al., ). This measure is closely related to the BLEU-1 evaluation metric and the Meteor (Banerjee and Lavie, 2005) metric (when it is evaluated without considering near-matches and the distortion penalty). We also believe that hypotheses that maximize the unigram precision and recall at the sentence level yield corpus level BLEU-4 scores close the maximal achievable. Indeed, in the setting we will introduce in the next section, BLEU-1 and BLEU-4 are highly correlated: as all correct words of the hypothesis will be compelled to be at their correct position, any hypothesis with a high 1-gram precision is also bound to have a high 2-gram precision, etc. 2.2 Formalizing the Oracle Decoding Problem The oracle decoding problem has already been considered in the case of word-based models, in which all translation units are bound to contain only one word. The problem can then be solved by a bipartite graph matching algorithm (Leusch et al., 2008): given a n m binary matarligxo describing possible t 2r0an08sl)a:ti goinv elinn aks n b×emtw beeinna source words and target words7, this algorithm finds the subset of links maximizing the number of words of the reference that have been translated, while ensuring that each word 6Neither at the sentence (Chiang et al., 2008), nor at the phrase level. 7The (i, j) entry of the matrix is 1if the ith word of the source can be translated by the jth word of the reference, 0 otherwise. 935 is translated only once. Generalizing this approach to phrase-based systems amounts to solving the following problem: given a set of possible translation links between potential phrases of the source and of the target, find the subset of links so that the unigram precision and recall are the highest possible. The corresponding oracle hypothesis can then be easily generated by selecting the target phrases that are aligned with one source phrase, disregarding the others. In addition, to mimic the way OOVs are usually handled, we match identical OOV tokens appearing both in the source and target sentences. In this approach, the unigram precision is always one (every word generated in the oracle hypothesis matches exactly one word in the reference). As a consequence, to find the oracle hypothesis, we just have to maximize the recall, that is the number of words appearing both in the hypothesis and in the reference. Considering phrases instead of isolated words has a major impact on the computational complexity: in this new setting, the optimal segmentations in phrases of both the source and of the target have to be worked out in addition to links selection. Moreover, constraints have to be taken into account so as to enforce a proper segmentation of the source and target sentences. These constraints make it impossible to use the approach of (Leusch et al., 2008) and concur in making the oracle decoding problem for phrase-based models more complex than it is for word-based models: it can be proven, using arguments borrowed from (De Nero and Klein, 2008), that this problem is NP-hard even for the simple unigram precision measure. 2.3 An Integer Program for Oracle Decoding To solve the combinatorial problem introduced in the previous section, we propose to cast it into an Integer Linear Programming (ILP) problem, for which many generic solvers exist. ILP has already been used in SMT to find the optimal translation for word-based (Germann et al., 2001) and to study the complexity of learning phrase alignments (De Nero and Klein, 2008) models. Following the latter reference, we introduce the following variables: fi,j (resp. ek,l) is a binary indicator variable that is true when the phrase contains all spans from betweenword position i to j (resp. k to l) of the source (resp. target) sentence. We also introduce a binary variable, denoted ai,j,k,l, to describe a possible link between source phrase fi,j and target phrase ek,l. These variables are built from the entries of the phrase table according to selection strategies introduced in Section 2.4. In the following, index variables are so that: 0 ≤ i< j ≤ n, in the source sentence and 0 ≤ k < l ≤ m, in the target sentence, where n (resp. m) is the length of the source (resp. target) sentence. Solving the oracle decoding problem then amounts to optimizing the following objective function: mi,j,akx,li,Xj,k,lai,j,k,l· (l − k), (1) under the constraints: X ∀x ∈ J1,mK : ek,l ≤ 1 (2) = (3) 1∀,kn,lK : Xai,j,k,l = fk,l (4) ∀i,j : Xai,j,k,l (5) k,l s.tX. Xk≤x≤l ∀∀xy ∈∈ J11,,mnKK : X i,j s.tX. Xi≤y≤j fi,j 1 Xi,j = ei,j Xk,l The objective function (1) corresponds to the number of target words that are generated. The first set of constraints (2) ensures that each word in the reference e ap- pears in no more than one phrase. Maximizing the objective under these constraints amounts to maximizing the unigram recall. The second set of constraints (3) ensures that each word in the source f is translated exactly once, which guarantees that the search space of the ILP problem is the same as the search space of a phrase-based system. Constraints (4) bind the fk,l and ai,j,k,l variables, ensuring that whenever a link ai,j,k,l is active, the corresponding phrase fk,l is also active. Constraints (5) play a similar role for the reference. The Relaxed Problem Even though it accurately models the search space of a phrase-based decoder, this programs is not really useful as is: due to out-ofvocabulary words or missing entries in the phrase table, the constraint that all source words should be translated yields infeasible problems8. We propose to relax this problem and allow some source words to remain untranslated. This is done by replacing constraints (3) by: ∀y ∈ J1,nK : X i,j s.tX. Xi≤y≤j fi,j ≤ 1 To better ref∀lyec ∈t th J1e, bneKh :avior of phrase-based decoders, which attempt to translate all source words, we also need to modify the objective function as follows: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) Xi,j (6) The second term in this new objective ensures that optimal solutions translate as many source words as possible. 8An ILP problem is said to be infeasible when tion violates at least one constraint. every possible solu- 936 The Relaxed-Distortion Problem A last caveat with the Relaxed optimization program is caused by frequently occurring source tokens, such as function words or punctuation signs, which can often align with more than one target word. For lack of taking distortion information into account in our objective function, all these alignments are deemed equivalent, even if some of them are clearly more satisfactory than others. This situation is illustrated on Figure 1. le chat et the cat and le the chien dog Figure 1: Equivalent alignments between “le” and “the”. The dashed lines corresponds to a less interpretable solution. To overcome this difficulty, we propose a last change to the objective function: X i,Xj,k,l ai,j,k,l · (l − k) +Xfi,j · (j − i) X ai,j,k,l|k − i| Xi,j −α (7) i Xk ,l X,j, Compared to the objective function of the relaxed problem (6), we introduce here a supplementary penalty factor which favors monotonous alignments. For each phrase pair, the higher the difference between source and target positions, the higher this penalty. If α is small enough, this extra term allows us to select, among all the optimal alignments of the re l axed problem, the one with the lowest distortion. In our experiments, we set α to min {n, m} to ensure that the penalty factor is always smminall{enr, ,tmha}n tthoe e rneswuarred t fhoart aligning atwltyo single iwso ardlwsa. 2.4 Selecting Indicator Variables In the approach introduced in the previous sections, the oracle decoding problem is solved by selecting, among a set of possible translation links, the ones that yield the solution with the highest unigram recall. We propose two strategies to build this set of possible translation links. In the first one, denoted exact match, an indicator ai,j,k,l is created if there is an entry (f, e) so that f spans from word position ito j in the source and e from word position k to l in the target. In this strategy, the ILP program considers exactly the same ruleset as conventional phrase-based decoders. We also consider an alternative strategy, which could help us to identify errors made during the phrase extraction process. In this strategy, denoted inside match, an indicator ai,j,k,l is created when the following three criteria are met: i) f spans from position ito j of the source; ii) a substring of e, denoted e, spans from position k to l of the reference; iii) (f, e¯) is not an entry of the phrase table. The resulting set of indicator variables thus contains, at least, all the variables used in the exact match strategy. In addition, we license here the use of phrases containing words that do not occur in the reference. In fact, using such solutions can yield higher BLEU scores when the reward for additional correct matches exceeds the cost incurred by wrong predictions. These cases are symptoms of situations where the extraction heuristic failed to extract potentially useful subphrases. 2.5 Oracle Decoding with Reordering Constraints The ILP problem introduced in the previous section can be extended in several ways to describe and test various improvements to phrase-based systems or to evaluate the impact of different parameter settings. This flexibility mainly stems from the possibility offered by our framework to express arbitrary constraints over variables. In this section, we illustrate these possibilities by describing how reordering constraints can easily be considered. As a first example, the Moses decoder uses a distortion limit to constrain the set of possible reorderings. This constraint “enforces (...) that the last word of a phrase chosen for translation cannot be more than d9 words from the leftmost untranslated word in the source” (Lopez, 2009) and is expressed as: ∀aijkl , ai0j0k0l0 s.t. k > k0, aijkl · ai0j0k0l0 · |j − i0 + 1| ≤ d, The maximum distortion limit strategy (Lopez, 2009) is also easily expressed and take the following form (assuming this constraint is parameterized by d): ∀l < m − 1, ai,j,k,l·ai0,j0,l+1,l0 · |i0 − j − 1| 71is%t e6hs.a distortion greater that Moses default distortion limit. alignment decisions enabled by the use of larger training corpora and phrase table. To evaluate the impact ofthe second heuristic, we computed the number of phrases discarded by Moses (be- cause of the default ttl limit) but used in the oracle hypotheses. In the English to French NEWSCO setting, they account for 34.11% of the total number of phrases used in the oracle hypotheses. When the oracle decoder is constrained to use the same phrase table as Moses, its BLEU-4 score drops to 42.78. This shows that filtering the phrase table prior to decoding discards many useful phrase pairs and is seriously limiting the best achievable performance, a conclusion shared with (Auli et al., 2009). Search Errors Search errors can be identified by comparing the score of the best hypothesis found by Moses and the score of the oracle hypothesis. If the score of the oracle hypothesis is higher, then there has been a search error; on the contrary, there has been an estimation error when the score of the oracle hypothesis is lower than the score of the best hypothesis found by Moses. 940 Based on the comparison of the score of Moses hypotheses and of oracle hypotheses for the English to French NEWSCO setting, our preliminary conclusion is that the number of search errors is quite limited: only about 5% of the hypotheses of our oracle decoder are actually getting a better score than Moses solutions. Again, this shows that the scoring function (model error) is one of the main bottleneck of current PBTS. Comparing these hypotheses is nonetheless quite revealing: while Moses mostly selects phrase pairs with high translation scores and generates monotonous alignments, our ILP decoder uses larger reorderings and less probable phrases to achieve better solutions: on average, the reordering score of oracle solutions is −5.74, compared to −76.78 fscoro rMeo osfe osr outputs. iGonivsen is −the5 weight assigned through MERT training to the distortion score, no wonder that these hypotheses are severely penalized. The Impact of Phrase Length The observed outputs do not only depend on decisions made during the search, but also on decisions made during training. One such decision is the specification of maximal length for the source and target phrases. In our framework, evaluating the impact of this decision is simple: it suffices to change the definition of indicator variables so as to consider only alignments between phrases of a given length. In the English-French NEWSCO setting, the most restrictive choice, when only alignments between single words are authorized, yields an oracle BLEU-4 of 48.68; however, authorizing phrases up to length 2 allows to achieve an oracle value of 66.57, very close to the score achieved when considering all extracted phrases (67.77). This is corroborated with a further analysis of our oracle alignments, which use phrases whose average source length is 1.21 words (respectively 1.31 for target words). If many studies have already acknowledged the predomi- nance of “small” phrases in actual translations, our oracle scores suggest that, for this language pair, increasing the phrase length limit beyond 2 or 3 might be a waste of computational resources. 4 Related Work To the best of our knowledge, there are only a few works that try to study the expressive power ofphrase-based machine translation systems or to provide tools for analyzing potential causes of failure. The approach described in (Auli et al., 2009) is very similar to ours: in this study, the authors propose to find and analyze the limits of machine translation systems by studying the reference reachability. A reference is reachable for a given system if it can be exactly generated by this system. Reference reachability is assessed using Moses in forced decoding mode: during search, all hypotheses that deviate from the reference are simply discarded. Even though the main goal of this study was to compare the search space of phrase-based and hierarchical systems, it also provides some insights on the impact of various search parameters in Moses, delivering conclusions that are consistent with our main results. As described in Section 1.2, these authors also propose a typology of the errors of a statistical translation systems, but do not attempt to provide methods for identifying them. The authors of (Turchi et al., 2008) study the learn- ing capabilities of Moses by extensively analyzing learning curves representing the translation performances as a function of the number of examples, and by corrupting the model parameters. Even though their focus is more on assessing the scoring function, they reach conclusions similar to ours: the current bottleneck of translation performances is not the representation power of the PBTS but rather in their scoring functions. Oracle decoding is useful to compute reachable pseudo-references in the context of discriminative training. This is the main motivation of (Tillmann and Zhang, 2006), where the authors compute high BLEU hypotheses by running a conventional decoder so as to maximize a per-sentence approximation of BLEU-4, under a simple (local) reordering model. Oracle decoding has also been used to assess the limitations induced by various reordering constraints in (Dreyer et al., 2007). To this end, the authors propose to use a beam-search based oracle decoder, which computes lower bounds of the best achievable BLEU-4 using dynamic programming techniques over finite-state (for so-called local and IBM constraints) or hierarchically structured (for ITG constraints) sets of hypotheses. Even 941 though the numbers reported in this study are not directly comparable with ours17, it seems that our decoder is not only conceptually much simpler, but also achieves much more optimistic lower-bounds of the oracle BLEU score. The approach described in (Li and Khudanpur, 2009) employs a similar technique, which is to guide a heuristic search in an hypergraph representing possible translation hypotheses with n-gram counts matches, which amounts to decoding with a n-gram model trained on the sole reference translation. Additional tricks are presented in this article to speed-up decoding. Computing oracle BLEU scores is also the subject of (Zens and Ney, 2005; Leusch et al., 2008), yet with a different emphasis. These studies are concerned with finding the best hypotheses in a word graph or in a consensus network, a problem that has various implications for multi-pass decoding and/or system combination techniques. The former reference describes an exponential approximate algorithm, while the latter proves the NPcompleteness of this problem and discuss various heuristic approaches. Our problem is somewhat more complex and using their techniques would require us to built word graphs containing all the translations induced by arbitrary segmentations and permutations of the source sentence. 5 Conclusions In this paper, we have presented a methodology for analyzing the errors of PBTS, based on the computation of an approximation of the BLEU-4 oracle score. We have shown that this approximation could be computed fairly accurately and efficiently using Integer Linear Programming techniques. Our main result is a confirmation of the fact that extant PBTS systems are expressive enough to achieve very high translation performance with respect to conventional quality measurements. The main efforts should therefore strive to improve on the way phrases and hypotheses are scored during training. This gives further support to attempts aimed at designing context-dependent scoring functions as in (Stroppa et al., 2007; Gimpel and Smith, 2008), or at attempts to perform discriminative training of feature-rich models. (Bangalore et al., 2007). We have shown that the examination of difficult-totranslate sentences was an effective way to detect errors or inconsistencies in the reference translations, making our approach a potential aid for controlling the quality or assessing the difficulty of test data. Our experiments have also highlighted the impact of various parameters. Various extensions of the baseline ILP program have been suggested and/or evaluated. In particular, the ILP formalism lends itself well to expressing various constraints that are typically used in conventional PBTS. In 17The best BLEU-4 oracle they achieve on Europarl German to English is approximately 48; but they considered a smaller version of the training corpus and the WMT’06 test set. our future work, we aim at using this ILP framework to systematically assess various search configurations. We plan to explore how replacing non-reachable references with high-score pseudo-references can improve discrim- inative training of PBTS. We are also concerned by determining how tight is our approximation of the BLEU4 score is: to this end, we intend to compute the best BLEU-4 score within the n-best solutions of the oracle decoding problem. Acknowledgments Warm thanks to Houda Bouamor for helping us with the annotation tool. This work has been partly financed by OSEO, the French State Agency for Innovation, under the Quaero program. References Tobias Achterberg. 2007. Constraint Integer Programming. Ph.D. thesis, Technische Universit a¨t Berlin. http : / / opus .kobv .de /tuberl in/vol ltexte / 2 0 0 7 / 16 11/ . Abhishek Arun and Philipp Koehn. 2007. Online learning methods for discriminative training of phrase based statistical machine translation. In Proc. of MT Summit XI, Copenhagen, Denmark. Michael Auli, Adam Lopez, Hieu Hoang, and Philipp Koehn. 2009. A systematic analysis of translation model search spaces. In Proc. of WMT, pages 224–232, Athens, Greece. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Srinivas Bangalore, Patrick Haffner, and Stephan Kanthak. 2007. Statistical machine translation through global lexical selection and sentence reconstruction. In Proc. of ACL, pages 152–159, Prague, Czech Republic. L e´on Bottou and Olivier Bousquet. 2008. The tradeoffs oflarge scale learning. In Proc. of NIPS, pages 161–168, Vancouver, B.C., Canada. Chris Callison-Burch, Philipp Koehn, Christof Monz, and Josh Schroeder. 2009. Findings of the 2009 Workshop on Statistical Machine Translation. In Proc. of WMT, pages 1–28, Athens, Greece. David Chiang, Steve DeNeefe, Yee Seng Chan, and Hwee Tou Ng. 2008. Decomposability of translation metrics for improved evaluation and efficient algorithms. In Proc. of ECML, pages 610–619, Honolulu, Hawaii. John De Nero and Dan Klein. 2008. The complexity of phrase alignment problems. In Proc. of ACL: HLT, Short Papers, pages 25–28, Columbus, Ohio. Markus Dreyer, Keith B. Hall, and Sanjeev P. Khudanpur. 2007. Comparing reordering constraints for smt using efficient bleu oracle computation. In NAACL-HLT/AMTA Workshop on Syntax and Structure in Statistical Translation, pages 103– 110, Rochester, New York. 942 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001 . Fast decoding and optimal decoding for machine translation. In Proc. of ACL, pages 228–235, Toulouse, France. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2004. Fast and optimal decoding for machine translation. Artificial Intelligence, 154(1-2): 127– 143. Ulrich Germann. 2003. Greedy decoding for statistical machine translation in almost linear time. In Proc. of NAACL, pages 1–8, Edmonton, Canada. Kevin Gimpel and Noah A. Smith. 2008. Rich source-side context for statistical machine translation. In Proc. of WMT, pages 9–17, Columbus, Ohio. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proc. of NAACL, pages 48–54, Edmonton, Canada. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris CallisonBurch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proc. of ACL, demonstration session. Philipp Koehn. 2004. Pharaoh: A beam search decoder for phrase-based statistical machine translation models. In Proc. of AMTA, pages 115–124, Washington DC. Shankar Kumar and William Byrne. 2005. Local phrase reordering models for statistical machine translation. In Proc. of HLT, pages 161–168, Vancouver, Canada. Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. The significance of recall in automatic metrics for MT evaluation. In In Proc. of AMTA, pages 134–143, Washington DC. Gregor Leusch, Evgeny Matusov, and Hermann Ney. 2008. Complexity of finding the BLEU-optimal hypothesis in a confusion network. In Proc. of EMNLP, pages 839–847, Honolulu, Hawaii. Zhifei Li and Sanjeev Khudanpur. 2009. Efficient extraction of oracle-best translations from hypergraphs. In Proc. of NAACL, pages 9–12, Boulder, Colorado. Percy Liang, Alexandre Bouchard-C oˆt´ e, Dan Klein, and Ben Taskar. 2006. An end-to-end discriminative approach to machine translation. In Proc. of ACL, pages 761–768, Sydney, Australia. Adam Lopez. 2009. Translation as weighted deduction. In Proc. of EACL, pages 532–540, Athens, Greece. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. , 29(1): 19–5 1. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proc. of ACL, pages 160–167, Sapporo, Japan. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. Technical report, Philadelphia, Pennsylvania. D. Roth and W. Yih. 2005. Integer linear programming inference for conditional random fields. In Proc. of ICML, pages 737–744, Bonn, Germany. Nicolas Stroppa, Antal van den Bosch, and Andy Way. 2007. Exploiting source similarity for smt using context-informed features. In Andy Way and Barbara Proc. of TMI, pages Christoph Tillmann 231–240, Sk¨ ovde, and Tong Zhang. Gawronska, editors, Sweden. 2006. A discriminative global training algorithm for statistical mt. In Proc. of ACL, 721–728, Sydney, Australia. Turchi, Tijl De Bie, and Nello pages Marco Cristianini. 2008. Learn- ing performance of a machine translation system: a statistical and computational analysis. In Proc. of WMT, pages Columbus, Ohio. 35–43, Richard Zens and Hermann Ney. 2005. Word graphs for statistical machine translation. In Proc. of the ACL Workshop on Building and Using Parallel Texts, pages 191–198, Ann Arbor, Michigan. 943

6 0.3090682 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

7 0.30143782 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

8 0.28905511 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

9 0.27766609 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

10 0.27191758 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

11 0.26439437 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

12 0.2598435 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

13 0.23031594 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

14 0.22347787 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

15 0.22074303 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

16 0.20721382 39 emnlp-2010-EMNLP 044

17 0.20448528 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

18 0.17613015 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

19 0.1698596 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

20 0.16541183 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.307), (10, 0.013), (12, 0.031), (29, 0.1), (30, 0.046), (52, 0.12), (56, 0.041), (66, 0.113), (72, 0.052), (76, 0.017), (92, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70103079 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

Author: Zhongjun He ; Yao Meng ; Hao Yu

Abstract: Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. However, the phrase reorderings lack of contextual information in conventional HPB systems. This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. We integrate the MaxEnt models into the HPB model. Experimental results show that our approach achieves significant improvements over a standard HPB system on large-scale translation tasks. On Chinese-to-English translation, , the absolute improvements in BLEU (caseinsensitive) range from 1.2 to 2.1.

2 0.69576848 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

Author: Rushin Shah ; Paramveer S. Dhillon ; Mark Liberman ; Dean Foster ; Mohamed Maamouri ; Lyle Ungar

Abstract: We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that over- comes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators. 1 Background and Motivation This paper presents a methodology for generating high quality lexical analysis of highly inflected languages, and demonstrates excellent performance applying our approach to Arabic. Lexical analysis of the written form of a language involves resolving, explicitly or implicitly, several different kinds ofambiguities. Unfortunately, the usual ways of talking about this process are also ambiguous, and our general approach to the problem, though not unprecedented, has uncommon aspects. Therefore, in order 725 Paramveer S. Dhillon, Mark Liberman, Dean Foster, Mohamed Maamouri and Lyle Ungar University of Pennsylvania 345 1Walnut Street Philadelphia, PA 19104, USA {dhi l lon | myl | ungar} @ cis .upenn .edu floo snt|emry@lw|huanrgta on .upenn .eednun maamouri @ ldc .upenn .edu , , to avoid confusion, we begin by describing how we define the problem. In an inflected language with an alphabetic writing system, a central issue is how to interpret strings of characters as forms of words. For example, the English letter-string ‘winds’ will normally be interpreted in one of four different ways, all four of which involve the sequence of two formatives wind+s. The stem ‘wind’ might be analyzed as (1) a noun meaning something like “air in motion”, pronounced [wInd] , which we can associate with an arbitrary but stable identifier like wind n1; (2) a verb wind v1 derived from that noun, and pronounced the same way; (3) a verb wind v2 meaning something like “(cause to) twist”, pronounced [waInd]; or (4) a noun wind n2 derived from that verb, and pro- nounced the same way. Each of these “lemmas”, or dictionary entries, will have several distinguishable senses, which we may also wish to associate with stable identifiers. The affix ‘-s’ might be analyzed as the plural inflection, if the stem is a noun; or as the third-person singular inflection, if the stem is a verb. We see this analysis as conceptually divided into four parts: 1) Morphological analysis, which recognizes that the letter-string ‘winds’ might be (perhaps among other things) wind/N s/PLURAL or wind/V s/3SING; 2) Morphological disambiguation, which involves deciding, for example, that in the phrase “the four winds”, ‘winds’ is probably a plural noun, i.e. wind/N s/PLURAL; 3) Lemma analysis, which involves recognizing that the stem wind in ‘winds’ might be any of the four lemmas listed above – perhaps with a further listing of senses or other sub-entries for each of them; and 4) Lemma disambiguation, deciding, for example, that + + + ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 7t2ic5s–735, the phrase “the four winds” probably involves the lemma wind n1. Confusingly, the standard word-analysis tasks in computational linguistics involve various combinations of pieces of these logically-distinguished operations. Thus, “part of speech (POS) tagging” is mainly what we’ve called “morphological disambiguation”, except that it doesn’t necessarily require identifying the specific stems and affixes involved. In some cases, it also may require a small amount of “lemma disambiguation”, for example to distinguish a proper noun from a common noun. “Sense disambiguation” is basically a form of what we’ve called “lemma disambiguation”, except that the sense disambiguation task may assume that the part of speech is known, and may break down lexical identity more finely than our system happens to do. “Lemmatization” generally refers to a radically simplified form of “lemma analysis” and “lemma disambiguation”, where the goal is simply to collapse different inflected forms of any similarly-spelled stems, so that the strings ‘wind’, ‘winds’, ‘winded’, ‘winding’ will all be treated as instances of the same thing, without in fact making any attempt to determine the identity of “lemmas” in the traditional sense of dictionary entries. Linguists use the term morphology to include all aspects of lexical analysis under discussion here. But in most computational applications, “morphological analysis” does not include the disambiguation of lemmas, because most morphological analyzers do not reference a set of stable lemma IDs. So for the purposes of this paper, we will continue to discuss lemma analysis and disambiguation as conceptually distinct from morphological analysis and disambiguation, although, in fact, our system disambiguates both of these aspects of lexical analysis at the same time. The lexical analysis of textual character-strings is a more complex and consequential problem in Arabic than it is in English, for several reasons. First, Arabic inflectional morphology is more complex than English inflectional morphology is. Where an English verb has five basic forms, for example, an Arabic verb in principle may have dozens. Second, the Arabic orthographic system writes elements such as prepositions, articles, and possessive pronouns without setting them off by spaces, roughly 726 as if the English phrase “in a way” were written “inaway”. This leads to an enormous increase in the number of distinct “orthographic words”, and a substantial increase in ambiguity. Third, short vowels are normally omitted in Arabic text, roughly as if English “in a way” were written “nway”. As a result, a whitespace/punctuation-delimited letter-string in Arabic text typically has many more alternative analyses than a comparable English letter-string does, and these analyses have many more parts, drawn from a much larger vocabulary of form-classes. While an English “tagger” can specify the morphosyntactic status of a word by choosing from a few dozen tags, an equivalent level of detail in Arabic would require thousands of alternatives. Similarly, the number of lemmas that might play a role in a given letter-sequence is generally much larger in Arabic than in English. We start our labeling of Arabic text with the alternative analyses provided by SAMA v. 3.1, the Standard Arabic Morphological Analyzer (Maamouri et al., 2009). SAMA is an updated version of the earlier Buckwalter analyzers (Buckwalter, 2004), with a number of significant differences in analysis to make it compatible with the LDC Arabic Treebank 3-v3.2 (Maamouri et al., 2004). The input to SAMA is an Arabic orthographic word (a string of letters delimited by whitespace or punctuation), and the output of SAMA is a set of alternative analyses, as shown in Table 1. For a typical word, SAMA produces approximately a dozen alternative analyses, but for certain highly ambiguous words it can produce hundreds of alternatives. The SAMA analyzer has good coverage; for typical texts, the correct analysis of an orthographic word can be found somewhere in SAMA’s list of alternatives about 95% of the time. However, this broad coverage comes at a cost; the list of analytic alternatives must include a long Zipfian tail of rare or contextually-implausible analyses, which collectively are correct often enough to make a large contribution to the coverage statistics. Furthermore, SAMA’s long lists of alternative analyses are not evaluated or ordered in terms of overall or contextual plausibility. This makes the results less useful in most practical applications. Our goal is to rank these alternative analyses so that the correct answer is as near to the top of the list as possible. Despite some risk of confusion, we’ll refer to SAMA’s list of alternative analyses for an orthographic word as potential labels for that word. And despite a greater risk ofconfusion, we’ll refer to the assignment of probabilities to the set of SAMA labels for a particular Arabic word in a particular textual context as tagging, by analogy to the operation of a stochastic part-of-speech tagger, which similarly assigns probabilities to the set of labels available for a word in textual context. Although our algorithms have been developed for the particular case of Arabic and the particular set of lexical-analysis labels produced by SAMA, they should be applicable without modification to the sets of labels produced by any broad-coverage lexical analyzer for the orthographic words of any highlyinflected language. In choosing our approach, we have been moti- vated by two specific applications. One application aims to help learners of Arabic in reading text, by offering a choice of English glosses with associated Arabic morphological analyses and vocalizations. SAMA’s excellent coverage is an important basis for this help; but SAMA’s long, unranked list of alternative analyses for a particular letter-string, where many analyses may involve rare words or alternatives that are completely implausible in the context, will be confusing at best for a learner. It is much more helpful for the list to be ranked so that the correct answer is almost always near the top, and is usually one of the top two or three alternatives. In our second application, this same sort of ranking is also helpful for the linguistically expert native speakers who do Arabic Treebank analysis. These 727 annotators understand the text without difficulty, but find it time-consuming and fatiguing to scan a long list of rare or contextually-implausible alternatives for the correct SAMA output. Their work is faster and more accurate if they start with a list that is ranked accurately in order of contextual plausibility. Other applications are also possible, such as vocalization of Arabic text for text-to-speech synthesis, or lexical analysis for Arabic parsing. However, our initial goals have been to rank the list of SAMA outputs for human users. We note in passing that the existence of set of stable “lemma IDs” is an unusual feature of SAMA, which in our opinion ought to be emulated by approaches to lexical analysis in other languages. The lack of such stable lemma IDs has helped to disguise the fact that without lemma analysis and disambiguation, morphological analyses and disambiguation is only a partial solution to the problem of lexical analysis. In principle, it is obvious that lemma disambiguation and morphological disambiguation are mutually beneficial. If we know the answer to one of the questions, the other one is easier to answer. However, these two tasks require rather different sets of contextual features. Lemma disambiguation is similar to the problem of word-sense disambiguation on some definitions, they are identical and as a result, it benefits from paragraph-level and documentlevel bag-of-words attributes that help to character– – ize what the text is “about” and therefore which lemmas are more likely to play a role in it. In contrast, morphological disambiguation mainly depends on features of nearby words, which help to characterize how inflected forms of these lemmas might fit into local phrasal structures. 2 Problem and Methodology Consider a collection oftokens (observations), ti, referred to by index i∈ {1, . . . , n}, where each token fise raressdo tcoia bteyd i nwdiethx a s∈et { of p features, xij, efaocr hth teo k jethn feature, and a label, li, which is a combination of a lemma and a morphological analysis. We use indicator functions yik to indicate whether or not the kth label for the ith token is present. We represent the complete set of features and labels for the entire training data using matrix notation as X and Y , respectively. Our goal is to predict the label l (or equivalently, the vector y for a given feature vector x. A standard linear regression model of this problem would be y = xβ + ? (1) The standard linear regression estimate of β (ig- ×× × noring, for simplicity the fact that the ys are 0/1) is: βˆ = (XTtrainXtrain)−1XtTrainYtrain (2) where Ytrain is an n h matrix containing 0s and 1s indicating whise tahner n or nho mt aetarcixh coofn tthaien ihn possible labels is the correct label (li) for each of the n tokens ti, Xtrain is an n p matrix of context features for each of thei n tokens, pth mea ctoriexff oifcie cnotnst are p hs .f However, this is a large, sparse, multiple l hab.el problem, and the above formulation is neither statistically nor computationally efficient. Each observation (x, y) consists of thousands of features associated with thousands of potential labels, almost all of which are zero. Worse, the matrix of coefficients β, to be estimated is large (p h) and one should thus use some soatretd do ifs tr laarngsefe (pr learning dto o nshea srheo strength across the different labels. We present a novel principled and highly computationally efficient method of estimating this multilabel model. We use a two stage procedure, first using a subset (Xtrain1 , Ytrain1) of training data to give a fast approximate estimate of β; we then use a second smaller subset of the training data (Xtrain2, Ytrain2,) to “correct” these estimates in a eβˆx way that we will show can be viewed as a specialized shrinkage. Our first stage estimation approximates β, but avoids the expensive computa728 tion of (XTtrainXtrain)−1. Our second stage corrects (shrinks) these initial estimates in a manner specialized to this problem. The second stage takes advantage of the fact that we only need to consider those candidate labels produced by SAMA. Thus, only dozens of the thousands of possible labels are considered for each token. We now present our algorithm. We start with a corpus D of documents d of labeled Arabic text. As described above, each token, ti is associated with a set of features characterizing its context, computed from the other words in the same document, and a label, li = (lemmai, morphologyi), which is a combination of a lemma and a morphological analysis. As described below, we introduce a novel factorization of the morphology into 15 different components. Our estimation algorithm, shown in Algorithm 1, has two stages. We partition the training corpus into × two subsets, one of which (Xtrain1) is used to estimate the coefficients βs and the other of which (Xtrain2) is used to optimally “shrink” these coefficient estimates to reduce variance and prevent overfitting due to data sparsity. For the first stage of our estimation procedure, we simplify the estimate of the (β) matrix (Equation 2) to avoid the inversion of the very high dimensional (p p) matrix (XTX) by approximating (XTX) by (itps diagonal, Var(X), the inverse of which is trivial to compute; i.e. we estimate β using βˆ = Var(Xtrain1)−1XtTrain1Ytrain1 (3) For the second stage, we assume that the coefficients for each feature can be shrunk differently, but that coefficients for each feature should be shrunk the same regardless of what label they are predicting. Thus, for a given observation we predict: ˆgik=Xpwjβˆjkxij (4) Xj=1 where the weights wj indicate how much to shrink each of the p features. In practice, we fold the variance of each of the j features into the weight, giving a slightly modified equation: ˆgik=Xj=p1αjβj∗kxij (5) where β∗ = XtTrain1Ytrain1 is just a matrix of the counts of how often each context feature shows up with each label in the first training set. The vector α, which we will estimate by regression, is just the shrinkage weights w rescaled by the feature variance. Note that the formation here is different from the first stage. Instead of having each observation be a token, we now let each observation be a (token, label) pair, but only include those labels that were output by SAMA. For a given token ti and potential label lk, our goal is to approximate the indicator function g(i, k), which is 1 if the kth label of token ti is present, and 0 otherwise. We find candidate labels using a morphological analyzer (namely SAMA), which returns a set of possible candidate labels, say C(t), for each Arabic token t. Our pre- dicted label for ti is then argmaxk∈C(ti)g(i, k). The regression model for learning tthe weights αj in the second stage thus has a row for each label g(i, k) associated with a SAMA candidate for each token i = ntrain1+1 . . . ntrain2 in the second training set. The value of g(i, k) is predicted as a function of the feature vector zijk = βj∗kxij. The shrinkage coefficients, αj, could be estimated from theory, using a version of James-Stein shrinkage (James and Stein, 1961), but in practice, superior results are obtained by estimating them empirically. Since there are only p of them (unlike the p ∗ h βs), a relatively asmreal oln training sheetm mis ( usunflfi kceie tnhte. Wp ∗e hfou βnsd), that regression-SVMs work slightly better than linear regression and significantly better than standard classification SVMs for this problem. Prediction is then done in the obvious way by taking the tokens in a test corpus Dtest, generating context features and candidate SAMA labels for each token ti, and selected the candidate label with the highest score ˆ g(i, k) that we set out to learn. More formally, The model parameters β∗ and α produced by the algorithm allow one to estimate the most likely label for a new token ti out of a set of can- didate labels C(ti) using kpred= argmaxk∈C(ti)jX=p1αjβj∗kxij (6) The most expensive part of the procedure is estimating β∗, which requires for each token in cor729 Algorithm 1 Training algorithm. Input: A training corpusDtrainof n observations (Xtrain, Ytrain) Partition Dtrain into two sets, D1 and D2, of sizes ntrain1 and ntrain2 = n − ntrain1 observations // Using D1, estimat=e β∗ βj∗k = Pin=tr1ain1 xijyik for the jth feature and kth label // Using D2, estimate αj // Generate new “features” Z and the true labels g(i, k) for each of the SAMA candidate labels for each of the tokens in D2 zijk = βj∗kxij for iin i= ntrain1 + 1...ntrain2 Estimate αj for the above (feature,label) pairs (zijk, g(i, k)) using Regression SVMs Output: α and β∗ pus D1, (a subset of D), finding the co-occurrence frequencies of each label element (a lemma, or a part of the morphological segmentation) with the target token and jointly with the token and with other tokens or characters in the context of the token of interest. For example, given an Arabic token, “yHlm”, we count what fraction of the time it is associated with each lemma (e.g. Halamu 1), count(lemma=Halam-u 1, token=yHlm) and each segment (e.g. “ya”), count(segment=ya, token=yHlm). (Of course, most tokens never show up with most lemmas or segments; this is not a problem.) We also find the base rates of the components of the labels (e.g., count(lemma=Halam-u 1), and what fraction of the time the label shows up in various contexts, e.g. count(lemma=Halam-u 1, previous token = yHlm). We describe these features in more detail below. 3 Features and Labels used for Training Our approach to tagging Arabic differs from conventional approaches in the two-part shrinkage-based method used, and in the choice of both features and labels used in our model. For features, we study both local context variables, as described above, and document-level word frequencies. For the labels, the key question is what labels are included and how they are factored. Standard “taggers” work by doing an n-way classification of all the alternatives, which is not feasible here due to the thousands of possible labels. Standard approaches such as Conditional Random Fields (CRFs) are intractable with so many labels. Moreover, few if any taggers do any lemma disambiguation; that is partly because one must start with some standard inventory of lemmas, which are not available for most languages, perhaps because the importance of lemma disambiguation has been underestimated. We make a couple of innovations to deal with these issues. First, we perform lemma disambiguation in addition to “tagging”. As mentioned above, lemmas and morphological information are not independent; the choice of lemma often influences morphology and vice versa. For example, Table 1 contains two analyses for the word qbl. For the first analysis, where the lemma is qabil-a 1 and the gloss is accept/receive/approve + he/it [verb], the word is a verb. However, for the second analysis, where the lemma is qabol 1 and the gloss is before, the word is a noun. Simultaneous lemma disambiguation and tagging introduces additional complexity: An analysis of ATB and SAMA shows that there are approximately 2,200 possible morphological analyses (“tags”) and 40,000 possible lemmas; even accounting for the fact that most combinations of lemmas and morphological analyses don’t occur, the size of the label space is still in the order of tens of thousands. To deal with data sparsity, our second innovation is to factor the labels. We factor each label linto a set of 16 label elements (LEs). These include lemmas, as well as morphological elements such as basic partof-speech, suffix, gender, number, mood, etc. These are explained in detail below. Thus, since each label l is a set of 15 categorical variables, each y in the first learning stage is actually a vector with 16 nonzero components and thousands of zeros. Since we do simultaneous estimation of the entire set of label elements, the value g(i, k) being predicted in the second learning phase is 1 if the entire label set is correct, and zero otherwise. We do not learn separate models for each label. 3.1 Label Elements (LEs) The fact that there are tens of thousands of possible labels presents the problem of extreme sparsity of label distribution in the training data. We find that a model that estimates coefficients β∗ to predict a sin730 data on basic POS include whether a noun is proper or common, whether a verb is transitive or not, etc. Both the basic POS and its suffix may have person, gender and number data. gle label (a label being in the Cartesian product of the set of label elements) yields poor performance. Therefore, as just mentioned, we factor each label l into a set of label elements (LEs), and learn the correlations β∗ between features and label elements, rather than features and entire label sets. This reduces, but does not come close to eliminating, the problem sparsity. A complete list of these LEs and their possible values is detailed in Table 2. 3.2 Features 3.2.1 Local Context Features We take (t, l) pairs from D2, and for each such pair generate features Z based on co-occurrence statistics β∗ in D1, as mentioned in Algorithm 2. These statistics include unigram co-occurrence frequencies of each label with the target token and bigram co-occurrence of the label with the token and with other tokens or characters in the context of the target token. We define them formally in Table 3. Let Zbaseline denote the set of all such basic features based on the local context statistics of the target token, namely the words and letters preceding and following it. We will use this set to create a baseline model. generate feature sets for our regression SVMs. For each label element (LE) e, we define a set of features Ze similar to Zbaseline; these features are based on co-occurrence frequencies of the particular LE e, not the entire label l. Finally, we define an aggregate feature set Zaggr as follows: Zaggr = Zbaseline [ {Ze} (7) where e ∈ {lemma, pre1, pre2, det, pos, dpos, suf, perpos, numpos, genpos, persuf, numsuf, gensuf, mood, pron}. 3.2.2 Document Level Features When trying to predict the lemma, it is useful to include not just the words and characters immediately adjacent to the target token, but also the all the words in the document. These words capture the “topic” of the document, and help to disambiguate different lemmas, which tend to be used or not used based on the topic being discussed, similarly to the way that word sense disambiguation systems in English sometimes use the “bag of words” the document to disambiguate, for example a “bank” for depositing money from a “bank” of a river. More precisely, we augment the features for each target token with the counts of each word in the document (the “term frequency” tf) in which the token occurs with a given label. Zfull = Zaggr [ Ztf (8) This set Zfull is our final feature set. We use Zfull to train an SVM model Mfull; this is our final predictive model. 731 3.3 Corpora used for Training and Testing We use three modules of the Penn Arabic Treebank (ATB) (Maamouri et al., 2004), namely ATB 1, ATB2 and ATB3 as our corpus of labeled Arabic text, D. Each ATB module is a collection of newswire data from a particular agency. ATB1 uses the Associated Press as a source, ATB2 uses Ummah, and ATB3 uses Annahar. D contains a total of 1,835 documents, accounting for approximately 350,000 words. We construct the training and testing sets Dtrain and Dtest from D using 10-fold cross validation, and we construct D1 and D2 from Dtrain by randomly performing a 9: 1 split. As mentioned earlier, we use the SAMA morphological analyzer to obtain candidate labels C(t) for each token t while training and testing an SVM model on D2 and Dtest respectively. A sample output of SAMA is shown in Table 1. To improve coverage, we also add to C(t) all the labels lseen for t in D1. We find that doing so improves coverage to 98% . This is an upper bound on the accuracy of our model. C(t) = SAMA(t) 4 [ {l|(t, l) ∈ D1} (9) Results We use two metrics of accuracy: A1, which measures the percentage of tokens for which the model assigns the highest score to the correct label or LE value (or E1= 100 A1, the corresponding percentage error), 1a=nd 1 A2, wAh1i,ch th measures tnhdei percentage of tokens for which the correct label or LE value is one of the two highest ranked choices returned by the model (or E2 = 100 A2). We test our bmyod theel Mfull on Dtest a =nd 1 a0c0hi −eve A A2)1. and A2 scores of 90.6% and 96.2% respectively. The accuracy achieved by our Mfull model is, to the best of our knowledge, higher than prior approaches have been able to achieve so far for the problem of combined morphological and lemma disambiguation. This is all the more impressive considering that the upper bound on accuracy for our model is 98% because, as described above, our set of candidate labels is incomplete. In order to analyze how well different LEs can be predicted, we train an SVM model Me for each LE e using the feature set Ze, and test all such models − − on Dtest. The results for all the LEs are reported in the form of error percentages E1 and E2 in Table 4. reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. A comparison of the results for Mfull with the results for Mlemma and Mpos is particularly informative. We see that Mfull is able to achieve a substantially lower E1 error score (9.4%) than Mlemma (11.1%) and Mpos (23.4%); in other words, we find that our full model is able to predict lemmas and basic parts-of-speech more accurately than the individ- ual models for each of these elements. We examine the effect of varying the size of D2, i.e. the number of SVM training instances, on the performance of Mfull on Dtest, and find that with increasing sizes of D2, E1 reduces only slightly from 9.5% to 9.4%, and shows no improvement thereafter. We also find that the use of documentlevel features in Mlemma reduces E1 and E2 percentages for Mlemma by 5.7% and 3.2% respectively. 4.1 Comparison to Alternate Approaches 4.1.1 Structured Prediction Models Preliminary experiments showed that knowing the predicted labels (lemma + morphology) of the surrounding words can slightly improve the predictive accuracy of our model. To further investigate this effect, we tried running experiments using different structured models, namely CRF (Conditional Random Fields) (Lafferty et al., 2001), (Structured) MIRA (Margin Infused Relaxation Algorithm) (Crammer et al., 2006) and Structured Perceptron (Collins, 2002). We used linear chain 732 CRFs as implemented in MALLET Toolbox (McCallum, 2001) and for Structured MIRA and Perceptron we used their implementations from EDLIN Toolbox (Ganchev and Georgiev, 2009). However, given the vast label space of our problem, running these methods proved infeasible. The time complexity of these methods scales badly with the number of labels; It took a week to train a linear chain CRF for only ∼ 50 labels and though MIRA and Perceptron are o 5n0lin leab algorithms, they MalsIoR Abec aonmde P ienr-tractable beyond a few hundred labels. Since our label space contains combinations of lemmas and morphologies, so even after factoring, the dimension of the label space is in the order of thousands. We also tried a na¨ ıve version (two-pass approximation) of these structured models. In addition to the features in Zfull, we include the predicted labels for the tokens preceding and following the target token as features. This new model is not only slow to train, but also achieves only slightly lower error rates (1.2% lower E1 and 1.0% lower E2) than Mfull. This provides an upper bound on the benefit of using the more complex structured models, and suggests that given their computational demands our (unstructured) model Mfull is a better choice. 4.1.2 MADA (Habash and Rambow, 2005) perform morphological disambiguation using a morphological analyzer. (Roth et al., 2008) augment this with lemma disambiguation; they call their system MADA. Our work differs from theirs in a number of respects. Firstly, they don’t use the two step regression procedure that we use. Secondly, they use only “unigram” features. Also, they do not learn a single model from a feature set based on labels and LEs; instead, they combine models for individual elements by using weighted agreement. We trained and tested MADA v2.32 using its full feature set on the same Dtrain and Dtest. We should point out that this is not an exact comparison, since MADA uses the older Buckwalter morphological analyzer.1 4.1.3 Other Alternatives Unfactored Labels: To illustrate the benefit obtained by breaking down each label l into 1A new version of MADA was released very close to the submission deadline for this conference. LEs, we contrast the performance of our Mfull model to an SVM model Mbaseline trained using only the feature set Zbaseline, which only contains features based on entire labels, those based on individual LEs. Independent lemma and morphology prediction: Another alternative approach is to predict lemmas and morphological analyses separately. We construct a feature set Zlemma0 = Zfull − Zlemma and train an SVM model Mlemma0 using this feature set. Labels are then predicted by simply combining the results predicted independently by Mlemma and Mlemma0 . Let Mind denote this approach. Unigram Features: Finally, we also consider a context-less approach, i.e. using only “unigram” features for labels as well as LEs. We call this feature set Zuni, and the corresponding SVM model Muni. The results of these various models, along with those of Mfull are summarized in Table 5. We see that Mfull has roughly half the error rate of the stateof-the-art MADA system. Note: The results reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. We used same train-test splits for all the datasets. 5 Related Work (Hajic, 2000) show that for highly inflectional languages, the use of a morphological analyzer improves accuracy of disambiguation. (Diab et al., 2004) perform tokenization, POS tagging and base phrase chunking using an SVM based learner. (Ahmed and N ¨urnberger, 2008) perform word-sense disambiguation using a Naive Bayesian 733 model and rely on parallel corpora and match- ing schemes instead of a morphological analyzer. (Kulick, 2010) perform simultaneous tokenization and part-of-speech tagging for Arabic by separating closed and open-class items and focusing on the likelihood of possible stems of openclass words. (Mohamed and K ¨ubler, 2010) present a hybrid method between word-based and segmentbased POS tagging for Arabic and report good results. (Toutanova and Cherry, 2009) perform joint lemmatization and part-of-speech tagging for English, Bulgarian, Czech and Slovene, but they do not use the two step estimation-shrinkage model described in this paper; nor do they factor labels. The idea of joint lemmatization and part-of-speech tagging has also been discussed in the context of Hungarian in (Kornai, 1994). A substantial amount of relevant work has been done previously for Hebrew. (Adler and Elhadad, 2006) perform Hebrew morphological disambiguation using an unsupervised morpheme-based HMM, but they report lower scores than those achieved by our model. Moreover, their analysis doesn’t include lemma IDs, which is a novelty of our model. (Goldberg et al., 2008) extend the work of (Adler and El- hadad, 2006) by using an EM algorithm, and achieve an accuracy of 88% for full morphological analysis, but again, this does not include lemma IDs. To the best of our knowledge, there is no existing research for Hebrew that does what we did for Arabic, namely to use simultaneous lemma and morphological disambiguation to improve both. (Dinur et al., 2009) show that prepositions and function words can be accurately segmented using unsupervised methods. However, by using this method as a preprocessing step, we would lose the power of a simultaneous solution for these problems. Our method is closer in style to a CRF, giving much of the accuracy gains of simultaneous solution, while being about 4 orders of magnitude easier to train. We believe that our use of factored labels is novel for the problem of simultaneous lemma and morphological disambiguation; however, (Smith et al., 2005) and (Hatori et al., 2008) have previously made use of features based on parts of labels in CRF models for morphological disambiguation and word-sense disambiguation respectively. Also, we note that there is a similarity between our two-stage machine learning approach and log-linear models in machine translation that break the data in two parts, estimating log-probabilities of generative models from one part, and discriminatively re-weighting the models using the second part. 6 Conclusions We introduced a new approach to accurately predict labels consisting of both lemmas and morphological analyses for Arabic text. We obtained an accuracy of over 90% substantially higher than current state-of-the-art systems. Key to our success is the factoring of labels into lemma and a large set of morphosyntactic elements, and the use of an algorithm that computes a simple initial estimate of the coefficient relating each contextual feature to each label element (simply by counting co-occurrence) and then regularizes these features by shrinking each of the coefficients for each feature by an amount determined by supervised learning using only the candidate label sets produced by SAMA. We also showed that using features of word ngrams is preferable to using features of only individual tokens of data. Finally, we showed that a model using a full feature set based on labels as well as – factored components of labels, which we call label elements (LEs) works better than a model created by combining individual models for each LE. We believe that the approach we have used to create our model can be successfully applied not just to Arabic but also to other languages such as Turkish, Hungarian and Finnish that have highly inflectional morphology. The current accuracy of of our model, getting the correct answer among the top two choices 96.2% of the time is high enough to be highly useful for tasks such as aiding the manual annotation of Arabic text; a more complete automation would require that accuracy for the single top choice. Acknowledgments We woud like to thank everyone at the Linguistic Data Consortium, especially Christopher Cieri, David Graff, Seth Kulick, Ann Bies, Wajdi Zaghouani and Basma Bouziri for their help. We also wish to thank the anonymous reviewers for their comments and suggestions. 734 References Meni Adler and Michael Elhadad. 2006. An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Farag Ahmed and Andreas N ¨urnberger. 2008. Arabic/English Word Translation Disambiguation using Parallel Corpora and Matching Schemes. In Proceedings of EAMT’08, Hamburg, Germany. Tim Buckwalter. 2004. Buckwalter Arabic Morphological Analyzer version 2.0. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of EMNLP’02. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online PassiveAggressive Algorithms. Journal of Machine Learning Research, 7:551–585. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic text: From Raw Text to Base Phrase Chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL’04). Elad Dinur, Dmitry Davidov, and Ari Rappoport. 2009. Unsupervised Concept Discovery in Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Kuzman Ganchev and Georgi Georgiev. 2009. Edlin: An Easy to Read Linear Learning Framework. In Proceedings of RANLP’09. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)*. In Proceedings of ACL’08. Nizar Habash and Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of ACL’05, Ann Arbor, MI, USA. Jan Hajic. 2000. Morphological Tagging: Data vs. Dictionaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’00). Jun Hatori, Yusuke Miyao, and Jun’ichi Tsujii. 2008. Word Sense Disambiguation for All Words using TreeStructured Conditional Random Fields. In Proceedings of COLing’08. W. James and Charles Stein. 1961 . Estimation with Quadratic Loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. Andr a´s Kornai. 1994. On Hungarian morphology (LinDissertationes 14). Lin- guistica, Series A: Studia et guistics Institute of Hungarian Academy of Sciences, Budapest. Seth Kulick. 2010. Simultaneous Tokenization and Partof-Speech Tagging for Arabic without a Morphological Analyzer. In Proceedings of ACL’10. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML’01, pages 282–289. Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a Large Scale Annotated Arabic Corpus. In Proceedings of NEMLAR Conference on Arabic Language Resources and Tools. Mohamed Maamouri, David Graff, Basma Bouziri, Sondos Krouna, and Seth Kulick. 2009. LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. Andrew McCallum, 2001. MALLET: A Machine Learning for Language Toolkit. Software available at http : / /mal let .cs .umas s .edu. Emad Mohamed and Sandra K ¨ubler. 2010. Arabic Part of Speech Tagging. In Proceedings of LREC’10. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of ACL’08, Columbus, Ohio, USA. Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-Based Morphological Disambiguation with Random Fields*. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). Kristina Toutanova and Colin Cherry. 2009. A Global Model for Joint Lemmatization and Part-of-Speech Prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, pages 486–494. 735

3 0.58431238 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Author: Taesun Moon ; Katrin Erk ; Jason Baldridge

Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.

4 0.54810214 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

Author: Hendra Setiawan ; Chris Dyer ; Philip Resnik

Abstract: We address the modeling, parameter estimation and search challenges that arise from the introduction of reordering models that capture non-local reordering in alignment modeling. In particular, we introduce several reordering models that utilize (pairs of) function words as contexts for alignment reordering. To address the parameter estimation challenge, we propose to estimate these reordering models from a relatively small amount of manuallyaligned corpora. To address the search challenge, we devise an iterative local search algorithm that stochastically explores reordering possibilities. By capturing non-local reordering phenomena, our proposed alignment model bears a closer resemblance to stateof-the-art translation model. Empirical results show significant improvements in alignment quality as well as in translation performance over baselines in a large-scale ChineseEnglish translation task.

5 0.53594851 39 emnlp-2010-EMNLP 044

Author: George Foster

Abstract: We describe a new approach to SMT adaptation that weights out-of-domain phrase pairs according to their relevance to the target domain, determined by both how similar to it they appear to be, and whether they belong to general language or not. This extends previous work on discriminative weighting by using a finer granularity, focusing on the properties of instances rather than corpus components, and using a simpler training procedure. We incorporate instance weighting into a mixture-model framework, and find that it yields consistent improvements over a wide range of baselines.

6 0.51304114 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

7 0.50607097 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

8 0.50217807 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

9 0.49919745 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

10 0.49115884 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

11 0.49083042 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

12 0.49072906 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

13 0.48899221 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

14 0.47940788 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

15 0.47638711 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

16 0.47423351 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

17 0.47370696 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

18 0.47304833 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

19 0.47130048 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

20 0.47116479 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts