acl acl2011 acl2011-44 knowledge-graph by maker-knowledge-mining

44 acl-2011-An exponential translation model for target language morphology

Source: pdf

Author: Michael Subotin

Abstract: This paper presents an exponential model for translation into highly inflected languages which can be scaled to very large datasets. As in other recent proposals, it predicts targetside phrases and can be conditioned on sourceside context. However, crucially for the task of modeling morphological generalizations, it estimates feature parameters from the entire training set rather than as a collection of separate classifiers. We apply it to English-Czech translation, using a variety of features capturing potential predictors for case, number, and gender, and one of the largest publicly available parallel data sets. We also describe generation and modeling of inflected forms unobserved in training data and decoding procedures for a model with non-local target-side feature dependencies.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 An exponential translation model for target language morphology Michael Subotin Paxfire, Inc. [sent-1, score-0.621]

2 com Abstract This paper presents an exponential model for translation into highly inflected languages which can be scaled to very large datasets. [sent-3, score-0.554]

3 As in other recent proposals, it predicts targetside phrases and can be conditioned on sourceside context. [sent-4, score-0.195]

4 However, crucially for the task of modeling morphological generalizations, it estimates feature parameters from the entire training set rather than as a collection of separate classifiers. [sent-5, score-0.279]

5 We apply it to English-Czech translation, using a variety of features capturing potential predictors for case, number, and gender, and one of the largest publicly available parallel data sets. [sent-6, score-0.184]

6 We also describe generation and modeling of inflected forms unobserved in training data and decoding procedures for a model with non-local target-side feature dependencies. [sent-7, score-0.581]

7 Thus, Birch et al (2008) find that translation quality achieved by a popular phrase-based system correlates significantly with a measure of targetside, but not source-side morphological complexity. [sent-9, score-0.194]

8 , 2009; Yeniterzi and Oflazer, 2010) proposed modeling targetside morphology in a phrase-based factored models framework (Koehn and Hoang, 2007). [sent-11, score-0.275]

9 Under this approach linguistic annotation of source sentences is analyzed using heuristics to identify relevant structural phenomena, whose occurrences are 230 in turn used to compute additional relative frequency (maximum likelihood) estimates predicting targetside inflections. [sent-12, score-0.444]

10 For example, the accusative case is usually preserved in translation, so that nouns appearing in the direct object position of English clauses tend to be translated to words with accusative case markings in languages with richer morphology, and vice versa. [sent-14, score-0.467]

11 This paper presents an alternative approach based on exponential phrase models, which can straightforwardly handle feature sets with arbitrarily elaborate source-side dependencies. [sent-20, score-0.485]

12 2 Hierarchical phrase-based translation We take as our starting point David Chiang’s Hiero system, which generalizes phrase-based translation to substrings with gaps (Chiang, 2007). [sent-21, score-0.21]

13 The translation is chosen to be the target-side yield of the highest-scoring synchronous parse consistent with the source sentence. [sent-26, score-0.219]

14 Although a variety of scores interpolated into the decision rule for phrasebased systems have been investigated over the years, only a handful have been discovered to be consistently useful. [sent-27, score-0.238]

15 For a targetgiven-source phrase model the predicted outcomes are target-side phrases the model is conditioned on a source-side phrase together with some context, and each GEN(X) consists of target phrases co-occurring with a given source phrase in the grammar. [sent-31, score-0.875]

16 Exponential models and other classifiers have been used in several recent studies to condition phrase model probabilities on source-side context (Chan et al 2007; Carpuat and Wu 2007a; Carpuat and Wu 2007b). [sent-36, score-0.193]

17 2 Recently, Jeong et al (2010) independently proposed an exponential model with shared features for target-side morphology in application to lexical scores in a treelet-based system. [sent-40, score-0.5]

18 4 Features The feature space for target-side inflection models used in this work consists of features tracking the source phrase and the corresponding target phrase together with its complete morphological tag, which will be referred to as rule features for brevity. [sent-41, score-1.369]

19 The feature space also includes features tracking the source phrase together with the lemmatized representation of the target phrase, called lemma features below. [sent-42, score-0.8]

20 Since there is little ambiguity in lemmatization for Czech, the lemma representations were for simplicity based on the most frequent lemma for each token. [sent-43, score-0.2]

21 Finally, we include features associating aspects of source-side annotation with inflec- tions of aligned target words. [sent-44, score-0.4]

22 We add inflection features for all words aligned to at least one English verb, adjective, noun, pronoun, or determiner, excepting definite and indefinite articles. [sent-46, score-0.588]

23 A separate feature type marks cases where an intended inflection category is not applicable to a target word falling under these criteria due to a POS mismatch between aligned words. [sent-47, score-0.611]

24 1 Number The inflection for number is particularly easy to model in translating from English, since it is generally marked on the source side, and POS taggers based on the Penn treebank tag set attempt to infer it in cases where it is not. [sent-49, score-0.513]

25 For word pairs whose source-side word is a verb, we add a feature marking the number of its subject, with separate features for noun and pronoun subjects. [sent-50, score-0.248]

26 For word pairs whose source side is an adjective, we add a feature marking the number of the head of the smallest noun phrase that contains it. [sent-51, score-0.492]

27 2 Case Among the inflection types of Czech nouns, the only type that is not generally observed in English and does not belong to derivational morphology is inflection for case. [sent-54, score-0.759]

28 Czech adjectives also inflect for case and their case has to match the case of their governing noun. [sent-57, score-0.368]

29 However, since the source sentence and its annotation contain a variety of predictors for case, we model it using only source-dependent features. [sent-58, score-0.269]

30 The following feature types for case were included: • • The structural role of the aligned source word or eth set uhectaudr aolf r tohlee osmf tahllee aslt noun phrase containing the aligned source word. [sent-59, score-0.939]

31 The preposition governing the smallest noun phrase containing otvhee aligned source word, inf it is governed by a preposition. [sent-61, score-0.668]

32 • • • • An indicator for the presence of a possessive Amanr kinedr modifying tehe p aligned source swsoersds or the head of the smallest noun phrase containing the aligned source word. [sent-62, score-0.838]

33 An indicator for the presence of a numeral modifying othre aligned source cweo rodf or t nhuem mheeraadl of the smallest noun phrase containing the aligned source word. [sent-63, score-0.838]

34 An indication that aligned source word modifAiend by quantifiers many, most, such, or half. [sent-64, score-0.302]

35 These features would be more properly defined based on the identity of the target word aligned to these quantifiers, but little ambiguity seems to arise from this substitution in practice. [sent-65, score-0.31]

36 The lemma of the verb governing the aligned source mwmoard o or hthee heerbad g oofv ethrnei nsgma thllee astl noun phrase containing the aligned source word. [sent-66, score-0.963]

37 This is the only lexicalized feature type used in the model and we include only those features which occur over 1,000 times in the training data. [sent-67, score-0.194]

38 Features corresponding to aspects of the source word itself and features corresponding to aspects of the head of a noun phrase containing it were treated as separate types. [sent-69, score-0.548]

39 Verbs and adjectives have to agree with nouns for gender, although this agreement is not marked in some forms of the verb. [sent-72, score-0.185]

40 In contrast to number and case, Czech gender generally cannot be predicted from any aspect of the English source sentence, which necessitates the use of features that depend on another target-side word. [sent-73, score-0.309]

41 For verbs we add a feature associating the gender of the verb with the gender of its subject. [sent-75, score-0.349]

42 For adjectives, we add a feature tracking the gender of the governing nouns. [sent-76, score-0.358]

43 5 Decoding with target-side model dependencies The procedure for decoding with non-local targetside feature dependencies is similar in its general outlines to the standard method of decoding with a 233 language model, as described in Chiang (2007). [sent-78, score-0.497]

44 Each rule that has matched the source sentence belongs to a rule chart associated with its location-anchored sequence of non-terminal and terminal source-side symbols and any of its aspects which may affect the score of a translation hypothesis when it is combined with another rule. [sent-80, score-0.788]

45 In the case of non-local target-side dependencies this includes any information about features needed for this rule’s estimate and tracking some target-side inflection beyond it or features tracking target-side inflections within this rule and needed for computation of another rule’s estimate. [sent-82, score-1.068]

46 Thus, a rule chart for a rDule with one nonEterminal can be denoted as as where we have introduced the symDbol to represenEt the set of messages associated with a given item in Dxii1+1Axjj1+1,µE, the chart. [sent-84, score-0.481]

47 Each item in the chart is associated with a score s, based on any submodels and heuristic estimates that can already be computed for that item and used to arrange the chart items into a priority queue. [sent-85, score-0.589]

48 Combinations of one or more rules that span a substring of terminals are arranged into a different type of chart which we shall call span charts. [sent-86, score-0.431]

49 A span chart has the form [i1,j1; µ1], where µ1 is a set of messages, and its items are likewise prioritized by a partial score s1. [sent-87, score-0.282]

50 Informally, whenever a rule chart is combined with one or more span charts corresponding to its non-terminals, we select best-scoring items from each chart and update derivation scores by performing any model computations that become possible once we combine the corresponding items. [sent-89, score-0.75]

51 Crucially, whenever an item in one of the charts crosses a pruning threshold, we discard the rest of that chart’s items, even though one of them could generate a better-scoring partial derivation in combination with an item from another chart. [sent-90, score-0.195]

52 We estimate these scores by computing exponential models using all features without non-local dependencies. [sent-92, score-0.41]

53 We take the example of computing an estimate for a rule whose only terminal on both sides is a verb and which requires a feature tracking the target-side gender inflection of the subject. [sent-94, score-0.865]

54 We make use of a cache storing all computed numerators and denominators of the exponential model, which makes it easy to recompute an estimate given an additional feature and use the difference between it and the incomplete estimate to update the score of the partial derivation. [sent-95, score-0.463]

55 In the simplest case, illustrated in figure 2, the non-local feature depends on the position within the span of the rule’s non-terminal symbol, so that its model estimate can be computed when its rule chart is combined with the span chart for its non-terminal symbol. [sent-96, score-0.832]

56 This is accomplished using a feature message, which indicates the gender inflection for the subject and is denoted as mf (i), where the index irefers to the position of its “recipient”. [sent-97, score-0.59]

57 Figure 3 illustrates the case where the non-local feature lies outside the rule’s span, but the estimated rule lies inside a non-terminal of the rule which contains the feature dependency. [sent-98, score-0.712]

58 This requires sending a rule message mr (i), which includes information about the estimated rule (which also serves as a pointer to the score cache) and its feature dependency. [sent-99, score-0.508]

59 The final example, shown in figure 4, illustrates the case where both types of messages need to be propagated until we reach a rule chart that spans both ends of the dependency. [sent-100, score-0.48]

60 In this case, the full estimate for a rule is computed while combining charts neither of which corresponds directly to that rule. [sent-101, score-0.324]

61 The message-updating function um(µ) takes a set of messages and outputs another set that includes those messages mr (k) and mf(k) whose destination k lies outside the span i,j of the 234 SAbAmVf(2)Scac ohre 1 2 Figure 2: Non-local dependency, case A. [sent-104, score-0.49]

62 µ Figure 5: Simplified set of inference rules for decoding with target-side model dependencies. [sent-107, score-0.198]

63 6 Modeling unobserved target inflections As a consequence of translating into a morphologically rich language, some inflected forms of target words are unobserved in training data and cannot be generated by the decoder under standard phrasebased approaches. [sent-110, score-0.889]

64 Exponential models with shared features provide a straightforward way to estimate probabilities of unobserved inflections. [sent-111, score-0.32]

65 This is accomplished by extending the sets of target phrases GEN(X) over which the model is normalized by including some phrases which have not been observed in the original sets. [sent-112, score-0.383]

66 When additional rule features with these unobserved target phrases are included in the model, their weights will be estimated even though they never appear in the training exam- ples (i. [sent-113, score-0.679]

67 We generate unobserved morphological variants for target phrases starting from a generation procedure for target words. [sent-115, score-0.527]

68 The forms produced by the tool from the lemma of an observed inflected word form were subjected to several restrictions: • • • • • For nouns, generated forms had to match the original nfos,rm ge efnoerr nautemdb feor. [sent-117, score-0.443]

69 Non-standard inflection forms for all POS were eNxocnlu-sdtaedn. [sent-121, score-0.376]

70 d The following criteria were used to select rules for which expanded inflection sets were generated: • • • The target phrase had to contain exactly one Twhoerd t aforgre wt phhicrha ienf hleacdte tdo f coormntas nco euxlda c btley generated according to the criteria given above. [sent-122, score-0.642]

71 If the target phrase contained prepositions or numerals, they rhaasde t coo bneta iinn a position inoonts sa od-r jacent to the inflected word. [sent-123, score-0.356]

72 The rationale for this criterion was the tendency of prepositions and numerals to determine the inflection of adjacent words. [sent-124, score-0.352]

73 The lemmatized form of the phrase had to accTohuent le mfomr aatti leedas fto r2m5% o o thf target phrases extracted for a given source phrase. [sent-125, score-0.471]

74 The standard relative frequency estimates for the p(X |Y ) phrase model and the lexical models do not provide r peahsroasneab mleo dveallu aensd f tohre th leex diceaclo mdeord scores nfoort unobserved rules and words. [sent-126, score-0.615]

75 For the experiments described below we trained an exponential model for the p(Y |X) lexical model. [sent-128, score-0.326]

76 Thus, the annotation for the development and testing sets provides a realistic reflection of what could be obtained for arbitrary source text. [sent-133, score-0.196]

77 The impact of the models on translation accuracy was investigated for two experimental conditions: • • Small data set: trained on the news portion of tShme data, containing 140,191 sentences; dioenve ol-f opment and testing sets containing 1500 sentences of news text each. [sent-138, score-0.259]

78 The decision rule was based on the standard log-linear interpolation of several models, with weights tuned by MERT on the development set (Och, 2003). [sent-146, score-0.183]

79 The baselines consisted of the language model, two phrase translation models, two lexical models, and a brevity penalty. [sent-147, score-0.247]

80 The proposed exponential phrase model contains several modifications relative to a standard phrase model (called baseline A below) with potential to improve translation accuracy, including smoothed estimates and estimates incorporating target-side tags. [sent-148, score-1.01]

81 To gain better insight into the role played by different elements of the model, we also tested a second baseline phrase model (baseline B), which attempted to isolate the exponential model itself from auxiliary modifications. [sent-149, score-0.559]

82 Baseline B was different from the experimental condition in using a grammar limited to observed inflections and in replacing the exponential p(Y |X) phrase model by a reliantgive th frequency phrase Ym |oXd)el p. [sent-150, score-0.76]

83 Irta was doidffeelre bnyt afr roemlbaseline A in computing the frequencies for the p(Y |X) phrase model based on counts of tagged target phrases ean md oidne using tdhe o same nstmso oofth taedg estimates in the other models as were used in the experimental condition. [sent-151, score-0.477]

84 Following the approach of Mann et al (2009), the training set was split into many approximately equal portions, for which parameters were estimated separately and then averaged for features observed in multiple portions. [sent-158, score-0.193]

85 The sets of target phrases for each source phrase prior to generation of additional inflected variants were truncated by discarding extracted rules which were observed with frequency less than the 200-th most frequent target phrase for that source phrase. [sent-159, score-0.993]

86 Additional computational challenges remained due to an important difference between models with shared features and usual phrase models. [sent-160, score-0.217]

87 Features appearing with source phrases found in development and testing data share their weights with features appearing with other source phrases, so that filtering the training set for development and testing data affects the solution. [sent-161, score-0.458]

88 The large data model used parameters for the inflection features estimated from the small data set. [sent-164, score-0.508]

89 In the runs where exponential models were used they replaced the corresponding baseline phrase translation model. [sent-165, score-0.522]

90 Aside from the two baselines described in section 7 and the full exponential model, the table also reports results for an exponential model that excluded gender-based features (and hence non-local target-side dependencies). [sent-167, score-0.676]

91 The highest scores were achieved by the full exponential model, although baseline B produced surprisingly disparate effects for the two data sets. [sent-168, score-0.275]

92 This suggests a complex interplay of the various aspects of the model and training data whose exploration could further improve the scores. [sent-169, score-0.195]

93 One can see that for both rule sets the estimated probabilities for rules observed a single time is only slightly 237 EB Cxaopsneodlni t-+ego neBA derS0m . [sent-176, score-0.362]

94 SCLamornagdleits ie ont13T79o,t03a48l9 r,u82l65e0s8Ob23s ,e9r68v73e9,d812r0u 1les Table 2: Grammar sizes after and before generation of unobserved inflections (all filtered for dev/test sets). [sent-180, score-0.291]

95 However, rules with relatively high counts in the second set receive proportionally higher estimates, while the difference between the singleton rule and the most frequent rule in the second set, which was observed 3 times, is smoothed away to an even greater extent. [sent-182, score-0.471]

96 The last two columns show model estimates when various inflection features are included. [sent-183, score-0.556]

97 There is a grammatical match between nominative case for the target word and subject position for the aligned source word and between accusative case for the target word and direct object role for the aligned source word. [sent-184, score-0.961]

98 120 b603j954 Table 3: The effect of inflection features on estimated probabilities. [sent-191, score-0.457]

99 10 Conclusion This paper has introduced a scalable exponential phrase model for target languages with complex morphology that can be trained on the full parallel corpus. [sent-192, score-0.703]

100 We have showed how it can provide estimates for inflected forms unobserved in the training data and described decoding procedures for features with non-local target-side dependencies. [sent-193, score-0.659]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('inflection', 0.308), ('exponential', 0.275), ('unobserved', 0.185), ('rule', 0.183), ('chart', 0.153), ('aa', 0.148), ('aligned', 0.144), ('phrase', 0.142), ('targetside', 0.124), ('inflected', 0.123), ('estimates', 0.122), ('gender', 0.12), ('source', 0.114), ('czech', 0.112), ('accusative', 0.111), ('inflections', 0.106), ('translation', 0.105), ('lemma', 0.1), ('morphology', 0.099), ('carpuat', 0.099), ('karel', 0.092), ('ohre', 0.092), ('target', 0.091), ('morphological', 0.089), ('governing', 0.088), ('messages', 0.088), ('decoding', 0.086), ('span', 0.082), ('tracking', 0.082), ('charts', 0.081), ('features', 0.075), ('markings', 0.074), ('estimated', 0.074), ('ym', 0.074), ('chiang', 0.073), ('phrases', 0.071), ('feature', 0.068), ('forms', 0.068), ('predictors', 0.064), ('smallest', 0.063), ('rules', 0.061), ('colorless', 0.061), ('excepting', 0.061), ('furieusement', 0.061), ('furiously', 0.061), ('kolovratn', 0.061), ('scac', 0.061), ('scipy', 0.061), ('subotin', 0.061), ('noun', 0.061), ('estimate', 0.06), ('xm', 0.059), ('nouns', 0.059), ('adjectives', 0.058), ('item', 0.057), ('gen', 0.057), ('case', 0.056), ('containing', 0.056), ('accomplished', 0.055), ('interpolated', 0.055), ('inflect', 0.054), ('abokrtsk', 0.054), ('avramidis', 0.054), ('fal', 0.054), ('ramanathan', 0.054), ('tohre', 0.054), ('shall', 0.053), ('lemmatized', 0.053), ('factored', 0.052), ('model', 0.051), ('aspects', 0.05), ('jana', 0.05), ('interplay', 0.05), ('sleep', 0.05), ('byrd', 0.05), ('genitive', 0.05), ('yeniterzi', 0.05), ('items', 0.047), ('jeong', 0.047), ('parallel', 0.045), ('observed', 0.044), ('whose', 0.044), ('numerals', 0.044), ('quantifiers', 0.044), ('ees', 0.044), ('regularizer', 0.044), ('rx', 0.042), ('ry', 0.042), ('testing', 0.042), ('dependencies', 0.041), ('verbs', 0.041), ('koehn', 0.041), ('rendered', 0.041), ('annotation', 0.04), ('generated', 0.04), ('role', 0.04), ('lies', 0.04), ('treebank', 0.04), ('mf', 0.039), ('bojar', 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 44 acl-2011-An exponential translation model for target language morphology

Author: Michael Subotin

2 0.18522175 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

Author: Andreas Zollmann ; Stephan Vogel

Abstract: In this work we propose methods to label probabilistic synchronous context-free grammar (PSCFG) rules using only word tags, generated by either part-of-speech analysis or unsupervised word class induction. The proposals range from simple tag-combination schemes to a phrase clustering model that can incorporate an arbitrary number of features. Our models improve translation quality over the single generic label approach of Chiang (2005) and perform on par with the syntactically motivated approach from Zollmann and Venugopal (2006) on the NIST large Chineseto-English translation task. These results persist when using automatically learned word tags, suggesting broad applicability of our technique across diverse language pairs for which syntactic resources are not available.

3 0.18058689 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

Author: Markos Mylonakis ; Khalil Sima'an

Abstract: While it is generally accepted that many translation phenomena are correlated with linguistic structures, employing linguistic syntax for translation has proven a highly non-trivial task. The key assumption behind many approaches is that translation is guided by the source and/or target language parse, employing rules extracted from the parse tree or performing tree transformations. These approaches enforce strict constraints and might overlook important translation phenomena that cross linguistic constituents. We propose a novel flexible modelling approach to introduce linguistic information of varying granularity from the source side. Our method induces joint probability synchronous grammars and estimates their parameters, by select- ing and weighing together linguistically motivated rules according to an objective function directly targeting generalisation over future data. We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.

4 0.16477717 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

Author: Ann Clifton ; Anoop Sarkar

Abstract: This paper extends the training and tuning regime for phrase-based statistical machine translation to obtain fluent translations into morphologically complex languages (we build an English to Finnish translation system) . Our methods use unsupervised morphology induction. Unlike previous work we focus on morphologically productive phrase pairs – our decoder can combine morphemes across phrase boundaries. Morphemes in the target language may not have a corresponding morpheme or word in the source language. Therefore, we propose a novel combination of post-processing morphology prediction with morpheme-based translation. We show, using both automatic evaluation scores and linguistically motivated analyses of the output, that our methods outperform previously proposed ones and pro- vide the best known results on the EnglishFinnish Europarl translation task. Our methods are mostly language independent, so they should improve translation into other target languages with complex morphology. 1 Translation and Morphology Languages with rich morphological systems present significant hurdles for statistical machine translation (SMT) , most notably data sparsity, source-target asymmetry, and problems with automatic evaluation. In this work, we propose to address the problem of morphological complexity in an Englishto-Finnish MT task within a phrase-based translation framework. We focus on unsupervised segmentation methods to derive the morphological information supplied to the MT model in order to provide coverage on very large datasets and for languages with few hand-annotated 32 resources. In fact, in our experiments, unsupervised morphology always outperforms the use of a hand-built morphological analyzer. Rather than focusing on a few linguistically motivated aspects of Finnish morphological behaviour, we develop techniques for handling morphological complexity in general. We chose Finnish as our target language for this work, because it exemplifies many of the problems morphologically complex languages present for SMT. Among all the languages in the Europarl data-set, Finnish is the most difficult language to translate from and into, as was demonstrated in the MT Summit shared task (Koehn, 2005) . Another reason is the current lack of knowledge about how to apply SMT successfully to agglutinative languages like Turkish or Finnish. Our main contributions are: 1) the introduction of the notion of segmented translation where we explicitly allow phrase pairs that can end with a dangling morpheme, which can connect with other morphemes as part of the translation process, and 2) the use of a fully segmented translation model in combination with a post-processing morpheme prediction system, using unsupervised morphology induction. Both of these approaches beat the state of the art on the English-Finnish translation task. Morphology can express both content and function categories, and our experiments show that it is important to use morphology both within the translation model (for morphology with content) and outside it (for morphology contributing to fluency) . Automatic evaluation measures for MT, BLEU (Papineni et al., 2002), WER (Word Error Rate) and PER (Position Independent Word Error Rate) use the word as the basic unit rather than morphemes. In a word comProce dPinogrstla ofn tdh,e O 4r9etghon A,n Jnu nael 1 M9-e 2t4i,n2g 0 o1f1 t.he ?c A2s0s1o1ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 32–42, prised of multiple morphemes, getting even a single morpheme wrong means the entire word is wrong. In addition to standard MT evaluation measures, we perform a detailed linguistic analysis of the output. Our proposed approaches are significantly better than the state of the art, achieving the highest reported BLEU scores on the English-Finnish Europarl version 3 data-set. Our linguistic analysis shows that our models have fewer morpho-syntactic errors compared to the word-based baseline. 2 2.1 Models Baseline Models We set up three baseline models for comparison in this work. The first is a basic wordbased model (called Baseline in the results) ; we trained this on the original unsegmented version of the text. Our second baseline is a factored translation model (Koehn and Hoang, 2007) (called Factored) , which used as factors the word, “stem” 1 and suffix. These are derived from the same unsupervised segmentation model used in other experiments. The results (Table 3) show that a factored model was unable to match the scores of a simple wordbased baseline. We hypothesize that this may be an inherently difficult representational form for a language with the degree of morphological complexity found in Finnish. Because the morphology generation must be precomputed, for languages with a high degree of morphological complexity, the combinatorial explosion makes it unmanageable to capture the full range of morphological productivity. In addition, because the morphological variants are generated on a per-word basis within a given phrase, it excludes productive morphological combination across phrase boundaries and makes it impossible for the model to take into account any longdistance dependencies between morphemes. We conclude from this result that it may be more useful for an agglutinative language to use morphology beyond the confines of the phrasal unit, and condition its generation on more than just the local target stem. In order to compare the 1see Section 2.2. 33 performance of unsupervised segmentation for translation, our third baseline is a segmented translation model based on a supervised segmentation model (called Sup) , using the hand-built Omorfi morphological analyzer (Pirinen and Listenmaa, 2007) , which provided slightly higher BLEU scores than the word-based baseline. 2.2 Segmented Translation For segmented translation models, it cannot be taken for granted that greater linguistic accuracy in segmentation yields improved translation (Chang et al. , 2008) . Rather, the goal in segmentation for translation is instead to maximize the amount of lexical content-carrying morphology, while generalizing over the information not helpful for improving the translation model. We therefore trained several different segmentation models, considering factors of granularity, coverage, and source-target symmetry. We performed unsupervised segmentation of the target data, using Morfessor (Creutz and Lagus, 2005) and Paramor (Monson, 2008) , two top systems from the Morpho Challenge 2008 (their combined output was the Morpho Challenge winner) . However, translation models based upon either Paramor alone or the combined systems output could not match the wordbased baseline, so we concentrated on Morfessor. Morfessor uses minimum description length criteria to train a HMM-based segmentation model. When tested against a human-annotated gold standard of linguistic morpheme segmentations for Finnish, this algorithm outperforms competing unsupervised methods, achieving an F-score of 67.0% on a 3 million sentence corpus (Creutz and Lagus, 2006) . Varying the perplexity threshold in Morfessor does not segment more word types, but rather over-segments the same word types. In order to get robust, common segmentations, we trained the segmenter on the 5000 most frequent words2 ; we then used this to segment the entire data set. In order to improve coverage, we then further segmented 2For the factored model baseline we also used the same setting perplexity = 30, 5,000 most frequent words, but with all but the last suffix collapsed and called the “stem” . TabHMleoat1nr:gplhiMngor phermphocTur631ae04in, 81c9ie03ns67gi,64n0S14e567theTp 2rsa51t, 29Se 3t168able and in translation. any word type that contained a match from the most frequent suffix set, looking for the longest matching suffix character string. We call this method Unsup L-match. After the segmentation, word-internal morpheme boundary markers were inserted into the segmented text to be used to reconstruct the surface forms in the MT output. We then trained the Moses phrase-based system (Koehn et al., 2007) on the segmented and marked text. After decoding, it was a simple matter to join together all adjacent morphemes with word-internal boundary markers to reconstruct the surface forms. Figure 1(a) gives the full model overview for all the variants of the segmented translation model (supervised/unsupervised; with and without the Unsup L-match procedure) . Table 1shows how morphemes are being used in the MT system. Of the phrases that included segmentations (‘Morph’ in Table 1) , roughly a third were ‘productive’, i.e. had a hanging morpheme (with a form such as stem+) that could be joined to a suffix (‘Hanging Morph’ in Table 1) . However, in phrases used while decoding the development and test data, roughly a quarter of the phrases that generated the translated output included segmentations, but of these, only a small fraction (6%) had a hanging morpheme; and while there are many possible reasons to account for this we were unable to find a single convincing cause. 2.3 Morphology Generation Morphology generation as a post-processing step allows major vocabulary reduction in the translation model, and allows the use of morphologically targeted features for modeling inflection. A possible disadvantage of this approach is that in this model there is no opportunity to con34 sider the morphology in translation since it is removed prior to training the translation model. Morphology generation models can use a variety of bilingual and contextual information to capture dependencies between morphemes, often more long-distance than what is possible using n-gram language models over morphemes in the segmented model. Similar to previous work (Minkov et al. , 2007; Toutanova et al. , 2008) , we model morphology generation as a sequence learning problem. Un- like previous work, we use unsupervised morphology induction and use automatically generated suffix classes as tags. The first phase of our morphology prediction model is to train a MT system that produces morphologically simplified word forms in the target language. The output word forms are complex stems (a stem and some suffixes) but still missing some important suffix morphemes. In the second phase, the output of the MT decoder is then tagged with a sequence of abstract suffix tags. In particular, the output of the MT decoder is a sequence of complex stems denoted by x and the output is a sequence of suffix class tags denoted by y. We use a list of parts from (x,y) and map to a d-dimensional feature vector Φ(x, y) , with each dimension being a real number. We infer the best sequence of tags using: F(x) = argymaxp(y | x,w) where F(x) returns the highest scoring output y∗ . A conditional random field (CRF) (Lafferty et al. , 2001) defines the conditional probability as a linear score for each candidate y and a global normalization term: logp(y | x, w) = Φ(x, y) · w − log Z where Z = Py0∈ exp(Φ(x, y0) · w) . We use stochastiPc gradient descent (using crfsgd3) to train the weight vector w. So far, this is all off-the-shelf sequence learning. However, the output y∗ from the CRF decoder is still only a sequence of abstract suffix tags. The third and final phase in our morphology prediction model GEN(x) 3 http://leon. bottou. org/projects/sgd English Training Data words Finnish Training Data words Morphological Pre-Processing stem+ +morph MT System Alignment: word word word stem+ +morph stem stem+ +morph Post-Process: Morph Re-Stitching Fully inflected surface form Evaluation against original reference (a) Segmented Translation Model English Training Data words Finnish Training Data Morphological Pre-Prowceosrdsisng 1 stem+ +morph1+ +morph2 Morphological Pre-Processing 2 stem+ +morph1+ MPosrpthe-mPRr+eo-+cSmetsio crhp1i:nhg+swteomrd+ MA+lTmigwnSomyrspdthen 1mt:+ wsotermd complex stem: stem+morph1+ MPo rpsht-oPlro gcyesGse2n:erCaRtioFnstem+morph1+ morph2sLuarnfagcueagfeorMmomdealp ing Fully inflected surface form Evaluation against original reference (b) Post-Processing Model Translation & Generation Figure 1: Training and testing pipelines for the SMT models. is to take the abstract suffix tag sequence y∗ and then map it into fully inflected word forms, and rank those outputs using a morphemic language model. The abstract suffix tags are extracted from the unsupervised morpheme learning process, and are carefully designed to enable CRF training and decoding. We call this model CRFLM for short. Figure 1(b) shows the full pipeline and Figure 2 shows a worked example of all the steps involved. We use the morphologically segmented training data (obtained using the segmented corpus described in Section 2.24) and remove selected suffixes to create a morphologically simplified version of the training data. The MT model is trained on the morphologically simplified training data. The output from the MT system is then used as input to the CRF model. The CRF model was trained on a ∼210,000 Finnish sentences, consisting noefd d∼ o1n.5 a am ∼il2li1o0n,0 tokens; tishhe 2,000 cseens,te cnoncse Europarl t.e5s tm isl eito nco tnoskiesntesd; hoef 41,434 stem tokens. The labels in the output sequence y were obtained by selecting the most productive 150 stems, and then collapsing certain vowels into equivalence classes corresponding to Finnish vowel harmony patterns. Thus 4Note that unlike Section 2.2 we do not use Unsup L-match because when evaluating the CRF model on the suffix prediction task it obtained 95.61% without using Unsup L-match and 82.99% when using Unsup L-match. 35 variants -k¨ o and -ko become vowel-generic enclitic particle -kO, and variants -ss ¨a and -ssa become the vowel-generic inessive case marker -ssA, etc. This is the only language-specific component of our translation model. However, we expect this approach to work for other agglutinative languages as well. For fusional languages like Spanish, another mapping from suffix to abstract tags might be needed. These suffix transformations to their equivalence classes prevent morphophonemic variants of the same morpheme from competing against each other in the prediction model. This resulted in 44 possible label outputs per stem which was a reasonable sized tag-set for CRF training. The CRF was trained on monolingual features of the segmented text for suffix prediction, where t is the current token: Word Stem st−n, .., st, .., st+n(n = 4) Morph Prediction yt−2 , yt−1 , yt With this simple feature set, we were able to use features over longer distances, resulting in a total of 1,110,075 model features. After CRF based recovery of the suffix tag sequence, we use a bigram language model trained on a full segmented version on the training data to recover the original vowels. We used bigrams only, because the suffix vowel harmony alternation depends only upon the preceding phonemes in the word from which it was segmented. original training koskevaa mietint o¨ ¨a data: k ¨asitell ¨a ¨an segmentation: koske+ +va+ +a mietint ¨o+ + a¨ k a¨si+ +te+ +ll a¨+ + a¨+ +n (train bigram language model with mapping A = { a , a }) map n fi bniaglr asmuff liaxn gtou agbest mraocdte tag-set: koske+ +va+ +A mietint ¨o+ +A k ¨asi+ +te+ +ll ¨a+ + ¨a+ +n (train CRF model to predict the final suffix) peeling of final suffix: koske+ +va+ mietint ¨o+ k a¨si+ +te+ +ll a¨+ + a¨+ (train SMT model on this transformation of training data) (a) Training decoder output: koske+ +va+ mietint o¨+ k a¨si+ +te+ +ll a¨+ + a¨+ decoder output stitched up: koskeva+ mietint o¨+ k ¨asitell ¨a ¨a+ CRF model prediction: x = ‘koskeva+ mietint ¨o+ k ¨asitell ¨a ¨a+’, y = ‘+A +A +n’ koskeva+ +A mietint ¨o+ +A k ¨asitell a¨ ¨a+ +n unstitch morphemes: koske+ +va+ +A mietint ¨o+ +A k ¨asi+ +te+ +ll ¨a+ + ¨a+ +n language model disambiguation: koske+ +va+ +a mietint ¨o+ + a¨ k a¨si+ +te+ +ll a¨+ + a¨+ +n final stitching: koskevaa mietint o¨ ¨a k ¨asitell ¨a ¨an (the output is then compared to the reference translation) (b) Decoding Figure 2: Worked example of all steps in the post-processing morphology prediction model. 3 Experimental Results used the Europarl version 3 corpus (Koehn, 2005) English-Finnish training data set, as well as the standard development and test data sets. Our parallel training data consists of ∼1 million senFor all of the models built in this paper, we tpeanrcaelsle lo tfr a4i0n nwgor ddast or less, sw ohfi ∼le 1t mhei development and test sets were each 2,000 sentences long. In all the experiments conducted in this paper, we used the Moses5 phrase-based translation system (Koehn et al. , 2007) , 2008 version. We trained all of the Moses systems herein using the standard features: language model, reordering model, translation model, and word penalty; in addition to these, the factored experiments called for additional translation and generation features for the added factors as noted above. We used in all experiments the following settings: a hypothesis stack size 100, distortion limit 6, phrase translations limit 20, and maximum phrase length 20. For the language models, we used SRILM 5-gram language models (Stolcke, 2002) for all factors. For our word-based Baseline system, we trained a word-based model using the same Moses system with identical settings. For evaluation against segmented translation systems in segmented forms before word reconstruction, we also segmented the baseline system’s word-based output. All the BLEU scores reported are for lowercase evaluation. We did an initial evaluation of the segmented output translation for each system using the no5http://www.statmt.org/moses/ 36 TabSlBUeuna2gps:meulSipengLmta-e nioatedchMo12dme804-.lB8S714cL±oEr0eUs.6 9 S8up19Nre.358ofe498rUs9ntoihe supervised segmentation baseline model. m-BLEU indicates that the segmented output was evaluated against a segmented version of the reference (this measure does not have the same correlation with human judgement as BLEU) . No Uni indicates the segmented BLEU score without unigrams. tion of m-BLEU score (Luong et al. , 2010) where the BLEU score is computed by comparing the segmented output with a segmented reference translation. Table 2 shows the m-BLEU scores for various systems. We also show the m-BLEU score without unigrams, since over-segmentation could lead to artificially high m-BLEU scores. In fact, if we compare the relative improvement of our m-BLEU scores for the Unsup L-match system we see a relative improvement of 39.75% over the baseline. Luong et. al. (2010) report an m-BLEU score of 55.64% but obtain a relative improvement of 0.6% over their baseline m-BLEU score. We find that when using a good segmentation model, segmentation of the morphologically complex target language improves model performance over an unsegmented baseline (the confidence scores come from bootstrap resampling) . Table 3 shows the evaluation scores for all the baselines and the methods introduced in this paper using standard wordbased lowercase BLEU, WER and PER. We do TSCMaFBU(LubanRolpcesdFotu3lne-ipLr:gMdeLT-tms.al,Stc2ho0r1es:)l 1wB54 Le.r682E90c 27a9Us∗eBL-7 W46E3. U659478R6,1WE-7 TR412E. 847Ra1528nd TER. The ∗ indicates a statistically significant improvement o∗f BndLiEcaUte score over tchalel yB saisgenli nfice mntod imel.The boldface scores are the best performing scores per evaluation measure. better than (Luong et al. , 2010) , the previous best score for this task. We also show a better relative improvement over our baseline when compared to (Luong et al., 2010) : a relative improvement of 4.86% for Unsup L-match compared to our baseline word-based model, compared to their 1.65% improvement over their baseline word-based model. Our best performing method used unsupervised morphology with L-match (see Section 2.2) and the improvement is significant: bootstrap resampling provides a confidence margin of ±0.77 and a t-test (Collins ceot nafli.d , 2005) sahrogwined o significance aw ti-thte p = 0o.0ll0in1s. 3.1 Morphological Fluency Analysis To see how well the models were doing at getting morphology right, we examined several patterns of morphological behavior. While we wish to explore minimally supervised morphological MT models, and use as little language specific information as possible, we do want to use linguistic analysis on the output of our system to see how well the models capture essential morphological information in the target language. So, we ran the word-based baseline system, the segmented model (Unsup L-match) , and the prediction model (CRF-LM) outputs, along with the reference translation through the supervised morphological analyzer Omorfi (Pirinen and Listenmaa, 2007) . Using this analysis, we looked at a variety of linguistic constructions that might reveal patterns in morphological behavior. These were: (a) explicitly marked 37 noun forms, (b) noun-adjective case agreement, (c) subject-verb person/number agreement, (d) transitive object case marking, (e) postpositions, and (f) possession. In each of these categories, we looked for construction matches on a per-sentence level between the models’ output and the reference translation. Table 4 shows the models’ performance on the constructions we examined. In all of the categories, the CRF-LM model achieves the best precision score, as we explain below, while the Unsup L-match model most frequently gets the highest recall score. A general pattern in the most prevalent of these constructions is that the baseline tends to prefer the least marked form for noun cases (corresponding to the nominative) more than the reference or the CRF-LM model. The baseline leaves nouns in the (unmarked) nominative far more than the reference, while the CRF-LM model comes much closer, so it seems to fare better at explicitly marking forms, rather than defaulting to the more frequent unmarked form. Finnish adjectives must be marked with the same case as their head noun, while verbs must agree in person and number with their subject. We saw that in both these categories, the CRFLM model outperforms for precision, while the segmented model gets the best recall. In addition, Finnish generally marks direct objects of verbs with the accusative or the partitive case; we observed more accusative/partitive-marked nouns following verbs in the CRF-LM output than in the baseline, as illustrated by example (1) in Fig. 3. While neither translation picks the same verb as in the reference for the input ‘clarify,’ the CRFLM-output paraphrases it by using a grammatical construction of the transitive verb followed by a noun phrase inflected with the accusative case, correctly capturing the transitive construction. The baseline translation instead follows ‘give’ with a direct object in the nominative case. To help clarify the constructions in question, we have used Google Translate6 to provide back6 http://translate.google. com/ of occurrences per sentence, recall and F-score. also averaged The constructions over the various translations. are listed in descending P, R and F stand for precision, order of their frequency in the texts. The highlighted value in each column is the most accurate with respect to the reference value. translations of our MT output into English; to contextualize these back-translations, we have provided Google’s back-translation of the reference. The use of postpositions shows another difference between the models. Finnish postpositions require the preceding noun to be in the genitive or sometimes partitive case, which occurs correctly more frequently in the CRF-LM than the baseline. In example (2) in Fig. 3, all three translations correspond to the English text, ‘with the basque nationalists. ’ However, the CRF-LM output is more grammatical than the baseline, because not only do the adjective and noun agree for case, but the noun ‘baskien’ to which the postposition ‘kanssa’ belongs is marked with the correct genitive case. However, this well-formedness is not rewarded by BLEU, because ‘baskien’ does not match the reference. In addition, while Finnish may express possession using case marking alone, it has another construction for possession; this can disambiguate an otherwise ambiguous clause. This alternate construction uses a pronoun in the genitive case followed by a possessive-marked noun; we see that the CRF-LM model correctly marks this construction more frequently than the baseline. As example (3) in Fig. 3 shows, while neither model correctly translates ‘matkan’ (‘trip’) , the baseline’s output attributes the inessive ‘yhteydess’ (‘connection’) as belonging to ‘tulokset’ (‘results’) , and misses marking the possession linking it to ‘Commissioner Fischler’. Our manual evaluation shows that the CRF38 LM model is producing output translations that are more morphologically fluent than the wordbased baseline and the segmented translation Unsup L-match system, even though the word choices lead to a lower BLEU score overall when compared to Unsup L-match. 4 Related Work The work on morphology in MT can be grouped into three categories, factored models, segmented translation, and morphology generation. Factored models (Koehn and Hoang, 2007) factor the phrase translation probabilities over additional information annotated to each word, allowing for text to be represented on multiple levels of analysis. We discussed the drawbacks of factored models for our task in Section 2. 1. While (Koehn and Hoang, 2007; Yang and Kirchhoff, 2006; Avramidis and Koehn, 2008) obtain improvements using factored models for translation into English, German, Spanish, and Czech, these models may be less useful for capturing long-distance dependencies in languages with much more complex morphological systems such as Finnish. In our experiments factored models did worse than the baseline. Segmented translation performs morphological analysis on the morphologically complex text for use in the translation model (Brown et al. , 1993; Goldwater and McClosky, 2005; de Gispert and Mari n˜o, 2008) . This method unpacks complex forms into simpler, more frequently occurring components, and may also increase the symmetry of the lexically realized content be(1) Input: ‘the charter we are to approve today both strengthens and gives visible shape to the common fundamental rights and values our community is to be based upon. ’ a. Reference: perusoikeuskirja , jonka t ¨an ¨a ¨an aiomme hyv a¨ksy ¨a , sek ¨a vahvistaa ett ¨a selvent a¨ a¨ (selvent ¨a a¨/VERB/ACT/INF/SG/LAT-clarify) niit a¨ (ne/PRONOUN/PL/PAR-them) yhteisi ¨a perusoikeuksia ja arvoja , joiden on oltava yhteis¨ omme perusta. Back-translation: ‘Charter of Fundamental Rights, which today we are going to accept that clarify and strengthen the common fundamental rights and values, which must be community based. ’ b. Baseline: perusoikeuskirja me hyv ¨aksymme t¨ an ¨a a¨n molemmat vahvistaa ja antaa (antaa/VERB/INF/SG/LATgive) n a¨kyv a¨ (n¨ aky a¨/VERB/ACT/PCP/SG/NOM-visible) muokata yhteist ¨a perusoikeuksia ja arvoja on perustuttava. Back-translation: ‘Charter today, we accept both confirm and modify to make a visible and common values, fundamental rights must be based. ’ c. CRF-LM: perusoikeuskirja on hyv a¨ksytty t ¨an ¨a ¨an , sek ¨a vahvistaa ja antaa (antaa/VERB/ACT/INF/SG/LAT-give) konkreettisen (konkreettinen/ADJECTIVE/SG/GEN,ACC-concrete) muodon (muoto/NOUN/SG/GEN,ACCshape) yhteisi ¨a perusoikeuksia ja perusarvoja , yhteis¨ on on perustuttava. Back-translation: ‘Charter has been approved today, and to strengthen and give concrete shape to the common basic rights and fundamental values, the Community must be based. ’ (2) Input: ‘with the basque nationalists’ a. Reference: baskimaan kansallismielisten kanssa basque-SG/NOM+land-SG/GEN,ACC nationalists-PL/GEN with-POST b. Baseline: baskimaan kansallismieliset kanssa basque-SG/NOM-+land-SG/GEN,ACC kansallismielinen-PL/NOM,ACC-nationalists POST-with c. CRF-LM: kansallismielisten baskien kanssa nationalists-PL/GEN basques-PL/GEN with-POST (3) Input: ‘and in this respect we should value the latest measures from commissioner fischler , the results of his trip to morocco on the 26th of last month and the high level meetings that took place, including the one with the king himself’ a. Reference: ja t ¨ass¨ a mieless ¨a osaamme my¨ os arvostaa komission j¨ asen fischlerin viimeisimpi ¨a toimia , jotka ovat h a¨nen (h¨ anen/GEN-his) marokkoon 26 lokakuuta tekemns (tekem¨ ans ¨a/POSS-his) matkan (matkan/GENtour) ja korkean tason kokousten jopa itsens¨ a kuninkaan kanssa tulosta Back-translation: ‘and in this sense we can also appreciate the Commissioner Fischler’s latest actions, which are his to Morocco 26 October trip to high-level meetings and even the king himself with the result b. Baseline: ja t ¨ass¨ a yhteydess a¨ olisi arvoa viimeisin toimia komission j¨ asen fischler , tulokset monitulkintaisia marokon yhteydess a¨ (yhteydess/INE-connection) , ja viime kuussa pidettiin korkean tason kokouksissa , mukaan luettuna kuninkaan kanssa Back-translation: ‘and in this context would be the value of the last act, Commissioner Fischler, the results of the Moroccan context, ambiguous, and last month held high level meetings, including with the king’ c. CRF-LM: ja t ¨ass¨ a yhteydess a¨ meid ¨an olisi lis ¨aarvoa viimeist ¨a toimenpiteit a¨ kuin komission j¨ asen fischler , ett a¨ h a¨nen (h¨ anen/GEN-his) kokemuksensa (kokemuksensa/POSS-experience) marokolle (marokolle-Moroccan) viime kuun 26 ja korkean tason tapaamiset j¨ arjestettiin, kuninkaan kanssa Back-translation: ‘and in this context, we should value the last measures as the Commissioner Fischler, that his experience in Morocco has on the 26th and high-level meetings took place, including with the king. ’ Figure 3: Morphological fluency analysis (see Section 3. 1) . tween source and target. In a somewhat orthogonal approach to ours, (Ma et al. , 2007) use alignment of a parallel text to pack together adjacent segments in the alignment output, which are then fed back to the word aligner to bootstrap an improved alignment, which is then used in the translation model. We compared our results against (Luong et al. , 2010) in Table 3 since their results are directly comparable to ours. They use a segmented phrase table and language model along with the word-based versions in the decoder and in tuning a Finnish target. Their approach requires segmented phrases 39 to match word boundaries, eliminating morphologically productive phrases. In their work a segmented language model can score a translation, but cannot insert morphology that does not show source-side reflexes. In order to perform a similar experiment that still allowed for morphologically productive phrases, we tried training a segmented translation model, the output of which we stitched up in tuning so as to tune to a word-based reference. The goal of this experiment was to control the segmented model’s tendency to overfit by rewarding it for using correct whole-word forms. However, we found that this approach was less successful than using the segmented reference in tuning, and could not meet the baseline (13.97% BLEU best tuning score, versus 14.93% BLEU for the baseline best tuning score) . Previous work in segmented translation has often used linguistically motivated morphological analysis selectively applied based on a language-specific heuristic. A typical approach is to select a highly inflecting class of words and segment them for particular morphology (de Gispert and Mari n˜o, 2008; Ramanathan et al. , 2009) . Popovi¸ c and Ney (2004) perform segmentation to reduce morphological complexity of the source to translate into an isolating target, reducing the translation error rate for the English target. For Czech-to-English, Goldwater and McClosky (2005) lemmatized the source text and inserted a set of ‘pseudowords’ expected to have lexical reflexes in English. Minkov et. al. (2007) and Toutanova et. al. (2008) use a Maximum Entropy Markov Model for morphology generation. The main drawback to this approach is that it removes morphological information from the translation model (which only uses stems) ; this can be a problem for languages in which morphology ex- presses lexical content. de Gispert (2008) uses a language-specific targeted morphological classifier for Spanish verbs to avoid this issue. Talbot and Osborne (2006) use clustering to group morphological variants of words for word alignments and for smoothing phrase translation tables. Habash (2007) provides various methods to incorporate morphological variants of words in the phrase table in order to help recognize out of vocabulary words in the source language. 5 Conclusion and Future Work We found that using a segmented translation model based on unsupervised morphology induction and a model that combined morpheme segments in the translation model with a postprocessing morphology prediction model gave us better BLEU scores than a word-based baseline. Using our proposed approach we obtain better scores than the state of the art on the EnglishFinnish translation task (Luong et al. , 2010) : from 14.82% BLEU to 15.09%, while using a 40 simpler model. We show that using morphological segmentation in the translation model can improve output translation scores. We also demonstrate that for Finnish (and possibly other agglutinative languages) , phrase-based MT benefits from allowing the translation model access to morphological segmentation yielding productive morphological phrases. Taking advantage of linguistic analysis of the output we show that using a post-processing morphology generation model can improve translation fluency on a sub-word level, in a manner that is not captured by the BLEU word-based evaluation measure. In order to help with replication of the results in this paper, we have run the various morphological analysis steps and created the necessary training, tuning and test data files needed in order to train, tune and test any phrase-based machine translation system with our data. The files can be downloaded from natlang. cs.sfu. ca. In future work we hope to explore the utility of phrases with productive morpheme boundaries and explore why they are not used more pervasively in the decoder. Evaluation measures for morphologically complex languages and tun- ing to those measures are also important future work directions. Also, we would like to explore a non-pipelined approach to morphological preand post-processing so that a globally trained model could be used to remove the target side morphemes that would improve the translation model and then predict those morphemes in the target language. Acknowledgements This research was partially supported by NSERC, Canada (RGPIN: 264905) and a Google Faculty Award. We would like to thank Christian Monson, Franz Och, Fred Popowich, Howard Johnson, Majid Razmara, Baskaran Sankaran and the anonymous reviewers for their valuable comments on this work. We would particularly like to thank the developers of the open-source Moses machine translation toolkit and the Omorfi morphological analyzer for Finnish which we used for our experiments. References Eleftherios Avramidis and Philipp Koehn. 2008. Enriching morphologically poor languages for statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, page 763?770, Columbus, Ohio, USA. Association for Computational Linguistics. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2) :263–31 1. Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 224–232, Columbus, Ohio, June. Association for Computational Linguistics. Michael Collins, Philipp Koehn, and Ivona Kucerova. 2005. Clause restructuring for statistical machine translation. In Proceedings of 43rd Annual Meeting of the Association for Computational Linguistics (A CL05). Association for Computational Linguistics. Mathias Creutz and Krista Lagus. 2005. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reason- ing (AKRR ’05), pages 106–113, Espoo, Finland. Mathias Creutz and Krista Lagus. 2006. Morfessor in the morpho challenge. In Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes. Adri ´a de Gispert and Jos e´ Mari n˜o. 2008. On the impact of morphology in English to Spanish statistical MT. Speech Communication, 50(11-12) . Sharon Goldwater and David McClosky. 2005. Improving statistical MT through morphological analysis. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 676–683, Vancouver, B.C. , Canada. Association for Computational Linguistics. Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 868–876, Prague, Czech Republic. Association for Computational Linguistics. 41 Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In A CL ‘07: Proceedings of the 45th Annual Meeting of the A CL on Interactive Poster and Demonstration Sessions, pages 177–108, Prague, Czech Republic. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X, pages 79–86, Phuket, Thailand. Association for Computational Linguistics. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, pages 282–289, San Francisco, California, USA. Association for Computing Machinery. Minh-Thang Luong, Preslav Nakov, and Min-Yen Kan. 2010. A hybrid morpheme-word representation for machine translation of morphologically rich languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 148–157, Cambridge, Massachusetts. Association for Computational Linguistics. Yanjun Ma, Nicolas Stroppa, and Andy Way. 2007. Bootstrapping word alignment via word packing. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 304–311, Prague, Czech Republic. Association for Computational Linguistics. Einat Minkov, Kristina Toutanova, and Hisami Suzuki. 2007. Generating complex morphology for machine translation. In In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (A CL07), pages 128–135, Prague, Czech Republic. Association for Computational Linguistics. Christian Monson. 2008. Paramor and morpho challenge 2008. In Lecture Notes in Computer Science: Workshop of the Cross-Language Evaluation Forum (CLEF 2008), Revised Selected Papers. Habash Nizar. 2007. Four techniques for online handling of out-of-vocabulary words in arabic-english statistical machine translation. In Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, Columbus, Ohio. Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics A CL, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. Tommi Pirinen and Inari Listenmaa. 2007. Omorfi morphological analzer. http://gna.org/projects/omorfi. Maja Popovi¸ c and Hermann Ney. 2004. Towards the use of word stems and suffixes for statistiWei jing cal machine translation. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), pages 1585–1588, Lisbon, Portugal. European Language Resources Association (ELRA) . Ananthakrishnan Ramanathan, Hansraj Choudhary, Avishek Ghosh, and Pushpak Bhattacharyya. 2009. Case markers and morphology: Addressing the crux of the fluency problem in EnglishHindi SMT. In Proceedings of the Joint Conference of the 4 7th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 800–808, Suntec, Singapore. Association for Computational Linguistics. Andreas Stolcke. 2002. Srilm – an extensible language modeling toolkit. 7th International Conference on Spoken Language Processing, 3:901–904. David Talbot and Miles Osborne. 2006. Modelling lexical redundancy for machine translation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 969–976, Sydney, Australia, July. Association for Computational Linguistics. Kristina Toutanova, Hisami Suzuki, and Achim Ruopp. 2008. Applying morphology generation models to machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 514–522, Columbus, Ohio, USA. Association for Computational Linguistics. Mei Yang and Katrin Kirchhoff. 2006. Phrase-based backoff models for machine translation of highly inflected languages. In Proceedings of the European Chapter of the Association for Computational Linguistics, pages 41–48, Trento, Italy. Association for Computational Linguistics. 42

5 0.15634091 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

Author: Xianchao Wu ; Takuya Matsuzaki ; Jun'ichi Tsujii

Abstract: In the present paper, we propose the effective usage of function words to generate generalized translation rules for forest-based translation. Given aligned forest-string pairs, we extract composed tree-to-string translation rules that account for multiple interpretations of both aligned and unaligned target function words. In order to constrain the exhaustive attachments of function words, we limit to bind them to the nearby syntactic chunks yielded by a target dependency parser. Therefore, the proposed approach can not only capture source-tree-to-target-chunk correspondences but can also use forest structures that compactly encode an exponential number of parse trees to properly generate target function words during decoding. Extensive experiments involving large-scale English-toJapanese translation revealed a significant im- provement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system.

6 0.15471932 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

7 0.14403404 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

8 0.14166999 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

9 0.1374066 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

10 0.1311442 313 acl-2011-Two Easy Improvements to Lexical Weighting

11 0.12413662 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

12 0.11671352 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

13 0.11658716 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

14 0.11632533 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality

15 0.11280161 61 acl-2011-Binarized Forest to String Translation

16 0.11261449 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation

17 0.11236656 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

18 0.11173163 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

19 0.1104916 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

20 0.11000509 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.315), (1, -0.156), (2, 0.072), (3, -0.043), (4, 0.002), (5, 0.006), (6, 0.008), (7, -0.024), (8, -0.002), (9, 0.039), (10, -0.032), (11, -0.032), (12, -0.071), (13, 0.034), (14, 0.069), (15, -0.062), (16, -0.015), (17, 0.077), (18, -0.083), (19, 0.008), (20, 0.011), (21, -0.041), (22, -0.008), (23, -0.077), (24, -0.025), (25, 0.054), (26, -0.009), (27, -0.05), (28, -0.012), (29, 0.02), (30, 0.044), (31, 0.039), (32, -0.027), (33, -0.021), (34, -0.023), (35, 0.043), (36, -0.004), (37, -0.03), (38, 0.053), (39, 0.071), (40, -0.034), (41, 0.1), (42, -0.013), (43, -0.138), (44, 0.013), (45, -0.031), (46, 0.036), (47, -0.036), (48, 0.07), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95521265 44 acl-2011-An exponential translation model for target language morphology

Author: Michael Subotin

2 0.81291366 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

Author: Andreas Zollmann ; Stephan Vogel

3 0.75825155 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

Author: Markos Mylonakis ; Khalil Sima'an

4 0.74309826 313 acl-2011-Two Easy Improvements to Lexical Weighting

Author: David Chiang ; Steve DeNeefe ; Michael Pust

Abstract: We introduce two simple improvements to the lexical weighting features of Koehn, Och, and Marcu (2003) for machine translation: one which smooths the probability of translating word f to word e by simplifying English morphology, and one which conditions it on the kind of training data that f and e co-occurred in. These new variations lead to improvements of up to +0.8 BLEU, with an average improvement of +0.6 BLEU across two language pairs, two genres, and two translation systems.

5 0.70732123 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction

Author: Ann Clifton ; Anoop Sarkar

6 0.69583094 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

7 0.69291061 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

8 0.68891436 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers

9 0.67514509 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

10 0.66536933 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

11 0.65680659 180 acl-2011-Issues Concerning Decoding with Synchronous Context-free Grammar

12 0.65383148 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

13 0.65064234 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

14 0.64841753 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System

15 0.63860875 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

16 0.63724089 154 acl-2011-How to train your multi bottom-up tree transducer

17 0.63222867 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

18 0.63146609 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

19 0.62827051 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

20 0.62664646 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.039), (17, 0.068), (26, 0.029), (31, 0.025), (37, 0.106), (39, 0.056), (41, 0.05), (55, 0.052), (59, 0.054), (72, 0.052), (80, 0.193), (91, 0.035), (96, 0.152), (97, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93726134 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality

Author: Sarah Alkuhlani ; Nizar Habash

Abstract: We present an enriched version of the Penn Arabic Treebank (Maamouri et al., 2004), where latent features necessary for modeling morpho-syntactic agreement in Arabic are manually annotated. We describe our process for efficient annotation, and present the first quantitative analysis of Arabic morphosyntactic phenomena.

same-paper 2 0.83727646 44 acl-2011-An exponential translation model for target language morphology

Author: Michael Subotin

3 0.79857844 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

Author: Yuval Marton ; Nizar Habash ; Owen Rambow

Abstract: We explore the contribution of morphological features both lexical and inflectional to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we find that definiteness, person, number, gender, and the undiacritzed lemma are most helpful for parsing on automatically tagged input. We further contrast the contribution of form-based and functional features, and show that functional gender and number (e.g., “broken plurals”) and the related rationality feature improve over form-based features. It is the first time functional morphological features are used for Arabic NLP. – –

4 0.79224479 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

Author: Joel Lang ; Mirella Lapata

Abstract: In this paper we describe an unsupervised method for semantic role induction which holds promise for relieving the data acquisition bottleneck associated with supervised role labelers. We present an algorithm that iteratively splits and merges clusters representing semantic roles, thereby leading from an initial clustering to a final clustering of better quality. The method is simple, surprisingly effective, and allows to integrate linguistic knowledge transparently. By combining role induction with a rule-based component for argument identification we obtain an unsupervised end-to-end semantic role labeling system. Evaluation on the CoNLL 2008 benchmark dataset demonstrates that our method outperforms competitive unsupervised approaches by a wide margin.

5 0.75102633 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic

Author: Muhammad Abdul-Mageed ; Mona Diab ; Mohammed Korayem

Abstract: Although Subjectivity and Sentiment Analysis (SSA) has been witnessing a flurry of novel research, there are few attempts to build SSA systems for Morphologically-Rich Languages (MRL). In the current study, we report efforts to partially fill this gap. We present a newly developed manually annotated corpus ofModern Standard Arabic (MSA) together with a new polarity lexicon.The corpus is a collection of newswire documents annotated on the sentence level. We also describe an automatic SSA tagging system that exploits the annotated data. We investigate the impact of different levels ofpreprocessing settings on the SSA classification task. We show that by explicitly accounting for the rich morphology the system is able to achieve significantly higher levels of performance.

6 0.7493642 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

7 0.73862708 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

8 0.73794073 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

9 0.73755443 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

10 0.73741621 311 acl-2011-Translationese and Its Dialects

11 0.73701769 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

12 0.73643267 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

13 0.73517025 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

14 0.7345897 133 acl-2011-Extracting Social Power Relationships from Natural Language

15 0.73453456 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

16 0.73370731 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

17 0.73279965 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

18 0.73199552 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates

19 0.73189193 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

20 0.73068762 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations