acl acl2013 acl2013-330 knowledge-graph by maker-knowledge-mining

330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages


Source: pdf

Author: Zhiyang Wang ; Yajuan Lu ; Meng Sun ; Qun Liu

Abstract: Current translation models are mainly designed for languages with limited morphology, which are not readily applicable to agglutinative languages as the difference in the way lexical forms are generated. In this paper, we propose a novel approach for translating agglutinative languages by treating stems and affixes differently. We employ stem as the atomic translation unit to alleviate data spareness. In addition, we associate each stemgranularity translation rule with a distribution of related affixes, and select desirable rules according to the similarity of their affix distributions with given spans to be translated. Experimental results show that our approach significantly improves the translation performance on tasks of translating from three Turkic languages to Chinese.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Stem Translation with Affix-Based Rule Selection for Agglutinative Languages Zhiyang Wang†, Yajuan L u¨†, Meng Sun†, Qun Liu‡† †Key Laboratory of Intelligent Information Processing Institu†teK eofy C Loamboprauttionryg Toefc Inhtneoll ioggeyn, tC Ihnifnoersmea Aticoand Permocye osfs Sngciences P. [sent-1, score-0.024]

2 cn ct ‡nCge,ntlrev yfoarj Nueaxnt ,Gseunenrmateinong ,Lolciauliqsuatnio}n@ Faculty of E‡Cneginntreeer fionrg N Nanexdt C Goemnepruattiinogn, LDoucballi ns aCti otny University qliu@ comput ing . [sent-5, score-0.024]

3 ie Abstract Current translation models are mainly designed for languages with limited morphology, which are not readily applicable to agglutinative languages as the difference in the way lexical forms are generated. [sent-7, score-0.604]

4 In this paper, we propose a novel approach for translating agglutinative languages by treating stems and affixes differently. [sent-8, score-0.528]

5 We employ stem as the atomic translation unit to alleviate data spareness. [sent-9, score-0.728]

6 In addition, we associate each stemgranularity translation rule with a distribution of related affixes, and select desirable rules according to the similarity of their affix distributions with given spans to be translated. [sent-10, score-1.158]

7 Experimental results show that our approach significantly improves the translation performance on tasks of translating from three Turkic languages to Chinese. [sent-11, score-0.336]

8 1 Introduction Currently, most methods on statistical machine translation (SMT) are developed for translation of languages with limited morphology (e. [sent-12, score-0.576]

9 They assumed that word was the atomic translation unit (ATU), always ignoring the internal morphological structure of word. [sent-15, score-0.483]

10 , 2003), hierarchical (Chiang, 2005) and syntactic (Quirk et al. [sent-18, score-0.025]

11 These improved models worked well for translating languages like English with large scale parallel corpora available. [sent-22, score-0.119]

12 Different from languages with limited morphol- ogy, words of agglutinative languages are formed mainly by concatenation of stems and affixes. [sent-23, score-0.448]

13 Generally, a stem can attach with several affixes, thus leading to tens ofhundreds ofpossible inflected variants of lexicons for a single stem. [sent-24, score-0.39]

14 Theoretically, ways like morphological analysis and increasing bilingual corpora could alleviate the problem of data sparsity, but most agglutinative languages are less-studied and suffer from the problem of resource-scarceness. [sent-26, score-0.475]

15 These work still assume that the atomic translation unit is word, stem or morpheme, without considering the difference between stems and affixes. [sent-29, score-0.711]

16 In agglutinative languages, stem is the base part of word not including inflectional affixes. [sent-30, score-0.618]

17 Affix, especially inflectional affix, indicates different grammatical categories such as tense, person, number and case, etc. [sent-31, score-0.049]

18 Therefore, we employ stem as the atomic translation unit and use affix information to guide translation rule selection. [sent-33, score-1.718]

19 Stem-granularity translation rules have much larger coverage and can lower the OOV rate. [sent-34, score-0.257]

20 Affix based rule selection takes advantage of auxiliary syntactic roles of affixes to make a better rule selection. [sent-35, score-0.464]

21 In this way, we can achieve a balance between rule coverage and matching accuracy, and ultimately improve the translation performance. [sent-36, score-0.396]

22 ||| igha (B)Translation rules with affix distribution zunyi yighin | | ? [sent-55, score-1.029]

23 24 Figure 1: Translation rule extraction from Uyghur “/SUF” means suffix. [sent-68, score-0.179]

24 Here tag “/STM” represents stem and 2 Affix Based Rule Selection Model Figure 1 (B) shows two translation rules along with affix distributions. [sent-70, score-1.204]

25 Here a translation rule contains three parts: the source part (on stem level), the target part, and the related affix distribution (represented as a vector). [sent-71, score-1.413]

26 We can see that, although the source part of the two translation rules are identical, their affix distributions are quite different. [sent-72, score-0.927]

27 Affix “gha” in the first rule indicates that something is affiliated to a subject, similar to “of” in English. [sent-73, score-0.179]

28 ” to be translated, we hope to encourage our model to select the second translation rule. [sent-78, score-0.217]

29 We can achieve this by calculating similarity between the affix distributions of the translation rule and the span. [sent-79, score-1.073]

30 The affix distribution can be obtained by keeping the related affixes for each rule instance during translation rule extraction ((A) in Figure 1). [sent-80, score-1.346]

31 After extracting and scoring stem-granularity rules in a traditional way, we extract stem-granularity rules × again by keeping affix information and compute the affix distribution with tf-idf (Salton and Buckley, 1987). [sent-81, score-1.365]

32 Finally, the affix distribution will be added to the previous stem-granularity rules. [sent-82, score-0.665]

33 1 Affix Distribution Estimation Formally, translation rule instances with the same source part can be treated as a document collection1, so each rule instance in the collection is 1We employ concepts from text classification to illustrate how to estimate affix distribution. [sent-84, score-1.272]

34 Our goal is to classify the source parts into the target parts on the document collection level with the help of affix distribution. [sent-86, score-0.62]

35 Accordingly, we employ vector space model (VSM) to represent affix distribution of each rule instance. [sent-87, score-0.896]

36 In this model, the feature weights are represented by the classic tf-idf (Salton and Buckley, 1987): tfi,j=∑nkin,jk,j idfi,j= log|j : a|Di∈| rj| tfidfi,j = tfi,j idfi,j (1) where tfidfi,j is the weight of affix ai in translation rule instance rj. [sent-88, score-1.053]

37 ni,j indicates the number of occurrence of affix ai in rj. [sent-89, score-0.657]

38 |D | is the number of rule instance with the same source part, abnedr |j : ai ∈ rj | is the number of rule instance which c|jo n:t aain∈s a rff|ix i ai hwei nthuimn |D |. [sent-90, score-0.467]

39 We assume that there are only three instances of translation rules extracted from parallel corpus ((A) in Figure 1). [sent-92, score-0.257]

40 Given a set of N translation rule instances with the same source and target part, we define the centroid vector dr according to the centroid-based classification algorithm (Han and Karypis, 2000), dr=N1∑di 365 (2) Data set#Sent. [sent-98, score-0.439]

41 ∗N means the number of reference, morph is short to morpheme. [sent-124, score-0.062]

42 By comparing the similarity of affix distributions, we are able to decide whether a translation rule is suitable for a span to be translated. [sent-127, score-1.076]

43 In this work, similarity is measured using the cosine distance similarity metric, given by sim(d1,d2) =∥d1d∥1 ×· d ∥2d2∥ (3) where di corresponds to a vector indicating affix distribution, and “·” denotes the inner product of tdhiset two vectors. [sent-128, score-0.711]

44 Therefore, for a specific span to be translated, we first analyze it to get the corresponding stem sequence and related affix distribution represented as a vector. [sent-129, score-1.02]

45 Then the stem sequence is used to search the translation rule table. [sent-130, score-0.723]

46 If the source part is matched, the similarity will be calculated for each candidate translation rule by cosine similarity (as in equation 3). [sent-131, score-0.485]

47 Therefore, in addition to the traditional translation features on stem level, our model also adds the affix similarity score as a dynamic feature into the log-linear model (Och and Ney, 2002). [sent-132, score-1.196]

48 3 Related Work Most previous work on agglutinative language translation mainly focus on Turkish and Finnish. [sent-133, score-0.46]

49 Bisazza and Federico (2009) and Mermer and Saraclar (201 1) optimized morphological analysis as a pre-processing step to improve the translation between Turkish and English. [sent-134, score-0.377]

50 Yeniterzi and Oflazer (2010) mapped the syntax of the English side to the morphology of the Turkish side with the factored model (Koehn and Hoang, 2007). [sent-135, score-0.079]

51 Yang and Kirchhoff (2006) backed off surface form to stem when translating OOV words of Finnish. [sent-136, score-0.374]

52 (2010) focused on Finnish-English translation through improving word alignment and enhancing phrase table. [sent-138, score-0.274]

53 These works still assumed that the atomic translation unit is word, stem or morpheme, without considering the difference between stems and affixes. [sent-139, score-0.711]

54 There are also some work that employed the context information to make a better choice of translation rules (Carpuat and Wu, 2007; Chan et al. [sent-140, score-0.257]

55 , and experiments were mostly done on less inflectional languages (i. [sent-145, score-0.121]

56 4 Experiments In this work, we conduct our experiments on three different agglutinative languages, including Uyghur, Kazakh and Kirghiz. [sent-150, score-0.217]

57 There are about 24 million people take these languages as mother tongue. [sent-152, score-0.072]

58 49871 Table 2: Translation results from Turkic languages to Chinese. [sent-169, score-0.072]

59 word: ATU is surface form, stem: ATU is represented stem, morph: ATU denotes morpheme, affix: stem translation with affix distribution similarity. [sent-170, score-1.209]

60 1 Using Unsupervised Morphological Analyzer As most agglutinative languages are resourcepoor, we employ unsupervised learning method to obtain the morphological structure. [sent-177, score-0.547]

61 , 2007), we employ the Morfessor4 Categories-MAP algorithm (Creutz and Lagus, 2005). [sent-179, score-0.052]

62 It applies a hierarchical model with three categories (prefix, stem, and suffix) in an unsupervised way. [sent-180, score-0.071]

63 From Table 1 we can see that vocabulary sizes of the three languages are reduced obviously after unsupervised morphological analysis. [sent-181, score-0.278]

64 All the three translation tasks achieve obvious improve- ments with the proposed model, which always performs better than only employ word, stem and morph. [sent-183, score-0.596]

65 For the Uyghur to Chinese translation (UY-CH) task in Table 2, performances after unsupervised morphological analysis are always better than the baseline. [sent-184, score-0.423]

66 6 BLEU points improvements with affix compared to the baseline. [sent-186, score-0.62]

67 For the Kazakh to Chinese translation (KA-CH) task, the improvements are also significant. [sent-187, score-0.217]

68 As for the Kirghiz to Chinese translation (KI-CH) task, improvements seem relative small compared to the other two language pairs. [sent-191, score-0.217]

69 u1327KpM Table 3: Statistics of training corpus after unsupervised(Unsup) and supervised(Sup) morphological analysis. [sent-199, score-0.16]

70 5 46532wordmphsteSUmunpsearfvxiesrvde Figure 2: Uyghur to Chinese translation results after unsupervised and supervised analysis. [sent-201, score-0.299]

71 2 Using Supervised Morphological Analyzer Taking it further, we also want to see the effect of supervised analysis on our model. [sent-203, score-0.036]

72 A generative statistical model of morphological analysis for Uyghur was developed according to (Mairehaba et al. [sent-204, score-0.197]

73 Table 3 shows the difference of statistics of training corpus after supervised and unsupervised analysis. [sent-206, score-0.082]

74 Supervised method generates fewer type of stems and affixes than the unsupervised approach. [sent-207, score-0.213]

75 As we can see from Figure 2, except for the morph method, stem and affix based approaches perform better after supervised analysis. [sent-208, score-1.045]

76 The results show that our approach can obtain even better translation performance if better morphological analyzers are available. [sent-209, score-0.377]

77 Supervised morphological analysis generates more meaningful morphemes, which lead to better disambiguation of translation rules. [sent-210, score-0.377]

78 5 Conclusions and Future Work In this paper we propose a novel framework for agglutinative language translation by treating stem and affix differently. [sent-211, score-1.406]

79 We employ the stem sequence as the main part for training and decoding. [sent-212, score-0.404]

80 Besides, we associate each stem-granularity translation rule with an affix distribution, which could be used to make better translation decisions by calculating the affix distribution similarity be- 367 tween the rule and the instance to be translated. [sent-213, score-2.109]

81 We conduct our model on three different language pairs, all of which substantially improved the translation performance. [sent-214, score-0.217]

82 07/CE/I1 142) as part of the CNGL at Dublin City University. [sent-221, score-0.025]

83 Morphological pre-processing for Turkish to English statistical machine translation. [sent-225, score-0.037]

84 Inducing the morphological lexicon of a natural language from unannotated text. [sent-251, score-0.16]

85 A joint rule selection model for hierarchical phrase-based translation. [sent-255, score-0.204]

86 In Proceedings of ACL, Short Papers, pages 6–1 1. [sent-256, score-0.032]

87 Scalable inference and training of context-rich syntactic translation models. [sent-259, score-0.217]

88 In Proceedings of NAACL, Short Papers, pages 49– 52. [sent-268, score-0.032]

89 Improving statistical machine translation using lexicalized rule selection. [sent-275, score-0.433]

90 A hybrid morpheme-word representation for machine translation of morphologically rich languages. [sent-303, score-0.217]

91 Discriminative training and maximum entropy models for statistical machine translation. [sent-315, score-0.037]

92 Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. [sent-346, score-0.3]

93 Multi-granularity word alignment and decoding for agglutinative language translation. [sent-350, score-0.246]

94 Phrase-based backoff models for machine translation of highly inflected languages. [sent-354, score-0.28]

95 Syntaxto-morphology mapping in factored phrase-based statistical machine translation from English to Turkish. [sent-358, score-0.3]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('affix', 0.62), ('stem', 0.327), ('translation', 0.217), ('agglutinative', 0.217), ('uyghur', 0.189), ('rule', 0.179), ('morphological', 0.16), ('yighin', 0.135), ('zunyi', 0.135), ('gha', 0.108), ('luong', 0.108), ('affixes', 0.106), ('atu', 0.083), ('bisazza', 0.081), ('kazakh', 0.081), ('turkic', 0.081), ('turkish', 0.077), ('qun', 0.073), ('zhiyang', 0.072), ('creutz', 0.072), ('languages', 0.072), ('atomic', 0.065), ('inflected', 0.063), ('morph', 0.062), ('dre', 0.062), ('stems', 0.061), ('idfgha', 0.054), ('igha', 0.054), ('kirghiz', 0.054), ('mairehaba', 0.054), ('mathias', 0.054), ('mermer', 0.054), ('tfgha', 0.054), ('virpioja', 0.054), ('employ', 0.052), ('salton', 0.05), ('och', 0.05), ('morpheme', 0.049), ('inflectional', 0.049), ('yeniterzi', 0.048), ('uy', 0.048), ('translating', 0.047), ('factored', 0.046), ('unsupervised', 0.046), ('distribution', 0.045), ('buckley', 0.044), ('chinese', 0.044), ('dr', 0.043), ('unit', 0.041), ('bleu', 0.041), ('rules', 0.04), ('koehn', 0.038), ('carpuat', 0.038), ('statistical', 0.037), ('ai', 0.037), ('supervised', 0.036), ('kirchhoff', 0.035), ('rj', 0.035), ('franz', 0.034), ('josef', 0.034), ('morphology', 0.033), ('liu', 0.033), ('similarity', 0.032), ('shouxun', 0.032), ('chan', 0.032), ('ch', 0.032), ('pages', 0.032), ('habash', 0.031), ('ka', 0.031), ('cui', 0.03), ('yang', 0.03), ('analyzer', 0.03), ('ki', 0.03), ('yajuan', 0.03), ('alignment', 0.029), ('span', 0.028), ('oov', 0.028), ('enhancing', 0.028), ('goldwater', 0.027), ('wang', 0.027), ('di', 0.027), ('mainly', 0.026), ('alleviate', 0.026), ('della', 0.026), ('treating', 0.025), ('hierarchical', 0.025), ('proceedings', 0.025), ('distributions', 0.025), ('part', 0.025), ('cwmt', 0.024), ('acti', 0.024), ('aticoand', 0.024), ('cneginntreeer', 0.024), ('eofy', 0.024), ('fionrg', 0.024), ('goemnepruattiinogn', 0.024), ('inhtneoll', 0.024), ('institu', 0.024), ('ioggeyn', 0.024), ('ldoucballi', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages

Author: Zhiyang Wang ; Yajuan Lu ; Meng Sun ; Qun Liu

Abstract: Current translation models are mainly designed for languages with limited morphology, which are not readily applicable to agglutinative languages as the difference in the way lexical forms are generated. In this paper, we propose a novel approach for translating agglutinative languages by treating stems and affixes differently. We employ stem as the atomic translation unit to alleviate data spareness. In addition, we associate each stemgranularity translation rule with a distribution of related affixes, and select desirable rules according to the similarity of their affix distributions with given spans to be translated. Experimental results show that our approach significantly improves the translation performance on tasks of translating from three Turkic languages to Chinese.

2 0.35351688 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics

Author: Angeliki Lazaridou ; Marco Marelli ; Roberto Zamparelli ; Marco Baroni

Abstract: Speakers of a language can construct an unlimited number of new words through morphological derivation. This is a major cause of data sparseness for corpus-based approaches to lexical semantics, such as distributional semantic models of word meaning. We adapt compositional methods originally developed for phrases to the task of deriving the distributional meaning of morphologically complex words from their parts. Semantic representations constructed in this way beat a strong baseline and can be of higher quality than representations directly constructed from corpus data. Our results constitute a novel evaluation of the proposed composition methods, in which the full additive model achieves the best performance, and demonstrate the usefulness of a compositional morphology component in distributional semantics.

3 0.17179357 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

Author: Burak Kerim Akku� ; Ruket Cakici

Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.

4 0.13725023 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

Author: Graham Neubig

Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, and find that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http : / /phont ron . com/t ravat ar

5 0.13223858 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

Author: Yang Feng ; Trevor Cohn

Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.

6 0.12043747 314 acl-2013-Semantic Roles for String to Tree Machine Translation

7 0.12017439 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

8 0.10753735 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German

9 0.10548896 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

10 0.098641731 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

11 0.097351417 255 acl-2013-Name-aware Machine Translation

12 0.096043654 303 acl-2013-Robust multilingual statistical morphological generation models

13 0.091824956 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

14 0.091741793 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

15 0.091104619 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

16 0.088945255 312 acl-2013-Semantic Parsing as Machine Translation

17 0.088170171 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

18 0.086868532 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German

19 0.084553629 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

20 0.082636751 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.19), (1, -0.117), (2, 0.119), (3, 0.043), (4, -0.015), (5, -0.023), (6, -0.045), (7, 0.035), (8, 0.042), (9, 0.081), (10, -0.037), (11, 0.067), (12, 0.143), (13, 0.013), (14, -0.096), (15, 0.021), (16, -0.041), (17, -0.167), (18, 0.023), (19, 0.005), (20, -0.173), (21, 0.105), (22, 0.045), (23, 0.045), (24, -0.116), (25, -0.048), (26, -0.006), (27, -0.051), (28, 0.018), (29, -0.005), (30, 0.112), (31, -0.014), (32, 0.014), (33, 0.06), (34, -0.009), (35, 0.006), (36, -0.108), (37, 0.012), (38, 0.015), (39, -0.103), (40, -0.061), (41, 0.112), (42, 0.094), (43, 0.118), (44, 0.084), (45, 0.006), (46, -0.042), (47, 0.045), (48, -0.063), (49, 0.134)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89738989 330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages

Author: Zhiyang Wang ; Yajuan Lu ; Meng Sun ; Qun Liu

Abstract: Current translation models are mainly designed for languages with limited morphology, which are not readily applicable to agglutinative languages as the difference in the way lexical forms are generated. In this paper, we propose a novel approach for translating agglutinative languages by treating stems and affixes differently. We employ stem as the atomic translation unit to alleviate data spareness. In addition, we associate each stemgranularity translation rule with a distribution of related affixes, and select desirable rules according to the similarity of their affix distributions with given spans to be translated. Experimental results show that our approach significantly improves the translation performance on tasks of translating from three Turkic languages to Chinese.

2 0.67946035 303 acl-2013-Robust multilingual statistical morphological generation models

Author: Ondrej Dusek ; Filip Jurcicek

Abstract: We present a novel method of statistical morphological generation, i.e. the prediction of inflected word forms given lemma, part-of-speech and morphological features, aimed at robustness to unseen inputs. Our system uses a trainable classifier to predict “edit scripts” that are then used to transform lemmas into inflected word forms. Suffixes of lemmas are included as features to achieve robustness. We evaluate our system on 6 languages with a varying degree of morphological richness. The results show that the system is able to learn most morphological phenomena and generalize to unseen inputs, producing significantly better results than a dictionarybased baseline.

3 0.65114188 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German

Author: Marion Weller ; Alexander Fraser ; Sabine Schulte im Walde

Abstract: This paper demonstrates the need and impact of subcategorization information for SMT. We combine (i) features on sourceside syntactic subcategorization and (ii) an external knowledge base with quantitative, dependency-based information about target-side subcategorization frames. A manual evaluation of an English-toGerman translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case.

4 0.6368227 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics

Author: Angeliki Lazaridou ; Marco Marelli ; Roberto Zamparelli ; Marco Baroni

Abstract: Speakers of a language can construct an unlimited number of new words through morphological derivation. This is a major cause of data sparseness for corpus-based approaches to lexical semantics, such as distributional semantic models of word meaning. We adapt compositional methods originally developed for phrases to the task of deriving the distributional meaning of morphologically complex words from their parts. Semantic representations constructed in this way beat a strong baseline and can be of higher quality than representations directly constructed from corpus data. Our results constitute a novel evaluation of the proposed composition methods, in which the full additive model achieves the best performance, and demonstrate the usefulness of a compositional morphology component in distributional semantics.

5 0.59200948 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

Author: Graham Neubig

Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, and find that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http : / /phont ron . com/t ravat ar

6 0.58676583 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

7 0.57012302 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

8 0.53003764 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German

9 0.52275527 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

10 0.51367682 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

11 0.50388366 312 acl-2013-Semantic Parsing as Machine Translation

12 0.49625227 286 acl-2013-Psycholinguistically Motivated Computational Models on the Organization and Processing of Morphologically Complex Words

13 0.49263918 227 acl-2013-Learning to lemmatise Polish noun phrases

14 0.48891264 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

15 0.47277406 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

16 0.47021484 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment

17 0.46951276 255 acl-2013-Name-aware Machine Translation

18 0.46461314 16 acl-2013-A Novel Translation Framework Based on Rhetorical Structure Theory

19 0.45833844 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

20 0.44968957 314 acl-2013-Semantic Roles for String to Tree Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (6, 0.027), (11, 0.053), (15, 0.015), (20, 0.294), (24, 0.033), (26, 0.044), (35, 0.058), (42, 0.049), (48, 0.085), (64, 0.013), (70, 0.05), (88, 0.025), (90, 0.048), (95, 0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74175048 330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages

Author: Zhiyang Wang ; Yajuan Lu ; Meng Sun ; Qun Liu

Abstract: Current translation models are mainly designed for languages with limited morphology, which are not readily applicable to agglutinative languages as the difference in the way lexical forms are generated. In this paper, we propose a novel approach for translating agglutinative languages by treating stems and affixes differently. We employ stem as the atomic translation unit to alleviate data spareness. In addition, we associate each stemgranularity translation rule with a distribution of related affixes, and select desirable rules according to the similarity of their affix distributions with given spans to be translated. Experimental results show that our approach significantly improves the translation performance on tasks of translating from three Turkic languages to Chinese.

2 0.61274713 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction

Author: Xiaojun Wan

Abstract: The task of review rating prediction can be well addressed by using regression algorithms if there is a reliable training set of reviews with human ratings. In this paper, we aim to investigate a more challenging task of crosslanguage review rating prediction, which makes use of only rated reviews in a source language (e.g. English) to predict the rating scores of unrated reviews in a target language (e.g. German). We propose a new coregression algorithm to address this task by leveraging unlabeled reviews. Evaluation results on several datasets show that our proposed co-regression algorithm can consistently improve the prediction results. 1

3 0.60446912 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue

Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1

4 0.51181376 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

5 0.50981861 97 acl-2013-Cross-lingual Projections between Languages from Different Families

Author: Mo Yu ; Tiejun Zhao ; Yalong Bai ; Hao Tian ; Dianhai Yu

Abstract: Cross-lingual projection methods can benefit from resource-rich languages to improve performances of NLP tasks in resources-scarce languages. However, these methods confronted the difficulty of syntactic differences between languages especially when the pair of languages varies greatly. To make the projection method well-generalize to diverse languages pairs, we enhance the projection method based on word alignments by introducing target-language word representations as features and proposing a novel noise removing method based on these word representations. Experiments showed that our methods improve the performances greatly on projections between English and Chinese.

6 0.5063681 62 acl-2013-Automatic Term Ambiguity Detection

7 0.50538975 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

8 0.50244808 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

9 0.50155562 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

10 0.50113595 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

11 0.50105655 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

12 0.49935639 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT

13 0.49814439 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

14 0.49798852 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

15 0.49780399 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

16 0.49715739 288 acl-2013-Punctuation Prediction with Transition-based Parsing

17 0.4969123 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

18 0.49645212 80 acl-2013-Chinese Parsing Exploiting Characters

19 0.49594527 240 acl-2013-Microblogs as Parallel Corpora

20 0.49570361 250 acl-2013-Models of Translation Competitions