emnlp emnlp2010 emnlp2010-50 knowledge-graph by maker-knowledge-mining

50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

Source: pdf

Author: Jinhua Du ; Jie Jiang ; Andy Way

Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Facilitating Translation Using Source Language Paraphrase Lattices Jinhua Du, Jie Jiang, Andy Way CNGL, School of Computing Dublin City University, Dublin, Ireland {j du , j j i , away} @ comput ing . [sent-1, score-0.153]

2 Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. [sent-4, score-0.879]

3 We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. [sent-5, score-1.181]

4 Compared to the baseline system, our method achieves relative improvements of 7. [sent-6, score-0.044]

5 63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. [sent-9, score-0.453]

6 The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent. [sent-10, score-0.228]

7 1 Introduction In recent years, statistical MT systems have been easy to develop due to the rapid explosion in data availability, especially parallel data. [sent-11, score-0.334]

8 However, in reality there are still many language pairs which lack parallel data, such as Urdu–English, Chinese– Italian, where large amounts of speakers exist for both languages; of course, the problem is far worse 420 for pairs such as Catalan–Irish. [sent-12, score-0.413]

9 For such resourcelimited language pairs, sparse amounts of parallel data would cause the word alignment to be inaccurate, which would in turn lead to an inaccurate phrase alignment, and bad translations would result. [sent-13, score-0.799]

10 (2006) argue that limited amounts of parallel training data can lead to the problem of low coverage in that many phrases en- countered at run-time are not observed in the training data and so their translations will not be learned. [sent-15, score-0.633]

11 Thus, in recent years, research on addressing the problem of unknown words or phrases has become more and more evident for resource-limited language pairs. [sent-16, score-0.338]

12 (2006) proposed a novel method which substitutes a paraphrase for an unknown source word or phrase in the input sentence, and then proceeds to use the translation of that paraphrase in the production of the target-language result. [sent-18, score-1.284]

13 Their experiments showed that by translating paraphrases a marked improvement was achieved in coverage and translation quality, especially in the case of unknown words which previously had been left untranslated. [sent-19, score-1.072]

14 However, on a large-scale data set, they did not achieve improvements in terms of automatic evaluation. [sent-20, score-0.044]

15 Nakov (2008) proposed another way to use paraphrases in SMT. [sent-21, score-0.323]

16 He generates nearly-equivalent syntactic paraphrases of the source-side training sentences, then pairs each paraphrased sentence with the target translation associated with the original sentence in the training data. [sent-22, score-0.909]

17 Essentially, this method generates new training data using paraphrases to train a new model and obtain more useful ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0. [sent-23, score-0.362]

18 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 4t2ic0s–429, phrase pairs. [sent-25, score-0.111]

19 However, he reported that this method results in bad system performance. [sent-26, score-0.073]

20 By contrast, real improvements can be achieved by merging the phrase tables of the paraphrase model and the original model, giving priority to the latter. [sent-27, score-0.472]

21 (2009) presented the use of word lattices for multi-source translation, in which the multiple source input texts are compiled into a compact lattice, over which a single decoding pass is then performed. [sent-29, score-0.298]

22 This lattice-based method achieved positive results across all data conditions. [sent-30, score-0.038]

23 In this paper, we propose a novel method using paraphrases to facilitate translation, especially for resource-limited languages. [sent-31, score-0.484]

24 In this case, we neither need to change the phrase table, nor add new features in the log-linear model, nor add new sentences in the training data. [sent-33, score-0.111]

25 The remainder of this paper is organised as follows. [sent-34, score-0.057]

26 In Section 2, we define the “translation difficulty” from the perspective of the source side, and then examine how well the test set is covered by the phrase table and the parallel training data . [sent-35, score-0.342]

27 Section 3 describes our paraphrase lattice method and discusses how to set the weights for the edges in the lattice network. [sent-36, score-0.776]

28 In Section 4, we report comparative experiments conducted on small, medium and largescale English-to-Chinese data sets. [sent-37, score-0.201]

29 In Section 5, we analyse the influence of our paraphrase lattice method. [sent-38, score-0.538]

30 Section 6 concludes and gives avenues for future work. [sent-39, score-0.099]

31 1 Translation Difficulty We use the term “translation difficulty” to explain how difficult it is to translate the source-side sentence in three respects: • The OOV rates of the source sentences in the tTehste s OetO (Callison-Burch oetu al. [sent-42, score-0.226]

32 • Translatability of a known phrase in the input 421 sentence. [sent-44, score-0.185]

33 Some particular grammatical structures on the source side cannot be directly translated into the corresponding structures on the target side. [sent-45, score-0.208]

34 Nakov (2008) presents an example showing how hard it is to translate an English construction into Spanish. [sent-46, score-0.162]

35 Assume that an English-to-Spanish SMT system has an entry in its phrase table for “inequality of income”, but not for “income inequality”. [sent-47, score-0.111]

36 He argues that the latter phrase is hard to translate into Spanish where noun compounds are rare: the correct translation in this case requires a suitable Spanish preposition and a reordering, which are hard for the system to realize properly in the target language (Nakov, 2008). [sent-48, score-0.925]

37 • Consistency between the reference and the target-side sentence nin t ethe r training corpus. [sent-49, score-0.255]

38 In this case, if we use paraphrases for these pieces of text, then we might improve the opportunity for the translation to approach the reference, especially in the case where only one reference is available. [sent-51, score-1.001]

39 2 Coverage As to the first aspect coverage we argue that the coverage rate of the new words or unknown words are more and more becoming a “bottleneck” for resource-limited languages. [sent-53, score-0.644]

40 , 2003; Chiang, 2005) or syntax-based (Zollmann and Venugopal, 2006), use phrases as the fundamental translation unit, so how much the phrase table and training data can cover the test set is an important factor which influences the translation quality. [sent-55, score-0.953]

41 Table 1 shows the statistics of the coverage of the test – – set on English-to-Chinese FBIS data, where we can see that the coverage of unigrams is very high, especially when the data is increased to the medium size (200K), where unigram coverage is greater than 90%. [sent-56, score-0.812]

42 Based on the observations of the unknown un- PLTsetPT20KCorpusin PTCovi. [sent-57, score-0.152]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('paraphrases', 0.323), ('translation', 0.309), ('lattice', 0.252), ('paraphrase', 0.233), ('nakov', 0.229), ('coverage', 0.19), ('parallel', 0.161), ('dublin', 0.161), ('inequality', 0.161), ('resourcelimited', 0.161), ('unknown', 0.152), ('medium', 0.144), ('income', 0.137), ('inaccurate', 0.137), ('smt', 0.136), ('respects', 0.124), ('phrase', 0.111), ('opportunity', 0.107), ('lattices', 0.101), ('translate', 0.099), ('phrases', 0.097), ('pieces', 0.091), ('difficulty', 0.086), ('du', 0.084), ('input', 0.074), ('bad', 0.073), ('reference', 0.072), ('source', 0.07), ('ethe', 0.069), ('comput', 0.069), ('ionrd', 0.069), ('explosion', 0.069), ('compounds', 0.069), ('realize', 0.069), ('utilise', 0.069), ('spanish', 0.068), ('amounts', 0.068), ('pairs', 0.067), ('argue', 0.066), ('hard', 0.063), ('ang', 0.062), ('schroeder', 0.062), ('jinhua', 0.062), ('argues', 0.062), ('substitutes', 0.062), ('side', 0.062), ('facilitate', 0.061), ('especially', 0.06), ('largescale', 0.057), ('andy', 0.057), ('facilitating', 0.057), ('venugopal', 0.057), ('zollmann', 0.057), ('paraphrased', 0.057), ('inconsistent', 0.057), ('nin', 0.057), ('organised', 0.057), ('sentence', 0.057), ('compiled', 0.053), ('italian', 0.053), ('aen', 0.053), ('analyse', 0.053), ('avenues', 0.053), ('translations', 0.051), ('fbis', 0.05), ('influences', 0.05), ('evident', 0.05), ('deliver', 0.05), ('jie', 0.05), ('reality', 0.05), ('ld', 0.048), ('bottleneck', 0.048), ('pl', 0.046), ('affects', 0.046), ('concludes', 0.046), ('missed', 0.046), ('priority', 0.046), ('oov', 0.046), ('becoming', 0.046), ('syntactically', 0.046), ('improvements', 0.044), ('rapid', 0.044), ('years', 0.043), ('properly', 0.042), ('factor', 0.041), ('novel', 0.04), ('generates', 0.039), ('discusses', 0.039), ('addressing', 0.039), ('consistency', 0.039), ('transfer', 0.039), ('might', 0.039), ('achieved', 0.038), ('structures', 0.038), ('preposition', 0.038), ('unit', 0.038), ('unigrams', 0.038), ('alignment', 0.037), ('reordering', 0.036), ('fundamental', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

Author: Jinhua Du ; Jie Jiang ; Andy Way

2 0.45538244 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

3 0.29110917 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.

4 0.26026827 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

5 0.22113083 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

6 0.21176533 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

7 0.17608286 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

8 0.1647301 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

9 0.15624756 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

10 0.13903625 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

11 0.12307151 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

12 0.11810278 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

13 0.11628379 39 emnlp-2010-EMNLP 044

14 0.11013303 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

15 0.10904373 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

16 0.1054196 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

17 0.10531236 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

18 0.096934408 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

19 0.090014286 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

20 0.072990686 19 emnlp-2010-Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.326), (1, -0.457), (2, -0.174), (3, 0.038), (4, -0.059), (5, 0.109), (6, 0.024), (7, 0.036), (8, 0.111), (9, -0.071), (10, 0.015), (11, -0.013), (12, 0.01), (13, 0.016), (14, -0.083), (15, 0.023), (16, -0.07), (17, 0.115), (18, 0.077), (19, 0.185), (20, 0.032), (21, 0.043), (22, -0.067), (23, 0.0), (24, -0.137), (25, 0.155), (26, -0.065), (27, -0.114), (28, 0.127), (29, -0.019), (30, -0.009), (31, 0.019), (32, 0.063), (33, -0.084), (34, -0.02), (35, 0.053), (36, 0.026), (37, -0.029), (38, -0.098), (39, -0.03), (40, 0.06), (41, 0.045), (42, 0.036), (43, 0.047), (44, -0.071), (45, 0.073), (46, 0.023), (47, -0.001), (48, -0.052), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96299577 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

Author: Jinhua Du ; Jie Jiang ; Andy Way

2 0.93121964 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

3 0.7892381 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

4 0.66234446 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

5 0.62209636 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

6 0.59703368 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

7 0.59379655 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

8 0.5060848 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

9 0.4679296 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

10 0.42123696 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

11 0.40997368 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

12 0.40264344 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

13 0.38857076 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

14 0.38178778 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

15 0.36464769 39 emnlp-2010-EMNLP 044

16 0.31774116 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

17 0.30577114 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

18 0.30458185 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter

19 0.3045319 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

20 0.29297405 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.012), (10, 0.019), (12, 0.019), (29, 0.075), (30, 0.084), (32, 0.02), (41, 0.303), (52, 0.057), (56, 0.053), (66, 0.184), (72, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75566065 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

Author: Jinhua Du ; Jie Jiang ; Andy Way

2 0.58726287 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

3 0.57113397 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

4 0.57101923 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.

5 0.56953019 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

Author: Zhongjun He ; Yao Meng ; Hao Yu

Abstract: Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. However, the phrase reorderings lack of contextual information in conventional HPB systems. This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. We integrate the MaxEnt models into the HPB model. Experimental results show that our approach achieves significant improvements over a standard HPB system on large-scale translation tasks. On Chinese-to-English translation, , the absolute improvements in BLEU (caseinsensitive) range from 1.2 to 2.1.

6 0.56828529 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

7 0.56613255 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

8 0.56558937 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

9 0.56550193 104 emnlp-2010-The Necessity of Combining Adaptation Methods

10 0.56480908 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

11 0.56360751 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

12 0.56161255 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

13 0.55742335 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

14 0.55578679 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

15 0.55432099 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

16 0.55413252 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

17 0.55346549 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

18 0.55306536 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

19 0.55251217 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

20 0.55157936 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC