emnlp emnlp2010 emnlp2010-50 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jinhua Du ; Jie Jiang ; Andy Way
Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.
Reference: text
sentIndex sentText sentNum sentScore
1 Facilitating Translation Using Source Language Paraphrase Lattices Jinhua Du, Jie Jiang, Andy Way CNGL, School of Computing Dublin City University, Dublin, Ireland {j du , j j i , away} @ comput ing . [sent-1, score-0.153]
2 Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. [sent-4, score-0.879]
3 We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. [sent-5, score-1.181]
4 Compared to the baseline system, our method achieves relative improvements of 7. [sent-6, score-0.044]
5 63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. [sent-9, score-0.453]
6 The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent. [sent-10, score-0.228]
7 1 Introduction In recent years, statistical MT systems have been easy to develop due to the rapid explosion in data availability, especially parallel data. [sent-11, score-0.334]
8 However, in reality there are still many language pairs which lack parallel data, such as Urdu–English, Chinese– Italian, where large amounts of speakers exist for both languages; of course, the problem is far worse 420 for pairs such as Catalan–Irish. [sent-12, score-0.413]
9 For such resourcelimited language pairs, sparse amounts of parallel data would cause the word alignment to be inaccurate, which would in turn lead to an inaccurate phrase alignment, and bad translations would result. [sent-13, score-0.799]
10 (2006) argue that limited amounts of parallel training data can lead to the problem of low coverage in that many phrases en- countered at run-time are not observed in the training data and so their translations will not be learned. [sent-15, score-0.633]
11 Thus, in recent years, research on addressing the problem of unknown words or phrases has become more and more evident for resource-limited language pairs. [sent-16, score-0.338]
12 (2006) proposed a novel method which substitutes a paraphrase for an unknown source word or phrase in the input sentence, and then proceeds to use the translation of that paraphrase in the production of the target-language result. [sent-18, score-1.284]
13 Their experiments showed that by translating paraphrases a marked improvement was achieved in coverage and translation quality, especially in the case of unknown words which previously had been left untranslated. [sent-19, score-1.072]
14 However, on a large-scale data set, they did not achieve improvements in terms of automatic evaluation. [sent-20, score-0.044]
15 Nakov (2008) proposed another way to use paraphrases in SMT. [sent-21, score-0.323]
16 He generates nearly-equivalent syntactic paraphrases of the source-side training sentences, then pairs each paraphrased sentence with the target translation associated with the original sentence in the training data. [sent-22, score-0.909]
17 Essentially, this method generates new training data using paraphrases to train a new model and obtain more useful ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0. [sent-23, score-0.362]
18 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 4t2ic0s–429, phrase pairs. [sent-25, score-0.111]
19 However, he reported that this method results in bad system performance. [sent-26, score-0.073]
20 By contrast, real improvements can be achieved by merging the phrase tables of the paraphrase model and the original model, giving priority to the latter. [sent-27, score-0.472]
21 (2009) presented the use of word lattices for multi-source translation, in which the multiple source input texts are compiled into a compact lattice, over which a single decoding pass is then performed. [sent-29, score-0.298]
22 This lattice-based method achieved positive results across all data conditions. [sent-30, score-0.038]
23 In this paper, we propose a novel method using paraphrases to facilitate translation, especially for resource-limited languages. [sent-31, score-0.484]
24 In this case, we neither need to change the phrase table, nor add new features in the log-linear model, nor add new sentences in the training data. [sent-33, score-0.111]
25 The remainder of this paper is organised as follows. [sent-34, score-0.057]
26 In Section 2, we define the “translation difficulty” from the perspective of the source side, and then examine how well the test set is covered by the phrase table and the parallel training data . [sent-35, score-0.342]
27 Section 3 describes our paraphrase lattice method and discusses how to set the weights for the edges in the lattice network. [sent-36, score-0.776]
28 In Section 4, we report comparative experiments conducted on small, medium and largescale English-to-Chinese data sets. [sent-37, score-0.201]
29 In Section 5, we analyse the influence of our paraphrase lattice method. [sent-38, score-0.538]
30 Section 6 concludes and gives avenues for future work. [sent-39, score-0.099]
31 1 Translation Difficulty We use the term “translation difficulty” to explain how difficult it is to translate the source-side sentence in three respects: • The OOV rates of the source sentences in the tTehste s OetO (Callison-Burch oetu al. [sent-42, score-0.226]
32 • Translatability of a known phrase in the input 421 sentence. [sent-44, score-0.185]
33 Some particular grammatical structures on the source side cannot be directly translated into the corresponding structures on the target side. [sent-45, score-0.208]
34 Nakov (2008) presents an example showing how hard it is to translate an English construction into Spanish. [sent-46, score-0.162]
35 Assume that an English-to-Spanish SMT system has an entry in its phrase table for “inequality of income”, but not for “income inequality”. [sent-47, score-0.111]
36 He argues that the latter phrase is hard to translate into Spanish where noun compounds are rare: the correct translation in this case requires a suitable Spanish preposition and a reordering, which are hard for the system to realize properly in the target language (Nakov, 2008). [sent-48, score-0.925]
37 • Consistency between the reference and the target-side sentence nin t ethe r training corpus. [sent-49, score-0.255]
38 In this case, if we use paraphrases for these pieces of text, then we might improve the opportunity for the translation to approach the reference, especially in the case where only one reference is available. [sent-51, score-1.001]
39 2 Coverage As to the first aspect coverage we argue that the coverage rate of the new words or unknown words are more and more becoming a “bottleneck” for resource-limited languages. [sent-53, score-0.644]
40 , 2003; Chiang, 2005) or syntax-based (Zollmann and Venugopal, 2006), use phrases as the fundamental translation unit, so how much the phrase table and training data can cover the test set is an important factor which influences the translation quality. [sent-55, score-0.953]
41 Table 1 shows the statistics of the coverage of the test – – set on English-to-Chinese FBIS data, where we can see that the coverage of unigrams is very high, especially when the data is increased to the medium size (200K), where unigram coverage is greater than 90%. [sent-56, score-0.812]
42 Based on the observations of the unknown un- PLTsetPT20KCorpusin PTCovi. [sent-57, score-0.152]
wordName wordTfidf (topN-words)
[('paraphrases', 0.323), ('translation', 0.309), ('lattice', 0.252), ('paraphrase', 0.233), ('nakov', 0.229), ('coverage', 0.19), ('parallel', 0.161), ('dublin', 0.161), ('inequality', 0.161), ('resourcelimited', 0.161), ('unknown', 0.152), ('medium', 0.144), ('income', 0.137), ('inaccurate', 0.137), ('smt', 0.136), ('respects', 0.124), ('phrase', 0.111), ('opportunity', 0.107), ('lattices', 0.101), ('translate', 0.099), ('phrases', 0.097), ('pieces', 0.091), ('difficulty', 0.086), ('du', 0.084), ('input', 0.074), ('bad', 0.073), ('reference', 0.072), ('source', 0.07), ('ethe', 0.069), ('comput', 0.069), ('ionrd', 0.069), ('explosion', 0.069), ('compounds', 0.069), ('realize', 0.069), ('utilise', 0.069), ('spanish', 0.068), ('amounts', 0.068), ('pairs', 0.067), ('argue', 0.066), ('hard', 0.063), ('ang', 0.062), ('schroeder', 0.062), ('jinhua', 0.062), ('argues', 0.062), ('substitutes', 0.062), ('side', 0.062), ('facilitate', 0.061), ('especially', 0.06), ('largescale', 0.057), ('andy', 0.057), ('facilitating', 0.057), ('venugopal', 0.057), ('zollmann', 0.057), ('paraphrased', 0.057), ('inconsistent', 0.057), ('nin', 0.057), ('organised', 0.057), ('sentence', 0.057), ('compiled', 0.053), ('italian', 0.053), ('aen', 0.053), ('analyse', 0.053), ('avenues', 0.053), ('translations', 0.051), ('fbis', 0.05), ('influences', 0.05), ('evident', 0.05), ('deliver', 0.05), ('jie', 0.05), ('reality', 0.05), ('ld', 0.048), ('bottleneck', 0.048), ('pl', 0.046), ('affects', 0.046), ('concludes', 0.046), ('missed', 0.046), ('priority', 0.046), ('oov', 0.046), ('becoming', 0.046), ('syntactically', 0.046), ('improvements', 0.044), ('rapid', 0.044), ('years', 0.043), ('properly', 0.042), ('factor', 0.041), ('novel', 0.04), ('generates', 0.039), ('discusses', 0.039), ('addressing', 0.039), ('consistency', 0.039), ('transfer', 0.039), ('might', 0.039), ('achieved', 0.038), ('structures', 0.038), ('preposition', 0.038), ('unit', 0.038), ('unigrams', 0.038), ('alignment', 0.037), ('reordering', 0.036), ('fundamental', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices
Author: Jinhua Du ; Jie Jiang ; Andy Way
Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.
2 0.45538244 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation
Author: Aurelien Max
Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.
3 0.29110917 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson
Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.
4 0.26026827 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
Author: Samidh Chatterjee ; Nicola Cancedda
Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.
5 0.22113083 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.
6 0.21176533 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
7 0.17608286 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities
8 0.1647301 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
9 0.15624756 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
10 0.13903625 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
11 0.12307151 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions
12 0.11810278 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model
13 0.11628379 39 emnlp-2010-EMNLP 044
14 0.11013303 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs
15 0.10904373 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study
16 0.1054196 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
17 0.10531236 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar
18 0.096934408 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
19 0.090014286 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation
20 0.072990686 19 emnlp-2010-Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation
topicId topicWeight
[(0, 0.326), (1, -0.457), (2, -0.174), (3, 0.038), (4, -0.059), (5, 0.109), (6, 0.024), (7, 0.036), (8, 0.111), (9, -0.071), (10, 0.015), (11, -0.013), (12, 0.01), (13, 0.016), (14, -0.083), (15, 0.023), (16, -0.07), (17, 0.115), (18, 0.077), (19, 0.185), (20, 0.032), (21, 0.043), (22, -0.067), (23, 0.0), (24, -0.137), (25, 0.155), (26, -0.065), (27, -0.114), (28, 0.127), (29, -0.019), (30, -0.009), (31, 0.019), (32, 0.063), (33, -0.084), (34, -0.02), (35, 0.053), (36, 0.026), (37, -0.029), (38, -0.098), (39, -0.03), (40, 0.06), (41, 0.045), (42, 0.036), (43, 0.047), (44, -0.071), (45, 0.073), (46, 0.023), (47, -0.001), (48, -0.052), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.96299577 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices
Author: Jinhua Du ; Jie Jiang ; Andy Way
Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.
2 0.93121964 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation
Author: Aurelien Max
Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.
3 0.7892381 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson
Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.
4 0.66234446 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
Author: Samidh Chatterjee ; Nicola Cancedda
Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.
5 0.62209636 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.
6 0.59703368 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
7 0.59379655 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
8 0.5060848 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
9 0.4679296 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
10 0.42123696 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
11 0.40997368 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions
12 0.40264344 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities
13 0.38857076 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation
14 0.38178778 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar
15 0.36464769 39 emnlp-2010-EMNLP 044
16 0.31774116 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model
17 0.30577114 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study
18 0.30458185 1 emnlp-2010-"Poetic" Statistical Machine Translation: Rhyme and Meter
19 0.3045319 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs
20 0.29297405 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
topicId topicWeight
[(5, 0.012), (10, 0.019), (12, 0.019), (29, 0.075), (30, 0.084), (32, 0.02), (41, 0.303), (52, 0.057), (56, 0.053), (66, 0.184), (72, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.75566065 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices
Author: Jinhua Du ; Jie Jiang ; Andy Way
Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.
2 0.58726287 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation
Author: Aurelien Max
Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.
3 0.57113397 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
Author: Quang Do ; Dan Roth
Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.
4 0.57101923 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification
Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng
Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.
5 0.56953019 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation
Author: Zhongjun He ; Yao Meng ; Hao Yu
Abstract: Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. However, the phrase reorderings lack of contextual information in conventional HPB systems. This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. We integrate the MaxEnt models into the HPB model. Experimental results show that our approach achieves significant improvements over a standard HPB system on large-scale translation tasks. On Chinese-to-English translation, , the absolute improvements in BLEU (caseinsensitive) range from 1.2 to 2.1.
6 0.56828529 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment
9 0.56550193 104 emnlp-2010-The Necessity of Combining Adaptation Methods
10 0.56480908 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
11 0.56360751 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
12 0.56161255 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
13 0.55742335 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
14 0.55578679 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields
15 0.55432099 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
16 0.55413252 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
17 0.55346549 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
18 0.55306536 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model
19 0.55251217 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac
20 0.55157936 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC