emnlp emnlp2010 emnlp2010-33 knowledge-graph by maker-knowledge-mining

33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Source: pdf

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. [sent-11, score-0.879]

2 We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language. [sent-12, score-0.822]

3 1 Introduction Given the accelerated growth of the number of multilingual documents on the Web and elsewhere, the need for effective multilingual and cross-lingual text processing techniques is becoming increasingly important. [sent-13, score-0.358]

4 In this paper, we address the task of cross-lingual text classification (CLTC), which builds text classifiers for multiple languages by using training data in one language, thereby avoiding the costly and timeconsuming process of labeling training data for each individual language. [sent-18, score-0.493]

5 Monolingual text classification algorithms can then be applied on these translated data. [sent-21, score-0.511]

6 First, most off-the-shelf machine translation systems typically generate only their best translation for a given text. [sent-23, score-0.87]

7 Since machine translation is known to be a notoriously hard problem, applying monolingual text classification algorithms directly on the erroneous translation of training or test data may severely deteriorate the classification accuracy. [sent-24, score-1.556]

8 So even if the translation of training or test data is perfectly correct, the cross language classifier may not perform as well as the monolingual one trained and tested on the data from the same language. [sent-26, score-0.758]

9 In this paper, we propose a new approach to CLTC, which trains a classification model in the source language and ports the model to the target language, with the translation knowledge learned using the EM algorithm. [sent-27, score-1.064]

10 Unlike previous methods based on machine translation (Fortuna and ShaweTaylor, 2005), our method takes into account difProceeMdiInTg,s M oaf sthseac 2h0u1s0et Ctso, UnfeSrAe,nc 9e-1 o1n O Ecmtopbireirca 2l0 M10e. [sent-28, score-0.435]

11 The translated model serves as an initial classifier for a semi-supervised process, by which the model is further adjusted to fit the distribution of the target language. [sent-31, score-0.531]

12 Our method does not require any labeled data in the target language, nor a machine translation system. [sent-32, score-0.668]

13 Instead, the only requirement is a reasonable amount of unlabeled data in the target language, which is often easy to obtain. [sent-33, score-0.316]

14 In section 3, we introduce our method that translates the classification model with the translation knowledge learned using the EM algorithm. [sent-35, score-0.796]

15 Section 4 describes model adaptation by training the translated model with unlabeled documents in the target language. [sent-36, score-0.766]

16 Text classification techniques have been applied to many diverse problems, ranging from topic classification (Joachims, 1997), to genre detection (Argamon et al. [sent-39, score-0.47]

17 Text classification is typically formulated as a learning task, where a classifier learns how to distinguish between categories in a given set, using features automatically extracted from a collection of documents. [sent-43, score-0.399]

18 Some of the most successful approaches to date for text classification involve the use of machine learning methods, which assume that enough an1058 notated data is available such that a classification model can be automatically learned. [sent-46, score-0.539]

19 Despite the attention that monolingual text classification has received from the research community, there is only very little work that was done on cross-lingual text classification. [sent-50, score-0.52]

20 The work that is most closely related to ours is (Gliozzo and Strapparava, 2006), where a multilingual domain kernel is learned from comparable corpora, and subsequently used for the cross-lingual classification of texts. [sent-51, score-0.344]

21 , 2005) studied the use of machine translation tools for the purpose of cross language text classification and mining. [sent-55, score-0.832]

22 The performance of such classifiers very much depends on the quality of the machine translation tools. [sent-57, score-0.491]

23 Unfortunately, the development of statistical machine translation systems (Brown et al. [sent-58, score-0.435]

24 Although in this method the transfer learning is performed across different domains in the same language, the underlying principle is similar to CLTC in the sense that different domains or languages may share a significant amount of knowledge in similar classification tasks. [sent-65, score-0.416]

25 This method bootstraps text classifiers with only unlabeled data or a small amount oflabeled training data, which is close to our setting that tries to leverage labeled data and unlabeled data in different languages to build text classifiers. [sent-67, score-0.688]

26 Finally, also closely related is the work carried out in the field of sentiment and subjectivity analysis for cross-lingual classification of opinions. [sent-68, score-0.374]

27 , 2007) use an English corpus annotated for subjectivity along with parallel text to build a subjectivity classifier for Romanian. [sent-70, score-0.474]

28 , 2008) propose a method based on machine translation to generate parallel texts, followed by a cross-lingual projection of subjectivity labels, which are used to train subjectivity annotation tools for Romanian and Spanish. [sent-72, score-0.757]

29 The technique is tested on the automatic sentiment classification of product reviews in Chinese, and showed to successfully make use of both crosslanguage and within-language knowledge. [sent-74, score-0.331]

30 3 Cross Language Model Translation To make the classifier applicable to documents in a foreign language, we introduce a method where model features that are learned from the training data are translated from the source language into the target language. [sent-75, score-0.864]

31 Using this translation process, a feature associated with a word in the source language is transferred to a word in the target language so that the feature is triggered when the word occurs in the target language test document. [sent-76, score-0.9]

32 In a typical translation process, the features would be translated by making use of a bilingual dictio- nary. [sent-77, score-0.734]

33 However, this translation method has a major drawback, due to the ambiguity usually associated with the entries in a bilingual dictionary: a word in one language can have multiple translations in another language, with possibly disparate meanings. [sent-78, score-0.678]

34 1059 If an incorrect translation is selected, it can distort the classification accuracy, by introducing erroneous features into the learning model. [sent-79, score-0.67]

35 Therefore, our goal is to minimize the distortion during the model translation process, in order to maximize the classification accuracy in the target language. [sent-80, score-0.815]

36 In this paper, we introduce a method that employs the EM algorithm to automatically learn feature translation probabilities from labeled text in the source language and unlabeled text in the target language. [sent-81, score-1.164]

37 Using the feature translation probabilities, we can derive a classification model for the target language from a mixture model with feature translations. [sent-82, score-0.858]

38 In the first step, a pseudo-document d′ is generated in the source language, followed by a second step, where d′ is translated into the observed document d in the target language. [sent-85, score-0.513]

39 The prior probability P(c) and the probability of the source language word w′ given class c are estimated using the labeled training data in the source language, so we use them as known parameters. [sent-91, score-0.341]

40 P(wi |w′i, c) is the probability of translating the word wi′ in|w the source language to the word wi in the target language given class c, and these are the parameters we want to learn from the corpus in the target language. [sent-92, score-0.711]

41 K is the set of translation candidates in the target language for the source language word w′ according to the bilingual lexicon. [sent-99, score-0.772]

42 Algorithm 1 illustrates the EM learning process, where nw′ denotes the number of translation candidates for w′ according to the bilingual lexicon. [sent-103, score-0.527]

43 Many statistical machine translation systems such as IBM models (Brown et al. [sent-105, score-0.435]

44 , 1993) learn word translation probabilities from millions of parallel sentences which are mutual translations. [sent-106, score-0.598]

45 (Koehn and Knight, 2000) proposed to use the EM algorithm to learn word translation probabilities from non-parallel monolingual corpora. [sent-108, score-0.669]

46 However, this method estimates only class independent translation probabilities P(wi |w′i), while our approach is able to learn class specific translation probabilities P(wi |w′i, c) by leveraging available labeled training data i|nw the source language. [sent-109, score-1.338]

47 2 Model Translation In order to classify documents in the target language, a straightforward approach to transferring the classification model learned from the labeled source language training data is to translate each feature from the bag-of-words model according to the bilingual lexicon. [sent-114, score-1.04]

48 However, because of the translation ambiguity of each word, a model in the source language could be potentially translated into many different models in the target language. [sent-115, score-0.939]

49 Thus, we think of the probability of the class of a target language document as the mixture of the probabilities by each translated model from the source language model, weighed by their translation probabilities. [sent-116, score-1.131]

50 P(c|d, mt) ≈ ∑m′t P(mt′|ms, c)P(c|d, mt′) where mt is the target language classification model and mt′ is a candidate model translated from the model ms trained on the labeled training data in the source language. [sent-117, score-0.86]

51 This is a very generic representation for model translation and the model m could be any type of text classification. [sent-118, score-0.504]

52 , 1996) as an example for the model translation across languages, since the ME model is one of the most widely used text classification models. [sent-120, score-0.739]

53 During model translation, the feature weight for f(wi, c) is transferred to f(wi′, c) in the target language model, where wi′ is the translation of wi. [sent-123, score-0.655]

54 4 Model Adaptation with SemiSupervised Learning In addition to translation ambiguity, another chal- lenge in building a classifier using training data in a foreign language is the discrepancy of data distribution in different languages. [sent-138, score-0.599]

55 Direct application of a classifier translated from a foreign model may not fit well the distribution of the current language. [sent-139, score-0.423]

56 Specifically, we first start by using the translated classifier from English as an initial classifier to label a set of Chinese documents. [sent-142, score-0.461]

57 The initial classifier is able to correctly classify a number of unlabeled Chinese documents with the knowledge transferred from English training data. [sent-143, score-0.562]

58 We then pick a set of labeled Chinese documents with high confidence to train a new Chinese classifier. [sent-145, score-0.317]

59 Re-training the classifier with the Chinese documents can adjust the feature weights for these words so that the model fits better the data distribution of Chinese documents, and thus it improves the classification accuracy. [sent-150, score-0.551]

60 The new classifier then re-labels the Chinese documents and the process is repeated for several iterations. [sent-151, score-0.316]

61 First, we evaluate the model translated with the parameters learned with EM, and then the model after the semisupervised learning for data distribution adaptation with different parameters, including the number of iterations and different amounts of unlabeled data. [sent-172, score-0.552]

62 1 Data Set Since a standard evaluation benchmark for crosslingual text classification is not available, we built our own data set from Yahoo! [sent-174, score-0.379]

63 In both cases, English is regarded as the source language, where training data are available, and Chinese and French are the target languages for which we want to build text classifiers. [sent-183, score-0.415]

64 ebshCdnupeatso clirtenahsg oisrnymetE213n645g8674l21is69487hC12 34i76n39756e0342s9F1 r3852e7 n594c68035h Table 1: number of documents in each class 1063 Before building the classification model, several preprocessing steps are applied an all the documents. [sent-186, score-0.477]

65 One method is to equally assign probabilities to all the translations for a given source language word, and to translate a word we randomly pick a translation from all of its translation candidates. [sent-193, score-1.206]

66 Another way is to calculate the translation probability based on the frequencies of the translation words in the target language itself. [sent-195, score-1.015]

67 We can obtain the following unigram counts of these translation words in our Yahoo! [sent-197, score-0.472]

68 bTuhsihs )m =eth 5o8d2 /of(5te8n2 a +llows us to estimate reasonable translation probabilities and we use “UNIGRAM” to denote this method. [sent-202, score-0.522]

69 And finally the third model translation approach is to use the translation probability learned with the EM algorithm proposed in this paper. [sent-203, score-0.929]

70 The initial parameters of the EM algorithm are set to the probabilities calculated with the “UNIGRAM” method and we use 4000 unlabeled documents in Chinese 1http://www. [sent-204, score-0.447]

71 We first train an English classification model for the topic of “sport” and then translate the model into Chinese using translation probabilities estimated by the above three different methods. [sent-213, score-0.85]

72 986 Table 2: Comparison of different methods for model translation From this table we can see that the baseline method has lowest classification accuracy due to the fact that it is unable to handle translation ambiguity since picking any one of the translation word is equally likely. [sent-218, score-1.592]

73 “UNIGRAM” shows significant improvement over “EQUAL” as the occurrence count of the translation words in the target language can help disambiguate the translations. [sent-219, score-0.616]

74 However occurrence count in a monolingual corpus may not always be the true translation probability. [sent-220, score-0.618]

75 However, in our Chinese monolingual news corpus, the count for “工厂(factory)” is more than that of “工作(labor)” even though “工作(labor)” should be a more likely translation for “work”. [sent-222, score-0.632]

76 The “EM” algorithm has the best performance as it is able to learn translation probabilities by looking at documents in both source language and target language instead of just a single language corpus. [sent-223, score-0.956]

77 We build a monolingual text classifier by training and testing the text classification system on documents in the same language. [sent-228, score-0.836]

78 This method plays the role of an upper-bound, since the best classification results are expected when 1064 monolingual training data is available. [sent-229, score-0.382]

79 0 machine translation system to translate the documents from one language into the other in two directions. [sent-232, score-0.717]

80 The first direction translates the training data from the source language into the target language, and then trains a model in the target language. [sent-233, score-0.504]

81 The second direction trains a classifier in the source language and translates the test data into the source language. [sent-235, score-0.441]

82 In our experiments, Systran generates the single best translation of the text as most off-the-shelf machine translation tools do. [sent-237, score-0.983]

83 We used 4,000 unlabeled documents to learn translation probabilities with the EM algorithm and the translation probabilities are leveraged to translate the model. [sent-240, score-1.497]

84 The rest of the unlabeled documents are used for other experimental purpose. [sent-241, score-0.36]

85 This is our proposed method, after both model translation and semi-supervised learning. [sent-243, score-0.435]

86 In the semi-supervised learning, we use 6,000 unlabeled target language documents with three training iterations. [sent-244, score-0.505]

87 In each experiment, the data consists of 4,000 labeled documents and 1,000 test documents (e. [sent-245, score-0.466]

88 The ML (Monolingual) classifier has the best performance, as it is trained on labeled data in the target language, so that there is no information loss and no distribution discrep- ancy due to a model translation. [sent-250, score-0.36]

89 The MT (machine translation) based approach scores the lowest accuracy, probably because the machine translation software produces only its best translation, which is often error-prone, thus leading to poor classification accuracy. [sent-251, score-0.67]

90 The reason is that when the model is trained on the translated training data, the model parameters are learned over an entire collection oftranslated documents, which is less sensitive to translation errors than translating a test document on which the classification is performed individually. [sent-284, score-1.09]

91 Our EM method for translating model features outperforms the machine translation approach, since it does not only rely on the best translation by the machine translation system, but instead takes into account all possible translations with knowledge learned specifically from the target language. [sent-285, score-1.658]

92 The semi-supervised learning is able to not only help adapt the translated model to fit the words distribution in the target language, but it also compensates the distortion or information loss during the model translation process as it can down-weigh the incorrectly translated features. [sent-287, score-1.087]

93 In both the EM 1065 and the SEMI models, the classification accuracy of English-French exceeds that of English-Chinese, which is probably explained by the fact that there is less translation ambiguity in similar languages, and they have more similar distributions. [sent-290, score-0.722]

94 For each of the five categories, we train a classification model using the 4,000 training documents in English and then translate the model into Chinese with the translation parameters learned with EM on 20,000 unlabeled Chinese documents. [sent-295, score-1.182]

95 Then we further train the translated model on a set of unlabeled Chinese documents using a different number of iterations and a different amount of unlabeled docu- ments. [sent-296, score-0.738]

96 As the plots show, the use of unlabeled data in the target language can improve the cross-language classification by learning new knowledge in the target language. [sent-298, score-0.696]

97 Our method ports a classification model trained in a source language to a target language, with the translation knowledge being learned using the EM algorithm. [sent-302, score-1.017]

98 Moreover, the cross-lingual classification accuracy obtained with our method was found to be close to the one achieved using monolingual text classifica1066 tion. [sent-305, score-0.451]

99 The use of machine translation tools for cross-lingual text mining. [sent-367, score-0.548]

100 Estimating word translation probabilities from unrelated monolingua lcorpora using the em algorithm. [sent-399, score-0.698]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('translation', 0.435), ('classification', 0.235), ('translated', 0.207), ('documents', 0.189), ('em', 0.176), ('wi', 0.175), ('unlabeled', 0.171), ('chinese', 0.164), ('monolingual', 0.147), ('target', 0.145), ('classifier', 0.127), ('cltc', 0.126), ('wsif', 0.126), ('subjectivity', 0.101), ('wti', 0.1), ('source', 0.1), ('nigam', 0.097), ('translate', 0.093), ('translating', 0.093), ('bilingual', 0.092), ('mihalcea', 0.09), ('labeled', 0.088), ('probabilities', 0.087), ('semi', 0.086), ('mt', 0.085), ('parallel', 0.076), ('argmaxc', 0.075), ('crosslingual', 0.075), ('gliozzo', 0.075), ('rss', 0.075), ('transferred', 0.075), ('bush', 0.072), ('text', 0.069), ('ct', 0.068), ('translates', 0.067), ('yahoo', 0.066), ('argamon', 0.065), ('fortuna', 0.065), ('rocchio', 0.065), ('languages', 0.064), ('semisupervised', 0.061), ('document', 0.061), ('wf', 0.06), ('learned', 0.059), ('cb', 0.058), ('banea', 0.058), ('crosslanguage', 0.058), ('translations', 0.056), ('classifiers', 0.056), ('sports', 0.055), ('adaptation', 0.054), ('class', 0.053), ('fit', 0.052), ('ambiguity', 0.052), ('joachims', 0.05), ('news', 0.05), ('french', 0.05), ('factory', 0.05), ('itna', 0.05), ('mingjun', 0.05), ('nip', 0.05), ('olsson', 0.05), ('sahami', 0.05), ('strapparava', 0.05), ('systran', 0.05), ('wis', 0.05), ('wit', 0.05), ('wtij', 0.05), ('labor', 0.05), ('blum', 0.05), ('multilingual', 0.05), ('cross', 0.049), ('trains', 0.047), ('categorization', 0.045), ('shi', 0.045), ('tools', 0.044), ('disparate', 0.043), ('soccer', 0.043), ('ports', 0.043), ('dai', 0.043), ('koppel', 0.043), ('schler', 0.043), ('della', 0.043), ('mixture', 0.043), ('adapt', 0.041), ('texts', 0.041), ('confidence', 0.04), ('domains', 0.04), ('pt', 0.04), ('english', 0.039), ('dempster', 0.039), ('berger', 0.039), ('transferring', 0.039), ('sentiment', 0.038), ('unigram', 0.037), ('categories', 0.037), ('foreign', 0.037), ('regarded', 0.037), ('transfer', 0.037), ('occurrence', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

2 0.27795494 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.

3 0.24479079 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

4 0.21176533 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

Author: Jinhua Du ; Jie Jiang ; Andy Way

Abstract: For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and large- scale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resourcesufficient pairs to some extent.

5 0.17542802 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

6 0.17319901 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

7 0.16350991 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

8 0.15934886 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

9 0.15496685 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

10 0.15016952 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

11 0.14772995 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

12 0.14268446 39 emnlp-2010-EMNLP 044

13 0.1390232 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

14 0.13611293 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

15 0.13314106 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

16 0.13180596 104 emnlp-2010-The Necessity of Combining Adaptation Methods

17 0.12767836 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

18 0.12246051 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

19 0.12187896 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

20 0.11654307 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.423), (1, -0.204), (2, -0.2), (3, -0.028), (4, -0.024), (5, 0.021), (6, 0.145), (7, 0.067), (8, -0.089), (9, 0.072), (10, 0.133), (11, 0.047), (12, -0.137), (13, 0.049), (14, -0.007), (15, 0.103), (16, -0.004), (17, 0.157), (18, 0.052), (19, 0.097), (20, -0.028), (21, -0.004), (22, -0.037), (23, -0.021), (24, -0.027), (25, -0.034), (26, -0.126), (27, -0.058), (28, -0.14), (29, 0.096), (30, -0.035), (31, -0.093), (32, -0.06), (33, 0.061), (34, 0.058), (35, -0.056), (36, -0.032), (37, 0.05), (38, 0.029), (39, 0.029), (40, -0.101), (41, -0.0), (42, -0.041), (43, -0.147), (44, -0.008), (45, -0.025), (46, -0.107), (47, 0.066), (48, 0.137), (49, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98295391 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

2 0.71388483 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson

3 0.6798954 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

Author: John Platt ; Kristina Toutanova ; Wen-tau Yih

Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corre- sponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

4 0.63638258 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

5 0.61017334 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

Author: Libin Shen ; Bing Zhang ; Spyros Matsoukas ; Jinxi Xu ; Ralph Weischedel

Abstract: In modern machine translation practice, a statistical phrasal or hierarchical translation system usually relies on a huge set of translation rules extracted from bi-lingual training data. This approach not only results in space and efficiency issues, but also suffers from the sparse data problem. In this paper, we propose to use factorized grammars, an idea widely accepted in the field of linguistic grammar construction, to generalize translation rules, so as to solve these two problems. We designed a method to take advantage of the XTAG English Grammar to facilitate the extraction of factorized rules. We experimented on various setups of low-resource language translation, and showed consistent significant improvement in BLEU over state-ofthe-art string-to-dependency baseline systems with 200K words of bi-lingual training data.

6 0.60034627 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

7 0.57515329 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

8 0.54541886 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

9 0.5263319 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

10 0.5215227 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

11 0.52147865 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

12 0.48190665 39 emnlp-2010-EMNLP 044

13 0.4604778 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

14 0.43976048 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

15 0.4359456 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

16 0.42871532 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

17 0.42254943 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

18 0.41934988 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

19 0.41865009 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

20 0.41724762 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.015), (10, 0.012), (12, 0.028), (29, 0.059), (30, 0.021), (32, 0.01), (52, 0.018), (56, 0.057), (66, 0.644), (72, 0.039), (76, 0.015), (87, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99752414 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

2 0.98757404 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

Author: Ching-Yun Chang ; Stephen Clark

Abstract: Linguistic Steganography is concerned with hiding information in natural language text. One of the major transformations used in Linguistic Steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this paper we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Web 1T Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are encoded by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex colouring algorithm. This method ensures that each word encodes a unique sequence of bits, without cutting out large number of synonyms, and thus maintaining a reasonable embedding capacity.

3 0.98340768 70 emnlp-2010-Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

Author: Xin Zhao ; Jing Jiang ; Hongfei Yan ; Xiaoming Li

Abstract: Discovering and summarizing opinions from online reviews is an important and challenging task. A commonly-adopted framework generates structured review summaries with aspects and opinions. Recently topic models have been used to identify meaningful review aspects, but existing topic models do not identify aspect-specific opinion words. In this paper, we propose a MaxEnt-LDA hybrid model to jointly discover both aspects and aspect-specific opinion words. We show that with a relatively small amount of training data, our model can effectively identify aspect and opinion words simultaneously. We also demonstrate the domain adaptability of our model.

4 0.97933531 10 emnlp-2010-A Probabilistic Morphological Analyzer for Syriac

Author: Peter McClanahan ; George Busby ; Robbie Haertel ; Kristian Heal ; Deryle Lonsdale ; Kevin Seppi ; Eric Ringger

Abstract: We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linkage, and morphological tagging and connect them in a pipeline to create a probabilistic morphological analyzer requiring only labeled data. We explore the performance of models with varying amounts of training data and find that with about 34,500 labeled tokens, we can outperform a reasonable baseline trained on over 99,000 tokens and achieve an accuracy of just over 80%. When trained on all available training data, our joint model achieves 86.47% accuracy, a 29.7% reduction in error rate over the baseline.

5 0.97446847 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.

6 0.89202589 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

7 0.89122313 104 emnlp-2010-The Necessity of Combining Adaptation Methods

8 0.88714987 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

9 0.88224119 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

10 0.8760832 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

11 0.86665356 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

12 0.85536695 114 emnlp-2010-Unsupervised Parse Selection for HPSG

13 0.85313028 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

14 0.8399725 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

15 0.83693731 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

16 0.83185762 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

17 0.83109707 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

18 0.8259694 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

19 0.82542813 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

20 0.82266098 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models