acl acl2011 acl2011-104 knowledge-graph by maker-knowledge-mining

104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Source: pdf

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We show that unseen words account for a large part of the translation error when moving to new domains. [sent-3, score-0.472]

2 Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al. [sent-4, score-0.344]

3 , 2008), we are able to find translations for otherwise OOV terms. [sent-5, score-0.124]

4 We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0. [sent-6, score-0.454]

5 5 Bleu points) on four domains and two language pairs. [sent-8, score-0.112]

6 1 Introduction Large amounts of data are currently available to train statistical machine translation systems. [sent-9, score-0.246]

7 Un- fortunately, these training data are often qualitatively different from the target task of the translation system. [sent-10, score-0.301]

8 In this paper, we consider one specific aspect of domain divergence (Jiang, 2008; Blitzer and Daum e´ III, 2010): the out-of-vocabulary problem. [sent-11, score-0.255]

9 By considering four different target domains (news, medical, movie subtitles, technical documentation) in two source languages (German, French), we: (1) Ascertain the degree to which domain divergence causes increases in unseen words, and the degree to which this degrades translation performance. [sent-12, score-0.938]

10 ) (2) Extend known methods for mining dictionaries from comparable corpora to the domain adaptation setting, by “bootstrapping” them based on known translations from the source domain. [sent-14, score-0.785]

11 (3) 407 Jagadeesh Jagarlamudi University of Maryland College Park, USA j ags @umiac s umd edu . [sent-15, score-0.066]

12 Develop methods for integrating these mined dictionaries into a phrase-based translation system (Koehn et al. [sent-17, score-0.293]

13 As we shall see, for most target domains, out of vocabulary terms are the source of approximately half of the additional errors made. [sent-19, score-0.209]

14 The only exception is the news domain, which is sufficiently similar to parliament proceedings (Europarl) that there are essentially no new, frequent words in news. [sent-20, score-0.53]

15 By mining a dictionary and naively incorporating it into a translation system, one can only do slightly better than baseline. [sent-21, score-0.632]

16 However, with a more clever integration, we can close about half of the gap between baseline (unadapted) performance and an oracle experiment. [sent-22, score-0.162]

17 The specific setting we consider is the one in which we have plentiful parallel (“labeled”) data in a source domain (eg. [sent-27, score-0.397]

18 , parliament) and plentiful comparable (“unlabeled”) data in a target domain (eg. [sent-28, score-0.421]

19 We can use the unlabeled data in the target domain to build a good language model. [sent-30, score-0.294]

20 Finally, we assume access to a very small amount of parallel (“labeled”) target data, but only enough to evaluate on, or run weight tuning (Och, 2003). [sent-31, score-0.254]

21 All knowledge about unseen words must come from the comparable data. [sent-32, score-0.264]

22 2 Background and Challenges Domain adaptation is a well-studied field, both in the NLP community as well as the machine learning and statistics communities. [sent-33, score-0.217]

23 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 407–412, adjust the weights of a learned translation model to do well on a new domain. [sent-36, score-0.206]

24 As expected, we shall see that unseen words pose a major challenge for adapting translation systems to distant domains. [sent-37, score-0.462]

25 No machine learning approach to adaptation could hope to attenuate this problem. [sent-38, score-0.217]

26 There have been a few attempts to measure or perform domain adaptation in machine translation. [sent-39, score-0.416]

27 One of the first approaches essentially performs test-set relativization (choosing training samples that look most like the test data) to improve translation performance, but applies the approach only to very small data sets (Hildebrand et al. [sent-40, score-0.284]

28 Later approaches are mostly based on a data set made available in the 2007 StatMT workshop (Koehn and Schroeder, 2007), and have attempted to use monolingual (Civera and Juan, 2007; Bertoldi and Federico, 2009) or comparable (Snover et al. [sent-42, score-0.1]

29 These papers all show small, but significant, gains in performance when moving from Parliament domain to News domain. [sent-44, score-0.258]

30 3 Data Our source domain is European Parliament proceedings (http : / /www . [sent-45, score-0.264]

31 We use three target domains: the News Commentary corpus (News) used in the MT Shared task at ACL 2007, European Medicines Agency text (Emea), the Open Subtitles data (Subs) and the PHP technical document data, provided as part of the OPUS corpus http : / / urd . [sent-48, score-0.095]

32 et We extracted development and test sets from each of these corpora, except for news (and the source domain) where we preserved the published dev and test data. [sent-52, score-0.145]

33 The “source” domain of Europarl has 996k sentences and 2130k words. [sent-53, score-0.199]

34 ) We count the number of words and sentences in the English side of the parallel data, which is the same for both language pairs (i. [sent-54, score-0.107]

35 The statistics are: PESNumHewbPsa3Ce0no576tkm spa4rw72o53b82r17l0dek s1T e0u53n408te75s2 4Te01n094st735 408 main word tokens that are unseen in the source domain, together with the most frequent English words in the target domains that do not appear in the source domain. [sent-57, score-0.571]

36 (In the actual data the subtitles words do not appear censored. [sent-58, score-0.196]

37 ) All of these data sets actually come with parallel target domain data. [sent-59, score-0.357]

38 While such data is more parallel than, say, Wikipedia, it is far from parallel. [sent-61, score-0.063]

39 Here, for each domain, we show the percentage of words (types) in the target domain that are unseen in the Parliament data. [sent-63, score-0.501]

40 As we can see, it is markedly higher in Emea, Subs and PHP than in News. [sent-64, score-0.07]

41 4 Dictionary Mining Our dictionary mining approach is based on Canonical Correlation Analysis, as used previously by (Haghighi et al. [sent-65, score-0.388]

42 In general, using all the eigenvectors is sub optimal and thus retaining top eigenvectors leads to an improved generalizability. [sent-69, score-0.132]

43 Here we describe the use of CCA to find the translations for the OOV German words (Haghighi et al. [sent-70, score-0.168]

44 From the target domain corpus we extract the most frequent words (approximately 5000) for both the languages. [sent-72, score-0.406]

45 Of these, words that have translation in the bilingual dictionary (learnt from Europarl) are used as training data. [sent-73, score-0.475]

46 We use these words to learn the CCA projections and then mine the translations for the remaining frequent words. [sent-74, score-0.349]

47 In the first stage, we extract feature vectors for all the words. [sent-76, score-0.093]

48 In the second stage, using the dictionary probabilities of seen words, we identify pairs ofwords whose feature vectors are used to learn the CCA projection directions. [sent-78, score-0.442]

49 In the final stage, we project all the words into the sub-space identified by CCA and mine translations for the OOV words. [sent-79, score-0.23]

50 For each of the frequent words we extract the context vectors using a window of length five. [sent-81, score-0.205]

51 Among the remaining features, we consider only the most frequent 2000 features in each language. [sent-84, score-0.068]

52 We convert the frequency vectors into TFIDF vectors, center the data and then binarize the vectors depending on if the feature value is positive of not. [sent-85, score-0.252]

53 We convert this data into word similarities using linear dot product kernel. [sent-86, score-0.118]

54 We also represent each word using the orthographic features, with n-grams of length 1-3 and convert them into TFIDF form and subsequently turn them into word similarities (again using the linear kernel). [sent-87, score-0.21]

55 Since we convert the data into word similarities, the orthographic features are relevant even though the script of source and target languages differ. [sent-88, score-0.407]

56 Where as using the features directly rending them useless for languages whose script is completely different like Arabic and En- 409 their mined translations. [sent-89, score-0.236]

57 For each language we linearly combine the kernel matrices obtained using the context vectors and the orthographic features. [sent-91, score-0.231]

58 We then run Hungarian algorithm to extract maximum weighted bipartite matching (Jonker and Volgenant, 1987). [sent-95, score-0.072]

59 We then run CCA on the resulting pairs of the bipartite matching to get the projection directions in each language. [sent-96, score-0.126]

60 We project all the frequent words, including the training words, in both the languages into the lower dimensional spaces and for each of the OOV word return the closest five points from the other language as potential new translations. [sent-99, score-0.192]

61 The dictionary mining, viewed subjectively and intrinsically, performs quite well. [sent-100, score-0.26]

62 In Table 2, we show four randomly selected unseen German words from Emea (that do not occur in the Parliament data), together with the top three translations and associated scores (which are not normalized). [sent-101, score-0.372]

63 Based on a cursory evaluation of 5 randomly selected words in French and German by native speakers (not the authors! [sent-102, score-0.082]

64 5 Integration into MT System The output of the dicionary mining approach is a list of pairs (f, e) of foreign words and predicted English translations. [sent-104, score-0.247]

65 There are two obvious ways to integrate such a dictionary into a phrase-based translation system: (1) Provide the dictionary entries as (weighted) “sentence” pairs in the parallel corpus. [sent-106, score-0.768]

66 The weighting can be derived from the translation probability from the dictionary mining. [sent-108, score-0.431]

67 (2) Append the phrase table of a baseline phrase-based translation model trained only on source domain data with the word pairs. [sent-109, score-0.571]

68 Use the mining probability as the phrase translation probabilities. [sent-110, score-0.413]

69 The second approach did not hurt translation performance, but did not help much either. [sent-114, score-0.25]

70 This is likely because weight tuning tuned a single weight to account for the import of the phrase probabilities across both “true” phrases as well as these “mined” phrases. [sent-117, score-0.177]

71 We therefore came up with a slightly more complex, but still simple, method for adding the dictionary entries to the phrase table. [sent-118, score-0.318]

72 We add four new features to the model, and set the plain phrasetranslation probabilities for the dictionary entries to zero. [sent-119, score-0.353]

73 An indicator feature that says whether all German words in this phrase pair were seen in the source data. [sent-124, score-0.251]

74 (This will always be true for source phrases and always be false for dictionary entries. [sent-125, score-0.29]

75 An indicator that says whether all German words in this phrase pair were seen in target data. [sent-127, score-0.281]

76 (This is not the negation of the previous 410 feature, because there are plenty ofwords in the target data that had also been seen. [sent-128, score-0.165]

77 The second is trained on the English side of the target domain corpus. [sent-140, score-0.294]

78 In these experiments, we built a translation model based only on the Parliament proceedings. [sent-145, score-0.206]

79 Next, we build an oracle, based on using the parallel target domain data. [sent-148, score-0.357]

80 The last line in this table shows the percent improvement when moving to this oracle system. [sent-150, score-0.205]

81 2 absolute Bleu points for news, which may just be because we have more data) to quite significant (73% for medical texts). [sent-152, score-0.186]

82 Finally, we consider how much of this gain we could possible hope to realize by our dictionary mining technique. [sent-153, score-0.388]

83 In order to estimate this, we take the OR system, and remove any phrases that con- tain source-language words that appear in neither NewsEmeBaLEUSubsPHPNewsEmMeaeteoSrubsPHP Table 3: Baseline and oracle scores. [sent-154, score-0.149]

84 In other words, if our dictionary mining system found as-good translations for the words in its list as the (cheating) oracle system, this is how well it would do. [sent-158, score-0.661]

85 As we can see, the upper bound on performance based only on mining unseen words is about halfway (absolute) between the baseline and the full Oracle. [sent-160, score-0.427]

86 Except in news, when it is essentially useless (because the vocabulary differences between news and Parliament proceedings are negligible). [sent-161, score-0.18]

87 2 Mining Results The results of the dictionary mining experiment, in terms of its effect on translation performance, are shown in Table 4. [sent-164, score-0.594]

88 As we can see, there is a modest improvement in Subtitles and PHP, a markedly 411 large improvement in Emea, and a modest improvement in News. [sent-165, score-0.289]

89 7 Discussion In this paper we have shown that dictionary mining techniques can be applied to mine unseen words in a domain adaptation task. [sent-168, score-1.033]

90 We have seen positive, consistent results across two languages and four domains. [sent-169, score-0.083]

91 The proposed approach is generic enough to be integrated into a wide variety of translation systems other than simple phrase-based translation. [sent-170, score-0.206]

92 Of course, unseen words are not the only cause of translation divergence between two domains. [sent-171, score-0.469]

93 We have not addressed other issues, such as better estimation of translation probabilities or words that change word sense across domains. [sent-172, score-0.25]

94 The former is precisely the area to which one might apply domain adaptation techniques from the machine learning community. [sent-173, score-0.416]

95 The latter requires significant additional work, since it is quite a bit more difficult to spot foreign language words that are used in new senses, rather that just never seen before. [sent-174, score-0.119]

96 An alter- native area of work is to extend these results beyond simply the top-most-frequent words in the target domain. [sent-175, score-0.139]

97 Domain adaptation for statistical machine translation with monolingual resources. [sent-182, score-0.466]

98 Domain adaptation in statistical machine translation with mixture modelling. [sent-192, score-0.423]

99 Adaptation of the translation model for statistical machine translation based on information retrieval. [sent-200, score-0.452]

100 A literature survey on domain adaptation of statistical classifiers. [sent-210, score-0.376]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cca', 0.303), ('parliament', 0.298), ('dictionary', 0.225), ('translation', 0.206), ('domain', 0.199), ('adaptation', 0.177), ('emea', 0.172), ('mining', 0.163), ('unseen', 0.163), ('oov', 0.152), ('php', 0.152), ('subtitles', 0.152), ('subs', 0.129), ('translations', 0.124), ('statmt', 0.119), ('oracle', 0.105), ('target', 0.095), ('bleu', 0.094), ('german', 0.094), ('vectors', 0.093), ('orthographic', 0.092), ('mined', 0.087), ('europarl', 0.084), ('points', 0.082), ('news', 0.08), ('jonker', 0.076), ('haghighi', 0.074), ('meteor', 0.072), ('bipartite', 0.072), ('domains', 0.071), ('ofwords', 0.07), ('civera', 0.07), ('markedly', 0.07), ('plentiful', 0.07), ('medical', 0.069), ('frequent', 0.068), ('hildebrand', 0.066), ('oracles', 0.066), ('umd', 0.066), ('eigenvectors', 0.066), ('convert', 0.066), ('hal', 0.065), ('source', 0.065), ('parallel', 0.063), ('mine', 0.062), ('daum', 0.06), ('useless', 0.06), ('tuning', 0.059), ('moving', 0.059), ('baseline', 0.057), ('tfidf', 0.057), ('comparable', 0.057), ('divergence', 0.056), ('bertoldi', 0.055), ('projection', 0.054), ('says', 0.052), ('similarities', 0.052), ('banerjee', 0.051), ('projections', 0.051), ('shall', 0.049), ('entries', 0.049), ('modest', 0.048), ('script', 0.047), ('maryland', 0.046), ('koehn', 0.046), ('indicator', 0.046), ('kernel', 0.046), ('mt', 0.046), ('phrase', 0.044), ('words', 0.044), ('hurt', 0.044), ('monolingual', 0.043), ('park', 0.043), ('canonical', 0.043), ('blitzer', 0.042), ('languages', 0.042), ('four', 0.041), ('snover', 0.041), ('improvement', 0.041), ('foreign', 0.04), ('machine', 0.04), ('stage', 0.04), ('essentially', 0.04), ('correlation', 0.039), ('relativization', 0.038), ('cheating', 0.038), ('cursory', 0.038), ('naively', 0.038), ('opus', 0.038), ('biometrica', 0.038), ('eck', 0.038), ('hotelling', 0.038), ('ori', 0.038), ('phrasetranslation', 0.038), ('unadapted', 0.038), ('volgenant', 0.038), ('weight', 0.037), ('marcello', 0.037), ('quite', 0.035), ('nicola', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

2 0.24802259 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

3 0.17880622 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

4 0.17462322 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

5 0.16818984 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This work introduces AM-FM, a semantic framework for machine translation evaluation. Based upon this framework, a new evaluation metric, which is able to operate without the need for reference translations, is implemented and evaluated. The metric is based on the concepts of adequacy and fluency, which are independently assessed by using a cross-language latent semantic indexing approach and an n-gram based language model approach, respectively. Comparative analyses with conventional evaluation metrics are conducted on two different evaluation tasks (overall quality assessment and comparative ranking) over a large collection of human evaluations involving five European languages. Finally, the main pros and cons of the proposed framework are discussed along with future research directions. 1

6 0.16816352 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

7 0.13548261 109 acl-2011-Effective Measures of Domain Similarity for Parsing

8 0.13071002 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

9 0.12867434 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

10 0.12822179 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

11 0.12815906 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

12 0.1270327 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

13 0.12387039 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

14 0.12146467 313 acl-2011-Two Easy Improvements to Lexical Weighting

15 0.12129944 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

16 0.12099822 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

17 0.12075896 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

18 0.11972418 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

19 0.11901399 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

20 0.11542859 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.287), (1, -0.105), (2, 0.125), (3, 0.157), (4, 0.046), (5, -0.002), (6, 0.105), (7, 0.007), (8, 0.067), (9, 0.075), (10, 0.03), (11, -0.104), (12, 0.073), (13, -0.11), (14, 0.081), (15, -0.003), (16, -0.054), (17, 0.022), (18, 0.093), (19, -0.154), (20, 0.037), (21, -0.128), (22, 0.064), (23, 0.016), (24, -0.054), (25, 0.051), (26, -0.114), (27, -0.018), (28, 0.045), (29, -0.015), (30, 0.052), (31, -0.033), (32, 0.023), (33, -0.034), (34, -0.06), (35, 0.028), (36, -0.133), (37, 0.152), (38, -0.03), (39, -0.046), (40, -0.082), (41, 0.076), (42, -0.031), (43, -0.009), (44, 0.01), (45, -0.011), (46, -0.014), (47, 0.045), (48, 0.044), (49, -0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95304686 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

2 0.79687697 311 acl-2011-Translationese and Its Dialects

Author: Moshe Koppel ; Noam Ordan

Abstract: While it is has often been observed that the product of translation is somehow different than non-translated text, scholars have emphasized two distinct bases for such differences. Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the process of translation that are independent of source language. Using a series of text categorization experiments, we show that both these effects exist and that, moreover, there is a continuum between them. There are many effects of translation that are consistent among texts translated from a given source language, some of which are consistent even among texts translated from families of source languages. Significantly, we find that even for widely unrelated source languages and multiple genres, differences between translated texts and non-translated texts are sufficient for a learned classifier to accurately determine if a given text is translated or original.

3 0.75910032 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefullydesigned experiments that led us to these conclusions. 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? Answer: Yes. The reason is that while many lations of a word may be valid, the MT system have a systematic bias. For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. From the perspective of MT, this translation is correct and preserves sentiment polarity. But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? Answer: No. It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). Here we present a series of experiments that led us to this conclusion. First we describe the experiment design (§2) and baselines (§3), before answering Question §12 (§4) dan bda Question 32) (§5). 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. Domain mismatch can arise from language differences (e.g. English vs. translated text) or market differences (e.g. DVD vs. Book reviews). Our experiments will involve fixing Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. This allows us to experiment with different settings for adaptation. We use the Amazon review dataset of Prettenhofer (2010)1 , due to its wide range of languages (English [EN], Japanese [JP], French [FR], German [DE]) and markets (music, DVD, books). Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. English is not a low-resource language, but this setting allows for more comparisons. Each source dataset has 2000 reviews, equally balanced between positive and negative. The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. Texts in JP, FR, and DE are translated word-by-word into English with Google Translate.2 We perform three sets of experiments, shown in Table 1. Table 2 lists all the results; we will interpret them in the following sections. Target (T) Source (S) 312BDMToVuasbDkil-ecE1N:ExpDMB eorVuimsDkice-JEnPtN,s eBD,MtuoVBDpuoVsk:-iFDck-iERxFN,T DB,vVoMaDruky-sSiDc.E-, 3 How much performance degradation occurs in cross-lingual adaptation? First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. So we train a SVM classifier on labeled source data3, and directly apply it on test data. The oracle setting, which has no domain-mismatch (e.g. train on Music-EN, test on Music-EN), achieves an average test accuracy of (81.6 + 80.9 + 80.0)/3 = 80.8%4. Aver1http://www.webis.de/research/corpora/webis-cls-10 2This is done by querying foreign words to build a bilingual dictionary. The words are converted to tfidf unigram features. 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. 4See column EN of Table 2, Supervised SVM results. 430 age cross-lingual accuracies are: 69.4% (JP), 75.6% (FR), 77.0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE).5 Crossmarket degradations are around -6%6. Observation 1: Degradations due to market and language mismatch are comparable in several cases (e.g. MUSIC-DE and DVD-EN perform similarly for target MUSIC-EN). Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. Does this mean JP-EN is a more difficult language pair for MT? The next section will show that this is not necessarily the case. Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. 4 Where exactly is the domain mismatch? 4.1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e.g. unigram feature vec- tor) and labels y (positive / negative). Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). • Labeling mismatch: ps (y|x) pt(y|x). Instance mismatch implies that the input feature vectors have different distribution (e.g. one dataset uses the word “excellent” often, while the other uses the word “awesome”). This degrades performance because classifiers trained on “excellent” might not know how to classify texts with the word “awesome.” The solution is to tie together these features (Blitzer et al., 2006) or re-weight the input distribution (Sugiyama et al., 2008). Under some assumptions (i.e. covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). Labeling mismatch implies the same input has different labels in different domains. For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. Then, positive JP = = 5See “Adapt by Language” columns of Table 2. Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). Nevertheless, mixing languages seem to give good results. 6See “Adapt by Market” columns of Table 2. TargetClassifierOEraNcleJPAFdaRpt bDyE LanJgPu+agFeR+DEMUASdIaCpt D byV MDar BkeOtOK MUSIC-ENSAudpaeprtvedise TdS SVVMM8719..666783..50 7745..62 7 776..937880..36--7768..847745..16 DVD-ENSAudpaeprtveidse TdS SVVMM8801..907701..14 7765..54 7 767..347789..477754..28--7746..57 BOOK-ENSAudpaeprtveidse TdS SVVMM8801..026793..68 7775..64 7 767..747799..957735..417767..24-Table 2: Test accuracies (%) for English Music/DVD/Book reviews. Each column is an adaptation scenario using different source data. The source data may vary by language or by market. For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68.5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76.8. “Oracle” indicates training on the same market and same language domain as the target. “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. Boldface shows the winner of Supervised vs. Adapted. reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? • Instance mismatch: Systematic Iantessta nwcoerd m diissmtraibtcuhti:on Ssy MT bias gener- sdtiefmferaetinct MfroTm b naturally- occurring English. (Translation may be valid.) Label mismatch: MT error mis-translates a word iLnatob something w: MithT Td eifrfreorren mti polarity. Conclusion from §4.2 and §4.3: Instance mismaCtcohn occurs often; M §4T. error appears Imnisntainmcael. • Mis-translated polarity Effect Taeb0+±.lge→ .3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. 431 4.2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr.) • Set Coverage of Avg(T) on Avg(S): how many Sweotrd C (type) ien o Tf appears oatn le Aavsgt once ionw wS .m Both measures correlate strongly with final accuracy, as seen in Figure 1. The correlation coefficients are r = −0.78 for KL Divergence and r = 0.71 for Coverage, 0 b.7o8th statistically significant (p < 0.05). This implies that instance mismatch is an important reason for the degradations seen in Section 3.7 4.3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. This suggests that MT output may be more constrained in vocabulary than naturally-occurring English. 0.35 0.3 gnvLrDeiceKe0 0 0. 120.25 510 erts TeCovega0 0 0. .98657 68 70 72 7A4ccuracy76 78 80 82 0.4 68 70 72 7A4ccuracy76 78 80 82 Figure 1: KL Divergence and Coverage vs. accuracy. (o) are cross-lingual and (x) are cross-market data points. is causing the polarity flip. Algorithm 1 (with K=2000) shows how we compute polarity flip rate.8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0.04). Conclusion: Labeling mismatch is not a factor in performance degradation. Nevertheless, we note there is a surprising large number of flips (24% on average). A manual check of the flipped words in BOOK-JP revealed few MT mistakes. Only 3.7% of 450 random EN-JP word pairs checked can be judged as blatantly incorrect (without sentence context). The majority of flipped words do not have a clear sentiment orientation (e.g. “amazon”, “human”, “moreover”). 5 Are standard adaptation algorithms applicable to cross-lingual problems? One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. This makes available a host of preexisting adaptation algorithms for improving over supervised results. However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. to ensure 432 Algorithm 1 Measuring labeling mismatch Input: Weight vectors for source wsand target wt Input: Target data average sample vector avg(T) Output: Polarity flip rate f 1: Normalize: ws = avg(T) * ws ; wt = avg(T) * wt 2: Set S+ = { K most positive features in ws} 3: Set S− == {{ KK mmoosstt negative ffeeaattuurreess inn wws}} 4: Set T+ == {{ KK m moosstt npoesgiatitivvee f efeaatuturreess i inn w wt}} 5: Set T− == {{ KK mmoosstt negative ffeeaattuurreess inn wwt}} 6: for each= f{e a Ktur me io ∈t T+ adtiov 7: rif e ia c∈h S fe−a ttuhreen i if ∈ = T f + 1 8: enidf fio ∈r 9: for each feature j ∈ T− do 10: rif e j ∈h Sfe+a uthreen j f ∈ = T f + 1 11: enidf fjo r∈ 12: f = 2Kf better to “adapt” the standard adaptation algorithm to the cross-lingual setting. We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. Table 2 shows that TSVM outperforms SVM in all but one case for cross-market adaptation, but gives mixed results for cross-lingual adaptation. This is a puzzling result considering that both use the same unlabeled data. Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? Is unlabeled target data interacting with source data in some unexpected way? Certainly there are several successful studies (Wan, 2009; Wei and Pal, 2010; Banea et al., 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. We conjecture that adapting from artificially-generated text (e.g. MT output) is a different story than adapting from naturallyoccurring text (e.g. cross-market). In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. References Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Alessandro Bergamo and Lorenzo Torresani. 2010. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Jenny Rose Finkel and Chris Manning. 2009. Hierarchical Bayesian domain adaptation. In Proc. of NAACL Human Language Technologies (HLT). Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proc. of the Association for Computational Linguistics (ACL). Peter Prettenhofer and Benno Stein. 2010. Crosslanguage text classification using structural correspondence learning. In Proc. of the Association for Computational Linguistics (ACL). Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inferenc, 90. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B ¨unau, and Motoaki Kawanabe. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4). Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proc. of the Association for Computational Linguistics (ACL). Bin Wei and Chris Pal. 2010. Cross lingual adaptation: an experiment on sentiment classification. In Proceedings of the ACL 2010 Conference Short Papers. 433

4 0.7228151 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

5 0.71745557 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; Prem Natarajan

Abstract: The language model (LM) is a critical component in most statistical machine translation (SMT) systems, serving to establish a probability distribution over the hypothesis space. Most SMT systems use a static LM, independent of the source language input. While previous work has shown that adapting LMs based on the input improves SMT performance, none of the techniques has thus far been shown to be feasible for on-line systems. In this paper, we develop a novel measure of cross-lingual similarity for biasing the LM based on the test input. We also illustrate an efficient on-line implementation that supports integration with on-line SMT systems by transferring much of the computational load off-line. Our approach yields significant reductions in target perplexity compared to the static LM, as well as consistent improvements in SMT performance across language pairs (English-Dari and English-Pashto).

6 0.71723825 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

7 0.67872345 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

8 0.64883256 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

9 0.64677089 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

10 0.63652617 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

11 0.63405365 313 acl-2011-Two Easy Improvements to Lexical Weighting

12 0.6296677 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

13 0.62111974 109 acl-2011-Effective Measures of Domain Similarity for Parsing

14 0.61334878 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

15 0.60036969 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

16 0.59672302 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

17 0.59203452 151 acl-2011-Hindi to Punjabi Machine Translation System

18 0.59124929 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

19 0.58842349 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

20 0.5748961 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.066), (17, 0.06), (26, 0.058), (27, 0.119), (37, 0.118), (39, 0.038), (41, 0.136), (45, 0.012), (53, 0.016), (55, 0.014), (59, 0.041), (72, 0.035), (91, 0.029), (96, 0.192)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96389258 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

2 0.91923726 101 acl-2011-Disentangling Chat with Local Coherence Models

Author: Micha Elsner ; Eugene Charniak

Abstract: We evaluate several popular models of local discourse coherence for domain and task generality by applying them to chat disentanglement. Using experiments on synthetic multiparty conversations, we show that most models transfer well from text to dialogue. Coherence models improve results overall when good parses and topic models are available, and on a constrained task for real chat data.

3 0.91150022 96 acl-2011-Disambiguating temporal-contrastive connectives for machine translation

Author: Thomas Meyer

Abstract: Temporal–contrastive discourse connectives (although, while, since, etc.) signal various types ofrelations between clauses such as temporal, contrast, concession and cause. They are often ambiguous and therefore difficult to translate from one language to another. We discuss several new and translation-oriented experiments for the disambiguation of a specific subset of discourse connectives in order to correct some of the translation errors made by current statistical machine translation systems.

same-paper 4 0.90807378 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

5 0.89251864 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

Author: Shasha Liao ; Ralph Grishman

Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1

6 0.88859332 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

7 0.87723982 94 acl-2011-Deciphering Foreign Language

8 0.87639731 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

9 0.87625659 311 acl-2011-Translationese and Its Dialects

10 0.87515187 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

11 0.87456042 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

12 0.87422472 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

13 0.87057549 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

14 0.86860073 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

15 0.86848485 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

16 0.86738086 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

17 0.86675656 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

18 0.86671507 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

19 0.86659133 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

20 0.86641097 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations