acl acl2011 acl2011-103 knowledge-graph by maker-knowledge-mining

103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

Source: pdf

Author: Ivan Titov

Abstract: We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain. One way to tackle this problem is to train a generative model with latent variables on the mixture of data from the source and target domains. Such a model would cluster features in both domains and ensure that at least some of the latent variables are predictive of the label on the source domain. The danger is that these predictive clusters will consist of features specific to the source domain only and, consequently, a classifier relying on such clusters would perform badly on the target domain. We introduce a constraint enforcing that marginal distributions of each cluster (i.e., each latent variable) do not vary significantly across domains. We show that this constraint is effec- tive on the sentiment classification task (Pang et al., 2002), resulting in scores similar to the ones obtained by the structural correspondence methods (Blitzer et al., 2007) without the need to engineer auxiliary tasks.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 t it ov@mmc i Abstract We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain. [sent-2, score-0.513]

2 One way to tackle this problem is to train a generative model with latent variables on the mixture of data from the source and target domains. [sent-3, score-0.403]

3 Such a model would cluster features in both domains and ensure that at least some of the latent variables are predictive of the label on the source domain. [sent-4, score-0.526]

4 The danger is that these predictive clusters will consist of features specific to the source domain only and, consequently, a classifier relying on such clusters would perform badly on the target domain. [sent-5, score-0.518]

5 We introduce a constraint enforcing that marginal distributions of each cluster (i. [sent-6, score-0.262]

6 , each latent variable) do not vary significantly across domains. [sent-8, score-0.184]

7 This difference in the data distributions normally results in a significant drop in accuracy. [sent-14, score-0.237]

8 In addition to the labeled data from the source domain, they also exploit small amounts of labeled data and/or unlabeled data from the target domain to estimate a more predictive model for the target domain. [sent-21, score-0.603]

9 In this paper we focus on a more challenging and arguably more realistic version of the domainadaptation problem where only unlabeled data is available for the target domain. [sent-22, score-0.284]

10 One of the most promising research directions on domain adaptation for this setting is based on the idea of inducing a shared feature representation (Blitzer et al. [sent-23, score-0.458]

11 , 2006), that is mapping from the initial feature representation to a new representation such that (1) examples from both domains ‘look similar’ and (2) an accurate classifier can be trained in this new representation. [sent-24, score-0.302]

12 (2006) use auxiliary tasks based on unlabeled data for both domains (called pivot features) and a dimensionality reduction technique to induce such shared representation. [sent-26, score-0.505]

13 The success of their domain-adaptation method (Structural Correspondence Learning, SCL) crucially depends on the choice of the auxiliary tasks, and defining them can be a non-trivial engineering problem for many NLP tasks (Plank, 2009). [sent-27, score-0.147]

14 In this paper, we investigate methods which do not use auxiliary tasks to induce a shared feature representation. [sent-28, score-0.276]

15 We use generative latent variable models (LVMs) learned on all the available data: unlabeled data for both domains and on the labeled data for the source domain. [sent-29, score-0.599]

16 Our LVMs use vectors of latent features Proce dPinogrstla ofn tdh,e O 4r9etghon A,n Jnu nael 1 M9-e 2t4i,n2g 0 o1f1 t. [sent-30, score-0.184]

17 The latent variables encode regularities observed on unlabeled data from both domains, and they are learned to be predictive of the labels on the source domain. [sent-33, score-0.503]

18 Such LVMs can be regarded as composed of two parts: a mapping from initial (normally, word-based) representation to a new shared distributed representation, and also a classifier in this representation. [sent-34, score-0.323]

19 We encode this intuition by introducing a term in the learning objective which regularizes inter-domain difference in marginal distributions of each latent variable. [sent-37, score-0.597]

20 Another, though conceptually similar, argument for our method is coming from theoretical results which postulate that the drop in accuracy of an adapted classifier is dependent on the discrepancy distance between the source and target domains (Blitzer et al. [sent-38, score-0.512]

21 Therefore, our approach can be regarded as minimizing a coarse approximation of the discrepancy distance. [sent-44, score-0.331]

22 The introduced term regularizes model expectations and it can be viewed as a form of a generalized expectation (GE) criterion (Mann and McCallum, 2010). [sent-45, score-0.313]

23 Unlike the standard GE criterion, where a model designer defines the prior for a model expectation, our criterion postulates that the model expectations should be similar across domains. [sent-46, score-0.166]

24 In our experiments, we use a form of Harmonium Model (Smolensky, 1986) with a single layer of binary latent variables. [sent-47, score-0.184]

25 We explain how the introduced regularizer can be integrated into the stochastic gradient descent learning algorithm for our model. [sent-50, score-0.235]

26 We evaluate our approach on adapting sentiment classifiers on 4 domains: books, DVDs, electronics and kitchen appliances (Blitzer et al. [sent-51, score-0.31]

27 The loss due to transfer to a new domain is very significant for this task: in our experiments it was approaching 9%, in average, for the non-adapted model. [sent-53, score-0.164]

28 In Section 2 we introduce a model which uses vectors of latent variables to model statistical dependencies between the elementary features. [sent-58, score-0.335]

29 2 The Latent Variable Model The adaptation method advocated in this paper is applicable to any joint probabilistic model which uses distributed representations, i. [sent-63, score-0.199]

30 vectors of latent variables, to abstract away from hand-crafted features. [sent-65, score-0.184]

31 Intuitively, the model can be regarded as a logistic regression classifier with latent features. [sent-80, score-0.342]

32 The model assumes that the features and the latent variable vector are generatedjointly from a globallynormalized model and then the label y is generated from a conditional distribution dependent on z. [sent-81, score-0.268]

33 The parameters of this model θ = (v, w) can be estimated by maximizing joint likelihood L(θ) of labeled data for the source domain , }l∈SL {x(l) y(l) 64 Figure1:Txhelatzentvaryiable. [sent-88, score-0.266]

34 and unlabeled data for the source and target domain {x(l) }l∈SU∪TU, where SU and TU stand for the un{laxbele}d datasets for the source and target domains, respectively. [sent-91, score-0.493]

35 Consequently, the latent representation induced in this way is likely to be inappropriate for the classification task in question. [sent-95, score-0.288]

36 Direct maximization of the objective is problematic, as it would require summation over all the 2m latent vectors z. [sent-98, score-0.236]

37 m 3 Constraints on Inter-Domain Variability As we discussed in the introduction, our goal is to provide a method for domain adaptation based on semi-supervised learning of models with distributed representations. [sent-102, score-0.363]

38 In this section, we first discuss the shortcomings of domain adaptation with the above-described semi-supervised approach and motivate constraints on inter-domain variability of the induced shared representation. [sent-103, score-0.506]

39 1 Motivation for the Constraints Each latent variable zi encodes a cluster or a combination of elementary features xj. [sent-106, score-0.646]

40 However, when the domains are substantially different, these predictive clusters are likely to be specific only to the source domain. [sent-108, score-0.289]

41 For example, consider moving from reviews of electronics to book reviews: the cluster of features related to equipment reliability and warranty service will not generalize to books. [sent-109, score-0.144]

42 The corresponding latent variable will always be inactive on the books domain (or always active, if negative correlation is induced during learning). [sent-110, score-0.57]

43 Equivalently, the marginal distribution of this variable will be very different for both domains. [sent-111, score-0.223]

44 Note that the classifier, defined by the vector w, is only trained on the labeled source examples , }l∈SL and therefore it will rely on such lat{exnt variables, even though they do not generalize to the target domain. [sent-112, score-0.164]

45 Clearly, the accuracy of such classifier will drop when it is applied to target domain examples. [sent-113, score-0.372]

46 To tackle this issue, we introduce a regularizing term which penalizes differences in the marginal distributions between the domains. [sent-114, score-0.409]

47 {x(l) y(l) Another motivation for the form of regularization we propose originates from theoretical analysis of the domain adaptation problems (Ben-David et al. [sent-118, score-0.383]

48 Under the assumption that there exists a domainindependent scoring function, these analyses show that the drop in accuracy is upper-bounded by the quantity called discrepancy distance. [sent-122, score-0.197]

49 The discrepancy distance is dependent on the feature represen65 tation z, and the input distributions for both domains PS(z) and PT(z), and is defined as dz(S,T)=mf,fa0x|EPS[f(z)6=f0(z)]−EPT[f(z)6=f0(z)]|, where f and f0 are arbitrary linear classifiers in the feature representation z. [sent-123, score-0.426]

50 It follows that low inter-domain variability of the marginal distributions of latent variables is a necessary condition for low discrepancy distance. [sent-131, score-0.763]

51 Minimizing the difference in the marginal distributions can be regarded as a coarse approximation to the minimization of the distance. [sent-132, score-0.415]

52 To achieve this goal we instead propose to use the symmetrized Kullback-Leibler (KL) divergence between the marginal distributions as the penalty. [sent-136, score-0.426]

53 derivative ofthe symmetrized KL divergence is large when one of the marginal distributions is concentrated at 0 or 1 with another distribution still having high entropy, and therefore such configurations are severely penalized. [sent-138, score-0.426]

54 3 Formally, the regularizer G(θ) is defined as Xm G(θ) = XD(PS(zi|θ)||PT(zi|θ)) Xi= X1 +D(PT(zi|θ) | |PS(zi|θ)), (1) where PS(zi) and PT(zi) stand for the training sample estimates of the marginal distributions of latent features, for instance: PT(zi= 1|θ) =|T1U|l∈XTUP(zi= 1|x(l),θ). [sent-139, score-0.55]

55 We augment the multi-conditional log-likelihood L(θ, α) with the weighted regularization term G(θ) to get the composite objective function: LR(θ, α, β) = L(θ, α) − βG(θ), β > 0. [sent-140, score-0.178]

56 , 2008) by considering symmetrized KL divergences for every pair of domains or regularizing the distributions for every domain towards their average. [sent-144, score-0.648]

57 More powerful regularization terms can also be motivated by minimization of the discrepancy distance but their optimization is likely to be expensive, whereas LR(θ, α, β) can be optimized efficiently. [sent-145, score-0.202]

58 3An alternative is to use the Jensen-Shannon (JS) divergence, however, our preliminary experiments seem to suggest that the symmetrized KL divergence is preferable. [sent-146, score-0.164]

59 Though the two divergences are virtually equivalent when the distributions are very similar (their ratio tends to a constant as the distributions go closer), the symmetrized KL divergence stronger penalizes extreme differences and this is important for our purposes. [sent-147, score-0.458]

60 m The stochastic gradient descent algorithm iterates over examples and updates the weight vector based on the contribution of every considered example to the objective function LR(θ, α, β). [sent-151, score-0.183]

61 To compute these updates we need to approximate gradients logP(y(l)|x(l), µ of ∇θ θ) (l ∈ SL), ∇θ logP(x(l)|θ) (l ∈ SL ∪ SU ∪ TU) as )w (lel ∈l as t)o, e ∇stimate the con(trlib ∈ut Sion ∪o fS a given example to the gradient of the regularizer ∇θG(θ). [sent-152, score-0.288]

62 2 Unlabeled Likelihood Term In this section, we describe how the unlabeled likelihood term is optimized in our stochastic learning algorithm. [sent-167, score-0.194]

63 (Ixn|sθte)a =d Pof( directly approximating the gradient ∇v log |v), we use a deterministic version of ∇the Contrastive|v Divergence (CD) algorithm, equivalent to the mean-field approximation of the reconstruction error used in training autoassociaters (Bengio and Delalleau, 2007). [sent-169, score-0.21]

64 Intuitively, maximizing the likelihood of unlabeled data is closely related to minimizing the reconstruc- P(x(l) µ tion error, that is training a model to discover such mapping parameters u that z encodes all the necessary information to accurately reproduce from z for every training example . [sent-171, score-0.154]

65 Formally, the meanfield approximation to the negated reconstruction error is defined as x(l) Lˆ(x(l), x(l) logP(x(l)|µ, v), = P(zi = 1|x(l) , v), are com- v) = where the means, µi puted as in the preceding sec=tio 1n|. [sent-172, score-0.161]

66 oT thhee regularizer G(v) is defined as in equation (1) and it is a function of the sample-based domainspecific marginal distributions of latent variables PS and PT: PT(zi= 1|θ) =|T1U|l∈XTUµi(l), µi(l) 1|x(l), where the means = P(zi = v); PS can be re-written analogously. [sent-178, score-0.656]

67 The dataset is composed of labeled and unlabeled reviews of four different product types: books, DVDs, electronics and kitchen appliances. [sent-189, score-0.364]

68 For each domain, the dataset contains 1,000 labeled positive reviews and 1,000 labeled negative reviews, as well as several thousands of unlabeled examples (4,919 reviews per domain in average: ranging from 3,685 for DVDs to 5,945 for kitchen appliances). [sent-190, score-0.545]

69 Figure 2: Averages accuracies when transferring to books, DVD, electronics and kitchen appliances domains, and average accuracy over all 12 domain pairs. [sent-193, score-0.397]

70 For every pair, the semi-supervised methods use labeled data from the source domain and unlabeled data from both domains. [sent-195, score-0.369]

71 We compare them with two supervised methods: a supervised model (Base) which is trained on the source domain data only, and another supervised model (Indomain) which is learned on the labeled data from the target domain. [sent-196, score-0.328]

72 The Base model can be regarded as a natural baseline model, whereas the In-domain model is essentially an upper-bound for any domainadaptation method. [sent-197, score-0.201]

73 The size of the latent representation was equal to 10. [sent-208, score-0.234]

74 We trained the model both without regularization of the domain variability (NoReg, β = 0), and with the regularizing term (Reg). [sent-211, score-0.47]

75 For the SCL method to produce an accurate classifier for the target domain it is necessary to train a classifier using both the induced shared representation and the initial nontransformed representation. [sent-212, score-0.542]

76 In all our models, we augmented the vector z with an additional component set to 0 for examples in the source domain and to 1 for the target domain examples. [sent-215, score-0.441]

77 In this way, we essentially subtracted a unigram domain-specific model from our latent variable model in the hope that this will further reduce the domain dependence of the rest of the model parameters. [sent-216, score-0.432]

78 The 4 leftmost groups of results correspond to a single target domain, and therefore each of 4The latent variables are not likely to learn any useful mapping in the presence of observable features. [sent-220, score-0.352]

79 them is an average over experiments on 3 domainpairs, for instance, the group Books represents an average over adaptation experiments DVDs→books, electronics→books, ikointcehxepne→rimboeonktssD. [sent-222, score-0.144]

80 First, observe that the total drop in the accuracy when moving to the target domain is 8. [sent-225, score-0.296]

81 We believe that this happens because the clusters induced when optimizing the non-regularized learning objective are often domain-specific. [sent-234, score-0.159]

82 Also, it is important to point out that the SCL method uses auxiliary tasks to induce the shared feature representation, these tasks are constructed on the basis of unlabeled data. [sent-258, score-0.424]

83 The auxiliary tasks and the original problem should be closely related, namely they should have the same (or similar) set of predictive features. [sent-259, score-0.206]

84 On the sentiment classification task in order to construct them two steps need to be performed: (1) a set of words correlated with the sentiment label is selected, and, then (2) prediction of each such word is regarded a distinct auxiliary problem. [sent-261, score-0.338]

85 , parsing (Plank, 2009)) the construction of an effective set of auxiliary tasks is still an open problem. [sent-264, score-0.147]

86 6 Related Work There is a growing body of work on domain adaptation. [sent-265, score-0.164]

87 Such methods tackle domain adaptation by instance re-weighting (Bickel et al. [sent-268, score-0.308]

88 In NLP, most features 5The drop in accuracy for the SCL method in Table 1 is is computed with respect to the less accurate supervised in-domain classifier considered in Blitzer et al. [sent-270, score-0.146]

89 Semi-supervised learning with distributed representations and its application to domain adaptation has previously been considered in (Huang and Yates, 2009), but no attempt has been made to address problems specific to the domain-adaptation setting. [sent-276, score-0.363]

90 Second, their expectation constraints are estimated from labeled data, whereas we are trying to match expectations computed on unlabeled data for two domains. [sent-283, score-0.321]

91 This approach bears some similarity to the adaptation methods standard for the setting where labelled data is available for both domains (Chelba and Acero, 2004; Daum e´ and Marcu, 2006). [sent-284, score-0.31]

92 However, instead of ensuring that the classifier parameters are similar across domains, we favor models resulting in similar marginal distributions of latent variables. [sent-285, score-0.522]

93 Our approach results in competitive domainadaptation performance on the sentiment classification task, rivalling that of the state-of-the-art SCL method (Blitzer et al. [sent-287, score-0.196]

94 Both of these methods induce a shared feature representation but un- 70 like SCL our method does not require construction of any auxiliary tasks in order to induce this representation. [sent-289, score-0.395]

95 The primary area of the future work is to apply our method to structured prediction problems in NLP, such as syntactic parsing or semantic role labeling, where construction of auxiliary tasks proved problematic. [sent-290, score-0.147]

96 Another direction is to favor domaininvariability not only of the expectations of individual variables but rather those of constraint functions involving latent variables, features and labels. [sent-291, score-0.367]

97 Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. [sent-321, score-0.221]

98 Generalized expectation criteria for semi-supervised learning with weakly labeled data. [sent-388, score-0.141]

99 Domain adaptation of conditional probability models via feature subsetting. [sent-422, score-0.144]

100 Fast and robust multilingual dependency parsing with a generative latent variable model. [sent-443, score-0.268]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zi', 0.333), ('blitzer', 0.221), ('latent', 0.184), ('scl', 0.182), ('noreg', 0.167), ('domain', 0.164), ('adaptation', 0.144), ('marginal', 0.139), ('reg', 0.138), ('discrepancy', 0.127), ('domains', 0.126), ('distributions', 0.123), ('domainadaptation', 0.119), ('mansour', 0.119), ('ps', 0.109), ('sl', 0.107), ('variables', 0.106), ('regularizer', 0.104), ('unlabeled', 0.103), ('auxiliary', 0.102), ('regularizing', 0.096), ('gradient', 0.091), ('symmetrized', 0.091), ('expectation', 0.09), ('electronics', 0.089), ('pt', 0.088), ('variability', 0.084), ('books', 0.084), ('dvds', 0.084), ('variable', 0.084), ('kl', 0.083), ('regarded', 0.082), ('hinton', 0.08), ('tu', 0.079), ('appliances', 0.078), ('sentiment', 0.077), ('expectations', 0.077), ('classifier', 0.076), ('regularization', 0.075), ('divergence', 0.073), ('lvms', 0.072), ('saul', 0.072), ('approximation', 0.071), ('drop', 0.07), ('induce', 0.069), ('titov', 0.069), ('kitchen', 0.066), ('smolensky', 0.063), ('target', 0.062), ('shared', 0.06), ('sigmoid', 0.06), ('predictive', 0.059), ('bickel', 0.058), ('reviews', 0.055), ('druck', 0.055), ('distributed', 0.055), ('induced', 0.054), ('ivan', 0.054), ('clusters', 0.053), ('contrastive', 0.052), ('objective', 0.052), ('minimizing', 0.051), ('source', 0.051), ('term', 0.051), ('labeled', 0.051), ('mccallum', 0.051), ('regularized', 0.05), ('gradients', 0.05), ('vl', 0.05), ('representation', 0.05), ('afshin', 0.048), ('harmonium', 0.048), ('pim', 0.048), ('regularizes', 0.048), ('satpal', 0.048), ('yishay', 0.048), ('reconstruction', 0.048), ('divergences', 0.048), ('belief', 0.047), ('neural', 0.047), ('criterion', 0.047), ('bengio', 0.046), ('ge', 0.046), ('tasks', 0.045), ('elementary', 0.045), ('henderson', 0.045), ('normally', 0.044), ('approximate', 0.043), ('factorial', 0.042), ('meanfield', 0.042), ('delalleau', 0.042), ('designer', 0.042), ('gesmundo', 0.042), ('sbns', 0.042), ('xlogp', 0.042), ('su', 0.042), ('setting', 0.04), ('correspondence', 0.04), ('stochastic', 0.04), ('logp', 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

Author: Ivan Titov

2 0.25250775 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

Author: Danushka Bollegala ; David Weir ; John Carroll

Abstract: We describe a sentiment classification method that is applicable when we do not have any labeled data for a target domain but have some labeled data for multiple other domains, designated as the source domains. We automat- ically create a sentiment sensitive thesaurus using both labeled and unlabeled data from multiple source domains to find the association between words that express similar sentiments in different domains. The created thesaurus is then used to expand feature vectors to train a binary classifier. Unlike previous cross-domain sentiment classification methods, our method can efficiently learn from multiple source domains. Our method significantly outperforms numerous baselines and returns results that are better than or comparable to previous cross-domain sentiment classification methods on a benchmark dataset containing Amazon user reviews for different types of products.

3 0.1878036 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefullydesigned experiments that led us to these conclusions. 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? Answer: Yes. The reason is that while many lations of a word may be valid, the MT system have a systematic bias. For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. From the perspective of MT, this translation is correct and preserves sentiment polarity. But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? Answer: No. It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). Here we present a series of experiments that led us to this conclusion. First we describe the experiment design (§2) and baselines (§3), before answering Question §12 (§4) dan bda Question 32) (§5). 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. Domain mismatch can arise from language differences (e.g. English vs. translated text) or market differences (e.g. DVD vs. Book reviews). Our experiments will involve fixing Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. This allows us to experiment with different settings for adaptation. We use the Amazon review dataset of Prettenhofer (2010)1 , due to its wide range of languages (English [EN], Japanese [JP], French [FR], German [DE]) and markets (music, DVD, books). Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. English is not a low-resource language, but this setting allows for more comparisons. Each source dataset has 2000 reviews, equally balanced between positive and negative. The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. Texts in JP, FR, and DE are translated word-by-word into English with Google Translate.2 We perform three sets of experiments, shown in Table 1. Table 2 lists all the results; we will interpret them in the following sections. Target (T) Source (S) 312BDMToVuasbDkil-ecE1N:ExpDMB eorVuimsDkice-JEnPtN,s eBD,MtuoVBDpuoVsk:-iFDck-iERxFN,T DB,vVoMaDruky-sSiDc.E-, 3 How much performance degradation occurs in cross-lingual adaptation? First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. So we train a SVM classifier on labeled source data3, and directly apply it on test data. The oracle setting, which has no domain-mismatch (e.g. train on Music-EN, test on Music-EN), achieves an average test accuracy of (81.6 + 80.9 + 80.0)/3 = 80.8%4. Aver1http://www.webis.de/research/corpora/webis-cls-10 2This is done by querying foreign words to build a bilingual dictionary. The words are converted to tfidf unigram features. 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. 4See column EN of Table 2, Supervised SVM results. 430 age cross-lingual accuracies are: 69.4% (JP), 75.6% (FR), 77.0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE).5 Crossmarket degradations are around -6%6. Observation 1: Degradations due to market and language mismatch are comparable in several cases (e.g. MUSIC-DE and DVD-EN perform similarly for target MUSIC-EN). Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. Does this mean JP-EN is a more difficult language pair for MT? The next section will show that this is not necessarily the case. Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. 4 Where exactly is the domain mismatch? 4.1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e.g. unigram feature vec- tor) and labels y (positive / negative). Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). • Labeling mismatch: ps (y|x) pt(y|x). Instance mismatch implies that the input feature vectors have different distribution (e.g. one dataset uses the word “excellent” often, while the other uses the word “awesome”). This degrades performance because classifiers trained on “excellent” might not know how to classify texts with the word “awesome.” The solution is to tie together these features (Blitzer et al., 2006) or re-weight the input distribution (Sugiyama et al., 2008). Under some assumptions (i.e. covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). Labeling mismatch implies the same input has different labels in different domains. For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. Then, positive JP = = 5See “Adapt by Language” columns of Table 2. Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). Nevertheless, mixing languages seem to give good results. 6See “Adapt by Market” columns of Table 2. TargetClassifierOEraNcleJPAFdaRpt bDyE LanJgPu+agFeR+DEMUASdIaCpt D byV MDar BkeOtOK MUSIC-ENSAudpaeprtvedise TdS SVVMM8719..666783..50 7745..62 7 776..937880..36--7768..847745..16 DVD-ENSAudpaeprtveidse TdS SVVMM8801..907701..14 7765..54 7 767..347789..477754..28--7746..57 BOOK-ENSAudpaeprtveidse TdS SVVMM8801..026793..68 7775..64 7 767..747799..957735..417767..24-Table 2: Test accuracies (%) for English Music/DVD/Book reviews. Each column is an adaptation scenario using different source data. The source data may vary by language or by market. For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68.5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76.8. “Oracle” indicates training on the same market and same language domain as the target. “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. Boldface shows the winner of Supervised vs. Adapted. reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? • Instance mismatch: Systematic Iantessta nwcoerd m diissmtraibtcuhti:on Ssy MT bias gener- sdtiefmferaetinct MfroTm b naturally- occurring English. (Translation may be valid.) Label mismatch: MT error mis-translates a word iLnatob something w: MithT Td eifrfreorren mti polarity. Conclusion from §4.2 and §4.3: Instance mismaCtcohn occurs often; M §4T. error appears Imnisntainmcael. • Mis-translated polarity Effect Taeb0+±.lge→ .3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. 431 4.2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr.) • Set Coverage of Avg(T) on Avg(S): how many Sweotrd C (type) ien o Tf appears oatn le Aavsgt once ionw wS .m Both measures correlate strongly with final accuracy, as seen in Figure 1. The correlation coefficients are r = −0.78 for KL Divergence and r = 0.71 for Coverage, 0 b.7o8th statistically significant (p < 0.05). This implies that instance mismatch is an important reason for the degradations seen in Section 3.7 4.3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. This suggests that MT output may be more constrained in vocabulary than naturally-occurring English. 0.35 0.3 gnvLrDeiceKe0 0 0. 120.25 510 erts TeCovega0 0 0. .98657 68 70 72 7A4ccuracy76 78 80 82 0.4 68 70 72 7A4ccuracy76 78 80 82 Figure 1: KL Divergence and Coverage vs. accuracy. (o) are cross-lingual and (x) are cross-market data points. is causing the polarity flip. Algorithm 1 (with K=2000) shows how we compute polarity flip rate.8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0.04). Conclusion: Labeling mismatch is not a factor in performance degradation. Nevertheless, we note there is a surprising large number of flips (24% on average). A manual check of the flipped words in BOOK-JP revealed few MT mistakes. Only 3.7% of 450 random EN-JP word pairs checked can be judged as blatantly incorrect (without sentence context). The majority of flipped words do not have a clear sentiment orientation (e.g. “amazon”, “human”, “moreover”). 5 Are standard adaptation algorithms applicable to cross-lingual problems? One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. This makes available a host of preexisting adaptation algorithms for improving over supervised results. However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. to ensure 432 Algorithm 1 Measuring labeling mismatch Input: Weight vectors for source wsand target wt Input: Target data average sample vector avg(T) Output: Polarity flip rate f 1: Normalize: ws = avg(T) * ws ; wt = avg(T) * wt 2: Set S+ = { K most positive features in ws} 3: Set S− == {{ KK mmoosstt negative ffeeaattuurreess inn wws}} 4: Set T+ == {{ KK m moosstt npoesgiatitivvee f efeaatuturreess i inn w wt}} 5: Set T− == {{ KK mmoosstt negative ffeeaattuurreess inn wwt}} 6: for each= f{e a Ktur me io ∈t T+ adtiov 7: rif e ia c∈h S fe−a ttuhreen i if ∈ = T f + 1 8: enidf fio ∈r 9: for each feature j ∈ T− do 10: rif e j ∈h Sfe+a uthreen j f ∈ = T f + 1 11: enidf fjo r∈ 12: f = 2Kf better to “adapt” the standard adaptation algorithm to the cross-lingual setting. We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. Table 2 shows that TSVM outperforms SVM in all but one case for cross-market adaptation, but gives mixed results for cross-lingual adaptation. This is a puzzling result considering that both use the same unlabeled data. Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? Is unlabeled target data interacting with source data in some unexpected way? Certainly there are several successful studies (Wan, 2009; Wei and Pal, 2010; Banea et al., 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. We conjecture that adapting from artificially-generated text (e.g. MT output) is a different story than adapting from naturallyoccurring text (e.g. cross-market). In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. References Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Alessandro Bergamo and Lorenzo Torresani. 2010. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Jenny Rose Finkel and Chris Manning. 2009. Hierarchical Bayesian domain adaptation. In Proc. of NAACL Human Language Technologies (HLT). Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proc. of the Association for Computational Linguistics (ACL). Peter Prettenhofer and Benno Stein. 2010. Crosslanguage text classification using structural correspondence learning. In Proc. of the Association for Computational Linguistics (ACL). Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inferenc, 90. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B ¨unau, and Motoaki Kawanabe. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4). Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proc. of the Association for Computational Linguistics (ACL). Bin Wei and Chris Pal. 2010. Cross lingual adaptation: an experiment on sentiment classification. In Proceedings of the ACL 2010 Conference Short Papers. 433

4 0.18126659 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

Author: Yulan He ; Chenghua Lin ; Harith Alani

Abstract: Joint sentiment-topic (JST) model was previously proposed to detect sentiment and topic simultaneously from text. The only supervision required by JST model learning is domain-independent polarity word priors. In this paper, we modify the JST model by incorporating word polarity priors through modifying the topic-word Dirichlet priors. We study the polarity-bearing topics extracted by JST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. Furthermore, using feature augmentation and selection according to the information gain criteria for cross-domain sentiment classification, our proposed approach performs either better or comparably compared to previous approaches. Nevertheless, our approach is much simpler and does not require difficult parameter tuning.

5 0.17931995 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

Author: Bin Lu ; Chenhao Tan ; Claire Cardie ; Benjamin K. Tsou

Abstract: Most previous work on multilingual sentiment analysis has focused on methods to adapt sentiment resources from resource-rich languages to resource-poor languages. We present a novel approach for joint bilingual sentiment classification at the sentence level that augments available labeled data in each language with unlabeled parallel data. We rely on the intuition that the sentiment labels for parallel sentences should be similar and present a model that jointly learns improved monolingual sentiment classifiers for each language. Experiments on multiple data sets show that the proposed approach (1) outperforms the monolingual baselines, significantly improving the accuracy for both languages by 3.44%-8. 12%; (2) outperforms two standard approaches for leveraging unlabeled data; and (3) produces (albeit smaller) performance gains when employing pseudo-parallel data from machine translation engines. 1

6 0.17835529 109 acl-2011-Effective Measures of Domain Similarity for Parsing

7 0.16620846 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

8 0.16513394 204 acl-2011-Learning Word Vectors for Sentiment Analysis

9 0.14117664 342 acl-2011-full-for-print

10 0.12623346 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis

11 0.12616503 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis

12 0.12464187 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

13 0.12387039 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

14 0.11791943 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

15 0.11669672 256 acl-2011-Query Weighting for Ranking Model Adaptation

16 0.10708024 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

17 0.10470343 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

18 0.10231494 82 acl-2011-Content Models with Attitude

19 0.09827812 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

20 0.097190939 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.267), (1, 0.118), (2, 0.066), (3, -0.061), (4, 0.036), (5, -0.065), (6, -0.015), (7, 0.127), (8, 0.0), (9, 0.088), (10, 0.145), (11, -0.029), (12, 0.105), (13, 0.083), (14, 0.117), (15, 0.072), (16, -0.153), (17, -0.028), (18, 0.031), (19, -0.061), (20, -0.049), (21, -0.149), (22, -0.002), (23, 0.025), (24, -0.012), (25, 0.033), (26, -0.108), (27, -0.067), (28, 0.076), (29, -0.026), (30, -0.041), (31, 0.11), (32, -0.015), (33, -0.014), (34, -0.021), (35, 0.013), (36, -0.111), (37, 0.062), (38, -0.045), (39, -0.007), (40, 0.029), (41, -0.061), (42, -0.035), (43, -0.025), (44, -0.058), (45, -0.045), (46, -0.081), (47, 0.073), (48, 0.016), (49, -0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9627322 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

Author: Ivan Titov

2 0.80742794 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

3 0.78015506 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

Author: Yulan He ; Chenghua Lin ; Harith Alani

4 0.73213756 109 acl-2011-Effective Measures of Domain Similarity for Parsing

Author: Barbara Plank ; Gertjan van Noord

Abstract: It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages exam- ined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain, say newspaper text, to a particular new domain, say biomedical data. Usually, some amount of (labeled or unlabeled) data from the new domain was given which has been determined by a human. However, with the growth of the web, more and more data is becoming available, where each document “is potentially its own domain” (McClosky et al., 2010). It is not straightforward to determine – 1566 Gertjan van Noord University of Groningen The Netherlands G J M van Noord@ rug nl . . . . . which data or model (in case we have several source domain models) will perform best on a new (unknown) target domain. Therefore, an important issue that arises is how to measure domain similarity, i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. Moreover, if we had such a measure, a related question is whether it can tell us something more about what is actually meant by “domain”. So far, it was mostly arbitrarily used to refer to some kind of coherent unit (related to topic, style or genre), e.g.: newspaper text, biomedical abstracts, questions, fiction. Most previous work on domain adaptation, for instance Hara et al. (2005), McClosky et al. (2006), Blitzer et al. (2006), Daum e´ III (2007), sidestepped this problem of automatic domain selection and adaptation. For parsing, to our knowledge only one recent study has started to examine this issue (McClosky et al., 2010) we will discuss their approach in Section 2. Rather, an implicit assumption of all of these studies is that domains are given, i.e. that they are represented by the respective corpora. Thus, a corpus has been considered a homogeneous unit. As more data is becoming available, it is unlikely that – domains will be ‘given’ . Moreover, a given corpus might not always be as homogeneous as originally thought (Webber, 2009; Lippincott et al., 2010). For instance, recent work has shown that the well-known Penn Treebank (PT) Wall Street Journal (WSJ) actually contains a variety of genres, including letters, wit and short verse (Webber, 2009). In this study we take a different approach. Rather than viewing a given corpus as a monolithic entity, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s566–1576, we break it down to the article-level and disregard corpora boundaries. Given the resulting set of documents (articles), we evaluate various ways to automatically acquire related training data for a given test set, to find answers to the following questions: • Given a pool of data (a collection of articles fGriovmen nun ak pnooowln o domains) caonldle a test article, eiss there a way to automatically select data that is relevant for the new domain? If so: • Which similarity measure is good for parsing? • How does it compare to human-annotated data? • Is the measure also useful for other languages Iasnd th/oer mtaesakssu?r To this end, we evaluate measures of domain similarity and feature representations and their impact on dependency parsing accuracy. Given a collection of annotated articles, and a new article that we want to parse, we want to select the most similar articles to train the best parser for that new article. In the following, we will first compare automatic measures to human-annotated labels by examining parsing performance within subdomains of the Penn Treebank WSJ. Then, we extend the experiments to the domain adaptation scenario. Experiments were performed on two languages: English and Dutch. The empirical results show that a simple measure based on topic distributions is effective for both languages and works well also for Part-of-Speech tagging. As the approach is based on plain surfacelevel information (words) and it finds related data in a completely unsupervised fashion, it can be easily applied to other tasks or languages for which annotated (or automatically annotated) data is available. 2 Related Work The work most related to ours is McClosky et al. (2010). They try to find the best combination of source models to parse data from a new domain, which is related to Plank and Sima’an (2008). In the latter, unlabeled data was used to create several parsers by weighting trees in the WSJ according to their similarity to the subdomain. McClosky et al. (2010) coined the term multiple source domain adaptation. Inspired by work on parsing accuracy 1567 prediction (Ravi et al., 2008), they train a linear regression model to predict the best (linear interpolation) of source domain models. Similar to us, McClosky et al. (2010) regard a target domain as mixture of source domains, but they focus on phrasestructure parsing. Furthermore, our approach differs from theirs in two respects: we do not treat source corpora as one entity and try to mix models, but rather consider articles as base units and try to find subsets of related articles (the most similar articles); moreover, instead of creating a supervised model (in their case to predict parsing accuracy), our approach is ‘simplistic’ : we apply measures of domain simi- larity directly (in an unsupervised fashion), without the necessity to train a supervised model. Two other related studies are (Lippincott et al., 2010; Van Asch and Daelemans, 2010). Van Asch and Daelemans (2010) explore a measure of domain difference (Renyi divergence) between pairs of domains and its correlation to Part-of-Speech tagging accuracy. Their empirical results show a linear correlation between the measure and the performance loss. Their goal is different, but related: rather than finding related data for a new domain, they want to estimate the loss in accuracy of a PoS tagger when applied to a new domain. We will briefly discuss results obtained with the Renyi divergence in Section 5.1. Lippincott et al. (2010) examine subdomain variation in biomedicine corpora and propose awareness of NLP tools to such variation. However, they did not yet evaluate the effect on a practical task, thus our study is somewhat complementary to theirs. The issue of data selection has recently been examined for Language Modeling (Moore and Lewis, 2010). A subset of the available data is automatically selected as training data for a Language Model based on a scoring mechanism that compares cross- entropy scores. Their approach considerably outperformed random selection and two previous proposed approaches both based on perplexity scoring.1 3 Measures of Domain Similarity 3.1 Measuring Similarity Automatically Feature Representations A similarity function may be defined over any set of events that are con1We tested data selection by perplexity scoring, but found the Language Models too small to be useful in our setting. sidered to be relevant for the task at hand. For parsing, these might be words, characters, n-grams (of words or characters), Part-of-Speech (PoS) tags, bilexical dependencies, syntactic rules, etc. However, to obtain more abstract types such as PoS tags or dependency relations, one would first need to gather respective labels. The necessary tools for this are again trained on particular corpora, and will suffer from domain shifts, rendering labels noisy. Therefore, we want to gauge the effect of the simplest representation possible: plain surface characteristics (unlabeled text). This has the advantage that we do not need to rely on additional supervised tools; moreover, it is interesting to know how far we can get with this level of information only. We examine the following feature representations: relative frequencies of words, relative frequencies of character tetragrams, and topic models. Our motivation was as follows. Relative frequencies of words are a simple and effective representation used e.g. in text classification (Manning and Sch u¨tze, 1999), while character n-grams have proven successful in genre classification (Wu et al., 2010). Topic models (Blei et al., 2003; Steyvers and Griffiths, 2007) can be considered an advanced model over word distributions: every article is represented by a topic distribution, which in turn is a distribution over words. Similarity between documents can be measured by comparing topic distributions. Similarity Functions There are many possible similarity (or distance) functions. They fall broadly into two categories: probabilistically-motivated and geometrically-motivated functions. The similarity functions examined in this study will be described in the following. The Kullback-Leibler (KL) divergence D(q| |r) is a cTlahsesic Kaull measure oibfl ‘edri s(KtaLn)ce d’i2v ebregtweneceen D Dtw(oq probability distributions, and is defined as: D(q| |r) = Pyq(y)logrq((yy)). It is a non-negative, additive, aPsymmetric measure, and 0 iff the two distributions are identical. However, the KL-divergence is undefined if there exists an event y such that q(y) > 0 but r(y) = 0, which is a property that “makes it unsuitable for distributions derived via maximumlikelihood estimates” (Lee, 2001). 2It is not a proper distance metric since it is asymmetric. 1568 One option to overcome this limitation is to apply smoothing techniques to gather non-zero estimates for all y. The alternative, examined in this paper, is to consider an approximation to the KL divergence, such as the Jensen-Shannon (JS) divergence (Lin, 1991) and the skew divergence (Lee, 2001). The Jensen-Shannon divergence, which is symmetric, computes the KL-divergence between q, r, and the average between the two. We use the JS divergence as defined in Lee (2001): JS(q, r) = [D(q| |avg(q, r)) + D(r| |avg(q, r))] . The asymm[eDtr(icq |s|akvewg( divergence sα, proposed by Lee (2001), mixes one distribution with the other by a degree de- 21 fined by α ∈ [0, 1) : sα (q, r, α) = D(q| |αr + (1 α)q). Ays α α approaches 1, rt,hαe )sk =ew D divergence approximates the KL-divergence. An alternative way to measure similarity is to consider the distributions as vectors and apply geometrically-motivated distance functions. This family of similarity functions includes the cosine cos(q, r) = qq(y) · r(y)/ | |q(y) | | | |r(y) | |, euclidean − euc(q,r) = qPy(q(y) − r(y))2 and variational (also known asq LP1 or MPanhattan) distance function, defined as var(q, r) = Py |q(y) − r(y) |. 3.2 Human-annotatePd data In contrast to the automatic measures devised in the previous section, we might have access to human annotated data. That is, use label information such as topic or genre to define the set of similar articles. Genre For the Penn Treebank (PT) Wall Street Journal (WSJ) section, more specifically, the subset available in the Penn Discourse Treebank, there exists a partition of the data by genre (Webber, 2009). Every article is assigned one of the following genre labels: news, letters, highlights, essays, errata, wit and short verse, quarterly progress reports, notable and quotable. This classification has been made on the basis of meta-data (Webber, 2009). It is wellknown that there is no meta-data directly associated with the individual WSJ files in the Penn Treebank. However, meta-data can be obtained by looking at the articles in the ACL/DCI corpus (LDC99T42), and a mapping file that aligns document numbers of DCI (DOCNO) to WSJ keys (Webber, 2009). An example document is given in Figure 1. The metadata field HL contains headlines, SO source info, and the IN field includes topic markers.

5 0.71884072 342 acl-2011-full-for-print

Author: Kuzman Ganchev

Abstract: unkown-abstract

6 0.71121436 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

7 0.64942765 204 acl-2011-Learning Word Vectors for Sentiment Analysis

8 0.64561194 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

9 0.62628651 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

10 0.60801506 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis

11 0.59938914 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

12 0.59388703 150 acl-2011-Hierarchical Text Classification with Latent Concepts

13 0.58497047 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging

14 0.55330354 238 acl-2011-P11-2093 k2opt.pdf

15 0.50698525 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

16 0.50554848 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

17 0.49020022 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

18 0.48442388 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

19 0.4842369 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

20 0.48205906 256 acl-2011-Query Weighting for Ranking Model Adaptation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.018), (17, 0.06), (26, 0.021), (37, 0.166), (39, 0.054), (41, 0.083), (55, 0.046), (59, 0.053), (72, 0.019), (88, 0.221), (91, 0.041), (96, 0.132)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85698181 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

Author: Ivan Titov

2 0.83985662 194 acl-2011-Language Use: What can it tell us?

Author: Marjorie Freedman ; Alex Baron ; Vasin Punyakanok ; Ralph Weischedel

Abstract: For 20 years, information extraction has focused on facts expressed in text. In contrast, this paper is a snapshot of research in progress on inferring properties and relationships among participants in dialogs, even though these properties/relationships need not be expressed as facts. For instance, can a machine detect that someone is attempting to persuade another to action or to change beliefs or is asserting their credibility? We report results on both English and Arabic discussion forums. 1

3 0.79737556 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

Author: Danushka Bollegala ; David Weir ; John Carroll

4 0.74093044 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

Author: Yulan He ; Chenghua Lin ; Harith Alani

5 0.73180234 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Author: Chung-chi Huang ; Mei-hua Chen ; Shih-ting Huang ; Jason S. Chang

Abstract: We introduce a new method for learning to detect grammatical errors in learner’s writing and provide suggestions. The method involves parsing a reference corpus and inferring grammar patterns in the form of a sequence of content words, function words, and parts-of-speech (e.g., “play ~ role in Ving” and “look forward to Ving”). At runtime, the given passage submitted by the learner is matched using an extended Levenshtein algorithm against the set of pattern rules in order to detect errors and provide suggestions. We present a prototype implementation of the proposed method, EdIt, that can handle a broad range of errors. Promising results are illustrated with three common types of errors in nonnative writing. 1

6 0.7268151 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

7 0.72465038 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

8 0.72148043 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

9 0.72129351 85 acl-2011-Coreference Resolution with World Knowledge

10 0.71906126 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

11 0.71832204 122 acl-2011-Event Extraction as Dependency Parsing

12 0.71682453 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

13 0.71502185 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

14 0.71499658 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

15 0.71482855 334 acl-2011-Which Noun Phrases Denote Which Concepts?

16 0.71462142 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

17 0.71417004 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

18 0.71368909 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

19 0.71248722 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

20 0.71164805 256 acl-2011-Query Weighting for Ranking Model Adaptation