acl acl2011 acl2011-179 knowledge-graph by maker-knowledge-mining

179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Source: pdf

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefullydesigned experiments that led us to these conclusions. 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? Answer: Yes. The reason is that while many lations of a word may be valid, the MT system have a systematic bias. For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. From the perspective of MT, this translation is correct and preserves sentiment polarity. But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? Answer: No. It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). Here we present a series of experiments that led us to this conclusion. First we describe the experiment design (§2) and baselines (§3), before answering Question §12 (§4) dan bda Question 32) (§5). 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. Domain mismatch can arise from language differences (e.g. English vs. translated text) or market differences (e.g. DVD vs. Book reviews). Our experiments will involve fixing Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. This allows us to experiment with different settings for adaptation. We use the Amazon review dataset of Prettenhofer (2010)1 , due to its wide range of languages (English [EN], Japanese [JP], French [FR], German [DE]) and markets (music, DVD, books). Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. English is not a low-resource language, but this setting allows for more comparisons. Each source dataset has 2000 reviews, equally balanced between positive and negative. The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. Texts in JP, FR, and DE are translated word-by-word into English with Google Translate.2 We perform three sets of experiments, shown in Table 1. Table 2 lists all the results; we will interpret them in the following sections. Target (T) Source (S) 312BDMToVuasbDkil-ecE1N:ExpDMB eorVuimsDkice-JEnPtN,s eBD,MtuoVBDpuoVsk:-iFDck-iERxFN,T DB,vVoMaDruky-sSiDc.E-, 3 How much performance degradation occurs in cross-lingual adaptation? First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. So we train a SVM classifier on labeled source data3, and directly apply it on test data. The oracle setting, which has no domain-mismatch (e.g. train on Music-EN, test on Music-EN), achieves an average test accuracy of (81.6 + 80.9 + 80.0)/3 = 80.8%4. Aver1http://www.webis.de/research/corpora/webis-cls-10 2This is done by querying foreign words to build a bilingual dictionary. The words are converted to tfidf unigram features. 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. 4See column EN of Table 2, Supervised SVM results. 430 age cross-lingual accuracies are: 69.4% (JP), 75.6% (FR), 77.0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE).5 Crossmarket degradations are around -6%6. Observation 1: Degradations due to market and language mismatch are comparable in several cases (e.g. MUSIC-DE and DVD-EN perform similarly for target MUSIC-EN). Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. Does this mean JP-EN is a more difficult language pair for MT? The next section will show that this is not necessarily the case. Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. 4 Where exactly is the domain mismatch? 4.1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e.g. unigram feature vec- tor) and labels y (positive / negative). Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). • Labeling mismatch: ps (y|x) pt(y|x). Instance mismatch implies that the input feature vectors have different distribution (e.g. one dataset uses the word “excellent” often, while the other uses the word “awesome”). This degrades performance because classifiers trained on “excellent” might not know how to classify texts with the word “awesome.” The solution is to tie together these features (Blitzer et al., 2006) or re-weight the input distribution (Sugiyama et al., 2008). Under some assumptions (i.e. covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). Labeling mismatch implies the same input has different labels in different domains. For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. Then, positive JP = = 5See “Adapt by Language” columns of Table 2. Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). Nevertheless, mixing languages seem to give good results. 6See “Adapt by Market” columns of Table 2. TargetClassifierOEraNcleJPAFdaRpt bDyE LanJgPu+agFeR+DEMUASdIaCpt D byV MDar BkeOtOK MUSIC-ENSAudpaeprtvedise TdS SVVMM8719..666783..50 7745..62 7 776..937880..36--7768..847745..16 DVD-ENSAudpaeprtveidse TdS SVVMM8801..907701..14 7765..54 7 767..347789..477754..28--7746..57 BOOK-ENSAudpaeprtveidse TdS SVVMM8801..026793..68 7775..64 7 767..747799..957735..417767..24-Table 2: Test accuracies (%) for English Music/DVD/Book reviews. Each column is an adaptation scenario using different source data. The source data may vary by language or by market. For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68.5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76.8. “Oracle” indicates training on the same market and same language domain as the target. “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. Boldface shows the winner of Supervised vs. Adapted. reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? • Instance mismatch: Systematic Iantessta nwcoerd m diissmtraibtcuhti:on Ssy MT bias gener- sdtiefmferaetinct MfroTm b naturally- occurring English. (Translation may be valid.) Label mismatch: MT error mis-translates a word iLnatob something w: MithT Td eifrfreorren mti polarity. Conclusion from §4.2 and §4.3: Instance mismaCtcohn occurs often; M §4T. error appears Imnisntainmcael. • Mis-translated polarity Effect Taeb0+±.lge→ .3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. 431 4.2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr.) • Set Coverage of Avg(T) on Avg(S): how many Sweotrd C (type) ien o Tf appears oatn le Aavsgt once ionw wS .m Both measures correlate strongly with final accuracy, as seen in Figure 1. The correlation coefficients are r = −0.78 for KL Divergence and r = 0.71 for Coverage, 0 b.7o8th statistically significant (p < 0.05). This implies that instance mismatch is an important reason for the degradations seen in Section 3.7 4.3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. This suggests that MT output may be more constrained in vocabulary than naturally-occurring English. 0.35 0.3 gnvLrDeiceKe0 0 0. 120.25 510 erts TeCovega0 0 0. .98657 68 70 72 7A4ccuracy76 78 80 82 0.4 68 70 72 7A4ccuracy76 78 80 82 Figure 1: KL Divergence and Coverage vs. accuracy. (o) are cross-lingual and (x) are cross-market data points. is causing the polarity flip. Algorithm 1 (with K=2000) shows how we compute polarity flip rate.8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0.04). Conclusion: Labeling mismatch is not a factor in performance degradation. Nevertheless, we note there is a surprising large number of flips (24% on average). A manual check of the flipped words in BOOK-JP revealed few MT mistakes. Only 3.7% of 450 random EN-JP word pairs checked can be judged as blatantly incorrect (without sentence context). The majority of flipped words do not have a clear sentiment orientation (e.g. “amazon”, “human”, “moreover”). 5 Are standard adaptation algorithms applicable to cross-lingual problems? One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. This makes available a host of preexisting adaptation algorithms for improving over supervised results. However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. to ensure 432 Algorithm 1 Measuring labeling mismatch Input: Weight vectors for source wsand target wt Input: Target data average sample vector avg(T) Output: Polarity flip rate f 1: Normalize: ws = avg(T) * ws ; wt = avg(T) * wt 2: Set S+ = { K most positive features in ws} 3: Set S− == {{ KK mmoosstt negative ffeeaattuurreess inn wws}} 4: Set T+ == {{ KK m moosstt npoesgiatitivvee f efeaatuturreess i inn w wt}} 5: Set T− == {{ KK mmoosstt negative ffeeaattuurreess inn wwt}} 6: for each= f{e a Ktur me io ∈t T+ adtiov 7: rif e ia c∈h S fe−a ttuhreen i if ∈ = T f + 1 8: enidf fio ∈r 9: for each feature j ∈ T− do 10: rif e j ∈h Sfe+a uthreen j f ∈ = T f + 1 11: enidf fjo r∈ 12: f = 2Kf better to “adapt” the standard adaptation algorithm to the cross-lingual setting. We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. Table 2 shows that TSVM outperforms SVM in all but one case for cross-market adaptation, but gives mixed results for cross-lingual adaptation. This is a puzzling result considering that both use the same unlabeled data. Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? Is unlabeled target data interacting with source data in some unexpected way? Certainly there are several successful studies (Wan, 2009; Wei and Pal, 2010; Banea et al., 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. We conjecture that adapting from artificially-generated text (e.g. MT output) is a different story than adapting from naturallyoccurring text (e.g. cross-market). In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. References Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Alessandro Bergamo and Lorenzo Torresani. 2010. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Jenny Rose Finkel and Chris Manning. 2009. Hierarchical Bayesian domain adaptation. In Proc. of NAACL Human Language Technologies (HLT). Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proc. of the Association for Computational Linguistics (ACL). Peter Prettenhofer and Benno Stein. 2010. Crosslanguage text classification using structural correspondence learning. In Proc. of the Association for Computational Linguistics (ACL). Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inferenc, 90. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B ¨unau, and Motoaki Kawanabe. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4). Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proc. of the Association for Computational Linguistics (ACL). Bin Wei and Chris Pal. 2010. Cross lingual adaptation: an experiment on sentiment classification. In Proceedings of the ACL 2010 Conference Short Papers. 433

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. [sent-9, score-0.327]

2 This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. [sent-10, score-0.558]

3 Various prior work have achieved positive results using this approach. [sent-11, score-0.035]

4 In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. [sent-12, score-0.386]

5 First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. [sent-13, score-0.679]

6 Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. [sent-14, score-1.201]

7 This paper will describe a series of carefullydesigned experiments that led us to these conclusions. [sent-15, score-0.036]

8 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? [sent-16, score-0.607]

9 For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. [sent-19, score-0.056]

10 From the perspective of MT, this translation is correct and preserves sentiment polarity. [sent-20, score-0.099]

11 But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. [sent-21, score-0.545]

12 Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? [sent-22, score-0.772]

13 It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. [sent-24, score-0.228]

14 We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. [sent-25, score-0.542]

15 The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. [sent-26, score-0.588]

16 We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). [sent-27, score-0.61]

17 Here we present a series of experiments that led us to this conclusion. [sent-28, score-0.036]

18 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. [sent-30, score-0.563]

19 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. [sent-40, score-0.034]

20 Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. [sent-43, score-0.386]

21 Each source dataset has 2000 reviews, equally balanced between positive and negative. [sent-45, score-0.104]

22 The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. [sent-46, score-0.373]

23 Texts in JP, FR, and DE are translated word-by-word into English with Google Translate. [sent-47, score-0.056]

24 E-, 3 How much performance degradation occurs in cross-lingual adaptation? [sent-51, score-0.088]

25 First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. [sent-52, score-0.711]

26 So we train a SVM classifier on labeled source data3, and directly apply it on test data. [sent-53, score-0.155]

27 train on Music-EN, test on Music-EN), achieves an average test accuracy of (81. [sent-56, score-0.046]

28 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. [sent-65, score-0.218]

29 0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE). [sent-70, score-0.299]

30 Observation 1: Degradations due to market and language mismatch are comparable in several cases (e. [sent-72, score-0.538]

31 Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. [sent-75, score-0.115]

32 Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. [sent-78, score-0.545]

33 1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). [sent-81, score-1.056]

34 Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e. [sent-82, score-0.276]

35 unigram feature vec- tor) and labels y (positive / negative). [sent-84, score-0.049]

36 Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. [sent-85, score-0.313]

37 Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). [sent-86, score-0.303]

38 Instance mismatch implies that the input feature vectors have different distribution (e. [sent-88, score-0.467]

39 covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). [sent-97, score-0.247]

40 Labeling mismatch implies the same input has different labels in different domains. [sent-98, score-0.467]

41 For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. [sent-99, score-0.034]

42 Then, positive JP = = 5See “Adapt by Language” columns of Table 2. [sent-100, score-0.068]

43 Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). [sent-101, score-0.472]

44 Each column is an adaptation scenario using different source data. [sent-147, score-0.455]

45 The source data may vary by language or by market. [sent-148, score-0.069]

46 For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68. [sent-149, score-0.285]

47 5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76. [sent-150, score-0.164]

48 “Oracle” indicates training on the same market and same language domain as the target. [sent-152, score-0.237]

49 “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. [sent-153, score-0.069]

50 reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. [sent-156, score-0.24]

51 labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). [sent-157, score-0.172]

52 The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). [sent-158, score-0.099]

53 Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? [sent-159, score-0.66]

54 3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. [sent-169, score-0.036]

55 Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. [sent-170, score-0.034]

56 2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. [sent-172, score-0.385]

57 Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). [sent-173, score-0.232]

58 Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr. [sent-174, score-0.034]

59 This implies that instance mismatch is an important reason for the degradations seen in Section 3. [sent-182, score-0.704]

60 3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. [sent-184, score-0.65]

61 Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. [sent-185, score-0.404]

62 Algorithm 1 (with K=2000) shows how we compute polarity flip rate. [sent-198, score-0.186]

63 8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0. [sent-199, score-0.232]

64 Conclusion: Labeling mismatch is not a factor in performance degradation. [sent-201, score-0.423]

65 Nevertheless, we note there is a surprising large number of flips (24% on average). [sent-202, score-0.034]

66 A manual check of the flipped words in BOOK-JP revealed few MT mistakes. [sent-203, score-0.076]

67 The majority of flipped words do not have a clear sentiment orientation (e. [sent-206, score-0.175]

68 5 Are standard adaptation algorithms applicable to cross-lingual problems? [sent-209, score-0.386]

69 One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. [sent-210, score-0.189]

70 This makes available a host of preexisting adaptation algorithms for improving over supervised results. [sent-211, score-0.42]

71 However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. [sent-212, score-0.077]

72 We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. [sent-214, score-0.04]

73 Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. [sent-215, score-0.386]

74 The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. [sent-216, score-0.218]

75 This is a puzzling result considering that both use the same unlabeled data. [sent-218, score-0.051]

76 Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? [sent-219, score-0.068]

77 Is unlabeled target data interacting with source data in some unexpected way? [sent-220, score-0.228]

78 , 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. [sent-222, score-0.386]

79 MT output) is a different story than adapting from naturallyoccurring text (e. [sent-225, score-0.035]

80 In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. [sent-228, score-0.588]

81 Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. [sent-236, score-0.508]

82 Improving predictive inference under covariate shift by weighting the loglikelihood function. [sent-260, score-0.152]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mismatch', 0.423), ('adaptation', 0.386), ('jp', 0.241), ('degradations', 0.191), ('avg', 0.19), ('mt', 0.162), ('fr', 0.122), ('ps', 0.122), ('domain', 0.122), ('reviews', 0.118), ('dvd', 0.116), ('market', 0.115), ('pt', 0.112), ('oracle', 0.108), ('tds', 0.101), ('prettenhofer', 0.101), ('ripe', 0.101), ('tsvm', 0.101), ('svm', 0.1), ('samples', 0.099), ('sentiment', 0.099), ('polarity', 0.093), ('covariate', 0.093), ('flip', 0.093), ('degradation', 0.088), ('excellent', 0.082), ('labeling', 0.079), ('ws', 0.077), ('bergamo', 0.076), ('enidf', 0.076), ('ffeeaattuurreess', 0.076), ('flipped', 0.076), ('mmoosstt', 0.076), ('kk', 0.076), ('source', 0.069), ('wt', 0.069), ('inn', 0.069), ('rif', 0.067), ('sugiyama', 0.067), ('de', 0.066), ('target', 0.065), ('bad', 0.06), ('shift', 0.059), ('banea', 0.058), ('kl', 0.057), ('translated', 0.056), ('unlabeled', 0.051), ('divergence', 0.05), ('labeled', 0.05), ('unigram', 0.049), ('adapt', 0.048), ('accuracy', 0.046), ('instance', 0.046), ('jiang', 0.045), ('music', 0.045), ('implies', 0.044), ('argue', 0.043), ('unexpected', 0.043), ('adapted', 0.04), ('certainly', 0.039), ('monolingual', 0.039), ('books', 0.038), ('coverage', 0.038), ('piece', 0.037), ('blitzer', 0.037), ('finkel', 0.036), ('svms', 0.036), ('negative', 0.036), ('classifier', 0.036), ('scenarios', 0.036), ('led', 0.036), ('io', 0.035), ('zhai', 0.035), ('adapting', 0.035), ('positive', 0.035), ('variance', 0.035), ('reader', 0.034), ('amazon', 0.034), ('larly', 0.034), ('invite', 0.034), ('observant', 0.034), ('barrier', 0.034), ('breakthroughs', 0.034), ('distributio', 0.034), ('awesome', 0.034), ('dce', 0.034), ('flips', 0.034), ('graceful', 0.034), ('magnitudes', 0.034), ('mistranslated', 0.034), ('motoaki', 0.034), ('nakajima', 0.034), ('nisc', 0.034), ('preexisting', 0.034), ('shinichi', 0.034), ('testset', 0.034), ('unau', 0.034), ('classification', 0.033), ('exhibit', 0.033), ('columns', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

2 0.21844524 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

Author: Danushka Bollegala ; David Weir ; John Carroll

Abstract: We describe a sentiment classification method that is applicable when we do not have any labeled data for a target domain but have some labeled data for multiple other domains, designated as the source domains. We automat- ically create a sentiment sensitive thesaurus using both labeled and unlabeled data from multiple source domains to find the association between words that express similar sentiments in different domains. The created thesaurus is then used to expand feature vectors to train a binary classifier. Unlike previous cross-domain sentiment classification methods, our method can efficiently learn from multiple source domains. Our method significantly outperforms numerous baselines and returns results that are better than or comparable to previous cross-domain sentiment classification methods on a benchmark dataset containing Amazon user reviews for different types of products.

3 0.19429396 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

Author: Bin Lu ; Chenhao Tan ; Claire Cardie ; Benjamin K. Tsou

Abstract: Most previous work on multilingual sentiment analysis has focused on methods to adapt sentiment resources from resource-rich languages to resource-poor languages. We present a novel approach for joint bilingual sentiment classification at the sentence level that augments available labeled data in each language with unlabeled parallel data. We rely on the intuition that the sentiment labels for parallel sentences should be similar and present a model that jointly learns improved monolingual sentiment classifiers for each language. Experiments on multiple data sets show that the proposed approach (1) outperforms the monolingual baselines, significantly improving the accuracy for both languages by 3.44%-8. 12%; (2) outperforms two standard approaches for leveraging unlabeled data; and (3) produces (albeit smaller) performance gains when employing pseudo-parallel data from machine translation engines. 1

4 0.1878036 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

Author: Ivan Titov

Abstract: We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain. One way to tackle this problem is to train a generative model with latent variables on the mixture of data from the source and target domains. Such a model would cluster features in both domains and ensure that at least some of the latent variables are predictive of the label on the source domain. The danger is that these predictive clusters will consist of features specific to the source domain only and, consequently, a classifier relying on such clusters would perform badly on the target domain. We introduce a constraint enforcing that marginal distributions of each cluster (i.e., each latent variable) do not vary significantly across domains. We show that this constraint is effec- tive on the sentiment classification task (Pang et al., 2002), resulting in scores similar to the ones obtained by the structural correspondence methods (Blitzer et al., 2007) without the need to engineer auxiliary tasks.

5 0.16829285 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

Author: Yulan He ; Chenghua Lin ; Harith Alani

Abstract: Joint sentiment-topic (JST) model was previously proposed to detect sentiment and topic simultaneously from text. The only supervision required by JST model learning is domain-independent polarity word priors. In this paper, we modify the JST model by incorporating word polarity priors through modifying the topic-word Dirichlet priors. We study the polarity-bearing topics extracted by JST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. Furthermore, using feature augmentation and selection according to the information gain criteria for cross-domain sentiment classification, our proposed approach performs either better or comparably compared to previous approaches. Nevertheless, our approach is much simpler and does not require difficult parameter tuning.

6 0.16816352 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

7 0.14465041 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

8 0.14392661 256 acl-2011-Query Weighting for Ranking Model Adaptation

9 0.1349013 204 acl-2011-Learning Word Vectors for Sentiment Analysis

10 0.13265564 109 acl-2011-Effective Measures of Domain Similarity for Parsing

11 0.10085412 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features

12 0.099088505 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews

13 0.098541625 238 acl-2011-P11-2093 k2opt.pdf

14 0.09311156 292 acl-2011-Target-dependent Twitter Sentiment Classification

15 0.089877129 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

16 0.084737293 253 acl-2011-PsychoSentiWordNet

17 0.083675943 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

18 0.080774158 105 acl-2011-Dr Sentiment Knows Everything!

19 0.080556616 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

20 0.075275086 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.203), (1, 0.108), (2, 0.159), (3, 0.008), (4, 0.036), (5, -0.012), (6, 0.037), (7, 0.026), (8, 0.06), (9, 0.028), (10, 0.103), (11, -0.08), (12, 0.03), (13, -0.015), (14, 0.083), (15, 0.069), (16, -0.069), (17, -0.0), (18, 0.06), (19, -0.094), (20, -0.01), (21, -0.119), (22, -0.003), (23, 0.078), (24, -0.001), (25, 0.067), (26, -0.109), (27, -0.099), (28, 0.093), (29, 0.01), (30, 0.014), (31, 0.03), (32, -0.034), (33, -0.058), (34, -0.002), (35, 0.032), (36, -0.146), (37, 0.054), (38, 0.008), (39, -0.026), (40, -0.02), (41, -0.005), (42, -0.07), (43, -0.009), (44, -0.059), (45, -0.032), (46, -0.024), (47, 0.092), (48, -0.057), (49, -0.137)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97379005 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

2 0.79725325 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

Author: Ivan Titov

3 0.76351351 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

Author: Danushka Bollegala ; David Weir ; John Carroll

4 0.72200465 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

Author: Yulan He ; Chenghua Lin ; Harith Alani

5 0.71580333 109 acl-2011-Effective Measures of Domain Similarity for Parsing

Author: Barbara Plank ; Gertjan van Noord

Abstract: It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages exam- ined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain, say newspaper text, to a particular new domain, say biomedical data. Usually, some amount of (labeled or unlabeled) data from the new domain was given which has been determined by a human. However, with the growth of the web, more and more data is becoming available, where each document “is potentially its own domain” (McClosky et al., 2010). It is not straightforward to determine – 1566 Gertjan van Noord University of Groningen The Netherlands G J M van Noord@ rug nl . . . . . which data or model (in case we have several source domain models) will perform best on a new (unknown) target domain. Therefore, an important issue that arises is how to measure domain similarity, i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. Moreover, if we had such a measure, a related question is whether it can tell us something more about what is actually meant by “domain”. So far, it was mostly arbitrarily used to refer to some kind of coherent unit (related to topic, style or genre), e.g.: newspaper text, biomedical abstracts, questions, fiction. Most previous work on domain adaptation, for instance Hara et al. (2005), McClosky et al. (2006), Blitzer et al. (2006), Daum e´ III (2007), sidestepped this problem of automatic domain selection and adaptation. For parsing, to our knowledge only one recent study has started to examine this issue (McClosky et al., 2010) we will discuss their approach in Section 2. Rather, an implicit assumption of all of these studies is that domains are given, i.e. that they are represented by the respective corpora. Thus, a corpus has been considered a homogeneous unit. As more data is becoming available, it is unlikely that – domains will be ‘given’ . Moreover, a given corpus might not always be as homogeneous as originally thought (Webber, 2009; Lippincott et al., 2010). For instance, recent work has shown that the well-known Penn Treebank (PT) Wall Street Journal (WSJ) actually contains a variety of genres, including letters, wit and short verse (Webber, 2009). In this study we take a different approach. Rather than viewing a given corpus as a monolithic entity, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s566–1576, we break it down to the article-level and disregard corpora boundaries. Given the resulting set of documents (articles), we evaluate various ways to automatically acquire related training data for a given test set, to find answers to the following questions: • Given a pool of data (a collection of articles fGriovmen nun ak pnooowln o domains) caonldle a test article, eiss there a way to automatically select data that is relevant for the new domain? If so: • Which similarity measure is good for parsing? • How does it compare to human-annotated data? • Is the measure also useful for other languages Iasnd th/oer mtaesakssu?r To this end, we evaluate measures of domain similarity and feature representations and their impact on dependency parsing accuracy. Given a collection of annotated articles, and a new article that we want to parse, we want to select the most similar articles to train the best parser for that new article. In the following, we will first compare automatic measures to human-annotated labels by examining parsing performance within subdomains of the Penn Treebank WSJ. Then, we extend the experiments to the domain adaptation scenario. Experiments were performed on two languages: English and Dutch. The empirical results show that a simple measure based on topic distributions is effective for both languages and works well also for Part-of-Speech tagging. As the approach is based on plain surfacelevel information (words) and it finds related data in a completely unsupervised fashion, it can be easily applied to other tasks or languages for which annotated (or automatically annotated) data is available. 2 Related Work The work most related to ours is McClosky et al. (2010). They try to find the best combination of source models to parse data from a new domain, which is related to Plank and Sima’an (2008). In the latter, unlabeled data was used to create several parsers by weighting trees in the WSJ according to their similarity to the subdomain. McClosky et al. (2010) coined the term multiple source domain adaptation. Inspired by work on parsing accuracy 1567 prediction (Ravi et al., 2008), they train a linear regression model to predict the best (linear interpolation) of source domain models. Similar to us, McClosky et al. (2010) regard a target domain as mixture of source domains, but they focus on phrasestructure parsing. Furthermore, our approach differs from theirs in two respects: we do not treat source corpora as one entity and try to mix models, but rather consider articles as base units and try to find subsets of related articles (the most similar articles); moreover, instead of creating a supervised model (in their case to predict parsing accuracy), our approach is ‘simplistic’ : we apply measures of domain simi- larity directly (in an unsupervised fashion), without the necessity to train a supervised model. Two other related studies are (Lippincott et al., 2010; Van Asch and Daelemans, 2010). Van Asch and Daelemans (2010) explore a measure of domain difference (Renyi divergence) between pairs of domains and its correlation to Part-of-Speech tagging accuracy. Their empirical results show a linear correlation between the measure and the performance loss. Their goal is different, but related: rather than finding related data for a new domain, they want to estimate the loss in accuracy of a PoS tagger when applied to a new domain. We will briefly discuss results obtained with the Renyi divergence in Section 5.1. Lippincott et al. (2010) examine subdomain variation in biomedicine corpora and propose awareness of NLP tools to such variation. However, they did not yet evaluate the effect on a practical task, thus our study is somewhat complementary to theirs. The issue of data selection has recently been examined for Language Modeling (Moore and Lewis, 2010). A subset of the available data is automatically selected as training data for a Language Model based on a scoring mechanism that compares cross- entropy scores. Their approach considerably outperformed random selection and two previous proposed approaches both based on perplexity scoring.1 3 Measures of Domain Similarity 3.1 Measuring Similarity Automatically Feature Representations A similarity function may be defined over any set of events that are con1We tested data selection by perplexity scoring, but found the Language Models too small to be useful in our setting. sidered to be relevant for the task at hand. For parsing, these might be words, characters, n-grams (of words or characters), Part-of-Speech (PoS) tags, bilexical dependencies, syntactic rules, etc. However, to obtain more abstract types such as PoS tags or dependency relations, one would first need to gather respective labels. The necessary tools for this are again trained on particular corpora, and will suffer from domain shifts, rendering labels noisy. Therefore, we want to gauge the effect of the simplest representation possible: plain surface characteristics (unlabeled text). This has the advantage that we do not need to rely on additional supervised tools; moreover, it is interesting to know how far we can get with this level of information only. We examine the following feature representations: relative frequencies of words, relative frequencies of character tetragrams, and topic models. Our motivation was as follows. Relative frequencies of words are a simple and effective representation used e.g. in text classification (Manning and Sch u¨tze, 1999), while character n-grams have proven successful in genre classification (Wu et al., 2010). Topic models (Blei et al., 2003; Steyvers and Griffiths, 2007) can be considered an advanced model over word distributions: every article is represented by a topic distribution, which in turn is a distribution over words. Similarity between documents can be measured by comparing topic distributions. Similarity Functions There are many possible similarity (or distance) functions. They fall broadly into two categories: probabilistically-motivated and geometrically-motivated functions. The similarity functions examined in this study will be described in the following. The Kullback-Leibler (KL) divergence D(q| |r) is a cTlahsesic Kaull measure oibfl ‘edri s(KtaLn)ce d’i2v ebregtweneceen D Dtw(oq probability distributions, and is defined as: D(q| |r) = Pyq(y)logrq((yy)). It is a non-negative, additive, aPsymmetric measure, and 0 iff the two distributions are identical. However, the KL-divergence is undefined if there exists an event y such that q(y) > 0 but r(y) = 0, which is a property that “makes it unsuitable for distributions derived via maximumlikelihood estimates” (Lee, 2001). 2It is not a proper distance metric since it is asymmetric. 1568 One option to overcome this limitation is to apply smoothing techniques to gather non-zero estimates for all y. The alternative, examined in this paper, is to consider an approximation to the KL divergence, such as the Jensen-Shannon (JS) divergence (Lin, 1991) and the skew divergence (Lee, 2001). The Jensen-Shannon divergence, which is symmetric, computes the KL-divergence between q, r, and the average between the two. We use the JS divergence as defined in Lee (2001): JS(q, r) = [D(q| |avg(q, r)) + D(r| |avg(q, r))] . The asymm[eDtr(icq |s|akvewg( divergence sα, proposed by Lee (2001), mixes one distribution with the other by a degree de- 21 fined by α ∈ [0, 1) : sα (q, r, α) = D(q| |αr + (1 α)q). Ays α α approaches 1, rt,hαe )sk =ew D divergence approximates the KL-divergence. An alternative way to measure similarity is to consider the distributions as vectors and apply geometrically-motivated distance functions. This family of similarity functions includes the cosine cos(q, r) = qq(y) · r(y)/ | |q(y) | | | |r(y) | |, euclidean − euc(q,r) = qPy(q(y) − r(y))2 and variational (also known asq LP1 or MPanhattan) distance function, defined as var(q, r) = Py |q(y) − r(y) |. 3.2 Human-annotatePd data In contrast to the automatic measures devised in the previous section, we might have access to human annotated data. That is, use label information such as topic or genre to define the set of similar articles. Genre For the Penn Treebank (PT) Wall Street Journal (WSJ) section, more specifically, the subset available in the Penn Discourse Treebank, there exists a partition of the data by genre (Webber, 2009). Every article is assigned one of the following genre labels: news, letters, highlights, essays, errata, wit and short verse, quarterly progress reports, notable and quotable. This classification has been made on the basis of meta-data (Webber, 2009). It is wellknown that there is no meta-data directly associated with the individual WSJ files in the Penn Treebank. However, meta-data can be obtained by looking at the articles in the ACL/DCI corpus (LDC99T42), and a mapping file that aligns document numbers of DCI (DOCNO) to WSJ keys (Webber, 2009). An example document is given in Figure 1. The metadata field HL contains headlines, SO source info, and the IN field includes topic markers.

6 0.67120188 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

7 0.66281092 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

8 0.63187224 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

9 0.56760997 204 acl-2011-Learning Word Vectors for Sentiment Analysis

10 0.55887604 238 acl-2011-P11-2093 k2opt.pdf

11 0.51169288 256 acl-2011-Query Weighting for Ranking Model Adaptation

12 0.49566364 297 acl-2011-That's What She Said: Double Entendre Identification

13 0.48799232 311 acl-2011-Translationese and Its Dialects

14 0.48700121 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews

15 0.48546574 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging

16 0.4775846 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

17 0.47397688 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic

18 0.47180775 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis

19 0.46793565 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

20 0.45714441 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.015), (17, 0.041), (26, 0.014), (37, 0.613), (39, 0.022), (41, 0.052), (55, 0.016), (59, 0.012), (72, 0.018), (91, 0.021), (96, 0.108)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97214997 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?

Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata

2 0.93222725 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation

Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport

Abstract: Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, 2004; Cohen and Smith, 2009; Spitkovsky et al., 2010a), a small set of parameters can be found whose modification yields a significant improvement in standard evaluation measures. These parameters correspond to local cases where no linguistic consensus exists as to the proper gold annotation. Therefore, the standard evaluation does not provide a true indication of algorithm quality. We present a new measure, Neutral Edge Direction (NED), and show that it greatly reduces this undesired phenomenon.

3 0.92894304 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai

Abstract: In this paper, we present a novel approach which incorporates the web-derived selectional preferences to improve statistical dependency parsing. Conventional selectional preference learning methods have usually focused on word-to-class relations, e.g., a verb selects as its subject a given nominal class. This paper extends previous work to wordto-word selectional preferences by using webscale data. Experiments show that web-scale data improves statistical dependency parsing, particularly for long dependency relationships. There is no data like more data, performance improves log-linearly with the number of parameters (unique N-grams). More importantly, when operating on new domains, we show that using web-derived selectional preferences is essential for achieving robust performance.

4 0.92515308 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

Author: Mark-Jan Nederhof ; Giorgio Satta

Abstract: We present a method for the computation of prefix probabilities for synchronous contextfree grammars. Our framework is fairly general and relies on the combination of a simple, novel grammar transformation and standard techniques to bring grammars into normal forms.

5 0.92351425 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation

Author: Bing Xiang ; Abraham Ittycheriah

Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.

6 0.90777373 204 acl-2011-Learning Word Vectors for Sentiment Analysis

7 0.90669101 122 acl-2011-Event Extraction as Dependency Parsing

8 0.90169597 334 acl-2011-Which Noun Phrases Denote Which Concepts?

9 0.85045171 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

10 0.78973305 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

11 0.78690034 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

12 0.78608942 256 acl-2011-Query Weighting for Ranking Model Adaptation

13 0.77555603 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

14 0.7729218 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

15 0.76243013 85 acl-2011-Coreference Resolution with World Knowledge

16 0.75243086 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning

17 0.7500236 292 acl-2011-Target-dependent Twitter Sentiment Classification

18 0.74854535 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features

19 0.74791282 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

20 0.74673688 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing