acl acl2011 acl2011-204 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrew L. Maas ; Raymond E. Daly ; Peter T. Pham ; Dan Huang ; Andrew Y. Ng ; Christopher Potts
Abstract: Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area.
Reference: text
sentIndex sentText sentNum sentScore
1 We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. [sent-6, score-0.858]
2 The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. [sent-7, score-0.711]
3 We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e. [sent-8, score-0.788]
4 We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. [sent-11, score-1.309]
5 We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area. [sent-12, score-0.498]
6 They encode continuous similarities between words as distance or angle between word vectors in a high-dimensional space. [sent-16, score-0.253]
7 In this paper, we present a model to capture both semantic and sentiment similarities among words. [sent-20, score-0.807]
8 The semantic component of our model learns word vectors via an unsupervised probabilistic model of documents. [sent-21, score-0.558]
9 However, in keeping with linguistic and cognitive research arguing that expressive content and descriptive semantic content are distinct (Kaplan, 1999; Jay, 2000; Potts, 2007), we find that this basic model misses crucial sentiment information. [sent-22, score-0.694]
10 For example, while it learns that wonderful and amazing are semantically close, it doesn’t capture the fact that these are both very strong positive sentiment words, at the opposite end of the spectrum from terrible and awful. [sent-23, score-0.718]
11 Thus, we extend the model with a supervised sentiment component that is capable of embracing many social and attitudinal aspects of meaning (Wilson et al. [sent-24, score-0.69]
12 This component of the model uses the vector representation of words to predict the sentiment annotations on contexts in which the words appear. [sent-27, score-0.818]
13 This causes words expressing similar sentiment to have similar vector representations. [sent-28, score-0.714]
14 The full objective function of the model thus learns semantic vectors that are imbued with nuanced sentiment information. [sent-29, score-0.978]
15 In our experiments, we show how the model can leverage document-level sentiment annotations of a sort that are abundant online in the form of consumer reviews such as word sense disambiguation, named entity142for movies, products, etc. [sent-30, score-0.898]
16 Ac s2s0o1ci1a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 142–150, ciently general to work also with continuous and multi-dimensional notions of sentiment as well as non-sentiment annotations (e. [sent-33, score-0.62]
17 After presenting the model in detail, we provide illustrative examples of the vectors it learns, and then we systematically evaluate the approach on document-level and sentence-level classification tasks. [sent-36, score-0.214]
18 Our experiments involve the small, widely used sentiment and subjectivity corpora of Pang and Lee (2004), which permits us to make comparisons with a number of related approaches and published results. [sent-37, score-0.678]
19 This leads us to evaluate on, and make publicly available, a large dataset of informal movie reviews from the Internet Movie Database (IMDB). [sent-39, score-0.434]
20 , 2003)) is a probabilistic document model that assumes each document is a mixture of latent topics. [sent-42, score-0.431]
21 For each latent topic T, the model learns a conditional distribution p(w|T) for the probability tchoantd iwtioonrda w occurs inn pT(w. [sent-43, score-0.282]
22 TO)n efo can eo pbrtaoibna a tkydimensional vector representation of words by first training a k-topic model and then filling the matrix with the p(w|T) values (normalized to unit length). [sent-44, score-0.242]
23 The semantic component of our model shares its probabilistic foundation with LDA, but is factored in a manner designed to discover word vectors rather than latent topics. [sent-48, score-0.478]
24 Some recent work introduces extensions of LDA to capture sentiment in addition to topical information (Li et al. [sent-49, score-0.636]
25 Latent Semantic Analysis (LSA), perhaps the best known VSM, explicitly learns semantic word vectors by applying singular value decomposition (SVD) to factor a term–document co-occurrence matrix. [sent-52, score-0.301]
26 Using term frequency (tf) and inverse document frequency (idf) weighting to transform the values in a VSM often increases the performance of retrieval and categorization systems. [sent-56, score-0.281]
27 Delta idf weighting (Martineau and Finin, 2009) is a supervised vari- ant of idf weighting in which the idf calculation is done for each document class and then one value is subtracted from the other. [sent-57, score-0.788]
28 Martineau and Finin present evidence that this weighting helps with sentiment classification, and Paltoglou and Thelwall (2010) systematically explore a number of weighting schemes in the context of sentiment analysis. [sent-58, score-1.52]
29 The success of delta idf weighting in previous work suggests that incorporating sentiment information into VSM values via supervised methods is helpful for sentiment analysis. [sent-59, score-1.511]
30 ) 3 Our Model To capture semantic similarities among words, we derive a probabilistic model of documents which learns word representations. [sent-62, score-0.477]
31 The sentiment component of our model uses sentiment 2010). [sent-64, score-1.264]
32 Like LDA, these methods focus on model-143annotations to constrain words expressing similar sentiment to have similar representations. [sent-65, score-0.62]
33 1 Capturing Semantic Similarities We build a probabilistic model of a document using a continuous mixture distribution over words indexed by a multi-dimensional random variable θ. [sent-68, score-0.322]
34 - The energy function uses a word representation matrix R ∈ R(β x|V |) where each word w (represented as a one-on vector) in the vocabulary V has a βdimensional vector representation φw = Rw corresponding to that word’s column in R. [sent-76, score-0.313]
35 We additionally introduce a bias bw for each word to capture differences in overall word frequencies. [sent-78, score-0.262]
36 One could view the entries of a word vector φ as that word’s association strength with respect to each latent topic dimension. [sent-86, score-0.284]
37 The random variable θ then defines a weighting over topics. [sent-87, score-0.218]
38 However, our model does not attempt to model individual topics, but instead directly models word probabilities conditioned on the topic mixture variable θ. [sent-88, score-0.285]
39 The word biases b are not regularized reflecting the fact that we want the biases to capture whatever overall word frequency statistics are present in the data. [sent-99, score-0.226]
40 The θ, a word w’s occurrence probability is related to144hyper-parameters in the model are the regularization weights (λ and ν), and the word vector dimensionality β. [sent-101, score-0.245]
41 2 Capturing Word Sentiment The model presented so far does not explicitly capture sentiment information. [sent-103, score-0.693]
42 Applying this algorithm to documents will produce representations where words that occur together in documents have similar representations. [sent-104, score-0.264]
43 However, this unsupervised approach has no explicit way of capturing which words are predictive of sentiment as opposed to content-related. [sent-105, score-0.645]
44 Depending on which aspects of sentiment we wish to capture, we can give some body of text a sentiment label s which can be categorical, continu- ous, or multi-dimensional. [sent-109, score-1.148]
45 To leverage such labels, we introduce an objective that the word vectors of our model should predict the sentiment label using some appropriate predictor, s = f(φw) . [sent-110, score-0.908]
46 (8) Using an appropriate predictor function f(x) we map a word vector φw to a predicted sentiment label We can then improve our word vector φw to better predict the sentiment labels of contexts in which that word occurs. [sent-111, score-1.531]
47 The logistic regression weights ψ and bc define a linear hyperplane in the word vector space where a word vector’s positive sentiment probability depends on where it lies with respect to this hyperplane. [sent-114, score-0.853]
48 Learning over a collection of documents results in words residing different distances from this hyperplane based on the average polarity of documents in which the words occur. [sent-115, score-0.256]
49 the sentiment label for document dk, we wish to maximize the probability of document labels given the documents. [sent-117, score-0.764]
50 Then we find the new MAP estimate for each document while leaving the word representations fixed, and continue this process until convergence. [sent-128, score-0.278]
51 4 Experiments We evaluate our model with document-level and sentence-level categorization tasks in the domain of online movie reviews. [sent-132, score-0.224]
52 For document categoriza- tion, we compare our method to previously published results on a standard dataset, and introduce a new dataset for the task. [sent-133, score-0.208]
53 In both tasks we compare our model’s word representations with several bag of words weighting methods, and alternative approaches to word vector induction. [sent-134, score-0.646]
54 1 Word Representation Learning We induce word representations with our model using 25,000 movie reviews from IMDB. [sent-136, score-0.593]
55 Because some movies receive substantially more reviews than others, we limited ourselves to including at most 30 reviews from any movie in the collection. [sent-137, score-0.609]
56 Stemming was not applied because the model learns similar representations for words of the same stem when the data suggests it. [sent-142, score-0.275]
57 ] The semantic component of our model does not require document labels. [sent-154, score-0.274]
58 We train a variant of our model which uses 50,000 unlabeled reviews in addibeled set of reviews contains neutral reviews as well as those which are polarized as found in the labeled set. [sent-155, score-0.761]
59 Training the model with additional unlabeled data captures a common scenario where the amount of labeled data is small relative to the amount of unlabeled data available. [sent-156, score-0.214]
60 As a qualitative assessment of word representations, we visualize the words most similar to a query word using vector similarity of the learned representations. [sent-158, score-0.225]
61 Given a query word w and another word w′ we obtain their vector representations φw and φw′, and evaluate their cosine similarity as | φwφ|Tw|·φ| wφ′w′||. [sent-159, score-0.413]
62 , screwball and grant as similar to romantic) A comparison of the two versions of our model also begins to highlight the importance of adding sentiment information. [sent-165, score-0.682]
63 In general, words indicative of sentiment tend to have high similarity with words of the same sentiment polarity, so even the purely unsupervised model’s results look promising. [sent-166, score-1.261]
64 For example, the sentiment enriched vectors for ghastly are truly semantic alternatives to that word, whereas the vectors without sentiment also contain some content words that tend to have ghastly predicated of them. [sent-168, score-1.581]
65 Of course, this is only an impressionistic analysis of a few cases, but it is helpful in understanding why the sentiment-enriched model proves superior at the sentiment classification results we report next. [sent-169, score-0.714]
66 Each target word is given with its five most similar words using cosine similarity of the vectors determined by each model. [sent-175, score-0.245]
67 The full version of our model (left) captures both lexical similarity as well as similarity of sentiment strength and orientation. [sent-176, score-0.743]
68 (LDA; described word representations Blei et in secfrom the To train the 50-topic LDA model we use code released by Blei et al. [sent-181, score-0.24]
69 Weighting Variants We evaluate both binary (b) term frequency weighting with smoothed delta idf (∆t’) and no idf (n) because these variants worked well in previous experiments in sentiment (Martineau and Finin, 2009; Pang et al. [sent-186, score-1.044]
70 Paltoglou Our unsupervised semantic component (center) and LSA of such weighting variants for sentiment tasks. [sent-189, score-0.915]
71 3 Document Polarity Classification Our first evaluation task is document-level sentiment polarity classification. [sent-191, score-0.667]
72 Given a document’s bag of words vector v, we obtain features from our model using a matrixvector product Rv, where v can have arbitrary tf. [sent-193, score-0.287]
73 In preliminary experiments, we found ‘bnn’ weighting to work best for v when generating document features via the product Rv. [sent-197, score-0.281]
74 From left to right the datasets are: A collection of 2,000 movie reviews often used as a benchmark of sentiment classification (Pang and Lee, 2004), 50,000 reviews we gathered from IMDB, and the sentence subjectivity dataset also released by (Pang and Lee, 2004). [sent-236, score-1.378]
75 0 introduced of 2,000 movie The polarity dataset version and Lee (2004) 1 consists by Pang reviews, where each is associated with a binary sentiment polarity label. [sent-242, score-1.008]
76 Bag of words vectors are denoted by their weighting notation. [sent-247, score-0.295]
77 As a control, we trained versions of our model with only the unsupervised semantic component, and the full model (semantic and sentiment). [sent-249, score-0.248]
78 We also include results for a version of our full model trained with 50,000 additional unlaOur method’s features clearly outperform those of other VSMs, and perform best when combined with the original bag of words representation. [sent-250, score-0.231]
79 The variant of our model trained with additional unlabeled data performed best, suggesting the model can effectively utilize large amounts of unlabeled data along with labeled examples. [sent-251, score-0.271]
80 We extracted the movie title associated with each review and found that 1,299 of the 2,000 reviews in the dataset have at least one other review of the same movie in the dataset. [sent-253, score-0.747]
81 In the random folds distributed by the authors, approximately 50% of reviews in each validation fold’s test set have a review of the same movie with the same label in the r1bewhloptse’rd:/sew,pxnwrteams. [sent-257, score-0.462]
82 These steps minimize the ability of a learner to rely on idiosyncratic word–class associations, thereby focusing attention on genuine sentiment features. [sent-267, score-0.606]
83 2 IMDB Review Dataset We constructed a collection of 50,000 reviews from IMDB, allowing no more than 30 reviews per movie. [sent-270, score-0.372]
84 The training set is the same 25,000 labeled reviews used to induce word vectors with our model. [sent-277, score-0.375]
85 Our model showed superior performance to other approaches, and performed best when concatenated with bag of words representation. [sent-280, score-0.228]
86 We used the dataset of Pang and Lee (2004), which contains subjective sentences from movie review summaries and objective sentences from movie plot summaries. [sent-288, score-0.543]
87 This task 2Dataset and further details are available online at: is substantially different from the review classification task because it uses sentences as opposed to entire documents and the target concept is subjectivity instead of opinion polarity. [sent-289, score-0.289]
88 5 Discussion We presented a vector space model that learns word representations captuing semantic and sentiment information. [sent-294, score-1.053]
89 The model’s probabilistic foundation gives a theoretically justified technique for word vector induction as an alternative to the overwhelming number of matrix factorization-based techniques commonly used. [sent-295, score-0.249]
90 We parametrize the topical component of our model in a manner that aims to capture word representations instead of latent topics. [sent-299, score-0.453]
91 We extended the unsupervised model to incor- porate sentiment information and showed how this extended model can leverage the abundance of sentiment-labeled texts available online to yield word representations that capture both sentiment and semantic relations. [sent-301, score-1.637]
92 We demonstrated the utility of such representations on two tasks of sentiment classification, using existing datasets as well as a larger one that we release for future research. [sent-302, score-0.71]
93 These tasks involve relatively simple sentiment information, but the model is highly flexible in this regard; it can be used to characterize a wide variety of annotations, and thus is broadly applicable in the http://www. [sent-303, score-0.631]
94 Mining Word- Net for fuzzy sentiment: sentiment tag extraction from WordNet glosses. [sent-320, score-0.574]
95 Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. [sent-344, score-0.666]
96 Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. [sent-391, score-0.625]
97 Delta tfidf: an improved feature space for sentiment analysis. [sent-413, score-0.574]
98 A study ofinformation retrieval weighting schemes for sentiment analysis. [sent-426, score-0.76]
99 A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. [sent-432, score-0.678]
100 Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. [sent-438, score-0.574]
wordName wordTfidf (topN-words)
[('sentiment', 0.574), ('lda', 0.192), ('weighting', 0.186), ('reviews', 0.186), ('movie', 0.167), ('pang', 0.148), ('representations', 0.136), ('bag', 0.136), ('imdb', 0.116), ('vectors', 0.109), ('idf', 0.107), ('subjectivity', 0.104), ('lsa', 0.095), ('document', 0.095), ('vector', 0.094), ('polarity', 0.093), ('latent', 0.092), ('martineau', 0.088), ('collobert', 0.082), ('vsm', 0.082), ('learns', 0.082), ('wi', 0.082), ('dataset', 0.081), ('ghastly', 0.076), ('vsms', 0.076), ('bw', 0.074), ('review', 0.073), ('blei', 0.072), ('lee', 0.071), ('movies', 0.07), ('bnc', 0.07), ('delta', 0.07), ('sk', 0.068), ('paltoglou', 0.067), ('documents', 0.064), ('finin', 0.063), ('semantic', 0.063), ('mnih', 0.062), ('capture', 0.062), ('unlabeled', 0.062), ('component', 0.059), ('weston', 0.058), ('matrix', 0.057), ('model', 0.057), ('bc', 0.056), ('objective', 0.055), ('map', 0.054), ('turney', 0.053), ('romance', 0.053), ('cosine', 0.052), ('similarities', 0.051), ('expe', 0.051), ('hideous', 0.051), ('polarized', 0.051), ('potts', 0.051), ('screwball', 0.051), ('sweet', 0.051), ('uninspired', 0.051), ('stars', 0.051), ('topic', 0.051), ('probabilistic', 0.051), ('classification', 0.048), ('word', 0.047), ('continuous', 0.046), ('expressing', 0.046), ('iyn', 0.045), ('romantic', 0.045), ('thelwall', 0.045), ('indicative', 0.043), ('mixture', 0.041), ('meanings', 0.04), ('zp', 0.039), ('topics', 0.038), ('full', 0.038), ('capturing', 0.038), ('dirichlet', 0.037), ('similarity', 0.037), ('svm', 0.037), ('deerwester', 0.037), ('folds', 0.036), ('alm', 0.035), ('wallach', 0.035), ('hyperplane', 0.035), ('andreevskaia', 0.035), ('turian', 0.035), ('biases', 0.035), ('superior', 0.035), ('leverage', 0.034), ('representation', 0.034), ('hinton', 0.034), ('dk', 0.034), ('pantel', 0.034), ('labeled', 0.033), ('bengio', 0.033), ('star', 0.033), ('unsupervised', 0.033), ('benchmark', 0.032), ('learner', 0.032), ('introduce', 0.032), ('variable', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 204 acl-2011-Learning Word Vectors for Sentiment Analysis
Author: Andrew L. Maas ; Raymond E. Daly ; Peter T. Pham ; Dan Huang ; Andrew Y. Ng ; Christopher Potts
Abstract: Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area.
Author: Danushka Bollegala ; David Weir ; John Carroll
Abstract: We describe a sentiment classification method that is applicable when we do not have any labeled data for a target domain but have some labeled data for multiple other domains, designated as the source domains. We automat- ically create a sentiment sensitive thesaurus using both labeled and unlabeled data from multiple source domains to find the association between words that express similar sentiments in different domains. The created thesaurus is then used to expand feature vectors to train a binary classifier. Unlike previous cross-domain sentiment classification methods, our method can efficiently learn from multiple source domains. Our method significantly outperforms numerous baselines and returns results that are better than or comparable to previous cross-domain sentiment classification methods on a benchmark dataset containing Amazon user reviews for different types of products.
3 0.36336255 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis
Author: Oscar Tackstrom ; Ryan McDonald
Abstract: We derive two variants of a semi-supervised model for fine-grained sentiment analysis. Both models leverage abundant natural supervision in the form of review ratings, as well as a small amount of manually crafted sentence labels, to learn sentence-level sentiment classifiers. The proposed model is a fusion of a fully supervised structured conditional model and its partially supervised counterpart. This allows for highly efficient estimation and inference algorithms with rich feature definitions. We describe the two variants as well as their component models and verify experimentally that both variants give significantly improved results for sentence-level sentiment analysis compared to all baselines. 1 Sentence-level sentiment analysis In this paper, we demonstrate how combining coarse-grained and fine-grained supervision benefits sentence-level sentiment analysis an important task in the field of opinion classification and retrieval (Pang and Lee, 2008). Typical supervised learning approaches to sentence-level sentiment analysis rely on sentence-level supervision. While such fine-grained supervision rarely exist naturally, and thus requires labor intensive manual annotation effort (Wiebe et al., 2005), coarse-grained supervision is naturally abundant in the form of online review ratings. This coarse-grained supervision is, of course, less informative compared to fine-grained supervision, however, by combining a small amount of sentence-level supervision with a large amount of document-level supervision, we are able to substantially improve on the sentence-level classification task. Our work combines two strands of research: models for sentiment analysis that take document structure into account; – 569 Ryan McDonald Google, Inc., New York ryanmcd@ google com . and models that use latent variables to learn unobserved phenomena from that which can be observed. Exploiting document structure for sentiment analysis has attracted research attention since the early work of Pang and Lee (2004), who performed minimal cuts in a sentence graph to select subjective sentences. McDonald et al. (2007) later showed that jointly learning fine-grained (sentence) and coarsegrained (document) sentiment improves predictions at both levels. More recently, Yessenalina et al. (2010) described how sentence-level latent variables can be used to improve document-level prediction and Nakagawa et al. (2010) used latent variables over syntactic dependency trees to improve sentence-level prediction, using only labeled sentences for training. In a similar vein, Sauper et al. (2010) integrated generative content structure models with discriminative models for multi-aspect sentiment summarization and ranking. These approaches all rely on the availability of fine-grained annotations, but Ta¨ckstro¨m and McDonald (201 1) showed that latent variables can be used to learn fine-grained sentiment using only coarse-grained supervision. While this model was shown to beat a set of natural baselines with quite a wide margin, it has its shortcomings. Most notably, due to the loose constraints provided by the coarse supervision, it tends to only predict the two dominant fine-grained sentiment categories well for each document sentiment category, so that almost all sentences in positive documents are deemed positive or neutral, and vice versa for negative documents. As a way of overcoming these shortcomings, we propose to fuse a coarsely supervised model with a fully supervised model. Below, we describe two ways of achieving such a combined model in the framework of structured conditional latent variable models. Contrary to (generative) topic models (Mei et al., 2007; Titov and Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 569–574, Figure 1: a) Factor graph of the fully observed graphical model. b) Factor graph of the corresponding latent variable model. During training, shaded nodes are observed, while non-shaded nodes are unobserved. The input sentences si are always observed. Note that there are no factors connecting the document node, yd, with the input nodes, s, so that the sentence-level variables, ys, in effect form a bottleneck between the document sentiment and the input sentences. McDonald, 2008; Lin and He, 2009), structured conditional models can handle rich and overlapping features and allow for exact inference and simple gradient based estimation. The former models are largely orthogonal to the one we propose in this work and combining their merits might be fruitful. As shown by Sauper et al. (2010), it is possible to fuse generative document structure models and task specific structured conditional models. While we do model document structure in terms of sentiment transitions, we do not model topical structure. An interesting avenue for future work would be to extend the model of Sauper et al. (2010) to take coarse-grained taskspecific supervision into account, while modeling fine-grained task-specific aspects with latent variables. Note also that the proposed approach is orthogonal to semi-supervised and unsupervised induction of context independent (prior polarity) lexicons (Turney, 2002; Kim and Hovy, 2004; Esuli and Sebastiani, 2009; Rao and Ravichandran, 2009; Velikovich et al., 2010). The output of such models could readily be incorporated as features in the proposed model. 1.1 Preliminaries Let d be a document consisting of n sentences, s = (si)in=1, with a document–sentence-sequence pair denoted d = (d, s). Let yd = (yd, ys) denote random variables1 the document level sentiment, yd, and the sequence of sentence level sentiment, = (ysi)in=1 . – ys 1We are abusing notation throughout by using the same symbols to refer to random variables and their particular assignments. 570 In what follows, we assume that we have access to two training sets: a small set of fully labeled instances, DF = {(dj, and a large set of ydj)}jm=f1, coarsely labeled instances DC = {(dj, yjd)}jm=fm+fm+c1. Furthermore, we assume that yd and all yis take values in {POS, NEG, NEU}. We focus on structured conditional models in the exponential family, with the standard parametrization pθ(yd,ys|s) = expnhφ(yd,ys,s),θi − Aθ(s)o
4 0.34685415 292 acl-2011-Target-dependent Twitter Sentiment Classification
Author: Long Jiang ; Mo Yu ; Ming Zhou ; Xiaohua Liu ; Tiejun Zhao
Abstract: Sentiment analysis on Twitter data has attracted much attention recently. In this paper, we focus on target-dependent Twitter sentiment classification; namely, given a query, we classify the sentiments of the tweets as positive, negative or neutral according to whether they contain positive, negative or neutral sentiments about that query. Here the query serves as the target of the sentiments. The state-ofthe-art approaches for solving this problem always adopt the target-independent strategy, which may assign irrelevant sentiments to the given target. Moreover, the state-of-the-art approaches only take the tweet to be classified into consideration when classifying the sentiment; they ignore its context (i.e., related tweets). However, because tweets are usually short and more ambiguous, sometimes it is not enough to consider only the current tweet for sentiment classification. In this paper, we propose to improve target-dependent Twitter sentiment classification by 1) incorporating target-dependent features; and 2) taking related tweets into consideration. According to the experimental results, our approach greatly improves the performance of target-dependent sentiment classification. 1
5 0.341573 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
Author: Bin Lu ; Chenhao Tan ; Claire Cardie ; Benjamin K. Tsou
Abstract: Most previous work on multilingual sentiment analysis has focused on methods to adapt sentiment resources from resource-rich languages to resource-poor languages. We present a novel approach for joint bilingual sentiment classification at the sentence level that augments available labeled data in each language with unlabeled parallel data. We rely on the intuition that the sentiment labels for parallel sentences should be similar and present a model that jointly learns improved monolingual sentiment classifiers for each language. Experiments on multiple data sets show that the proposed approach (1) outperforms the monolingual baselines, significantly improving the accuracy for both languages by 3.44%-8. 12%; (2) outperforms two standard approaches for leveraging unlabeled data; and (3) produces (albeit smaller) performance gains when employing pseudo-parallel data from machine translation engines. 1
6 0.33630368 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features
7 0.32509303 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
8 0.32259732 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews
9 0.28877982 253 acl-2011-PsychoSentiWordNet
10 0.28594747 105 acl-2011-Dr Sentiment Knows Everything!
11 0.26890332 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs
12 0.25761369 218 acl-2011-MemeTube: A Sentiment-based Audiovisual System for Analyzing and Displaying Microblog Messages
13 0.18737032 82 acl-2011-Content Models with Attitude
14 0.16513394 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
15 0.16310436 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic
16 0.14882605 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
17 0.14850752 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
18 0.14766875 256 acl-2011-Query Weighting for Ranking Model Adaptation
19 0.14301184 211 acl-2011-Liars and Saviors in a Sentiment Annotated Corpus of Comments to Political Debates
20 0.1349013 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?
topicId topicWeight
[(0, 0.298), (1, 0.441), (2, 0.348), (3, -0.052), (4, 0.103), (5, -0.056), (6, -0.061), (7, 0.07), (8, 0.026), (9, -0.006), (10, 0.136), (11, -0.047), (12, 0.02), (13, 0.038), (14, 0.113), (15, 0.081), (16, -0.028), (17, 0.003), (18, -0.015), (19, 0.047), (20, 0.016), (21, -0.049), (22, 0.002), (23, -0.018), (24, -0.028), (25, 0.004), (26, -0.038), (27, -0.045), (28, 0.023), (29, 0.009), (30, -0.098), (31, 0.038), (32, -0.033), (33, -0.005), (34, -0.043), (35, -0.011), (36, -0.026), (37, -0.013), (38, 0.025), (39, 0.055), (40, 0.01), (41, -0.07), (42, 0.011), (43, 0.062), (44, 0.014), (45, 0.057), (46, 0.01), (47, -0.02), (48, 0.02), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.96238607 204 acl-2011-Learning Word Vectors for Sentiment Analysis
Author: Andrew L. Maas ; Raymond E. Daly ; Peter T. Pham ; Dan Huang ; Andrew Y. Ng ; Christopher Potts
Abstract: Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area.
2 0.93364644 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis
Author: Oscar Tackstrom ; Ryan McDonald
Abstract: We derive two variants of a semi-supervised model for fine-grained sentiment analysis. Both models leverage abundant natural supervision in the form of review ratings, as well as a small amount of manually crafted sentence labels, to learn sentence-level sentiment classifiers. The proposed model is a fusion of a fully supervised structured conditional model and its partially supervised counterpart. This allows for highly efficient estimation and inference algorithms with rich feature definitions. We describe the two variants as well as their component models and verify experimentally that both variants give significantly improved results for sentence-level sentiment analysis compared to all baselines. 1 Sentence-level sentiment analysis In this paper, we demonstrate how combining coarse-grained and fine-grained supervision benefits sentence-level sentiment analysis an important task in the field of opinion classification and retrieval (Pang and Lee, 2008). Typical supervised learning approaches to sentence-level sentiment analysis rely on sentence-level supervision. While such fine-grained supervision rarely exist naturally, and thus requires labor intensive manual annotation effort (Wiebe et al., 2005), coarse-grained supervision is naturally abundant in the form of online review ratings. This coarse-grained supervision is, of course, less informative compared to fine-grained supervision, however, by combining a small amount of sentence-level supervision with a large amount of document-level supervision, we are able to substantially improve on the sentence-level classification task. Our work combines two strands of research: models for sentiment analysis that take document structure into account; – 569 Ryan McDonald Google, Inc., New York ryanmcd@ google com . and models that use latent variables to learn unobserved phenomena from that which can be observed. Exploiting document structure for sentiment analysis has attracted research attention since the early work of Pang and Lee (2004), who performed minimal cuts in a sentence graph to select subjective sentences. McDonald et al. (2007) later showed that jointly learning fine-grained (sentence) and coarsegrained (document) sentiment improves predictions at both levels. More recently, Yessenalina et al. (2010) described how sentence-level latent variables can be used to improve document-level prediction and Nakagawa et al. (2010) used latent variables over syntactic dependency trees to improve sentence-level prediction, using only labeled sentences for training. In a similar vein, Sauper et al. (2010) integrated generative content structure models with discriminative models for multi-aspect sentiment summarization and ranking. These approaches all rely on the availability of fine-grained annotations, but Ta¨ckstro¨m and McDonald (201 1) showed that latent variables can be used to learn fine-grained sentiment using only coarse-grained supervision. While this model was shown to beat a set of natural baselines with quite a wide margin, it has its shortcomings. Most notably, due to the loose constraints provided by the coarse supervision, it tends to only predict the two dominant fine-grained sentiment categories well for each document sentiment category, so that almost all sentences in positive documents are deemed positive or neutral, and vice versa for negative documents. As a way of overcoming these shortcomings, we propose to fuse a coarsely supervised model with a fully supervised model. Below, we describe two ways of achieving such a combined model in the framework of structured conditional latent variable models. Contrary to (generative) topic models (Mei et al., 2007; Titov and Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 569–574, Figure 1: a) Factor graph of the fully observed graphical model. b) Factor graph of the corresponding latent variable model. During training, shaded nodes are observed, while non-shaded nodes are unobserved. The input sentences si are always observed. Note that there are no factors connecting the document node, yd, with the input nodes, s, so that the sentence-level variables, ys, in effect form a bottleneck between the document sentiment and the input sentences. McDonald, 2008; Lin and He, 2009), structured conditional models can handle rich and overlapping features and allow for exact inference and simple gradient based estimation. The former models are largely orthogonal to the one we propose in this work and combining their merits might be fruitful. As shown by Sauper et al. (2010), it is possible to fuse generative document structure models and task specific structured conditional models. While we do model document structure in terms of sentiment transitions, we do not model topical structure. An interesting avenue for future work would be to extend the model of Sauper et al. (2010) to take coarse-grained taskspecific supervision into account, while modeling fine-grained task-specific aspects with latent variables. Note also that the proposed approach is orthogonal to semi-supervised and unsupervised induction of context independent (prior polarity) lexicons (Turney, 2002; Kim and Hovy, 2004; Esuli and Sebastiani, 2009; Rao and Ravichandran, 2009; Velikovich et al., 2010). The output of such models could readily be incorporated as features in the proposed model. 1.1 Preliminaries Let d be a document consisting of n sentences, s = (si)in=1, with a document–sentence-sequence pair denoted d = (d, s). Let yd = (yd, ys) denote random variables1 the document level sentiment, yd, and the sequence of sentence level sentiment, = (ysi)in=1 . – ys 1We are abusing notation throughout by using the same symbols to refer to random variables and their particular assignments. 570 In what follows, we assume that we have access to two training sets: a small set of fully labeled instances, DF = {(dj, and a large set of ydj)}jm=f1, coarsely labeled instances DC = {(dj, yjd)}jm=fm+fm+c1. Furthermore, we assume that yd and all yis take values in {POS, NEG, NEU}. We focus on structured conditional models in the exponential family, with the standard parametrization pθ(yd,ys|s) = expnhφ(yd,ys,s),θi − Aθ(s)o
Author: Danushka Bollegala ; David Weir ; John Carroll
Abstract: We describe a sentiment classification method that is applicable when we do not have any labeled data for a target domain but have some labeled data for multiple other domains, designated as the source domains. We automat- ically create a sentiment sensitive thesaurus using both labeled and unlabeled data from multiple source domains to find the association between words that express similar sentiments in different domains. The created thesaurus is then used to expand feature vectors to train a binary classifier. Unlike previous cross-domain sentiment classification methods, our method can efficiently learn from multiple source domains. Our method significantly outperforms numerous baselines and returns results that are better than or comparable to previous cross-domain sentiment classification methods on a benchmark dataset containing Amazon user reviews for different types of products.
4 0.89794725 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
Author: Yulan He ; Chenghua Lin ; Harith Alani
Abstract: Joint sentiment-topic (JST) model was previously proposed to detect sentiment and topic simultaneously from text. The only supervision required by JST model learning is domain-independent polarity word priors. In this paper, we modify the JST model by incorporating word polarity priors through modifying the topic-word Dirichlet priors. We study the polarity-bearing topics extracted by JST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. Furthermore, using feature augmentation and selection according to the information gain criteria for cross-domain sentiment classification, our proposed approach performs either better or comparably compared to previous approaches. Nevertheless, our approach is much simpler and does not require difficult parameter tuning.
5 0.82271928 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews
Author: Jianxing Yu ; Zheng-Jun Zha ; Meng Wang ; Tat-Seng Chua
Abstract: In this paper, we dedicate to the topic of aspect ranking, which aims to automatically identify important product aspects from online consumer reviews. The important aspects are identified according to two observations: (a) the important aspects of a product are usually commented by a large number of consumers; and (b) consumers’ opinions on the important aspects greatly influence their overall opinions on the product. In particular, given consumer reviews of a product, we first identify the product aspects by a shallow dependency parser and determine consumers’ opinions on these aspects via a sentiment classifier. We then develop an aspect ranking algorithm to identify the important aspects by simultaneously considering the aspect frequency and the influence of consumers’ opinions given to each aspect on their overall opinions. The experimental results on 11 popular products in four domains demonstrate the effectiveness of our approach. We further apply the aspect ranking results to the application ofdocumentlevel sentiment classification, and improve the performance significantly.
6 0.80595917 218 acl-2011-MemeTube: A Sentiment-based Audiovisual System for Analyzing and Displaying Microblog Messages
7 0.75860476 292 acl-2011-Target-dependent Twitter Sentiment Classification
8 0.72302854 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features
9 0.71101463 82 acl-2011-Content Models with Attitude
10 0.70319045 253 acl-2011-PsychoSentiWordNet
11 0.6994952 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs
12 0.69377875 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
13 0.69153923 105 acl-2011-Dr Sentiment Knows Everything!
14 0.59069639 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
15 0.57226175 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic
16 0.56054956 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?
17 0.53490758 55 acl-2011-Automatically Predicting Peer-Review Helpfulness
18 0.46086165 211 acl-2011-Liars and Saviors in a Sentiment Annotated Corpus of Comments to Political Debates
19 0.45838353 133 acl-2011-Extracting Social Power Relationships from Natural Language
20 0.41040501 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications
topicId topicWeight
[(5, 0.017), (17, 0.032), (26, 0.017), (31, 0.01), (37, 0.501), (39, 0.035), (41, 0.061), (55, 0.031), (59, 0.042), (72, 0.024), (91, 0.025), (96, 0.128)]
simIndex simValue paperId paperTitle
1 0.98576814 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?
Author: Kevin Duh ; Akinori Fujino ; Masaaki Nagata
Abstract: Recent advances in Machine Translation (MT) have brought forth a new paradigm for building NLP applications in low-resource scenarios. To build a sentiment classifier for a language with no labeled resources, one can translate labeled data from another language, then train a classifier on the translated text. This can be viewed as a domain adaptation problem, where labeled translations and test data have some mismatch. Various prior work have achieved positive results using this approach. In this opinion piece, we take a step back and make some general statements about crosslingual adaptation problems. First, we claim that domain mismatch is not caused by MT errors, and accuracy degradation will occur even in the case of perfect MT. Second, we argue that the cross-lingual adaptation problem is qualitatively different from other (monolingual) adaptation problems in NLP; thus new adaptation algorithms ought to be considered. This paper will describe a series of carefullydesigned experiments that led us to these conclusions. 1 Summary Question 1: If MT gave perfect translations (semantically), do we still have a domain adaptation challenge in cross-lingual sentiment classification? Answer: Yes. The reason is that while many lations of a word may be valid, the MT system have a systematic bias. For example, the word some” might be prevalent in English reviews, transmight “awebut in 429 translated reviews, the word “excellent” is generated instead. From the perspective of MT, this translation is correct and preserves sentiment polarity. But from the perspective of a classifier, there is a domain mismatch due to differences in word distributions. Question 2: Can we apply standard adaptation algorithms developed for other (monolingual) adaptation problems to cross-lingual adaptation? Answer: No. It appears that the interaction between target unlabeled data and source data can be rather unexpected in the case of cross-lingual adaptation. We do not know the reason, but our experiments show that the accuracy of adaptation algorithms in cross-lingual scenarios have much higher variance than monolingual scenarios. The goal of this opinion piece is to argue the need to better understand the characteristics of domain adaptation in cross-lingual problems. We invite the reader to disagree with our conclusion (that the true barrier to good performance is not insufficient MT quality, but inappropriate domain adaptation methods). Here we present a series of experiments that led us to this conclusion. First we describe the experiment design (§2) and baselines (§3), before answering Question §12 (§4) dan bda Question 32) (§5). 2 Experiment Design The cross-lingual setup is this: we have labeled data from source domain S and wish to build a sentiment classifier for target domain T. Domain mismatch can arise from language differences (e.g. English vs. translated text) or market differences (e.g. DVD vs. Book reviews). Our experiments will involve fixing Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 429–433, T to a common testset and varying S. This allows us to experiment with different settings for adaptation. We use the Amazon review dataset of Prettenhofer (2010)1 , due to its wide range of languages (English [EN], Japanese [JP], French [FR], German [DE]) and markets (music, DVD, books). Unlike Prettenhofer (2010), we reverse the direction of cross-lingual adaptation and consider English as target. English is not a low-resource language, but this setting allows for more comparisons. Each source dataset has 2000 reviews, equally balanced between positive and negative. The target has 2000 test samples, large unlabeled data (25k, 30k, 50k samples respectively for Music, DVD, and Books), and an additional 2000 labeled data reserved for oracle experiments. Texts in JP, FR, and DE are translated word-by-word into English with Google Translate.2 We perform three sets of experiments, shown in Table 1. Table 2 lists all the results; we will interpret them in the following sections. Target (T) Source (S) 312BDMToVuasbDkil-ecE1N:ExpDMB eorVuimsDkice-JEnPtN,s eBD,MtuoVBDpuoVsk:-iFDck-iERxFN,T DB,vVoMaDruky-sSiDc.E-, 3 How much performance degradation occurs in cross-lingual adaptation? First, we need to quantify the accuracy degradation under different source data, without consideration of domain adaptation methods. So we train a SVM classifier on labeled source data3, and directly apply it on test data. The oracle setting, which has no domain-mismatch (e.g. train on Music-EN, test on Music-EN), achieves an average test accuracy of (81.6 + 80.9 + 80.0)/3 = 80.8%4. Aver1http://www.webis.de/research/corpora/webis-cls-10 2This is done by querying foreign words to build a bilingual dictionary. The words are converted to tfidf unigram features. 3For all methods we try here, 5% of the 2000 labeled source samples are held-out for parameter tuning. 4See column EN of Table 2, Supervised SVM results. 430 age cross-lingual accuracies are: 69.4% (JP), 75.6% (FR), 77.0% (DE), so degradations compared to oracle are: -11% (JP), -5% (FR), -4% (DE).5 Crossmarket degradations are around -6%6. Observation 1: Degradations due to market and language mismatch are comparable in several cases (e.g. MUSIC-DE and DVD-EN perform similarly for target MUSIC-EN). Observation 2: The ranking of source language by decreasing accuracy is DE > FR > JP. Does this mean JP-EN is a more difficult language pair for MT? The next section will show that this is not necessarily the case. Certainly, the domain mismatch for JP is larger than DE, but this could be due to phenomenon other than MT errors. 4 Where exactly is the domain mismatch? 4.1 Theory of Domain Adaptation We analyze domain adaptation by the concepts of labeling and instance mismatch (Jiang and Zhai, 2007). Let pt(x, y) = pt (y|x)pt (x) be the target distribution of samples x (e.g. unigram feature vec- tor) and labels y (positive / negative). Let ps (x, y) = ps (y|x)ps (x) be the corresponding source distributio(ny. Wx)pe assume that one (or both) of the following distributions differ between source and target: • Instance mismatch: ps (x) pt (x). • Labeling mismatch: ps (y|x) pt(y|x). Instance mismatch implies that the input feature vectors have different distribution (e.g. one dataset uses the word “excellent” often, while the other uses the word “awesome”). This degrades performance because classifiers trained on “excellent” might not know how to classify texts with the word “awesome.” The solution is to tie together these features (Blitzer et al., 2006) or re-weight the input distribution (Sugiyama et al., 2008). Under some assumptions (i.e. covariate shift), oracle accuracy can be achieved theoretically (Shimodaira, 2000). Labeling mismatch implies the same input has different labels in different domains. For example, the JP word meaning “excellent” may be mistranslated as “bad” in English. Then, positive JP = = 5See “Adapt by Language” columns of Table 2. Note JP+FR+DE condition has 6000 labeled samples, so is not directly comparable to other adaptation scenarios (2000 samples). Nevertheless, mixing languages seem to give good results. 6See “Adapt by Market” columns of Table 2. TargetClassifierOEraNcleJPAFdaRpt bDyE LanJgPu+agFeR+DEMUASdIaCpt D byV MDar BkeOtOK MUSIC-ENSAudpaeprtvedise TdS SVVMM8719..666783..50 7745..62 7 776..937880..36--7768..847745..16 DVD-ENSAudpaeprtveidse TdS SVVMM8801..907701..14 7765..54 7 767..347789..477754..28--7746..57 BOOK-ENSAudpaeprtveidse TdS SVVMM8801..026793..68 7775..64 7 767..747799..957735..417767..24-Table 2: Test accuracies (%) for English Music/DVD/Book reviews. Each column is an adaptation scenario using different source data. The source data may vary by language or by market. For example, the first row shows that for the target of Music-EN, the accuracy of a SVM trained on translated JP reviews (in the same market) is 68.5, while the accuracy of a SVM trained on DVD reviews (in the same language) is 76.8. “Oracle” indicates training on the same market and same language domain as the target. “JP+FR+DE” indicates the concatenation of JP, FR, DE as source data. Boldface shows the winner of Supervised vs. Adapted. reviews ps (y will be associated = +1|x = bad) co(nydit =io +na1l − |x = 1 will be high, whereas the true xdis =tr bibaudti)o wn bad) instead. labeling mismatch, with the word “bad”: lslh boeu hldi hha,v we high pt(y = There are several cases for depending on sheovwe tahle c polarity changes (Table 3). The solution is to filter out these noisy samples (Jiang and Zhai, 2007) or optimize loosely-linked objectives through shared parameters or Bayesian priors (Finkel and Manning, 2009). Which mismatch is responsible for accuracy degradations in cross-lingual adaptation? • Instance mismatch: Systematic Iantessta nwcoerd m diissmtraibtcuhti:on Ssy MT bias gener- sdtiefmferaetinct MfroTm b naturally- occurring English. (Translation may be valid.) Label mismatch: MT error mis-translates a word iLnatob something w: MithT Td eifrfreorren mti polarity. Conclusion from §4.2 and §4.3: Instance mismaCtcohn occurs often; M §4T. error appears Imnisntainmcael. • Mis-translated polarity Effect Taeb0+±.lge→ .3(:±“ 0−tgLhoae b”nd →l m− i“sg→m otbah+dce”h):mIfpoLAinse ca-ptsoriuaesncvieatl /ndioeansgbvcaewrptlimovaeshipntdvaei(+), negative (−), or neutral (0) words have different effects. Wnege athtiivnek ( −th)e, foirrs nt tuwtroa cases hoardves graceful degradation, but the third case may be catastrophic. 431 4.2 Analysis of Instance Mismatch To measure instance mismatch, we compute statistics between ps (x) and pt(x), or approximations thereof: First, we calculate a (normalized) average feature from all samples of source S, which represents the unigram distribution of MT output. Simi- larly, the average feature vector for target T approximates the unigram distribution of English reviews pt(x). Then we measure: • KL Divergence between Avg(S) and Avg(T), wKhLer De Avg() nisc eth bee average Avvegct(oSr.) • Set Coverage of Avg(T) on Avg(S): how many Sweotrd C (type) ien o Tf appears oatn le Aavsgt once ionw wS .m Both measures correlate strongly with final accuracy, as seen in Figure 1. The correlation coefficients are r = −0.78 for KL Divergence and r = 0.71 for Coverage, 0 b.7o8th statistically significant (p < 0.05). This implies that instance mismatch is an important reason for the degradations seen in Section 3.7 4.3 Analysis of Labeling Mismatch We measure labeling mismatch by looking at differences in the weight vectors of oracle SVM and adapted SVM. Intuitively, if a feature has positive weight in the oracle SVM, but negative weight in the adapted SVM, then it is likely a MT mis-translation 7The observant reader may notice that cross-market points exhibit higher coverage but equal accuracy (74-78%) to some cross-lingual points. This suggests that MT output may be more constrained in vocabulary than naturally-occurring English. 0.35 0.3 gnvLrDeiceKe0 0 0. 120.25 510 erts TeCovega0 0 0. .98657 68 70 72 7A4ccuracy76 78 80 82 0.4 68 70 72 7A4ccuracy76 78 80 82 Figure 1: KL Divergence and Coverage vs. accuracy. (o) are cross-lingual and (x) are cross-market data points. is causing the polarity flip. Algorithm 1 (with K=2000) shows how we compute polarity flip rate.8 We found that the polarity flip rate does not correlate well with accuracy at all (r = 0.04). Conclusion: Labeling mismatch is not a factor in performance degradation. Nevertheless, we note there is a surprising large number of flips (24% on average). A manual check of the flipped words in BOOK-JP revealed few MT mistakes. Only 3.7% of 450 random EN-JP word pairs checked can be judged as blatantly incorrect (without sentence context). The majority of flipped words do not have a clear sentiment orientation (e.g. “amazon”, “human”, “moreover”). 5 Are standard adaptation algorithms applicable to cross-lingual problems? One of the breakthroughs in cross-lingual text classification is the realization that it can be cast as domain adaptation. This makes available a host of preexisting adaptation algorithms for improving over supervised results. However, we argue that it may be 8The feature normalization in Step 1 is important that the weight magnitudes are comparable. to ensure 432 Algorithm 1 Measuring labeling mismatch Input: Weight vectors for source wsand target wt Input: Target data average sample vector avg(T) Output: Polarity flip rate f 1: Normalize: ws = avg(T) * ws ; wt = avg(T) * wt 2: Set S+ = { K most positive features in ws} 3: Set S− == {{ KK mmoosstt negative ffeeaattuurreess inn wws}} 4: Set T+ == {{ KK m moosstt npoesgiatitivvee f efeaatuturreess i inn w wt}} 5: Set T− == {{ KK mmoosstt negative ffeeaattuurreess inn wwt}} 6: for each= f{e a Ktur me io ∈t T+ adtiov 7: rif e ia c∈h S fe−a ttuhreen i if ∈ = T f + 1 8: enidf fio ∈r 9: for each feature j ∈ T− do 10: rif e j ∈h Sfe+a uthreen j f ∈ = T f + 1 11: enidf fjo r∈ 12: f = 2Kf better to “adapt” the standard adaptation algorithm to the cross-lingual setting. We arrived at this conclusion by trying the adapted counterpart of SVMs off-the-shelf. Recently, (Bergamo and Torresani, 2010) showed that Transductive SVMs (TSVM), originally developed for semi-supervised learning, are also strong adaptation methods. The idea is to train on source data like a SVM, but encourage the classification boundary to divide through low density regions in the unlabeled target data. Table 2 shows that TSVM outperforms SVM in all but one case for cross-market adaptation, but gives mixed results for cross-lingual adaptation. This is a puzzling result considering that both use the same unlabeled data. Why does TSVM exhibit such a large variance on cross-lingual problems, but not on cross-market problems? Is unlabeled target data interacting with source data in some unexpected way? Certainly there are several successful studies (Wan, 2009; Wei and Pal, 2010; Banea et al., 2008), but we think it is important to consider the possibility that cross-lingual adaptation has some fundamental differences. We conjecture that adapting from artificially-generated text (e.g. MT output) is a different story than adapting from naturallyoccurring text (e.g. cross-market). In short, MT is ripe for cross-lingual adaptation; what is not ripe is probably our understanding of the special characteristics of the adaptation problem. References Carmen Banea, Rada Mihalcea, Janyce Wiebe, and Samer Hassan. 2008. Multilingual subjectivity analysis using machine translation. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Alessandro Bergamo and Lorenzo Torresani. 2010. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in Neural Information Processing Systems (NIPS). John Blitzer, Ryan McDonald, and Fernando Pereira. 2006. Domain adaptation with structural correspondence learning. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP). Jenny Rose Finkel and Chris Manning. 2009. Hierarchical Bayesian domain adaptation. In Proc. of NAACL Human Language Technologies (HLT). Jing Jiang and ChengXiang Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proc. of the Association for Computational Linguistics (ACL). Peter Prettenhofer and Benno Stein. 2010. Crosslanguage text classification using structural correspondence learning. In Proc. of the Association for Computational Linguistics (ACL). Hidetoshi Shimodaira. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inferenc, 90. Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B ¨unau, and Motoaki Kawanabe. 2008. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4). Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proc. of the Association for Computational Linguistics (ACL). Bin Wei and Chris Pal. 2010. Cross lingual adaptation: an experiment on sentiment classification. In Proceedings of the ACL 2010 Conference Short Papers. 433
2 0.95881331 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation
Author: Roy Schwartz ; Omri Abend ; Roi Reichart ; Ari Rappoport
Abstract: Dependency parsing is a central NLP task. In this paper we show that the common evaluation for unsupervised dependency parsing is highly sensitive to problematic annotations. We show that for three leading unsupervised parsers (Klein and Manning, 2004; Cohen and Smith, 2009; Spitkovsky et al., 2010a), a small set of parameters can be found whose modification yields a significant improvement in standard evaluation measures. These parameters correspond to local cases where no linguistic consensus exists as to the proper gold annotation. Therefore, the standard evaluation does not provide a true indication of algorithm quality. We present a new measure, Neutral Edge Direction (NED), and show that it greatly reduces this undesired phenomenon.
3 0.95636356 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
Author: Guangyou Zhou ; Jun Zhao ; Kang Liu ; Li Cai
Abstract: In this paper, we present a novel approach which incorporates the web-derived selectional preferences to improve statistical dependency parsing. Conventional selectional preference learning methods have usually focused on word-to-class relations, e.g., a verb selects as its subject a given nominal class. This paper extends previous work to wordto-word selectional preferences by using webscale data. Experiments show that web-scale data improves statistical dependency parsing, particularly for long dependency relationships. There is no data like more data, performance improves log-linearly with the number of parameters (unique N-grams). More importantly, when operating on new domains, we show that using web-derived selectional preferences is essential for achieving robust performance.
4 0.9494217 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars
Author: Mark-Jan Nederhof ; Giorgio Satta
Abstract: We present a method for the computation of prefix probabilities for synchronous contextfree grammars. Our framework is fairly general and relies on the combination of a simple, novel grammar transformation and standard techniques to bring grammars into normal forms.
5 0.94823396 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
6 0.94172496 122 acl-2011-Event Extraction as Dependency Parsing
same-paper 7 0.94070888 204 acl-2011-Learning Word Vectors for Sentiment Analysis
8 0.93686664 334 acl-2011-Which Noun Phrases Denote Which Concepts?
10 0.84022498 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers
11 0.8393259 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
12 0.8348273 256 acl-2011-Query Weighting for Ranking Model Adaptation
13 0.82983041 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
14 0.82464439 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
15 0.81931037 85 acl-2011-Coreference Resolution with World Knowledge
16 0.80777466 292 acl-2011-Target-dependent Twitter Sentiment Classification
17 0.80561256 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning
18 0.80397409 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
19 0.80336457 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
20 0.80095416 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing