emnlp emnlp2011 emnlp2011-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samuel Brody ; Nicholas Diakopoulos
Abstract: We present an automatic method which leverages word lengthening to adapt a sentiment lexicon specifically for Twitter and similar social messaging networks. The contributions of the paper are as follows. First, we call attention to lengthening as a widespread phenomenon in microblogs and social messaging, and demonstrate the importance of handling it correctly. We then show that lengthening is strongly associated with subjectivity and sentiment. Finally, we present an automatic method which leverages this association to detect domain-specific sentiment- and emotionbearing words. We evaluate our method by comparison to human judgments, and analyze its strengths and weaknesses. Our results are of interest to anyone analyzing sentiment in microblogs and social networks, whether for research or commercial purposes.
Reference: text
sentIndex sentText sentNum sentScore
1 s dbrody@ gmai l com Abstract We present an automatic method which leverages word lengthening to adapt a sentiment lexicon specifically for Twitter and similar social messaging networks. [sent-16, score-1.392]
2 First, we call attention to lengthening as a widespread phenomenon in microblogs and social messaging, and demonstrate the importance of handling it correctly. [sent-18, score-0.829]
3 We then show that lengthening is strongly associated with subjectivity and sentiment. [sent-19, score-0.691]
4 Our results are of interest to anyone analyzing sentiment in microblogs and social networks, whether for research or commercial purposes. [sent-22, score-0.558]
5 1 Introduction Recently, there has been a surge of interest in sentiment analysis of Twitter messages. [sent-23, score-0.42]
6 2011; Kivran-Swaine and Naaman 2011) are interested in studying structure and interactions in social networks, where sentiment can play an important role. [sent-27, score-0.491]
7 Others use Twitter as a barometer for public mood and opinion in diverse areas such as entertainment, politics and economics. [sent-28, score-0.182]
8 (2010) measure public mood on Twitter and use it to predict upcoming stock market fluc562 Nicholas Diakopoulos School of Communication and Information Rutgers University diakop@ rutgers . [sent-30, score-0.22]
9 (2010) connect public opinion data from polls to sentiment expressed in Twitter messages along a timeline. [sent-32, score-0.598]
10 A prerequisite of all such research is an effective method for measuring the sentiment of a post or tweet. [sent-33, score-0.42]
11 For example, Kivran-Swaine and Naaman (201 1) use manual coding of tweets in several emotion categories (e. [sent-38, score-0.175]
12 The automatic approach to sentiment analysis is commonly used for processing data from social networks and microblogs, where there is often a huge quantity of information and a need for low latency. [sent-45, score-0.554]
13 The overall sentiment of a piece of text is calculated as a function of the labels of the component words. [sent-49, score-0.42]
14 There are also approaches that use deeper machine learning techniques to train sentiment classifiers on examples that have been labeled for sentiment, either manually or automatically, as described above. [sent-51, score-0.42]
15 2005, see Section 5) were created for a general domain, and suffer from limited coverage and inaccuracies when applied to the highly informal domain of social networks communication. [sent-56, score-0.285]
16 By creating a sentiment lexicon which is specifically tailored to the microblogging domain, or adapting an existing one, we can expect to achieve higher accuracy and increased coverage. [sent-57, score-0.693]
17 (2010), who developed a method for automatically deriving an extensive sentiment lexicon from the web as a whole. [sent-59, score-0.642]
18 The resulting lexicon has greatly increased coverage compared to existing dictionaries and can handle spelling errors and web-specific jargon. [sent-60, score-0.276]
19 They use this expanded emotion lexicon (named GPOS) in conjunction with the lexicon of Wilson et al. [sent-67, score-0.444]
20 The method we present in this paper leverages a phenomenon that is specific to informal social communication to enable the extension of an existing lexicon in a domain specific manner. [sent-69, score-0.479]
21 However, there exist some orthographic conventions which are used to mark or substitute for prosody, including punctuation and typographic styling (italic, bold, and underlined text). [sent-75, score-0.149]
22 In this work, we hypothesize that the commonly observed phenomenon of lengthening words by repeating letters is a substitute for prosodic emphasis (increased duration or change of pitch). [sent-80, score-0.91]
23 Our experiments are designed to analyze the phenomenon of lengthening and its implications to sentiment detection. [sent-82, score-1.144]
24 First, in Experiment I, we show the pervasiveness of the phenomenon in our dataset, and measure the potential gains in coverage that can be achieved by considering lengthening when processing Twitter data. [sent-83, score-0.752]
25 Experiment II substantiates the claim that word lengthening is not arbitrary, and is used for emphasis of important words, including those conveying sentiment and emotion. [sent-84, score-1.083]
26 In the first part of Experiment III we demonstrate the implications of this connection for the purpose of sentiment detection using an existing sentiment lexicon. [sent-85, score-0.947]
27 In the second part, we present an unsupervised method for using the lengthening phenomenon to expand an existing sentiment lexicon and tailor it to our domain. [sent-86, score-1.326]
28 4 Experiment I Detection - To detect and analyze lengthened words, we employ the procedure described in Figure 1. [sent-105, score-0.229]
29 Finally, in Step 4, we associate all the forms in a single set with a canonical form, which is the most common one observed in the data. [sent-108, score-0.186]
30 564 To reduce noise resulting from typos and misspellings, we do not consider words containing nonalphabetic characters, or sets where the canonical form is a single character or occurs less than 10 times. [sent-110, score-0.135]
31 Analysis Table 1 lists the canonical forms of the 20 largest sets in our list (in terms of the number of variations). [sent-112, score-0.186]
32 , ow, ugh, yay) are often lengthened and, for some, the combined frequency of the different lengthened forms is actually greater than that of the canonical (single most frequent) one. [sent-116, score-0.594]
33 5 million words, our procedure identifies 108,762 word occurrences which are lengthenings of a canonical form. [sent-119, score-0.165]
34 The wide-spread use of lengthening is surprising in light of the length restriction of Twitter. [sent-122, score-0.613]
35 The fact that lengthening is used in spite of the need for brevity suggests that it conveys important information. [sent-124, score-0.613]
36 Canonical Assumption We validate the assumption that the most frequent form in the set is the canonical form by examining sets containing one or more word forms that were identified in a standard terms of cardinality), with the number of occurrences of the canonical and non-canonical forms. [sent-125, score-0.375]
37 This indicates that the strategy of choosing the most frequent form as the canonical one is reliable and highly accurate (> 97%). [sent-131, score-0.159]
38 Implications for NLP To examine the effects of lengthening on analyzing Twitter data, we look at the difference in coverage of a standard English dictionary when we explicitly handle lengthened words by mapping them to the canonical form. [sent-132, score-1.013]
39 We searched for occurrences of these words which were lengthened by two or more characters, meaning they would not be identified using standard lemmatization methods or spell-correction techniques that are based on edit 3We use the standard dictionary for U. [sent-135, score-0.263]
40 5 Experiment II- Relation to Sentiment At the beginning of Section 2 we presented the hypothesis that lengthening represents a textual substitute for prosodic indicators in speech. [sent-142, score-0.716]
41 As such, it is not used arbitrarily, but rather applied to subjective words to strengthen the sentiment or emotion they convey. [sent-143, score-0.597]
42 In this section we wish to provide experimental evidence for our hypothesis, by demonstrating a significant degree of association between lengthening and subjectivity. [sent-145, score-0.613]
43 For this purpose we use an existing sentiment lexicon (Wilson et al. [sent-146, score-0.635]
44 , 2005), which is commonly used in the literature (see Section 1) and is at the core of OpinionFinder4, a popular sentiment analysis tool designed to determine opinion in a general domain. [sent-147, score-0.495]
45 The lexicon provides a list of subjective words, each annotated with its degree of subjectivity (strongly subjective, weakly subjective), as well as its sentiment polarity (positive, negative, or neutral). [sent-148, score-0.82]
46 In these experiments, we use the presence of a word (canonical form) in the lexicon as an indicator of subjectivity. [sent-149, score-0.186]
47 , the fact that a word is absent from the lexicon does not indicate it is objective. [sent-152, score-0.222]
48 As a measure of tendency to lengthen a word, we look at the number of distinct forms of that word appearing in our dataset (the cardinality of the set to which it belongs). [sent-153, score-0.139]
49 The graph shows a clear trend - the more lengthening forms a word has, 4http : / /www . [sent-157, score-0.691]
50 We can verify this by calculating the average number of distinct forms for words in our data that are subjective and comparing to the rest. [sent-162, score-0.156]
51 41 forms for words appearing in our sentiment lexicon (our proxy for subjectivity), compared to an average of 1. [sent-164, score-0.691]
52 The lexicon we use was designed for a general domain, and suffers from limited coverage (see below) and inaccuracies (see O’Connor et al. [sent-168, score-0.284]
53 The sentiment lexicon contains 6,878 words, but only 4,939 occur in our data, and only 2,446 appear more than 10 times. [sent-171, score-0.606]
54 Of those appearing in our data, only 485 words (7% of the lexicon vocabulary) are lengthened (the bar for group 2+ in Figure 2), but these are extremely salient. [sent-172, score-0.448]
55 This pro- vides further evidence that lengthening is used with salient sentiment words. [sent-174, score-1.033]
56 These results also demonstrates the limitations of using a sentiment lexicon which is not tailored to the domain. [sent-175, score-0.637]
57 Only a small fraction of the lexicon is represented in our data, and it is likely that there are many sentiment words that are commonly used but are absent from it. [sent-176, score-0.674]
58 6 Experiment III - Adapting the Sentiment Lexicon The previous experiment showed the connection between lengthening and sentiment-bearing words. [sent-178, score-0.683]
59 It also demonstrated some of the shortcomings of a lexicon which is not specifically tailored to our domain. [sent-179, score-0.217]
60 There are two steps we can take to use the lengthening phenomenon to adapt an existing sentiment lexicon. [sent-180, score-1.164]
61 The first of these is simply to take lengthening into account when identifying sentiment-bearing words in our corpus. [sent-181, score-0.613]
62 566 18A,8l 17 32,7+27 23,4+51 1,45+40 1,50+77 767+8 671+5 488+7 490+6 1 303+5 Figure 2: The percentage of subjective word-sets (those whose canonical form appears in the lexicon) as a function of cardinality (number of lengthening variations). [sent-183, score-0.907]
63 is to exploit the connection between lengthening and sentiment to expand the lexicon itself. [sent-185, score-1.264]
64 1 Expanding Coverage of Existing Words We can assess the effect of specifically consider- ing lengthening in our domain by measuring the increase of coverage of the existing sentiment lexicon. [sent-187, score-1.161]
65 Similarly to Experiment I(Section 4), we searched for occurrences of words from the lexicon which were lengthened by two or more characters, and would therefore not be detected using editdistance. [sent-188, score-0.449]
66 This increase in coverage is relatively small6, but comes at almost no cost, by simply considering lengthening in the analysis. [sent-191, score-0.674]
67 A much greater benefit of lengthening, however, results from using it as an aid in expanding the sentiment lexicon and detecting new sentiment-bearing words. [sent-192, score-0.674]
68 2 Expanding the Sentiment Vocabulary In Experiment II(Section 5) we showed that lengthening is strongly associated with sentiment. [sent-195, score-0.657]
69 Therefore, words which are lengthened can provide us with good candidates for inclusion in the lexicon. [sent-196, score-0.232]
70 We can employ existing sentiment-detection meth6Note that almost half of the increase in coverage calculated in Experiment I(Section 4) comes from subjective words! [sent-197, score-0.22]
71 Choosing a Candidate Set The first step in expanding the lexicon is to choose a set of candidate words for inclusion. [sent-199, score-0.223]
72 15%) are currently in our lexicon (see Figure 2). [sent-202, score-0.186]
73 Since we are looking for commonly lengthened words, we disregard those where the combined frequency of the non-canonical forms is less than 1% that of the canonical one. [sent-203, score-0.422]
74 We also remove stop words, even though some are often lengthened for emphasis (e. [sent-204, score-0.254]
75 Graph Approach We examine two methods for sentiment detection - that of Brody and Elhadad (2010) for detecting sentiment in reviews, and that of Velikovich et al. [sent-209, score-0.871]
76 (2010) for finding sentiment terms in a giga-scale web corpus. [sent-210, score-0.456]
77 Both of these employ a graph-based approach, where candidate terms are nodes, and sentiments is propagated from a set of seed words of known sentiment polarity. [sent-211, score-0.542]
78 We calculated the frequency in our corpus of all strongly positive and strongly negative words in the Wilson et al. [sent-212, score-0.197]
79 We consider all our candidates words as nodes, along with the words in our positive and negative seed sets. [sent-223, score-0.234]
80 The words in the positive and negative seed groups are assigned a polarity score of 1and 0, respectively. [sent-229, score-0.281]
81 In their paper, the authors claim that their algorithm is more suitable than that of Zhu and Ghahramani (2002) to a web-based dataset, which P and N are the positive and negative seed sets, respec- tively, wij are the weights, and T and γ are parameters9. [sent-239, score-0.206]
82 For words appearing in the sentiment lexicon, we used the polarity label provided. [sent-243, score-0.529]
83 The Web algorithm takes into account the strongest path to every seed word, while the Reviews algorithm propagates from the each seed to its neighbors and then onward. [sent-274, score-0.194]
84 The Web algorithm, on the other hand, finds words that have a strong association with the positive or negative seed group as a whole, thus making it more robust. [sent-277, score-0.23]
85 The words yeah and yea, which often follow the negative seed word hell, are considered negative by the Reviews algorithm. [sent-279, score-0.213]
86 The word Justin refers to Justin Beiber, and is closely associated with the positive seed word love. [sent-280, score-0.148]
87 For example, the word bull is listed as positive in the sentiment lexicon, presumably because of its financial sense. [sent-288, score-0.5]
88 This example also demonstrates that our method is capable of detecting terms which are associated with sentiment at different time points, something that is not possible with a fixed lexicon. [sent-292, score-0.451]
89 We also demonstrated that lengthening is not arbitrary, and is often used with subjective words, presumably to emphasize the sentiment they convey. [sent-296, score-1.167]
90 This finding leads us to develop an unsupervised method based on lengthening for detecting new sentiment bearing words that are not in the existing lexicon, and discovering their polarity. [sent-297, score-1.093]
91 In the rapidly-changing domain of microblogging and net-speak, such a method is essential for up-to-date sentiment detection. [sent-298, score-0.485]
92 8 Future Work This paper examined one aspect of the lengthening phenomenon. [sent-299, score-0.613]
93 There are other aspects of lengthening that merit research, such as the connection between the amount of lengthening and the strength of emphasis in individual instances of a word. [sent-300, score-1.321]
94 , very, so, odee), and named enti- ties associated with sentiment (e. [sent-303, score-0.42]
95 Another direction of research is the connection between lengthening and other orthographic conventions associated with sentiment and emphasis, such as emoticons, punctuation, and capitalization. [sent-308, score-1.162]
96 Finally, we plan to integrate lengthening and its related phenomena into an accurate, Twitter-specific, sentiment classifier. [sent-309, score-1.033]
97 Robust sentiment detection on twitter from biased and noisy data. [sent-314, score-0.607]
98 Network properties and social sharing of emotions in social awareness streams. [sent-367, score-0.142]
99 From tweets to polls: Linking text sentiment to public opinion time series. [sent-383, score-0.629]
100 Twitter as a corpus for sentiment analysis and opinion mining. [sent-387, score-0.463]
wordName wordTfidf (topN-words)
[('lengthening', 0.613), ('sentiment', 0.42), ('lengthened', 0.204), ('twitter', 0.187), ('lexicon', 0.186), ('brody', 0.139), ('canonical', 0.135), ('velikovich', 0.121), ('subjective', 0.105), ('tweets', 0.103), ('seed', 0.097), ('diakopoulos', 0.093), ('elhadad', 0.083), ('bollen', 0.08), ('phenomenon', 0.078), ('mood', 0.076), ('polarity', 0.075), ('emotion', 0.072), ('social', 0.071), ('microblogs', 0.067), ('reviews', 0.064), ('public', 0.063), ('coverage', 0.061), ('negative', 0.058), ('naaman', 0.056), ('odee', 0.056), ('rutgers', 0.056), ('shamma', 0.056), ('wilson', 0.054), ('cardinality', 0.054), ('conventions', 0.054), ('positive', 0.051), ('forms', 0.051), ('emphasis', 0.05), ('messaging', 0.048), ('prosodic', 0.048), ('informal', 0.047), ('ratings', 0.045), ('connection', 0.045), ('strongly', 0.044), ('opinion', 0.043), ('messages', 0.04), ('domain', 0.038), ('neutral', 0.037), ('expanding', 0.037), ('accent', 0.037), ('bolinger', 0.037), ('canes', 0.037), ('grinter', 0.037), ('happiness', 0.037), ('inaccuracies', 0.037), ('mcnair', 0.037), ('mor', 0.037), ('pitch', 0.037), ('poms', 0.037), ('styling', 0.037), ('web', 0.036), ('absent', 0.036), ('appearing', 0.034), ('subjectivity', 0.034), ('characters', 0.033), ('connor', 0.033), ('justin', 0.033), ('associations', 0.033), ('implications', 0.033), ('bermingham', 0.032), ('calhoun', 0.032), ('duration', 0.032), ('joy', 0.032), ('pak', 0.032), ('polls', 0.032), ('commonly', 0.032), ('networks', 0.031), ('tailored', 0.031), ('detecting', 0.031), ('leverages', 0.03), ('orthographic', 0.03), ('occurrences', 0.03), ('edges', 0.029), ('presumably', 0.029), ('repeating', 0.029), ('barbosa', 0.029), ('searched', 0.029), ('existing', 0.029), ('substitute', 0.028), ('candidates', 0.028), ('graph', 0.027), ('indicators', 0.027), ('zhu', 0.027), ('microblogging', 0.027), ('samuel', 0.027), ('sampled', 0.025), ('stock', 0.025), ('ghahramani', 0.025), ('nicholas', 0.025), ('employ', 0.025), ('experiment', 0.025), ('group', 0.024), ('adapt', 0.024), ('frequent', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
Author: Samuel Brody ; Nicholas Diakopoulos
Abstract: We present an automatic method which leverages word lengthening to adapt a sentiment lexicon specifically for Twitter and similar social messaging networks. The contributions of the paper are as follows. First, we call attention to lengthening as a widespread phenomenon in microblogs and social messaging, and demonstrate the importance of handling it correctly. We then show that lengthening is strongly associated with subjectivity and sentiment. Finally, we present an automatic method which leverages this association to detect domain-specific sentiment- and emotionbearing words. We evaluate our method by comparison to human judgments, and analyze its strengths and weaknesses. Our results are of interest to anyone analyzing sentiment in microblogs and social networks, whether for research or commercial purposes.
2 0.264815 30 emnlp-2011-Compositional Matrix-Space Models for Sentiment Analysis
Author: Ainur Yessenalina ; Claire Cardie
Abstract: We present a general learning-based approach for phrase-level sentiment analysis that adopts an ordinal sentiment scale and is explicitly compositional in nature. Thus, we can model the compositional effects required for accurate assignment of phrase-level sentiment. For example, combining an adverb (e.g., “very”) with a positive polar adjective (e.g., “good”) produces a phrase (“very good”) with increased polarity over the adjective alone. Inspired by recent work on distributional approaches to compositionality, we model each word as a matrix and combine words using iterated matrix multiplication, which allows for the modeling of both additive and multiplicative semantic effects. Although the multiplication-based matrix-space framework has been shown to be a theoretically elegant way to model composition (Rudolph and Giesbrecht, 2010), training such models has to be done carefully: the optimization is nonconvex and requires a good initial starting point. This paper presents the first such algorithm for learning a matrix-space model for semantic composition. In the context of the phrase-level sentiment analysis task, our experimental results show statistically significant improvements in performance over a bagof-words model.
3 0.23193276 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
Author: Richard Socher ; Jeffrey Pennington ; Eric H. Huang ; Andrew Y. Ng ; Christopher D. Manning
Abstract: We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model’s ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.
4 0.20038266 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification
Author: Balamurali AR ; Aditya Joshi ; Pushpak Bhattacharyya
Abstract: Traditional approaches to sentiment classification rely on lexical features, syntax-based features or a combination of the two. We propose semantic features using word senses for a supervised document-level sentiment classifier. To highlight the benefit of sense-based features, we compare word-based representation of documents with a sense-based representation where WordNet senses of the words are used as features. In addition, we highlight the benefit of senses by presenting a part-ofspeech-wise effect on sentiment classification. Finally, we show that even if a WSD engine disambiguates between a limited set of words in a document, a sentiment classifier still performs better than what it does in absence of sense annotation. Since word senses used as features show promise, we also examine the possibility of using similarity metrics defined on WordNet to address the problem of not finding a sense in the training corpus. We per- form experiments using three popular similarity metrics to mitigate the effect of unknown synsets in a test corpus by replacing them with similar synsets from the training corpus. The results show promising improvement with respect to the baseline.
5 0.18633251 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
Author: Song Feng ; Ritwik Bose ; Yejin Choi
Abstract: In this paper, we introduce a connotation lexicon, a new type of lexicon that lists words with connotative polarity, i.e., words with positive connotation (e.g., award, promotion) and words with negative connotation (e.g., cancer, war). Connotation lexicons differ from much studied sentiment lexicons: the latter concerns words that express sentiment, while the former concerns words that evoke or associate with a specific polarity of sentiment. Understanding the connotation of words would seem to require common sense and world knowledge. However, we demonstrate that much of the connotative polarity of words can be inferred from natural language text in a nearly unsupervised manner. The key linguistic insight behind our approach is selectional preference of connotative predicates. We present graphbased algorithms using PageRank and HITS that collectively learn connotation lexicon together with connotative predicates. Our empirical study demonstrates that the resulting connotation lexicon is of great value for sentiment analysis complementing existing sentiment lexicons.
6 0.16333972 71 emnlp-2011-Identifying and Following Expert Investors in Stock Microblogs
7 0.14505936 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
8 0.14124914 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs
9 0.13902332 17 emnlp-2011-Active Learning with Amazon Mechanical Turk
10 0.13871807 41 emnlp-2011-Discriminating Gender on Twitter
11 0.12866031 89 emnlp-2011-Linguistic Redundancy in Twitter
12 0.1250874 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
13 0.079659164 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
14 0.077391744 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
15 0.074631304 142 emnlp-2011-Unsupervised Discovery of Discourse Relations for Eliminating Intra-sentence Polarity Ambiguities
16 0.065051816 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
17 0.062125079 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models
18 0.058322851 38 emnlp-2011-Data-Driven Response Generation in Social Media
19 0.05270464 122 emnlp-2011-Simple Effective Decipherment via Combinatorial Optimization
20 0.051848359 121 emnlp-2011-Semi-supervised CCG Lexicon Extension
topicId topicWeight
[(0, 0.187), (1, -0.31), (2, 0.282), (3, 0.117), (4, 0.321), (5, 0.031), (6, 0.058), (7, 0.085), (8, 0.005), (9, 0.043), (10, 0.043), (11, 0.183), (12, 0.002), (13, 0.043), (14, 0.044), (15, 0.033), (16, -0.031), (17, -0.022), (18, 0.048), (19, -0.043), (20, -0.008), (21, -0.028), (22, 0.011), (23, -0.041), (24, -0.009), (25, -0.001), (26, 0.047), (27, -0.065), (28, -0.026), (29, -0.034), (30, -0.005), (31, -0.008), (32, 0.018), (33, -0.003), (34, -0.007), (35, 0.01), (36, -0.038), (37, -0.003), (38, -0.056), (39, -0.005), (40, -0.04), (41, -0.025), (42, 0.031), (43, -0.011), (44, 0.047), (45, 0.002), (46, 0.017), (47, -0.004), (48, 0.001), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.96741533 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
Author: Samuel Brody ; Nicholas Diakopoulos
Abstract: We present an automatic method which leverages word lengthening to adapt a sentiment lexicon specifically for Twitter and similar social messaging networks. The contributions of the paper are as follows. First, we call attention to lengthening as a widespread phenomenon in microblogs and social messaging, and demonstrate the importance of handling it correctly. We then show that lengthening is strongly associated with subjectivity and sentiment. Finally, we present an automatic method which leverages this association to detect domain-specific sentiment- and emotionbearing words. We evaluate our method by comparison to human judgments, and analyze its strengths and weaknesses. Our results are of interest to anyone analyzing sentiment in microblogs and social networks, whether for research or commercial purposes.
2 0.79360348 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
Author: Richard Socher ; Jeffrey Pennington ; Eric H. Huang ; Andrew Y. Ng ; Christopher D. Manning
Abstract: We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model’s ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.
3 0.7909773 30 emnlp-2011-Compositional Matrix-Space Models for Sentiment Analysis
Author: Ainur Yessenalina ; Claire Cardie
Abstract: We present a general learning-based approach for phrase-level sentiment analysis that adopts an ordinal sentiment scale and is explicitly compositional in nature. Thus, we can model the compositional effects required for accurate assignment of phrase-level sentiment. For example, combining an adverb (e.g., “very”) with a positive polar adjective (e.g., “good”) produces a phrase (“very good”) with increased polarity over the adjective alone. Inspired by recent work on distributional approaches to compositionality, we model each word as a matrix and combine words using iterated matrix multiplication, which allows for the modeling of both additive and multiplicative semantic effects. Although the multiplication-based matrix-space framework has been shown to be a theoretically elegant way to model composition (Rudolph and Giesbrecht, 2010), training such models has to be done carefully: the optimization is nonconvex and requires a good initial starting point. This paper presents the first such algorithm for learning a matrix-space model for semantic composition. In the context of the phrase-level sentiment analysis task, our experimental results show statistically significant improvements in performance over a bagof-words model.
4 0.77009869 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
Author: Song Feng ; Ritwik Bose ; Yejin Choi
Abstract: In this paper, we introduce a connotation lexicon, a new type of lexicon that lists words with connotative polarity, i.e., words with positive connotation (e.g., award, promotion) and words with negative connotation (e.g., cancer, war). Connotation lexicons differ from much studied sentiment lexicons: the latter concerns words that express sentiment, while the former concerns words that evoke or associate with a specific polarity of sentiment. Understanding the connotation of words would seem to require common sense and world knowledge. However, we demonstrate that much of the connotative polarity of words can be inferred from natural language text in a nearly unsupervised manner. The key linguistic insight behind our approach is selectional preference of connotative predicates. We present graphbased algorithms using PageRank and HITS that collectively learn connotation lexicon together with connotative predicates. Our empirical study demonstrates that the resulting connotation lexicon is of great value for sentiment analysis complementing existing sentiment lexicons.
5 0.66699976 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification
Author: Balamurali AR ; Aditya Joshi ; Pushpak Bhattacharyya
Abstract: Traditional approaches to sentiment classification rely on lexical features, syntax-based features or a combination of the two. We propose semantic features using word senses for a supervised document-level sentiment classifier. To highlight the benefit of sense-based features, we compare word-based representation of documents with a sense-based representation where WordNet senses of the words are used as features. In addition, we highlight the benefit of senses by presenting a part-ofspeech-wise effect on sentiment classification. Finally, we show that even if a WSD engine disambiguates between a limited set of words in a document, a sentiment classifier still performs better than what it does in absence of sense annotation. Since word senses used as features show promise, we also examine the possibility of using similarity metrics defined on WordNet to address the problem of not finding a sense in the training corpus. We per- form experiments using three popular similarity metrics to mitigate the effect of unknown synsets in a test corpus by replacing them with similar synsets from the training corpus. The results show promising improvement with respect to the baseline.
6 0.57065761 71 emnlp-2011-Identifying and Following Expert Investors in Stock Microblogs
7 0.55959213 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs
8 0.5468899 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
9 0.44757441 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
10 0.41904426 17 emnlp-2011-Active Learning with Amazon Mechanical Turk
11 0.39965078 89 emnlp-2011-Linguistic Redundancy in Twitter
12 0.36328369 41 emnlp-2011-Discriminating Gender on Twitter
13 0.30875325 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
14 0.27875224 142 emnlp-2011-Unsupervised Discovery of Discourse Relations for Eliminating Intra-sentence Polarity Ambiguities
15 0.23164909 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models
16 0.22345883 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
17 0.19352227 121 emnlp-2011-Semi-supervised CCG Lexicon Extension
18 0.18254924 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
19 0.18189576 38 emnlp-2011-Data-Driven Response Generation in Social Media
20 0.18179269 122 emnlp-2011-Simple Effective Decipherment via Combinatorial Optimization
topicId topicWeight
[(15, 0.013), (23, 0.085), (36, 0.026), (37, 0.025), (45, 0.09), (52, 0.036), (53, 0.031), (54, 0.025), (57, 0.013), (62, 0.02), (64, 0.024), (66, 0.026), (69, 0.012), (70, 0.257), (79, 0.063), (82, 0.012), (96, 0.065), (97, 0.069), (98, 0.025)]
simIndex simValue paperId paperTitle
1 0.80073249 72 emnlp-2011-Improved Transliteration Mining Using Graph Reinforcement
Author: Ali El Kahki ; Kareem Darwish ; Ahmed Saad El Din ; Mohamed Abd El-Wahab ; Ahmed Hefny ; Waleed Ammar
Abstract: Mining of transliterations from comparable or parallel text can enhance natural language processing applications such as machine translation and cross language information retrieval. This paper presents an enhanced transliteration mining technique that uses a generative graph reinforcement model to infer mappings between source and target character sequences. An initial set of mappings are learned through automatic alignment of transliteration pairs at character sequence level. Then, these mappings are modeled using a bipartite graph. A graph reinforcement algorithm is then used to enrich the graph by inferring additional mappings. During graph reinforcement, appropriate link reweighting is used to promote good mappings and to demote bad ones. The enhanced transliteration mining technique is tested in the context of mining transliterations from parallel Wikipedia titles in 4 alphabet-based languages pairs, namely English-Arabic, English-Russian, English-Hindi, and English-Tamil. The improvements in F1-measure over the baseline system were 18.7, 1.0, 4.5, and 32.5 basis points for the four language pairs respectively. The results herein outperform the best reported results in the literature by 2.6, 4.8, 0.8, and 4.1 basis points for respectively. the four language 1384 pairs
same-paper 2 0.74945009 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
Author: Samuel Brody ; Nicholas Diakopoulos
Abstract: We present an automatic method which leverages word lengthening to adapt a sentiment lexicon specifically for Twitter and similar social messaging networks. The contributions of the paper are as follows. First, we call attention to lengthening as a widespread phenomenon in microblogs and social messaging, and demonstrate the importance of handling it correctly. We then show that lengthening is strongly associated with subjectivity and sentiment. Finally, we present an automatic method which leverages this association to detect domain-specific sentiment- and emotionbearing words. We evaluate our method by comparison to human judgments, and analyze its strengths and weaknesses. Our results are of interest to anyone analyzing sentiment in microblogs and social networks, whether for research or commercial purposes.
3 0.5500533 30 emnlp-2011-Compositional Matrix-Space Models for Sentiment Analysis
Author: Ainur Yessenalina ; Claire Cardie
Abstract: We present a general learning-based approach for phrase-level sentiment analysis that adopts an ordinal sentiment scale and is explicitly compositional in nature. Thus, we can model the compositional effects required for accurate assignment of phrase-level sentiment. For example, combining an adverb (e.g., “very”) with a positive polar adjective (e.g., “good”) produces a phrase (“very good”) with increased polarity over the adjective alone. Inspired by recent work on distributional approaches to compositionality, we model each word as a matrix and combine words using iterated matrix multiplication, which allows for the modeling of both additive and multiplicative semantic effects. Although the multiplication-based matrix-space framework has been shown to be a theoretically elegant way to model composition (Rudolph and Giesbrecht, 2010), training such models has to be done carefully: the optimization is nonconvex and requires a good initial starting point. This paper presents the first such algorithm for learning a matrix-space model for semantic composition. In the context of the phrase-level sentiment analysis task, our experimental results show statistically significant improvements in performance over a bagof-words model.
4 0.51913172 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models
Author: Deepak Agarwal ; Bee-Chung Chen ; Bo Pang
Abstract: In recent years, the amount of user-generated opinionated texts (e.g., reviews, user comments) continues to grow at a rapid speed: featured news stories on a major event easily attract thousands of user comments on a popular online News service. How to consume subjective information ofthis volume becomes an interesting and important research question. In contrast to previous work on review analysis that tried to filter or summarize information for a generic average user, we explore a different direction of enabling personalized recommendation of such information. For each user, our task is to rank the comments associated with a given article according to personalized user preference (i.e., whether the user is likely to like or dislike the comment). To this end, we propose a factor model that incorporates rater-comment and rater-author interactions simultaneously in a principled way. Our full model significantly outperforms strong baselines as well as related models that have been considered in previous work.
5 0.51683587 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: Context-dependent word similarity can be measured over multiple cross-cutting dimensions. For example, lung and breath are similar thematically, while authoritative and superficial occur in similar syntactic contexts, but share little semantic similarity. Both of these notions of similarity play a role in determining word meaning, and hence lexical semantic models must take them both into account. Towards this end, we develop a novel model, Multi-View Mixture (MVM), that represents words as multiple overlapping clusterings. MVM finds multiple data partitions based on different subsets of features, subject to the marginal constraint that feature subsets are distributed according to Latent Dirich- let Allocation. Intuitively, this constraint favors feature partitions that have coherent topical semantics. Furthermore, MVM uses soft feature assignment, hence the contribution of each data point to each clustering view is variable, isolating the impact of data only to views where they assign the most features. Through a series of experiments, we demonstrate the utility of MVM as an inductive bias for capturing relations between words that are intuitive to humans, outperforming related models such as Latent Dirichlet Allocation.
6 0.51127362 81 emnlp-2011-Learning General Connotation of Words using Graph-based Algorithms
7 0.51121628 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
8 0.49806342 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
9 0.48985809 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
10 0.48652014 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
11 0.48617396 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
12 0.48533753 128 emnlp-2011-Structured Relation Discovery using Generative Models
13 0.48512217 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models
14 0.48348221 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
15 0.48278594 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
16 0.48050624 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
17 0.48048413 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
18 0.47974974 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
19 0.47950166 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
20 0.47949851 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances