acl acl2013 acl2013-147 knowledge-graph by maker-knowledge-mining

147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

Source: pdf

Author: Jianfeng Si ; Arjun Mukherjee ; Bing Liu ; Qing Li ; Huayi Li ; Xiaotie Deng

Abstract: This paper proposes a technique to leverage topic based sentiments from Twitter to help predict the stock market. We first utilize a continuous Dirichlet Process Mixture model to learn the daily topic set. Then, for each topic we derive its sentiment according to its opinion words distribution to build a sentiment time series. We then regress the stock index and the Twitter sentiment time series to predict the market. Experiments on real-life S&P100; Index show that our approach is effective and performs better than existing state-of-the-art non-topic based methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com} ‡AIMS Lab, Department of Computer Science, Shanghai Jiaotong University, Shanghai, China , ‡deng-xt @ cs Abstract This paper proposes a technique to leverage topic based sentiments from Twitter to help predict the stock market. [sent-9, score-0.779]

2 We first utilize a continuous Dirichlet Process Mixture model to learn the daily topic set. [sent-10, score-0.368]

3 Then, for each topic we derive its sentiment according to its opinion words distribution to build a sentiment time series. [sent-11, score-0.988]

4 We then regress the stock index and the Twitter sentiment time series to predict the market. [sent-12, score-1.113]

5 In this paper, we use them for the application of stock index time series analysis. [sent-18, score-0.724]

6 Here are some example tweets upon querying the keyword “$aapl” (which is the stock symbol for Apple Inc. [sent-19, score-0.533]

7 s j tu edu cn was visiting As shown, the retrieved tweets may talk about Apple’s products, Apple’s competition relationship with other companies, etc. [sent-34, score-0.196]

8 These messages are often related to people’s sentiments about Apple Inc. [sent-35, score-0.223]

9 , which can affect or reflect its stock trading since positive sentiments can impact sales and financial gains. [sent-36, score-0.587]

10 Naturally, this hints that topic based sentiment is a useful factor to consider for stock prediction as they reflect people’s sentiment on different topics in a certain time frame. [sent-37, score-1.465]

11 This paper focuses on daily one-day-ahead prediction of stock index based on the temporal characteristics of topics in Twitter in the recent past. [sent-38, score-0.827]

12 Specifically, we propose a non-parametric topic-based sentiment time series approach to analyzing the streaming Twitter data. [sent-39, score-0.574]

13 The key motivation here is that Twitter’s streaming messages reflect fresh sentiments of people which are likely to be correlated with stocks in a short time frame. [sent-40, score-0.447]

14 We also analyze the effect of training window size which best fits the temporal dynamics of stocks. [sent-41, score-0.178]

15 Here window size refers to the number of days of tweets used in model building. [sent-42, score-0.413]

16 Our final prediction model is built using vec- tor autoregression (VAR). [sent-43, score-0.182]

17 To our knowledge, this is the first attempt to use non-parametric continuous topic based Twitter sentiments for stock prediction in an autoregressive framework. [sent-44, score-1.031]

18 1 Related Work Market Prediction and Social Media Stock market prediction has attracted a great deal of attention in the past. [sent-46, score-0.211]

19 , can be analyzed to extract public sentiments to help predict the market (Lavrenko et al. [sent-48, score-0.357]

20 (201 1) used tweet based public mood to predict the movement of Dow Jones 24 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. [sent-51, score-0.222]

21 (2012) studied the relationship between Twitter activities and stock market under a graph based view. [sent-55, score-0.44]

22 (201 1) introduced a hybrid approach for stock sentiment analysis based on companies’ news articles. [sent-57, score-0.62]

23 LDA can learn a predefined number of topics and has been widely applied in its extended forms in sentiment analysis and many other tasks (Mei et al. [sent-62, score-0.439]

24 , 2006), which can estimate the number of topics inherent in the data itself. [sent-70, score-0.156]

25 In this work, we employ topic based sentiment analysis using DPM on Twitter posts (or tweets). [sent-71, score-0.516]

26 First, we employ a DPM to estimate the number of topics in the streaming snapshot of tweets in each day. [sent-72, score-0.424]

27 Next, we build a sentiment time series based on the estimated topics of daily tweets. [sent-73, score-0.716]

28 Lastly, we regress the stock index and the sentiment time series in an autoregressive framework. [sent-74, score-1.18]

29 3 Model We now present our stock prediction framework. [sent-75, score-0.445]

30 1 Continuous DPM Model Comparing to edited articles, it is much harder to preset the number of topics to best fit continuous streaming Twitter data due to the large topic di- versity in tweets. [sent-77, score-0.538]

31 Thus, we resort to a nonparametric approach: the Dirichlet Process Mixture (DPM) model, and let the model estimate the number of topics inherent in the data itself. [sent-78, score-0.156]

32 In our setting of DPM, the number of mixture components (topics) K is unfixed apriori but estimated from tweets in each day. [sent-80, score-0.255]

33 We note that neighboring days may share the same or closely related topics because some topics may last for a long period of time covering multiple days, while other topics may just last for a short period of time. [sent-82, score-0.79]

34 Given a set of timestamped tweets, the overall generative process should be dynamic as the topics evolve over time. [sent-83, score-0.194]

35 As shown, the tweets set is divided into daily + * + are the and * + are the model pa- based collections: * observed tweets rameters (latent topics) that generate these tweets. [sent-90, score-0.45]

36 However, for later days ( ), besides the base measure, ( ), we make use of topics learned from previous days as priors. [sent-95, score-0.375]

37 This ensures smooth topic chains or links (details in §3. [sent-96, score-0.263]

38 For efficiency, we only consider topics of one previous day as priors. [sent-98, score-0.266]

39 Because a tweet has at most 140 characters, we assume that each tweet contains only one topic. [sent-103, score-0.152]

40 1 …t t …+1 … N … Figure 2: Linking the continuous neighboring priors. [sent-106, score-0.115]

41 sample the topic assignment for each tweet According to different situations with respect to a topic’s prior, for each tweet in , the , . [sent-108, score-0.385]

42 conditional distribution for given all other tweets’ topic assignments, denoted by can be summarized as follows: 1. [sent-109, score-0.233]

43 If takes one topic k from as its prior: ( ) * ( ( ) )∏ ∏ (( + ( )( )) ) (4) 2. [sent-111, score-0.265]

44 ) ( ) )( ))∏ ∏ If k takes topic ( ( ) ( / )/∏ ∏ ( ( )) (5) as its prior: ) (( ( )( ) ) ) (6) Notations in the above equations are listed as follows:  is the number of topics learned in day t-1. [sent-115, score-0.566]

45  is the document length of  is the term frequency of word in  ( ) is the probability of word in previous day’s topic k. [sent-117, score-0.233]

46  is the number of tweets assigned to topic k excluding the current one  is the term frequency of word in topic k, . [sent-118, score-0.662]

47 While denotes the marginalized sum of all words in topic k with statistic from excluded. [sent-122, score-0.323]

48 Similarly, the posteriors on *( )+ (topic word distributions) are given according to their prior situations as follows:  If topic k takes the base prior:  ( ) ( )⁄ ( ( )) (7) where is the frequency of word in topic k and ( ) is the marginalized sum over all words. [sent-123, score-0.646]

49 ( ) ( () )⁄ ( ( )) (8) where serves as the topic prior for Finally, for each day we estimate the topic weights, as follows: where ⁄∑ is the number of tweets in topic k. [sent-125, score-1.075]

50 2 Topic-based Sentiment Time Series Based on an opinion lexicon (a list of positive and negative opinion words, e. [sent-127, score-0.282]

51 , good and bad), each opinion word, is assigned with a polarity label ( ) as “+1” if it is positive and “-1” if negative. [sent-129, score-0.124]

52 We spilt each tweet’s text into opinion part and non-opinion part. [sent-130, score-0.124]

53 Only non-opinion words in tweets are used for Gibbs sampling. [sent-131, score-0.196]

54 the non-opinion words space The corresponding tweets’ opinion words share the same topic assignments as its tweet. [sent-133, score-0.357]

55 Then, we compute the posterior on opinion word probability, ( ) for topic analogously to equations (7) and (8). [sent-134, score-0.392]

56 Finally, we define the topic based sentiment score ( of topic ) in day t as a weighted linear combination of the opinion polarity labels: ( ) ∑ ( ) ( ); ( ) (10) According to the generative process of cDPM, topics between neighboring days are linked if a topic k takes another topic as its prior. [sent-135, score-1.762]

57 Then, the sentiment scores for each topic series form the sentiment time series { S(t-1, k), S(t, k), S(t+1, k), . [sent-138, score-1.172]

58 In this example, two continuous topic chains or links (via linked priors) exist for the time interval [t-1, t+1]: one in light grey color, and the other in black. [sent-143, score-0.405]

59 As shown, there may be more than one topic chain/link (5-20 in our experiments) for a certain time interval1. [sent-144, score-0.298]

60 Thus, we sort multiple sentiment series according to their accumulative weights of topics over each link: . [sent-145, score-0.593]

61 In our experiments, we try the top five series and use the one that gives the best result, which is mostly the first (top ranked) series with a few exceptions of the second series. [sent-146, score-0.308]

62 The topics mostly focus on hot keywords like: news, stocknews, earning, report, which stimulate active discussions on the social media platform. [sent-147, score-0.255]

63 The first order (time steps of historical information to use: lag = 1) VAR model for two time series * + and * + is given by: (11) where * + are the white noises and * + are model parameters. [sent-150, score-0.465]

64 Instead of training in one period and predicting over another disjointed period, we use a moving training and prediction process under sliding windows3 (i. [sent-152, score-0.21]

65 , train in [t, t + w] and predict index on t + w + 1) with two main considerations:  Due to the dynamic and random nature of both the stock market and public sentiments, we are more interested in their short term relationship. [sent-154, score-0.723]

66 Figure 3 details the algorithm for stock index prediction. [sent-156, score-0.505]

67 The accuracy is computed based on the index up and down dynamics, the function ( )returns True only if (our prediction) and (actual value) share the same index up or down direction. [sent-157, score-0.365]

68 1 The actual topic priors for topic links are governed by the four cases of the Gibbs Sampler. [sent-158, score-0.502]

69 Parameter: w: training window size; lag: the order of VAR; Input: : date of time series; {} : sentiment time the series; {} : index time series; Output: prediction accuracy. [sent-162, score-0.884]

70 Return Accuracy; Figure 3: Prediction algorithm and accuracy … = = 4 … … , - -, Dataset We collected the tweets via Twitter’s REST API for streaming data, using symbols of the Standard & Poor's 100 stocks (S&P100;) as keywords. [sent-172, score-0.384]

71 (201 1) used the mood dimension, Calm together with the index value itself to predict the Dow Jones Industrial Average. [sent-180, score-0.242]

72 We identified and labeled a Calm lexicon (words like “anxious”, “shocked”, “settled” and “dormant”) using the opinion lexicon4 of Hu and Liu (2004) and computed the sentiment score using the method of Bollen et al. [sent-183, score-0.441]

73 Our pilot experiments showed that using the full opinion lexicon of Hu and Liu (2004) actually performs consistently better than the Calm lexicon. [sent-185, score-0.158]

74 Hence, we use the entire opinion lexicon in Hu and Liu (2004). [sent-186, score-0.158]

75 The first (Index) uses only the index itself, which reduces the VAR model to the univariate autoregressive model (AR), resulting in only one index time series {} in the algorithm of Figure 3. [sent-189, score-0.654]

76 M6 843) Table 1: Average (best) accuracies over all training window sizes and different lags 1, 2, 3. [sent-199, score-0.193]

77 nRdaewx, Raw and cDPM averaged over all training window sizes. [sent-205, score-0.13]

78 , 2012) simply compute the sentiment score as ratio of pos/neg opinion words per day. [sent-208, score-0.435]

79 This generates a lexicon-based sentiment time series, which is then combined with the index value series to give us the second baseline Raw. [sent-209, score-0.67]

80 In summary, Index uses index only with the AR model while Raw uses index and opinion lexicon based time series. [sent-210, score-0.559]

81 Our cDPM uses index and the proposed topic based sentiment time series. [sent-211, score-0.749]

82 We experiment with different lag settings from 1-3 days. [sent-213, score-0.246]

83 We also experiment with different training window sizes, ranging from 15 - 30 days, and compute the prediction accuracy for each window size. [sent-214, score-0.397]

84 Table 1 shows the respective average and best accuracies over all window sizes for each lag and Table 2 summarizes the pairwise performance improvements of averaged scores over all training window sizes. [sent-215, score-0.569]

85 Figure 4 show the detailed accuracy comparison for lag 1 and lag 3. [sent-216, score-0.521]

86 Topic-based public sentiments from tweets can improve stock prediction over simple sentiment ratio which may suffer from backchan- nel noise and lack of focus on prevailing topics. [sent-218, score-1.174]

87 cDPM outperforms all others in terms of both the best accuracy (lag 3) and the average accuracies for different window sizes. [sent-222, score-0.189]

88 This is due to the fact that cDPM learns the topic based sentiments instead of just using the opinion words’ ratio like Raw, and in a short time period, some topics are more correlated with the stock marComparison on Lag 1 acyuAcr0 . [sent-226, score-1.12]

89 5327645 518 92I0nd2eC1xom2p a3riso2R4naw2o5n2L6ag273c2D8P2M930 Training Window size Figure 4: Comparison of prediction accuracy of up/down stock index on S&P; 100 index for different training window sizes. [sent-228, score-0.94]

90 Our proposed sentiment time series using cDPM can capture this phenomenon and also help reduce backchannel noise of raw sentiments. [sent-230, score-0.553]

91 On average, cDPM gets the best performance for training window sizes within [21, 22], and the best prediction accuracy is 68. [sent-232, score-0.3]

92 6 Conclusions Predicting the stock market is an important but difficult problem. [sent-234, score-0.44]

93 This paper showed that Twitter’s topic based sentiment can improve the pre- diction accuracy beyond existing non-topic based approaches. [sent-235, score-0.545]

94 Specifically, a non-parametric topicbased sentiment time series approach was proposed for the Twitter stream. [sent-236, score-0.502]

95 For prediction, vector autoregression was used to regress S&P100; index with the learned sentiment time series. [sent-237, score-0.664]

96 Besides the short term dynamics based prediction, we believe that the proposed method can be extended for long range dependency analysis of Twitter sentiments and stocks, which can render deep insights into the complex phenomenon of stock market. [sent-238, score-0.562]

97 Aspect and sentiment unification model for online review analysis. [sent-308, score-0.283]

98 Topic sentiment mixture: modeling facets and opinions in weblogs. [sent-339, score-0.283]

99 Markov chain sampling methods for dirichlet process mixture models. [sent-354, score-0.127]

100 Textual analysis of stock market prediction using breaking financial news. [sent-375, score-0.585]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stock', 0.337), ('sentiment', 0.283), ('dpm', 0.271), ('lag', 0.246), ('topic', 0.233), ('cdpm', 0.222), ('tweets', 0.196), ('twitter', 0.196), ('var', 0.181), ('sentiments', 0.177), ('index', 0.168), ('topics', 0.156), ('series', 0.154), ('window', 0.13), ('opinion', 0.124), ('day', 0.11), ('prediction', 0.108), ('market', 0.103), ('aapl', 0.099), ('autoregressive', 0.099), ('bollen', 0.09), ('days', 0.087), ('calm', 0.087), ('stocks', 0.087), ('continuous', 0.077), ('tweet', 0.076), ('autoregression', 0.074), ('regress', 0.074), ('streaming', 0.072), ('prior', 0.07), ('dirichlet', 0.068), ('period', 0.066), ('time', 0.065), ('apple', 0.061), ('ruiz', 0.06), ('mixture', 0.059), ('daily', 0.058), ('blei', 0.055), ('oh', 0.053), ('neal', 0.052), ('raw', 0.051), ('shanghai', 0.05), ('social', 0.05), ('asur', 0.049), ('moghaddam', 0.049), ('rightnum', 0.049), ('schumaker', 0.049), ('media', 0.049), ('dynamics', 0.048), ('messages', 0.046), ('base', 0.045), ('public', 0.045), ('sauper', 0.044), ('mood', 0.042), ('dow', 0.04), ('ester', 0.04), ('chua', 0.04), ('liu', 0.04), ('hu', 0.039), ('dynamic', 0.038), ('neighboring', 0.038), ('symmetric', 0.038), ('bishop', 0.038), ('branavan', 0.038), ('financial', 0.037), ('priors', 0.036), ('sales', 0.036), ('sliding', 0.036), ('feldman', 0.036), ('lda', 0.036), ('mining', 0.035), ('equations', 0.035), ('lexicon', 0.034), ('sizes', 0.033), ('marginalized', 0.033), ('lavrenko', 0.033), ('sun', 0.032), ('mukherjee', 0.032), ('predict', 0.032), ('takes', 0.032), ('industrial', 0.031), ('aspect', 0.03), ('chains', 0.03), ('accuracies', 0.03), ('mei', 0.029), ('companies', 0.029), ('statistic', 0.029), ('accuracy', 0.029), ('denotes', 0.028), ('gibbs', 0.028), ('acm', 0.028), ('ratio', 0.028), ('teh', 0.028), ('brody', 0.027), ('jo', 0.027), ('movement', 0.027), ('illinois', 0.027), ('zhao', 0.027), ('gmai', 0.027), ('elhadad', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

Author: Jianfeng Si ; Arjun Mukherjee ; Bing Liu ; Qing Li ; Huayi Li ; Xiaotie Deng

2 0.32628867 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts

Author: Alexandra Balahur ; Hristo Tanev

Abstract: Nowadays, the importance of Social Media is constantly growing, as people often use such platforms to share mainstream media news and comment on the events that they relate to. As such, people no loger remain mere spectators to the events that happen in the world, but become part of them, commenting on their developments and the entities involved, sharing their opinions and distributing related content. This paper describes a system that links the main events detected from clusters of newspaper articles to tweets related to them, detects complementary information sources from the links they contain and subsequently applies sentiment analysis to classify them into positive, negative and neutral. In this manner, readers can follow the main events happening in the world, both from the perspective of mainstream as well as social media and the public’s perception on them. This system will be part of the EMM media monitoring framework working live and it will be demonstrated using Google Earth.

3 0.2217544 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

Author: Angeliki Lazaridou ; Ivan Titov ; Caroline Sporleder

Abstract: We propose a joint model for unsupervised induction of sentiment, aspect and discourse information and show that by incorporating a notion of latent discourse relations in the model, we improve the prediction accuracy for aspect and sentiment polarity on the sub-sentential level. We deviate from the traditional view of discourse, as we induce types of discourse relations and associated discourse cues relevant to the considered opinion analysis task; consequently, the induced discourse relations play the role of opinion and aspect shifters. The quantitative analysis that we conducted indicated that the integration of a discourse model increased the prediction accuracy results with respect to the discourse-agnostic approach and the qualitative analysis suggests that the induced representations encode a meaningful discourse structure.

4 0.20358799 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams

Author: Svitlana Volkova ; Theresa Wilson ; David Yarowsky

Abstract: We study subjective language media and create Twitter-specific lexicons via bootstrapping sentiment-bearing terms from multilingual Twitter streams. Starting with a domain-independent, highprecision sentiment lexicon and a large pool of unlabeled data, we bootstrap Twitter-specific sentiment lexicons, using a small amount of labeled data to guide the process. Our experiments on English, Spanish and Russian show that the resulting lexicons are effective for sentiment classification for many underexplored languages in social media.

5 0.19258881 146 acl-2013-Exploiting Social Media for Natural Language Processing: Bridging the Gap between Language-centric and Real-world Applications

Author: Simone Paolo Ponzetto ; Andrea Zielinski

Abstract: unkown-abstract

6 0.18576762 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words

7 0.18509044 310 acl-2013-Semantic Frames to Predict Stock Price Movement

8 0.18466194 121 acl-2013-Discovering User Interactions in Ideological Discussions

9 0.17118113 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media

10 0.17078042 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

11 0.15624022 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

12 0.15412909 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

13 0.15160932 318 acl-2013-Sentiment Relevance

14 0.1472176 244 acl-2013-Mining Opinion Words and Opinion Targets in a Two-Stage Framework

15 0.14646249 240 acl-2013-Microblogs as Parallel Corpora

16 0.13440658 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

17 0.13382749 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction

18 0.13167074 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

19 0.12747343 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction

20 0.12549859 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.208), (1, 0.336), (2, 0.01), (3, 0.246), (4, 0.086), (5, -0.016), (6, 0.152), (7, 0.074), (8, -0.008), (9, -0.138), (10, 0.024), (11, 0.076), (12, 0.149), (13, 0.003), (14, 0.093), (15, -0.048), (16, 0.01), (17, 0.008), (18, 0.009), (19, -0.048), (20, 0.031), (21, 0.035), (22, -0.033), (23, -0.058), (24, 0.0), (25, -0.013), (26, -0.023), (27, -0.046), (28, 0.022), (29, 0.02), (30, 0.056), (31, 0.039), (32, 0.005), (33, -0.03), (34, -0.038), (35, -0.025), (36, 0.007), (37, -0.028), (38, 0.005), (39, -0.016), (40, -0.037), (41, -0.001), (42, -0.042), (43, -0.029), (44, -0.057), (45, 0.023), (46, -0.004), (47, 0.018), (48, -0.024), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95726705 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

Author: Jianfeng Si ; Arjun Mukherjee ; Bing Liu ; Qing Li ; Huayi Li ; Xiaotie Deng

2 0.83971989 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts

Author: Alexandra Balahur ; Hristo Tanev

3 0.71844327 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media

Author: Weiwei Guo ; Hao Li ; Heng Ji ; Mona Diab

Abstract: Many current Natural Language Processing [NLP] techniques work well assuming a large context of text as input data. However they become ineffective when applied to short texts such as Twitter feeds. To overcome the issue, we want to find a related newswire document to a given tweet to provide contextual support for NLP tasks. This requires robust modeling and understanding of the semantics of short texts. The contribution of the paper is two-fold: 1. we introduce the Linking-Tweets-toNews task as well as a dataset of linked tweet-news pairs, which can benefit many NLP applications; 2. in contrast to previ- ous research which focuses on lexical features within the short texts (text-to-word information), we propose a graph based latent variable model that models the inter short text correlations (text-to-text information). This is motivated by the observation that a tweet usually only covers one aspect of an event. We show that using tweet specific feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are able to extract text-to-text correlations, and thus completes the semantic picture of a short text. Our experiments show significant improvement of our new model over baselines with three evaluation metrics in the new task.

4 0.68230367 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams

Author: Svitlana Volkova ; Theresa Wilson ; David Yarowsky

5 0.67408818 146 acl-2013-Exploiting Social Media for Natural Language Processing: Bridging the Gap between Language-centric and Real-world Applications

Author: Simone Paolo Ponzetto ; Andrea Zielinski

Abstract: unkown-abstract

6 0.66176492 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

7 0.64051348 114 acl-2013-Detecting Chronic Critics Based on Sentiment Polarity and Userâ•Žs Behavior in Social Media

8 0.6079123 45 acl-2013-An Empirical Study on Uncertainty Identification in Social Media Context

9 0.59486777 33 acl-2013-A user-centric model of voting intention from Social Media

10 0.59367019 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction

11 0.58267647 49 acl-2013-An annotated corpus of quoted opinions in news articles

12 0.5775314 42 acl-2013-Aid is Out There: Looking for Help from Tweets during a Large Scale Disaster

13 0.56488663 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words

14 0.56251538 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting

15 0.5583154 121 acl-2013-Discovering User Interactions in Ideological Discussions

16 0.54880792 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

17 0.53488672 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

18 0.532736 318 acl-2013-Sentiment Relevance

19 0.52958715 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays

20 0.52841598 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.053), (4, 0.015), (6, 0.042), (11, 0.041), (24, 0.07), (26, 0.126), (35, 0.086), (38, 0.019), (42, 0.048), (48, 0.043), (70, 0.049), (76, 0.22), (88, 0.026), (90, 0.016), (95, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82801497 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

Author: Jianfeng Si ; Arjun Mukherjee ; Bing Liu ; Qing Li ; Huayi Li ; Xiaotie Deng

2 0.79492915 57 acl-2013-Arguments and Modifiers from the Learner's Perspective

Author: Leon Bergen ; Edward Gibson ; Timothy J. O'Donnell

Abstract: We present a model for inducing sentential argument structure, which distinguishes arguments from optional modifiers. We use this model to study whether representing an argument/modifier distinction helps in learning argument structure, and whether a linguistically-natural argument/modifier distinction can be induced from distributional data alone. Our results provide evidence for both hypotheses.

3 0.74924785 312 acl-2013-Semantic Parsing as Machine Translation

Author: Jacob Andreas ; Andreas Vlachos ; Stephen Clark

Abstract: Semantic parsing is the problem of deriving a structured meaning representation from a natural language utterance. Here we approach it as a straightforward machine translation task, and demonstrate that standard machine translation components can be adapted into a semantic parser. In experiments on the multilingual GeoQuery corpus we find that our parser is competitive with the state of the art, and in some cases achieves higher accuracy than recently proposed purpose-built systems. These results support the use of machine translation methods as an informative baseline in semantic parsing evaluations, and suggest that research in semantic parsing could benefit from advances in machine translation.

4 0.65157038 257 acl-2013-Natural Language Models for Predicting Programming Comments

Author: Dana Movshovitz-Attias ; William W. Cohen

Abstract: Statistical language models have successfully been used to describe and analyze natural language documents. Recent work applying language models to programming languages is focused on the task of predicting code, while mainly ignoring the prediction of programmer comments. In this work, we predict comments from JAVA source files of open source projects, using topic models and n-grams, and we analyze the performance of the models given varying amounts of background data on the project being predicted. We evaluate models on their comment-completion capability in a setting similar to codecompletion tools built into standard code editors, and show that using a comment completion tool can save up to 47% of the comment typing. 1 Introduction and Related Work Statistical language models have traditionally been used to describe and analyze natural language documents. Recently, software engineering researchers have adopted the use of language models for modeling software code. Hindle et al. (2012) observe that, as code is created by humans it is likely to be repetitive and predictable, similar to natural language. NLP models have thus been used for a variety of software development tasks such as code token completion (Han et al., 2009; Jacob and Tairas, 2010), analysis of names in code (Lawrie et al., 2006; Binkley et al., 2011) and mining software repositories (Gabel and Su, 2008). An important part of software programming and maintenance lies in documentation, which may come in the form of tutorials describing the code, or inline comments provided by the programmer. The documentation provides a high level description of the task performed by the code, and may William W. Cohen Computer Science Department Carnegie Mellon University wcohen @ c s .cmu .edu include examples of use-cases for specific code segments or identifiers such as classes, methods and variables. Well documented code is easier to read and maintain in the long-run but writing comments is a laborious task that is often overlooked or at least postponed by many programmers. Code commenting not only provides a summarization of the conceptual idea behind the code (Sridhara et al., 2010), but can also be viewed as a form of document expansion where the comment contains significant terms relevant to the described code. Accurately predicted comment words can therefore be used for a variety of linguistic uses including improved search over code bases using natural language queries, code categorization, and locating parts of the code that are relevant to a specific topic or idea (Tseng and Juang, 2003; Wan et al., 2007; Kumar and Carterette, 2013; Shepherd et al., 2007; Rastkar et al., 2011). A related and well studied NLP task is that of predicting natural language caption and commentary for images and videos (Blei and Jordan, 2003; Feng and Lapata, 2010; Feng and Lapata, 2013; Wu and Li, 2011). In this work, our goal is to apply statistical language models for predicting class comments. We show that n-gram models are extremely successful in this task, and can lead to a saving of up to 47% in comment typing. This is expected as n-grams have been shown as a strong model for language and speech prediction that is hard to improve upon (Rosenfeld, 2000). In some cases however, for example in a document expansion task, we wish to extract important terms relevant to the code regardless of local syntactic dependencies. We hence also evaluate the use of LDA (Blei et al., 2003) and link-LDA (Erosheva et al., 2004) topic models, which are more relevant for the term ex- traction scenario. We find that the topic model performance can be improved by distinguishing code and text tokens in the code. 35 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. th ?c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 35–40, 2 Method 2.1 Models We train n-gram models (n = 1, 2, 3) over source code documents containing sequences of combined code and text tokens from multiple training datasets (described below). We use the Berkeley Language Model package (Pauls and Klein, 2011) with absolute discounting (Kneser-Ney smoothing; (1995)) which includes a backoff strategy to lower-order n-grams. Next, we use LDA topic models (Blei et al., 2003) trained on the same data, with 1, 5, 10 and 20 topics. The joint distribution of a topic mixture θ, and a set of N topics z, for a single source code document with N observed word tokens, d = {wi}iN=1, given the Dirichlet parameters α sa,n dd β, {isw th}erefore p(θ, z, w|α, β) = p(θ|α) Yp(z|θ)p(w|z, (1) β) Yw Under the models described so far, there is no distinction between text and code tokens. Finally, we consider documents as having a mixed membership of two entity types, code and text tokens, d = where tthexet text ws,o drd =s are tok}ens f,r{owm comment and string literals, and the code words include the programming language syntax tokens (e.g., publ ic, private, for, etc’ ) and all identifiers. In this case, we train link-LDA models (Erosheva et al., 2004) with 1, 5, 10 and 20 topics. Under the linkLDA model, the mixed-membership joint distribution of a topic mixture, words and topics is then ({wciode}iC=n1, {witext}iT=n1), p(θ, z, w|α, β) = p(θ|α) Y wYtext · p(ztext|θ)p(wtext|ztext,β)· (2) Y p(zcode|θ)p(wcode|zcode,β) wYcode where θ is the joint topic distribution, w is the set of observed document words, ztext is a topic associated with a text word, and zcode a topic associated with a code word. The LDA and link-LDA models use Gibbs sampling (Griffiths and Steyvers, 2004) for topic inference, based on the implementation of Balasubramanyan and Cohen (201 1) with single or multiple entities per document, respectively. 2.2 Testing Methodology Our goal is to predict the tokens of the JAVA class comment (the one preceding the class definition) in each of the test files. Each of the models described above assigns a probability to the next comment token. In the case of n-grams, the probability of a token word wi is given by considering previous words p(wi |wi−1 , . . . , w0). This probability is estimated given the previous n 1tokens as p(wi|wi−1, wi−(n−1)). For t|hwe topic models, we separate the docu- ..., − ment tokens into the class definition and the comment we wish to predict. The set of tokens of the class comment are all considered as text tokens. The rest of the tokens in the document are considered to be the class definition, and they may contain both code and text tokens (from string literals and other comments in the source file). We then compute the posterior probability of document topics by solving the following inference problem conditioned on the tokens wc, wr, wr p(θ,zr|wr,α,β) =p(θp,(zwr,rw|αr,|αβ),β) (3) This gives us an estimate of the document distribution, θ, with which we infer the probability of the comment tokens as p(wc|θ,β) = Xp(wc|z,β)p(z|θ) (4) Xz Following Blei et al. (2003), for the case of a single entity LDA, the inference problem from equation (3) can be solved by considering p(θ, z, w|α, β), as in equation (1), and by taking tph(eθ marginal )di,s atrsib iunti eoqnu aotfio othne ( 1d)o,c aunmde bnyt t toakkeinngs as a continuous mixture distribution for the set w = by integrating over θ and summing over the set of topics z wr, p(w|α,β) =Zp(θ|α)· (5) YwXzp(z|θ)p(w|z,β)!dθ For the case of link-LDA where the document is comprised of two entities, in our case code tokens and text tokens, we can consider the mixedmembership joint distribution θ, as in equation (2), and similarly the marginal distribution p(w|α, β) over bimoithla rclyod teh ean mda tregxint tlok deisntsri bfruotmion w pr(.w |Sαi,nβce) comment words in are all considered as text tokens they are sampled using text topics, namely ztext, in equation (4). wc 36 3 Experimental Settings 3.1 Data and Training Methodology We use source code from nine open source JAVA projects: Ant, Cassandra, Log4j, Maven, MinorThird, Batik, Lucene, Xalan and Xerces. For each project, we divide the source files into a training and testing dataset. Then, for each project in turn, we consider the following three main training scenarios, leading to using three training datasets. To emulate a scenario in which we are predicting comments in the middle of project development, we can use data (documented code) from the same project. In this case, we use the in-project training dataset (IN). Alternatively, if we train a comment prediction model at the beginning of the development, we need to use source files from other, possibly related projects. To analyze this scenario, for each of the projects above we train models using an out-of-project dataset (OUT) containing data from the other eight projects. Typically, source code files contain a greater amount ofcode versus comment text. Since we are interested in predicting comments, we consider a third training data source which contains more English text as well as some code segments. We use data from the popular Q&A; website StackOverflow (SO) where users ask and answer technical questions about software development, tools, algorithms, etc’ . We downloaded a dataset of all actions performed on the site since it was launched in August 2008 until August 2012. The data includes 3,453,742 questions and 6,858,133 answers posted by 1,295,620 users. We used only posts that are tagged as JAVA related questions and answers. All the models for each project are then tested on the testing set of that project. We report results averaged over all projects in Table 1. Source files were tokenized using the Eclipse JDT compiler tools, separating code tokens and identifiers. Identifier names (of classes, methods and variables), were further tokenized by camel case notation (e.g., ’minMargin’ was converted to ’min margin’). Non alpha-numeric tokens (e.g., dot, semicolon) were discarded from the code, as well as numeric and single character literals. Text from comments or any string literals within the code were further tokenized with the Mallet statistical natural language processing package (Mc- Callum, 2002). Posts from SO were parsed using the Apache Tika toolkit1 and then tokenized with the Mallet package. We considered as raw code tokens anything labeled using a markup (as indicated by the SO users who wrote the post). 3.2 Evaluation Since our models are trained using various data sources the vocabularies used by each of them are different, making the comment likelihood given by each model incomparable due to different sets of out-of-vocabulary tokens. We thus evaluate models using a character saving metric which aims at quantifying the percentage of characters that can be saved by using the model in a word-completion settings, similar to standard code completion tools built into code editors. For a comment word with n characters, w = w1, . . . , wn, we predict the two most likely words given each model filtered by the first 0, . . . , n characters ofw. Let k be the minimal ki for which w is in the top two predicted word tokens where tokens are filtered by the first ki characters. Then, the number of saved characters for w is n k. In Table 1we report the average percentage o−f ksa.v Iend T Tcahbalera 1cte wrse per ocrotm thmee avnet using eearcchen not-f the above models. The final results are also averaged over the nine input projects. As an example, in the predicted comment shown in Table 2, taken from the project Minor-Third, the token entity is the most likely token according to the model SO trigram, out of tokens starting with the prefix ’en’ . The saved characters in this case are ’tity’ . − 4 Results Table 1 displays the average percentage of characters saved per class comment using each of the models. Models trained on in-project data (IN) perform significantly better than those trained on another data source, regardless of the model type, with an average saving of 47. 1% characters using a trigram model. This is expected, as files from the same project are likely to contain similar comments, and identifier names that appear in the comment of one class may appear in the code of another class in the same project. Clearly, in-project data should be used when available as it improves comment prediction leading to an average increase of between 6% for the worst model (26.6 for OUT unigram versus 33.05 for IN) and 14% for the best (32.96 for OUT trigram versus 47. 1for IN). 1http://tika.apache.org/ 37 Model n / topics n-gram LDA Link-LDA 1 2 3 20 10 5 1 20 10 5 1 IN 33.05 (3.62) 43.27 (5.79) 47.1 (6.87) 34.20 (3.63) 33.93 (3.67) 33.63 (3.67) 33.05 (3.62) 35.76 (3.95) 35.81 (4.12) 35.37 (3.98) 34.59 (3.92) OUT 26.6 (3.37) 31.52 (4.17) 32.96 (4.33) 26.79 (3.26) 26.8 (3.36) 26.86 (3.44) 26.6 (3.37) 28.03 (3.60) 28 (3.56) 28 (3.67) 27.82 (3.62) SO 27.8 (3.51) 33.29 (4.40) 34.56 (4.78) 27.25 (3.67) 27.22 (3.44) 27.34 (3.55) 27.8 (3.51) 28.08 (3.48) 28.12 (3.58) 27.94 (3.56) 27.9 (3.45) Table 1: Average percentage of characters saved per comment using n-gram, LDA and link-LDA models trained on three training sets: IN, OUT, and SO. The results are averaged over nine JAVA projects (with standard deviations in parenthesis). Model Predicted Comment trigram IN link-LDA OUT trigram SO trigram “Train “Train “Train “Train IN named-entity a named-entity a named-entity a named-entity a extractor“ extractor“ extractor“ extractor“ Table 2: Sample comment from the Minor-Third project predicted using IN, OUT and SO based models. Saved characters are underlined. Of the out-of-project data sources, models using a greater amount of text (SO) mostly outperformed models based on more code (OUT). This increase in performance, however, comes at a cost of greater run-time due to the larger word dictionary associated with the SO data. Note that in the scope of this work we did not investigate the contribution of each of the background projects used in OUT, and how their relevance to the target prediction project effects their performance. The trigram model shows the best performance across all training data sources (47% for IN, 32% for OUT and 34% for SO). Amongst the tested topic models, link-LDA models which distinguish code and text tokens perform consistently better than simple LDA models in which all tokens are considered as text. We did not however find a correlation between the number of latent topics learned by a topic model and its performance. In fact, for each of the data sources, a different num- ber of topics gave the optimal character saving results. Note that in this work, all topic models are based on unigram tokens, therefore their results are most comparable with that of the unigram in Dataset n-gram link-LDA IN 2778.35 574.34 OUT 1865.67 670.34 SO 1898.43 638.55 Table 3: Average words per project for which each tested model completes the word better than the other. This indicates that each of the models is better at predicting a different set of comment words. Table 1, which does not benefit from the backoff strategy used by the bigram and trigram models. By this comparison, the link-LDA topic model proves more successful in the comment prediction task than the simpler models which do not distin- guish code and text tokens. Using n-grams without backoff leads to results significantly worse than any of the presented models (not shown). Table 2 shows a sample comment segment for which words were predicted using trigram models from all training sources and an in-project linkLDA. The comment is taken from the TrainExtractor class in the Minor-Third project, a machine learning library for annotating and categorizing text. Both IN models show a clear advantage in completing the project-specific word Train, compared to models based on out-of-project data (OUT and SO). Interestingly, in this example the trigram is better at completing the term namedentity given the prefix named. However, the topic model is better at completing the word extractor which refers to the target class. This example indicates that each model type may be more successful in predicting different comment words, and that combining multiple models may be advantageous. 38 This can also be seen by the analysis in Table 3 where we compare the average number of words completed better by either the best n-gram or topic model given each training dataset. Again, while n-grams generally complete more words better, a considerable portion of the words is better completed using a topic model, further motivating a hybrid solution. 5 Conclusions We analyze the use of language models for predicting class comments for source file documents containing a mixture of code and text tokens. Our experiments demonstrate the effectiveness of using language models for comment completion, showing a saving of up to 47% of the comment characters. When available, using in-project training data proves significantly more successful than using out-of-project data. However, we find that when using out-of-project data, a dataset based on more words than code performs consistently better. The results also show that different models are better at predicting different comment words, which motivates a hybrid solution combining the advantages of multiple models. Acknowledgments This research was supported by the NSF under grant CCF-1247088. References Ramnath Balasubramanyan and William W Cohen. 2011. Block-lda: Jointly modeling entity-annotated text and entity-entity links. In Proceedings ofthe 7th SIAM International Conference on Data Mining. Dave Binkley, Matthew Hearn, and Dawn Lawrie. 2011. Improving identifier informativeness using part of speech information. In Proc. of the Working Conference on Mining Software Repositories. ACM. David M Blei and Michael I Jordan. 2003. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. David M Blei, Andrew Y Ng, and Michael IJordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. Elena Erosheva, Stephen Fienberg, and John Lafferty. 2004. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences of the United States of America. Yansong Feng and Mirella Lapata. 2010. How many words is a picture worth? automatic caption generation for news images. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. Yansong Feng and Mirella Lapata. 2013. Automatic caption generation for news images. IEEE transactions on pattern analysis and machine intelligence. Mark Gabel and Zhendong Su. 2008. Javert: fully automatic mining of general temporal properties from dynamic traces. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 339–349. ACM. Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proc. of the National Academy of Sciences of the United States of America. Sangmok Han, David R Wallace, and Robert C Miller. 2009. Code completion from abbreviated input. In Automated Software Engineering, 2009. ASE’09. 24th IEEE/ACM International Conference on, pages 332–343. IEEE. Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE. Ferosh Jacob and Robert Tairas. 2010. Code template inference using language models. In Proceedings of the 48th Annual Southeast Regional Conference. ACM. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., volume 1, pages 181–184. IEEE. Naveen Kumar and Benjamin Carterette. 2013. Time based feedback and query expansion for twitter search. In Advances in Information Retrieval, pages 734–737. Springer. Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. Whats in a name? a study of identifiers. In Program Comprehension, 2006. ICPC 2006. 14th IEEE International Conference on, pages 3–12. IEEE. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Adam Pauls and Dan Klein. 2011. Faster and smaller language models. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 258–267. n-gram Sarah Rastkar, Gail C Murphy, and Alexander WJ Bradley. 2011. Generating natural language summaries for crosscutting source code concerns. In Software Maintenance (ICSM), 2011 27th IEEE International Conference on, pages 103–1 12. IEEE. 39 Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? Proceedings of the IEEE, 88(8): 1270–1278. David Shepherd, Zachary P Fry, Emily Hill, Lori Pollock, and K Vijay-Shanker. 2007. Using natural language program analysis to locate and understand action-oriented concerns. In Proceedings of the 6th international conference on Aspect-oriented software development, pages 212–224. ACM. Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering, pages 43–52. ACM. Yuen-Hsien Tseng and Da-Wei Juang. 2003. Document-self expansion for text categorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 399–400. ACM. Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. 2007. Single document summarization with document expansion. In Proc. of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. Roung-Shiunn Wu and Po-Chun Li. 2011. Video annotation using hierarchical dirichlet process mixture model. Expert Systems with Applications, 38(4):3040–3048. 40

5 0.65029722 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

Author: Angeliki Lazaridou ; Ivan Titov ; Caroline Sporleder

6 0.64983618 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

7 0.64796227 318 acl-2013-Sentiment Relevance

8 0.64479727 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

9 0.63907534 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

10 0.63735062 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation

11 0.63626808 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

12 0.6343016 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

13 0.63148534 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

14 0.62592357 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

15 0.6258257 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

16 0.62543142 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

17 0.62509352 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

18 0.62392467 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

19 0.62309521 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

20 0.62281942 163 acl-2013-From Natural Language Specifications to Program Input Parsers