acl acl2011 acl2011-305 knowledge-graph by maker-knowledge-mining

305 acl-2011-Topical Keyphrase Extraction from Twitter


Source: pdf

Author: Xin Zhao ; Jing Jiang ; Jing He ; Yang Song ; Palakorn Achanauparp ; Ee-Peng Lim ; Xiaoming Li

Abstract: Summarizing and analyzing Twitter content is an important and challenging task. In this paper, we propose to extract topical keyphrases as one way to summarize Twitter. We propose a context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking. We evaluate our proposed methods on a large Twitter data set. Experiments show that these methods are very effective for topical keyphrase extraction.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we propose to extract topical keyphrases as one way to summarize Twitter. [sent-10, score-0.694]

2 We propose a context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking. [sent-11, score-1.793]

3 Experiments show that these methods are very effective for topical keyphrase extraction. [sent-13, score-0.777]

4 In this paper, we propose to extract keyphrases as a way to summarize Twitter content. [sent-26, score-0.538]

5 Traditionally, keyphrases are defined as a short list of terms to summarize the topics of a document (Turney, 2000). [sent-27, score-0.613]

6 So far there is little work on keyword or keyphrase extraction from Twitter. [sent-31, score-0.819]

7 Existing work on keyphrase extraction identifies keyphrases from either individual documents or an entire text collection (Turney, 2000; Tomokiyo and Hurst, 2003). [sent-35, score-1.148]

8 Therefore, in this paper, we propose to study the novel problem of extracting topical keyphrases for summarizing and analyzing Twitter content. [sent-37, score-0.684]

9 In other words, we extract and organize keyphrases by topics learnt from Twitter. [sent-38, score-0.603]

10 In our work, we follow the standard three steps ofkeyphrase extraction, namely, keyword ranking, candidate keyphrase generation ProceedinPgosrt olafn thde, 4 O9rtehg Aonn,n Juuanle M 1e9e-2tin4g, 2 o0f1 t1h. [sent-39, score-0.86]

11 Ac s2s0o1ci1a Atiosnso fcoirat Cioonm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 379–388, and keyphrase ranking. [sent-41, score-0.621]

12 For keyphrase ranking, we propose a principled probabilistic phrase ranking method, which can be flexibly combined with any keyword ranking method and candidate keyphrase generation method. [sent-45, score-1.75]

13 Experiments on a large Twitter data set show that our proposed methods are very effective in topical keyphrase extraction from Twitter. [sent-46, score-0.805]

14 Interestingly, our proposed keyphrase ranking method can incorporate users’ interests by modeling the retweet behavior. [sent-47, score-0.797]

15 We further examine what topics are suitable for incorporating users’ interests for topical keyphrase extraction. [sent-48, score-0.898]

16 To the best of our knowledge, our work is the first to study how to extract keyphrases from microblogs. [sent-49, score-0.505]

17 2 Related Work Our work is related to unsupervised keyphrase extraction. [sent-51, score-0.621]

18 Graph-based ranking methods are the state of the art in unsupervised keyphrase extraction. [sent-52, score-0.747]

19 Language modeling methods (Tomokiyo and Hurst, 2003) and natural language processing techniques (Barker and Cornacchia, 2000) have also been used for unsupervised keyphrase extraction. [sent-56, score-0.621]

20 We focus on extracting topical keyphrases in microblogs, which has its own chal- lenges. [sent-62, score-0.638]

21 , 2010), but we further extract keyphrases from each topic for summarizing and analyzing Twitter content. [sent-68, score-0.708]

22 aGti tvheerne eT is a an dse Ct ,o topical keyphrase ee xcotrllaecctitoionn ni sC t. [sent-79, score-0.777]

23 o disGcoivveenr a l aisnt do fC keyphrases fpohrr aesaech e topic to ∈ T to . [sent-80, score-0.639]

24 dHiserceo veaerch a keyphrase pish a sequence aocfh hw tooprdics. [sent-81, score-0.621]

25 To extract keyphrases, we first identify topics from the Twitter collection using topic models (Section 3. [sent-82, score-0.295]

26 Next for each topic, we run a topical PageRank algorithm to rank keywords and then generate candidate keyphrases using the top ranked keywords (Section 3. [sent-84, score-0.846]

27 Finally, we use a probabilistic model to rank the candidate keyphrases (Section 3. [sent-86, score-0.574]

28 When a user wants to write a tweet, she first chooses a topic based on her topic distribution. [sent-99, score-0.352]

29 However, not all words in a tweet are closely related to the topic of that tweet; some are background words commonly used in tweets on different topics. [sent-101, score-0.347]

30 Therefore, for each word in a tweet, the user first decides whether it is a background word or a topic word and then chooses the word from its respective word distribution. [sent-102, score-0.218]

31 Formally, let φt denote the word distribution for topic t and φB the word distribution for background words. [sent-103, score-0.229]

32 Let π denote a Bernoulli distribution that governs the choice between background words and topic words. [sent-105, score-0.208]

33 It runs topic-biased PageRank for each topic separately and boosts those words with high relevance to the corresponding topic. [sent-111, score-0.22]

34 A large Rt(·) indicPatews∈ a word is a good candidate keyword i·n) topic Pt. [sent-114, score-0.378]

35 However, the original TPR ignores the topic context when setting the edge weights; the edge weight is set by counting the number of co-occurrences of the two words within a certain window size. [sent-116, score-0.205]

36 (2) Here we compute the propagation from wj to wi in the context of topic t, namely, the edge weight from wj to wi is parameterized by t. [sent-121, score-0.37]

37 In this paper, we compute edge weight et(wj , wi) between two words by counting the number of co-occurrences of these two words in tweets assigned to topic t. [sent-122, score-0.273]

38 After keyword ranking using cTPR or any other method, we adopt a common candidate keyphrase generation method proposed by Mihalcea and Tarau (2004) as follows. [sent-124, score-0.986]

39 While a standard method is to simply aggregate the scores of keywords inside a candidate keyphrase as the score for the keyphrase, here we propose a different probabilistic scoring function. [sent-129, score-0.779]

40 Our method is based on the following hypotheses about good keyphrases given a topic: Figure 2: Assumptions of variable dependencies. [sent-130, score-0.482]

41 Relevance: A good keyphrase should be closely related to the given topic and also discriminative. [sent-131, score-0.778]

42 For example, for the topic “news,” “president obama” is a good keyphrase while “math class” is not. [sent-132, score-0.778]

43 Interestingness: A good keyphrase should be interesting and can attract users’ attention. [sent-133, score-0.621]

44 ” Sometimes, there is a trade-off between these two properties and a good keyphrase has to balance both. [sent-135, score-0.621]

45 Following the probabilistic relevance models in information retrieval (Lafferty and Zhai, 2003), we propose to use P(R = 1, I = 1|t, k) to rank candidate keyphrases for topic t. [sent-139, score-0.794]

46 the interestingness of a keyphrase is independent of the topic or whether the keyphrase is relevant to the topic. [sent-143, score-1.555]

47 I =n g 1e)nbereaclau wse there are much more non-relevant keyphrases than relevant ones, that is, δ ? [sent-150, score-0.498]

48 We can see that the ranking score log P(R = 1, I = 1|t, k) can be decomposed into two components, a and an interestingness score log P(I = 1|k). [sent-157, score-0.382]

49 relevance scorelogPP((kk||tt,,RR==10)) Estimating the relevance score Let a keyphrase candidate k be a sequence of words (w1, w2, . [sent-159, score-0.815]

50 (4) Given the topic model φt previously learned for topic t, we can set P(w|t, R = 1) to φtw, i. [sent-168, score-0.314]

51 (5) Here Ct denotes the collection of tweets assigned to topic t, #(Ct, w) is the number of times w appears in Ct, and #(Ct, ·) is the total number of words in Ct. [sent-172, score-0.266]

52 (7) and |k| denotes the number Estimating the interestingness score To capture the interestingness of keyphrases, we make use of the retweeting behavior in Twitter. [sent-181, score-0.316]

53 (8) Here |ReTweetsk | and |Tweetsk | denote the numbHeresr oef |R retewTweeetse sand| atwndee |tTs containing tohtee keyphrase k, respectively, and lavg is the average number of tweets that a candidate keyphrase appears in. [sent-189, score-1.413]

54 Incorporating length preference Our preliminary experiments with Equation (9) show that this scoring function usually ranks longer keyphrases higher than shorter ones. [sent-194, score-0.5]

55 However, because our candidate keyphrase are extracted without using any linguistic knowledge such as noun phrase boundaries, longer candidate keyphrases tend to be less meaningful as a phrase. [sent-195, score-1.232]

56 Moreover, for our task of using keyphrases to summarize Twitter, we hypothesize that shorter keyphrases are preferred by users as they are more compact. [sent-196, score-1.025]

57 lWowe further observe that Equation (9) tends to give longer keyphrases higher scores mainly due to the term |k|η. [sent-199, score-0.482]

58 We selected 10 topics that cover a diverse range of content in Twitter for evaluation of topical keyphrase extraction. [sent-218, score-0.875]

59 1, there are three steps to generate keyphrases, namely, keyword ranking, candidate keyphrase generation, and keyphrase ranking. [sent-226, score-1.463]

60 We have proposed a context-sensitive topical PageRank method (cTPR) for the first step of keyword ranking, and a probabilistic scoring function for the third step of keyphrase ranking. [sent-227, score-0.982]

61 384 Keyphrase Ranking We use kpRelInt to denote our relevance and interestingness based keyphrase ranking function P(R = 1, I 1|t, k), i. [sent-239, score-0.978]

62 (P2010), we can r taonk w hcaantd isid uastee keyphrases by Pw∈k f(w), where f(w) is the score assigned tPo ww∈orkd w by a keyword ranking method. [sent-248, score-0.795]

63 • kpRel: If Pwe consider only relevance but nkpotR interestingness, we can rank candidate keyphrases by Pw∈k log 4. [sent-250, score-0.661]

64 3 Gold StandarPd Generation Since there is no existing test collection for topical keyphrase extraction from Twitter, we manually constructed our test collection. [sent-251, score-0.822]

65 For each topic, the judges were given the top topic words and a short topic description. [sent-261, score-0.373]

66 The number of the remaining keyphrases for each topic ranges from 56 to 282. [sent-270, score-0.639]

67 4 Evaluation Metrics Traditionally keyphrase extraction is evaluated using precision and recall on all the extracted keyphrases. [sent-272, score-0.649]

68 We choose not to use these measures for the following reasons: (1) Traditional keyphrase extraction works on single documents while we study topical keyphrase extraction. [sent-273, score-1.426]

69 The gold standard keyphrase list for a single document is usually short and clean, while for each Twitter topic there can be many keyphrases, some are more relevant and interesting than others. [sent-274, score-0.794]

70 (2) Our extracted topical keyphrases are meant for summarizing Twitter content, and they are likely to be directly shown to the users. [sent-275, score-0.669]

71 385 t, score(·) is the average score from the two human judges, asn tdh IdealScore(K,t) eis f trohem mno threma tlwizoati hounfactor—score of the top K keyphrases of topic t under the ideal ranking. [sent-282, score-0.679]

72 Intuitively, if M returns more good keyphrases i nng top rtuaintkivse, yit,s nKQM vuarnluse m worilel be higher. [sent-283, score-0.505]

73 5 Experiment Results Evaluation of keyword ranking methods Since keyword ranking is the first step for keyphrase extraction, we first compare our keyword ranking method cTPR with other methods. [sent-286, score-1.509]

74 We manually examined whether a word is a good keyword or a noisy word based on topic context. [sent-288, score-0.327]

75 Since our final goal is to extract topical keyphrases, we further compare the performance of cTPR and TPR when they are combined with a keyphrase ranking algorithm. [sent-291, score-0.926]

76 6 56 346 7185968 Table 3: Comparisons of keyphrase extraction for cTPR and baselines. [sent-303, score-0.649]

77 66669946 Table 4: Comparisons of keyphrase extraction for different keyphrase ranking methods. [sent-324, score-1.396]

78 Evaluation of keyphrase ranking methods In this section we compare keypharse ranking methods. [sent-331, score-0.873]

79 Therefore we use cTPR as the keyword ranking method and examine the keyphrase ranking method kpRelInt with kpBL1, kpBL2 and kpRel when they are combined with cTPR. [sent-333, score-1.043]

80 Interestingly, we also see that for the nKQM metric, kpBL1, which is the most commonly used keyphrase ranking method, did not perform as well as kpBL2, a modified version of kpBL1. [sent-337, score-0.747]

81 These findings support our assumption that our proposed keyphrase ranking method is effective. [sent-340, score-0.747]

82 “Singapore”) tend to gain higher weights during keyword ranking due to their high frequency, especially in graph-based method, but we do not want such words to contribute too much to keyphrase scores. [sent-348, score-0.917]

83 Here we study why it worked better for keyphrase ranking. [sent-351, score-0.621]

84 We then counted the numbers of entity and event keyphrases for these four topics retrieved by different methods, shown in Table 6 . [sent-357, score-0.605]

85 We can see that in these four topics, kpRelInt is consistently better than kpRel in terms of the number of entity and event keyphrases retrieved. [sent-358, score-0.507]

86 eolgmraitnvfdecginrhspaelo wtmseoclr( ftdaern s biy) MethodsT5T12T20T25 cTcPTPR+R+kkpRpeRleIlnt18019211671141 Table 6: Numbers of entity and event keyphrases retrieved by different methods within top 20. [sent-360, score-0.53]

87 On the other hand, we also find that for some topics interestingness helped little or even hurt the performance a little, e. [sent-361, score-0.238]

88 ” We find that the keyphrases in these topics are stable and change less over time. [sent-364, score-0.58]

89 6 Qualitative evaluation of cTPR+kpRelInt We show the top 10 keyphrases discovered by cTPR+kRelInt in Table 7. [sent-383, score-0.505]

90 We can observe that these keyphrases are clear, interesting and informative for summarizing Twitter topics. [sent-384, score-0.513]

91 We hypothesize that the following applications can benefit from the extracted keyphrases: Automatic generation of realtime trendy phrases: 387 For exampoe, keyphrases in the topic “Food” (T2) can be used to help online restaurant reviews. [sent-385, score-0.699]

92 Event detection and topic tracking: In the topic “News” top keyphrases can be used as candidate trendy topics for event detection and topic tracking. [sent-386, score-1.173]

93 5 Conclusion In this paper, we studied the novel problem oftopical keyphrase extraction for summarizing and analyzing Twitter content. [sent-388, score-0.695]

94 We proposed the context-sensitive topical PageRank (cTPR) method for keyword ranking. [sent-389, score-0.326]

95 Experiments showed that cTPR is consistently better than the original TPR and other baseline methods in terms of top keyword and keyphrase extraction. [sent-390, score-0.814]

96 For keyphrase ranking, we proposed a probabilistic ranking method, which models both relevance and interestingness of keyphrases. [sent-391, score-0.967]

97 In our experiments, this method is shown to be very effective to boost the performance of keyphrase extraction for different kinds of keyword ranking methods. [sent-392, score-0.945]

98 In the future, we may consider how to incorporate keyword scores into our keyphrase ranking method. [sent-393, score-0.917]

99 Note that we propose to rank keyphrases by a general formula P(R = 1, I 1|t, k) and we have made = some approximations based on reasonable assumptions. [sent-394, score-0.506]

100 Automatic generation of personalized annotation tags for twitter users. [sent-474, score-0.253]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('keyphrase', 0.621), ('keyphrases', 0.482), ('twitter', 0.219), ('ctpr', 0.186), ('keyword', 0.17), ('kprelint', 0.163), ('topic', 0.157), ('topical', 0.156), ('interestingness', 0.14), ('ranking', 0.126), ('kprel', 0.105), ('pagerank', 0.1), ('topics', 0.098), ('tweets', 0.092), ('nkqm', 0.081), ('tpr', 0.081), ('tweet', 0.075), ('kk', 0.07), ('relevance', 0.063), ('rr', 0.062), ('keywords', 0.055), ('wj', 0.052), ('candidate', 0.051), ('cct', 0.047), ('tomokiyo', 0.047), ('log', 0.041), ('equation', 0.039), ('rt', 0.038), ('tarau', 0.038), ('xin', 0.037), ('judges', 0.036), ('tewtweeetestks', 0.035), ('xiaoming', 0.035), ('summarize', 0.033), ('summarizing', 0.031), ('logpp', 0.031), ('hurst', 0.031), ('lav', 0.031), ('wi', 0.031), ('users', 0.028), ('extraction', 0.028), ('denote', 0.028), ('meaningful', 0.027), ('jing', 0.027), ('mihalcea', 0.027), ('retweet', 0.027), ('textrank', 0.027), ('pp', 0.026), ('event', 0.025), ('ct', 0.025), ('weng', 0.025), ('zhao', 0.025), ('social', 0.024), ('edge', 0.024), ('pw', 0.024), ('rank', 0.024), ('top', 0.023), ('arvelin', 0.023), ('jianshu', 0.023), ('kek', 0.023), ('methodnkqm', 0.023), ('sakaki', 0.023), ('scoret', 0.023), ('trendy', 0.023), ('tumasjan', 0.023), ('interests', 0.023), ('background', 0.023), ('propagation', 0.023), ('extract', 0.023), ('singapore', 0.022), ('user', 0.022), ('pt', 0.021), ('let', 0.021), ('ww', 0.021), ('liu', 0.021), ('juice', 0.021), ('klog', 0.021), ('litvak', 0.021), ('realtime', 0.019), ('barker', 0.019), ('celebrities', 0.019), ('damping', 0.019), ('retweeting', 0.019), ('wxj', 0.019), ('tw', 0.018), ('scoring', 0.018), ('traditional', 0.018), ('generation', 0.018), ('probabilistic', 0.017), ('microblogs', 0.017), ('publish', 0.017), ('collection', 0.017), ('score', 0.017), ('griffiths', 0.016), ('chooses', 0.016), ('relevant', 0.016), ('lim', 0.016), ('personalized', 0.016), ('ramage', 0.016), ('analyzing', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 305 acl-2011-Topical Keyphrase Extraction from Twitter

Author: Xin Zhao ; Jing Jiang ; Jing He ; Yang Song ; Palakorn Achanauparp ; Ee-Peng Lim ; Xiaoming Li

Abstract: Summarizing and analyzing Twitter content is an important and challenging task. In this paper, we propose to extract topical keyphrases as one way to summarize Twitter. We propose a context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking. We evaluate our proposed methods on a large Twitter data set. Experiments show that these methods are very effective for topical keyphrase extraction.

2 0.16546153 177 acl-2011-Interactive Group Suggesting for Twitter

Author: Zhonghua Qu ; Yang Liu

Abstract: The number of users on Twitter has drastically increased in the past years. However, Twitter does not have an effective user grouping mechanism. Therefore tweets from other users can quickly overrun and become inconvenient to read. In this paper, we propose methods to help users group the people they follow using their provided seeding users. Two sources of information are used to build sub-systems: textural information captured by the tweets sent by users, and social connections among users. We also propose a measure of fitness to determine which subsystem best represents the seed users and use it for target user ranking. Our experiments show that our proposed framework works well and that adaptively choosing the appropriate sub-system for group suggestion results in increased accuracy.

3 0.15470403 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

Author: Kevin Gimpel ; Nathan Schneider ; Brendan O'Connor ; Dipanjan Das ; Daniel Mills ; Jacob Eisenstein ; Michael Heilman ; Dani Yogatama ; Jeffrey Flanigan ; Noah A. Smith

Abstract: We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.

4 0.14315933 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

5 0.14113578 292 acl-2011-Target-dependent Twitter Sentiment Classification

Author: Long Jiang ; Mo Yu ; Ming Zhou ; Xiaohua Liu ; Tiejun Zhao

Abstract: Sentiment analysis on Twitter data has attracted much attention recently. In this paper, we focus on target-dependent Twitter sentiment classification; namely, given a query, we classify the sentiments of the tweets as positive, negative or neutral according to whether they contain positive, negative or neutral sentiments about that query. Here the query serves as the target of the sentiments. The state-ofthe-art approaches for solving this problem always adopt the target-independent strategy, which may assign irrelevant sentiments to the given target. Moreover, the state-of-the-art approaches only take the tweet to be classified into consideration when classifying the sentiment; they ignore its context (i.e., related tweets). However, because tweets are usually short and more ambiguous, sometimes it is not enough to consider only the current tweet for sentiment classification. In this paper, we propose to improve target-dependent Twitter sentiment classification by 1) incorporating target-dependent features; and 2) taking related tweets into consideration. According to the experimental results, our approach greatly improves the performance of target-dependent sentiment classification. 1

6 0.12759113 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

7 0.12025507 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

8 0.11411578 117 acl-2011-Entity Set Expansion using Topic information

9 0.11131322 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

10 0.11085302 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis

11 0.10514564 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

12 0.094666883 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

13 0.093920015 178 acl-2011-Interactive Topic Modeling

14 0.087968685 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

15 0.078234948 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

16 0.07280878 14 acl-2011-A Hierarchical Model of Web Summaries

17 0.070197783 285 acl-2011-Simple supervised document geolocation with geodesic grids

18 0.068971835 261 acl-2011-Recognizing Named Entities in Tweets

19 0.057738286 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

20 0.056673218 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.123), (1, 0.126), (2, 0.0), (3, 0.071), (4, -0.016), (5, -0.058), (6, -0.076), (7, 0.015), (8, -0.014), (9, 0.087), (10, -0.216), (11, 0.142), (12, 0.146), (13, -0.096), (14, -0.0), (15, -0.041), (16, -0.006), (17, 0.001), (18, -0.058), (19, 0.016), (20, -0.05), (21, 0.023), (22, -0.069), (23, 0.023), (24, 0.006), (25, 0.032), (26, -0.046), (27, 0.016), (28, -0.03), (29, 0.004), (30, 0.02), (31, 0.013), (32, 0.016), (33, 0.015), (34, -0.005), (35, -0.004), (36, -0.0), (37, -0.009), (38, 0.016), (39, -0.005), (40, 0.014), (41, 0.005), (42, 0.022), (43, 0.043), (44, 0.044), (45, -0.032), (46, -0.019), (47, 0.005), (48, 0.001), (49, -0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92670053 305 acl-2011-Topical Keyphrase Extraction from Twitter

Author: Xin Zhao ; Jing Jiang ; Jing He ; Yang Song ; Palakorn Achanauparp ; Ee-Peng Lim ; Xiaoming Li

Abstract: Summarizing and analyzing Twitter content is an important and challenging task. In this paper, we propose to extract topical keyphrases as one way to summarize Twitter. We propose a context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking. We evaluate our proposed methods on a large Twitter data set. Experiments show that these methods are very effective for topical keyphrase extraction.

2 0.74300301 177 acl-2011-Interactive Group Suggesting for Twitter

Author: Zhonghua Qu ; Yang Liu

Abstract: The number of users on Twitter has drastically increased in the past years. However, Twitter does not have an effective user grouping mechanism. Therefore tweets from other users can quickly overrun and become inconvenient to read. In this paper, we propose methods to help users group the people they follow using their provided seeding users. Two sources of information are used to build sub-systems: textural information captured by the tweets sent by users, and social connections among users. We also propose a measure of fitness to determine which subsystem best represents the seed users and use it for target user ranking. Our experiments show that our proposed framework works well and that adaptively choosing the appropriate sub-system for group suggestion results in increased accuracy.

3 0.64661157 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

Author: Roberto Gonzalez-Ibanez ; Smaranda Muresan ; Nina Wacholder

Abstract: Sarcasm transforms the polarity of an apparently positive or negative utterance into its opposite. We report on a method for constructing a corpus of sarcastic Twitter messages in which determination of the sarcasm of each message has been made by its author. We use this reliable corpus to compare sarcastic utterances in Twitter to utterances that express positive or negative attitudes without sarcasm. We investigate the impact of lexical and pragmatic factors on machine learning effectiveness for identifying sarcastic utterances and we compare the performance of machine learning techniques and human judges on this task. Perhaps unsurprisingly, neither the human judges nor the machine learning techniques perform very well. 1

4 0.64417732 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

Author: Kevin Gimpel ; Nathan Schneider ; Brendan O'Connor ; Dipanjan Das ; Daniel Mills ; Jacob Eisenstein ; Michael Heilman ; Dani Yogatama ; Jeffrey Flanigan ; Noah A. Smith

Abstract: We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.

5 0.61321402 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

6 0.61005646 178 acl-2011-Interactive Topic Modeling

7 0.58489871 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis

8 0.56940812 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

9 0.55504787 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

10 0.54638636 117 acl-2011-Entity Set Expansion using Topic information

11 0.53905094 285 acl-2011-Simple supervised document geolocation with geodesic grids

12 0.52515364 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

13 0.5142318 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

14 0.50955105 14 acl-2011-A Hierarchical Model of Web Summaries

15 0.50775486 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

16 0.5039072 261 acl-2011-Recognizing Named Entities in Tweets

17 0.50087565 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

18 0.4745411 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

19 0.45825005 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

20 0.45542648 292 acl-2011-Target-dependent Twitter Sentiment Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(17, 0.05), (26, 0.016), (37, 0.063), (39, 0.068), (41, 0.039), (55, 0.026), (57, 0.313), (59, 0.032), (72, 0.041), (91, 0.031), (96, 0.181)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.79508483 285 acl-2011-Simple supervised document geolocation with geodesic grids

Author: Benjamin Wing ; Jason Baldridge

Abstract: We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.

2 0.79250586 243 acl-2011-Partial Parsing from Bitext Projections

Author: Prashanth Mannem ; Aswarth Dara

Abstract: Recent work has shown how a parallel corpus can be leveraged to build syntactic parser for a target language by projecting automatic source parse onto the target sentence using word alignments. The projected target dependency parses are not always fully connected to be useful for training traditional dependency parsers. In this paper, we present a greedy non-directional parsing algorithm which doesn’t need a fully connected parse and can learn from partial parses by utilizing available structural and syntactic information in them. Our parser achieved statistically significant improvements over a baseline system that trains on only fully connected parses for Bulgarian, Spanish and Hindi. It also gave a significant improvement over previously reported results for Bulgarian and set a benchmark for Hindi.

same-paper 3 0.7899344 305 acl-2011-Topical Keyphrase Extraction from Twitter

Author: Xin Zhao ; Jing Jiang ; Jing He ; Yang Song ; Palakorn Achanauparp ; Ee-Peng Lim ; Xiaoming Li

Abstract: Summarizing and analyzing Twitter content is an important and challenging task. In this paper, we propose to extract topical keyphrases as one way to summarize Twitter. We propose a context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking. We evaluate our proposed methods on a large Twitter data set. Experiments show that these methods are very effective for topical keyphrase extraction.

4 0.77068073 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

5 0.74593067 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

Author: Manaal Faruqui ; Sebastian Pado

Abstract: In contrast to many languages (like Russian or French), modern English does not distinguish formal and informal (“T/V”) address overtly, for example by pronoun choice. We describe an ongoing study which investigates to what degree the T/V distinction is recoverable in English text, and with what textual features it correlates. Our findings are: (a) human raters can label English utterances as T or V fairly well, given sufficient context; (b), lexical cues can predict T/V almost at human level.

6 0.58152366 101 acl-2011-Disentangling Chat with Local Coherence Models

7 0.5731163 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

8 0.57284224 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

9 0.57272965 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

10 0.57042897 117 acl-2011-Entity Set Expansion using Topic information

11 0.56896758 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

12 0.56817609 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

13 0.56805718 28 acl-2011-A Statistical Tree Annotator and Its Applications

14 0.56770611 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

15 0.56618887 187 acl-2011-Jointly Learning to Extract and Compress

16 0.5652281 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

17 0.56494808 175 acl-2011-Integrating history-length interpolation and classes in language modeling

18 0.56446832 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

19 0.5642432 61 acl-2011-Binarized Forest to String Translation

20 0.56408954 178 acl-2011-Interactive Topic Modeling