acl acl2012 acl2012-61 knowledge-graph by maker-knowledge-mining

61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

Source: pdf

Author: Fangtao Li ; Sinno Jialin Pan ; Ou Jin ; Qiang Yang ; Xiaoyan Zhu

Abstract: Extracting sentiment and topic lexicons is important for opinion mining. Previous works have showed that supervised learning methods are superior for this task. However, the performance of supervised methods highly relies on manually labeled training data. In this paper, we propose a domain adaptation framework for sentiment- and topic- lexicon co-extraction in a domain of interest where we do not require any labeled data, but have lots of labeled data in another related domain. The framework is twofold. In the first step, we generate a few high-confidence sentiment and topic seeds in the target domain. In the second step, we propose a novel Relational Adaptive bootstraPping (RAP) algorithm to expand the seeds in the target domain by exploiting the labeled source domain data and the relationships between topic and sentiment words. Experimental results show that our domain adaptation framework can extract precise lexicons in the target domain without any annotation.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 gmai l Abstract Extracting sentiment and topic lexicons is important for opinion mining. [sent-10, score-1.124]

2 In this paper, we propose a domain adaptation framework for sentiment- and topic- lexicon co-extraction in a domain of interest where we do not require any labeled data, but have lots of labeled data in another related domain. [sent-13, score-0.763]

3 In the first step, we generate a few high-confidence sentiment and topic seeds in the target domain. [sent-15, score-1.194]

4 In the second step, we propose a novel Relational Adaptive bootstraPping (RAP) algorithm to expand the seeds in the target domain by exploiting the labeled source domain data and the relationships between topic and sentiment words. [sent-16, score-1.795]

5 Experimental results show that our domain adaptation framework can extract precise lexicons in the target domain without any annotation. [sent-17, score-0.749]

6 1 Introduction In the past few years, opinion mining and sentiment analysis have attracted much attention in Natural Language Processing (NLP) and Information Retrieval (IR) (Pang and Lee, 2008; Liu, 2010). [sent-18, score-0.738]

7 Sentiment lexicon construction and topic lexicon extraction are two fundamental subtasks for opinion mining (Qiu et al. [sent-19, score-0.8]

8 A sentiment lexicon is a list of sentiment expressions, which are used to indicate sentiment polarity (e. [sent-21, score-2.18]

9 The sentiment lexicon is domain dependent as users may use different sentiment words to express their opinion in different domains (e. [sent-24, score-1.856]

10 A topic lexicon is a list of topic expressions, on which 410 com, qyang@ c s e . [sent-27, score-0.747]

11 Extracting the topic lexicon from a specific domain is important because users not only care about the overall sentiment polarity of a review but also care about which aspects are mentioned in review. [sent-30, score-1.351]

12 Note that, similar to sentiment lexicons, different domains may have very different topic lexicons. [sent-31, score-0.998]

13 In this paper, we focus on the co-extraction task of sentiment and topic lexicons in a target domain where we do not have any labeled data, but have plenty of labeled data in a source domain. [sent-38, score-1.608]

14 In the first step, we build a bridge between the source and tar- get domains by identifying some common sentiment words as sentiment seeds in the target domain, such as “good”, “bad”, “nice”, etc. [sent-41, score-1.73]

15 After that, we generate topic seeds in the target domain by mining some general syntactic relation patterns between the sentiment and topic words from the source domain. [sent-42, score-1.835]

16 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi4c 1s0–419, lize useful labeled data from the source domain as well as exploit the relationships between the topic and sentiment words to propagate information for lexicon construction in the target domain. [sent-46, score-1.678]

17 In summary, we have three main contributions: 1) We give a systematic study on cross-domain sentiment analysis in word level. [sent-48, score-0.65]

18 1 Sentiment or Topic Lexicon Extraction Sentiment or topic lexicon extraction is to identify the sentiment or topic words from text. [sent-52, score-1.489]

19 (2004) proposed an association-rule-based method to extract topic words and a dictionary-based method to identify sentiment words, independently. [sent-55, score-1.022]

20 Some researchers also proposed to use topic modeling to identify implicit topics and sentiment words (Mei et al. [sent-60, score-0.994]

21 However, their method requires to manually define some general syntactic rules among sentiment and topic words. [sent-70, score-0.928]

22 There are also lots of studies for cross-domain sentiment analysis (Blitzer et al. [sent-77, score-0.65]

23 However, most of them focused on coarse-grained document-level sentiment classification, which is different from our fine-grained word-level extraction. [sent-85, score-0.65]

24 While we extract both topic and sentiment words and allow non-adjective sentiment words, which is more practical. [sent-91, score-1.646]

25 1 I nde lnexoitceos nth eex corresponding {w1o,r2d, wi a sentiment word, yi = 2 denotes wi a topic word, and yi = 3 denotes wi neither a sentiment nor topic word. [sent-98, score-2.228]

26 a Ondu rse gnotaiml eisnt to ow porreddsi fcot rl constructing topic and sentiment lexicons, respectively. [sent-100, score-0.928]

27 From the table, we can observe that there are some common sentiment words across different domains, such as “great”, “excellent” and “amazing”. [sent-104, score-0.69]

28 vBield- faces are topic words and Italics are sentiment words. [sent-114, score-0.968]

29 Based on the observations, we can build a connection between the source and target domains by identifying the common sentiment words. [sent-115, score-0.887]

30 Furthermore, intuitively, there are some general syntactic relationships or patterns between topic and sentiment words across different domains. [sent-116, score-1.08]

31 Therefore, if we can mine the patterns from the source and target domain data, then we are able to construct an indirect connection between topic words across domains by using the common sentiment words as a bridge, which makes knowledge transfer across domains possible. [sent-117, score-1.619]

32 Figure 1 shows two dependency trees for the sen- tence “the camera is great” in the camera domain and the sentence “the movie is excellent” in the movie domain, respectively. [sent-118, score-0.681]

33 As can be observed, the relationships between the topic and sentiment words in the two sentences are the same. [sent-119, score-1.03]

34 Let the camera domain be the source domain and the movie domain be the target domain. [sent-121, score-0.99]

35 If the word “excellent” is identified as a common sentiment word, and the “TOPIC-nsubj-SENTIMENT” relation extracted from the camera domain is recognized as a common 412 syntactic pattern, then the word “movie” can be predicted as a topic word in the movie domain with high probability. [sent-122, score-1.583]

36 After new topic words are extracted in the movie domain, we can apply the same syntactic pattern or other syntactic patterns to extract new sentiment and topic words iteratively. [sent-123, score-1.558]

37 More specifically, we use the shortest path between a topic word and a sentiment word in the corresponding dependency tree to denote the relation between them. [sent-127, score-0.928]

38 As an example shown in Figure 2, we can extract two paths or relationships between topic and sentiment words from the dependency tree of the sentence “The movie has good script”: “NNamod-JJ” from “script” and “good”, and “NN-nsubjVB-dobj-NN-amod-JJ” from “movie” and “good”. [sent-129, score-1.168]

39 In the following sections, we present the proposed two-stage domain adaptation framework: 1) generating some sentiment and topic seeds in the target domain; and 2) expanding the seeds in the target domain to construct sentiment and topic lexicons. [sent-131, score-2.889]

40 4 Seed Generation Our basic idea is to first identify several common sentiment words across domains as sentiment seeds. [sent-132, score-1.41]

41 Meanwhile, we mine some general patterns between sentiment and topic words from the source domain. [sent-133, score-1.072]

42 Finally, we use the sentiment seeds and general patterns to generate topic seeds in the target domain. [sent-134, score-1.397]

43 1 Sentiment Seed Generation To identify common sentiment words across domains, we extract all sentiment words from the × source domain as candidates. [sent-136, score-1.655]

44 If a word wi has high S1 score, which implies that the word wi occurs frequently and similarly in both domains, then it can be considered as a common sentiment word (Pan et al. [sent-138, score-0.898]

45 We select top r candidates with highest S1 scores as sentiment seeds. [sent-141, score-0.717]

46 2 Topic Seed Generation We extract all patterns between sentiment and topic words in the source domain as candidates. [sent-143, score-1.293]

47 5 Seed Expansion After generating the topic and sentiment seeds, we aim to expand them in the target domain to construct topic and sentiment lexicons. [sent-149, score-2.23]

48 In each iteration, we employ a cross-domain classifier trained on the source domain lexicons and the extracted target domain lexicons to predict the labels of the target unlabeled data, and select top k2 predicted topic and sentiment words as candidates based on confidence. [sent-159, score-2.061]

49 With the extracted syntactic patterns in the previous iterations, we construct a bipartite graph between sentiment and topic words on the extracted target domain lexicons and candidates. [sent-160, score-1.623]

50 p rTfohrem m paoinor i odnea x of TrAdaBoost is to re-weight the source domain data based on a few of target domain labeled data, which is referred to as seeds in our task. [sent-169, score-0.772]

51 In each iteration of RAP, we train cross-domain classifiers fOT fPT and for sentiment- and topic- word extractioOn using TrAdaBoost separately (taking sentiment or topic words as positive instances). [sent-171, score-1.016]

52 aPnrded tiocpt cth-e wseonrdtim exentrat score hfT (wTj ) Dand topic score hfT (wTj ) on DTu, and select k2 sentiment aSeli3z 5: words and topic words) w onith D highest scores as candidates. [sent-184, score-1.321]

53 Construct a bipartite graph between sentiment and topic words on DTl and the k2 sentiment- and topic- word canwdidoardtess o, na nDd calculate the normalized weights θij ’s for each edge of the graph. [sent-185, score-1.048]

54 Refine the scores Se1 and Se3 of the k2 sentiment and topic word candidates using Eqs. [sent-186, score-0.97]

55 Select k1 new sentiment words and k1 new topic words with the final scores, and add them to lexicons B and C. [sent-188, score-1.142]

56 2 Graph Construction Based on the cross-domain we can predict the sentiment label and topic label score fT and fPT, score hfT(wTi) classifiers hfT (wTi ) for the target domain data wTi . [sent-192, score-1.307]

57 and topic- Together with the extracted sentiment and topic lexicons in the target domain, 414 we build a bipartite graph among them as shown in Figure 3. [sent-194, score-1.28]

58 In the bipartite graph, one set of nodes × represents topic words, including new topic candidates and words in the lexicon C, and the other set of nodes represents sentiment words, including new sentiment candidates and words in the lexicon B. [sent-195, score-2.453]

59 For a pair of sentiment and topic words wTOi and wTPj, if there is a pattern Rj in the pattern set that can satisfy, then there exists an edge eij between them. [sent-196, score-1.153]

60 ∈Not Be twhat i∈n t hCe a bnedginning of each iteration, Se2 is updated based on the e e new sentiment score Se1 aned topicP score Se3. [sent-202, score-0.7]

61 eθeij µ Figure 3: Topic and sentiment word graph. [sent-204, score-0.65]

62 3 Score Computation We construct the bipartite graph to exploit the relationships between sentiment and topic words to propagate information for lexicon extraction. [sent-206, score-1.367]

63 (5) and (6), the sentiment scores and topic scores are iteratively refined until the state of the graph trends to be stable. [sent-211, score-0.957]

64 Finally, we select k1 ≪ k2 sentiment and topic words from the k2 cand≪idate ks based on their refined scores, and add them to the target domain lexicons, respectively. [sent-213, score-1.299]

65 We also update the sentiment e e score Se1 and topic score Se3 for next iteration. [sent-214, score-1.001]

66 (5) and (6), if the parameter = 1, then RAP only uses the relationships between sentiment and topic words with their patterns to propagate label information in the target domain without using the cross-domain classifier. [sent-218, score-1.417]

67 If = 0, then RAP only utilizes useful source domain labeled data to assist learning of the target domain classifier without considering the relationships between sentiment and topic words. [sent-220, score-1.655]

68 In this dataset, all types of sentiment words are annotated instead of adjective words only. [sent-227, score-0.73]

69 For example, the verbs, such as “like”, “recommend”, and nouns, such as “masterpiece”, are also labeled as sentiment words. [sent-228, score-0.716]

70 Since this method requires some target domain labeled data, we manually label 30 sentiment words in the target domain. [sent-240, score-1.175]

71 , 2007) on the source domain labeled data and the generated seeds in the target domain to train a lexicon extractor. [sent-243, score-0.963]

72 From Table 2, we can ob- serve that our proposed methods are effective for sentiment lexicon extraction. [sent-246, score-0.867]

73 In addition, we also observe that embedding the TrAdaBoost algorithm into a bootstrapping process can further boost the performance of the classifier for sentiment lexicon extraction. [sent-251, score-1.047]

74 From the table, we can observe that different from the sentiment lexicon extraction task, the relational bootstrapping method performs better than the adaptive bootstrapping method slightly. [sent-253, score-1.373]

75 The reason may be that for the sentiment lexicon extrac- tion task, there exist some common sentiment words bers in boldface denote significant improvement. [sent-254, score-1.561]

76 However, for the topic lexicon extraction task, the topic words may be totally different, and as a result, we may not be able to find useful source domain labeled data to boost the performance for lexicon extraction in the target domain. [sent-257, score-1.508]

77 In this case, mutual label propagation between sentiment and topic words may be more reasonable for knowledge transfer. [sent-258, score-0.968]

78 This is because relational bootstrapping only utilizes the patterns to propagate label information, which may cover more topic and sentiment seeds, but include some noisy words. [sent-261, score-1.24]

79 However, by using this pattern and the topic word “camera”, we may extract “take” as a sentiment word from another phase “take the cam416 era”, which is incorrect. [sent-263, score-1.015]

80 Our RAP method can exploit both relationships between sentiment and topic words and part of labeled source domain data for cross-domain lexicon extraction. [sent-266, score-1.534]

81 Observe that for sentiment word extraction, the results of the proposed methods are not sensitive to the values of r. [sent-277, score-0.676]

82 7 Application: Sentiment Classification To further verify the usefulness of the lexicons extracted by the RAP method, we apply the extracted sentiment lexicon for sentiment classification. [sent-293, score-1.675]

83 1 Experiment Setting Our work is motivated by the work of (Pang and Lee, 2004), which only used subjective sentences for document-level sentiment classification, instead of using all sentences. [sent-295, score-0.68]

84 In this experiment, we only use sentiment related words as features to represent opinion documents for classification, instead of using all words. [sent-296, score-0.752]

85 Our goal is compare the sentiment lexicon constructed by the RAP method with other general lexicons on the impact of for sentiment classification. [sent-297, score-1.625]

86 To construct domain specific sentiment lexicons, we apply RAP on each product domain with the movie domain described in Section 6. [sent-305, score-1.412]

87 2 Experimental Results Experimental results on sentiment classification are shown in Table 5, where we denote “All” using all unigram and bigram features instead of using subjective words. [sent-314, score-0.702]

88 These promising results imply that our RAP can be applied for sentiment classification effectively and efficiently. [sent-317, score-0.672]

89 8 Conclusions In this paper, we propose a two-stage framework for co-extraction of sentiment and topic lexicons across domains where we have no labeled data in the target domain but have plenty of labeled data in another domain. [sent-325, score-1.624]

90 In the first stage, we propose a simple strategy to generate a few high-quality sentiment and topic seeds for the target domain. [sent-326, score-1.194]

91 In the second stage, we propose a novel Relational Adaptive bootstraPping (RAP) method to expand the seeds, which can exploit the relationships between topic and opinion words, and make use of part of useful source domain labeled data for help. [sent-327, score-0.748]

92 Extensive experimental results show our proposed method can extract precise sentiment and topic lexicons from the target domain. [sent-328, score-1.263]

93 Furthermore, the extracted sentiment lexicon can be applied to sentiment classification effectively. [sent-329, score-1.538]

94 In the future work, besides the heterogeneous relationships between topic and sentiment words, we intend to investigate the homogeneous relationships among topic words and those among sentiment words (Qiu et al. [sent-330, score-2.06]

95 Furthermore, in our framework, we do not identify the polarity of the extracted sentiment lexicon. [sent-332, score-0.714]

96 Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. [sent-349, score-0.704]

97 Using multiple sources to construct a sentiment sensitive thesaurus for cross-domain sentiment classification. [sent-358, score-1.335]

98 Adapting information bottleneck method for automatic construction of domain-oriented sentiment lexicon. [sent-377, score-0.65]

99 Domain adaptation for large-scale sentiment classification: A deep learning approach. [sent-386, score-0.704]

100 A generation model to unify topic relevance and lexicon-based sentiment for opinion retrieval. [sent-525, score-0.99]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sentiment', 0.65), ('topic', 0.278), ('rap', 0.233), ('domain', 0.193), ('lexicon', 0.191), ('bootstrapping', 0.16), ('seeds', 0.153), ('rj', 0.149), ('lexicons', 0.134), ('camera', 0.134), ('wi', 0.124), ('wtj', 0.12), ('target', 0.113), ('movie', 0.11), ('tradaboost', 0.107), ('wti', 0.094), ('adaptive', 0.089), ('relational', 0.071), ('domains', 0.07), ('wj', 0.069), ('eij', 0.067), ('labeled', 0.066), ('opinion', 0.062), ('relationships', 0.062), ('pattern', 0.059), ('source', 0.054), ('adaptation', 0.054), ('pan', 0.054), ('dtl', 0.054), ('hft', 0.054), ('plenty', 0.054), ('wsi', 0.054), ('extraction', 0.052), ('bipartite', 0.051), ('patterns', 0.05), ('classifier', 0.046), ('qiu', 0.043), ('candidates', 0.042), ('words', 0.04), ('crossdomain', 0.04), ('fangtao', 0.04), ('jialin', 0.04), ('sinno', 0.04), ('ysi', 0.04), ('polarity', 0.039), ('qiang', 0.038), ('product', 0.038), ('ij', 0.038), ('construct', 0.035), ('fpt', 0.035), ('blitzer', 0.034), ('precise', 0.034), ('expand', 0.033), ('riloff', 0.033), ('ellen', 0.032), ('pages', 0.031), ('propagate', 0.031), ('subjective', 0.03), ('boldface', 0.03), ('jin', 0.03), ('graph', 0.029), ('li', 0.029), ('wiebe', 0.029), ('extract', 0.028), ('bollegala', 0.028), ('xiaoyan', 0.027), ('glorot', 0.027), ('songbo', 0.027), ('sults', 0.027), ('xsi', 0.027), ('xtj', 0.027), ('proposed', 0.026), ('mining', 0.026), ('crf', 0.026), ('transfer', 0.026), ('pang', 0.026), ('seed', 0.026), ('ds', 0.025), ('extracted', 0.025), ('select', 0.025), ('score', 0.025), ('iteration', 0.025), ('dai', 0.025), ('wk', 0.025), ('theresa', 0.025), ('reviews', 0.024), ('daum', 0.024), ('xueqi', 0.023), ('tan', 0.023), ('janyce', 0.023), ('classifiers', 0.023), ('update', 0.023), ('classification', 0.022), ('jakob', 0.022), ('ps', 0.022), ('excellent', 0.022), ('unlabeled', 0.021), ('ando', 0.021), ('sensitivity', 0.021), ('dtu', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

Author: Fangtao Li ; Sinno Jialin Pan ; Ou Jin ; Qiang Yang ; Xiaoyan Zhu

2 0.36541304 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

Author: Xinfan Meng ; Furu Wei ; Xiaohua Liu ; Ming Zhou ; Ge Xu ; Houfeng Wang

Abstract: The amount of labeled sentiment data in English is much larger than that in other languages. Such a disproportion arouse interest in cross-lingual sentiment classification, which aims to conduct sentiment classification in the target language (e.g. Chinese) using labeled data in the source language (e.g. English). Most existing work relies on machine translation engines to directly adapt labeled data from the source language to the target language. This approach suffers from the limited coverage of vocabulary in the machine translation results. In this paper, we propose a generative cross-lingual mixture model (CLMM) to leverage unlabeled bilingual parallel data. By fitting parameters to maximize the likelihood of the bilingual parallel data, the proposed model learns previously unseen sentiment words from the large bilingual parallel data and improves vocabulary coverage signifi- cantly. Experiments on multiple data sets show that CLMM is consistently effective in two settings: (1) labeled data in the target language are unavailable; and (2) labeled data in the target language are also available.

3 0.32538509 100 acl-2012-Fine Granular Aspect Analysis using Latent Structural Models

Author: Lei Fang ; Minlie Huang

Abstract: In this paper, we present a structural learning model forjoint sentiment classification and aspect analysis of text at various levels of granularity. Our model aims to identify highly informative sentences that are aspect-specific in online custom reviews. The primary advantages of our model are two-fold: first, it performs document-level and sentence-level sentiment polarity classification jointly; second, it is able to find informative sentences that are closely related to some respects in a review, which may be helpful for aspect-level sentiment analysis such as aspect-oriented summarization. The proposed method was evaluated with 9,000 Chinese restaurant reviews. Preliminary experiments demonstrate that our model obtains promising performance. 1

4 0.28318626 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

Author: Arjun Mukherjee ; Bing Liu

Abstract: Aspect extraction is a central problem in sentiment analysis. Current methods either extract aspects without categorizing them, or extract and categorize them using unsupervised topic modeling. By categorizing, we mean the synonymous aspects should be clustered into the same category. In this paper, we solve the problem in a different setting where the user provides some seed words for a few aspect categories and the model extracts and clusters aspect terms into categories simultaneously. This setting is important because categorizing aspects is a subjective task. For different application purposes, different categorizations may be needed. Some form of user guidance is desired. In this paper, we propose two statistical models to solve this seeded problem, which aim to discover exactly what the user wants. Our experimental results show that the two proposed models are indeed able to perform the task effectively. 1

5 0.28135026 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

Author: Hao Wang ; Dogan Can ; Abe Kazemzadeh ; Francois Bar ; Shrikanth Narayanan

Abstract: This paper describes a system for real-time analysis of public sentiment toward presidential candidates in the 2012 U.S. election as expressed on Twitter, a microblogging service. Twitter has become a central site where people express their opinions and views on political parties and candidates. Emerging events or news are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to gauge the relation between expressed public sentiment and electoral events. In addition, sentiment analysis can help explore how these events affect public opinion. While traditional content analysis takes days or weeks to complete, the system demonstrated here analyzes sentiment in the entire Twitter traffic about the election, delivering results instantly and continuously. It offers the public, the media, politicians and scholars a new and timely perspective on the dynamics of the electoral process and public opinion. 1

6 0.2460743 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

7 0.24138862 115 acl-2012-Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification

8 0.23643956 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

9 0.22302279 37 acl-2012-Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

10 0.19626708 161 acl-2012-Polarity Consistency Checking for Sentiment Dictionaries

11 0.18365441 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

12 0.15751965 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation

13 0.1515484 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

14 0.14771245 187 acl-2012-Subgroup Detection in Ideological Discussions

15 0.12081959 171 acl-2012-SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations

16 0.11858319 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

17 0.11198577 144 acl-2012-Modeling Review Comments

18 0.10463367 188 acl-2012-Subgroup Detector: A System for Detecting Subgroups in Online Discussions

19 0.10254306 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

20 0.098459966 98 acl-2012-Finding Bursty Topics from Microblogs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.258), (1, 0.274), (2, 0.371), (3, -0.272), (4, -0.012), (5, -0.133), (6, 0.046), (7, -0.04), (8, -0.246), (9, -0.102), (10, 0.086), (11, 0.049), (12, -0.107), (13, -0.182), (14, -0.021), (15, -0.042), (16, 0.007), (17, 0.027), (18, -0.022), (19, 0.005), (20, -0.031), (21, 0.024), (22, -0.043), (23, -0.038), (24, 0.039), (25, 0.011), (26, 0.078), (27, 0.012), (28, -0.059), (29, 0.049), (30, 0.014), (31, 0.099), (32, -0.053), (33, 0.008), (34, -0.046), (35, -0.059), (36, 0.076), (37, -0.036), (38, 0.022), (39, -0.029), (40, -0.048), (41, -0.099), (42, -0.016), (43, 0.018), (44, -0.022), (45, 0.05), (46, 0.027), (47, 0.015), (48, -0.014), (49, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97327346 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

Author: Fangtao Li ; Sinno Jialin Pan ; Ou Jin ; Qiang Yang ; Xiaoyan Zhu

2 0.82239503 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

Author: Xinfan Meng ; Furu Wei ; Xiaohua Liu ; Ming Zhou ; Ge Xu ; Houfeng Wang

3 0.78471065 37 acl-2012-Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

Author: Sida Wang ; Christopher Manning

Abstract: Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, but their performance varies greatly depending on the model variant, features used and task/ dataset. We show that: (i) the inclusion of word bigram features gives consistent gains on sentiment analysis tasks; (ii) for short snippet sentiment tasks, NB actually does better than SVMs (while for longer documents the opposite result holds); (iii) a simple but novel SVM variant using NB log-count ratios as feature values consistently performs well across tasks and datasets. Based on these observations, we identify simple NB and SVM variants which outperform most published results on sentiment analysis datasets, sometimes providing a new state-of-the-art performance level.

4 0.77046949 100 acl-2012-Fine Granular Aspect Analysis using Latent Structural Models

Author: Lei Fang ; Minlie Huang

5 0.72546673 161 acl-2012-Polarity Consistency Checking for Sentiment Dictionaries

Author: Eduard Dragut ; Hong Wang ; Clement Yu ; Prasad Sistla ; Weiyi Meng

Abstract: Polarity classification of words is important for applications such as Opinion Mining and Sentiment Analysis. A number of sentiment word/sense dictionaries have been manually or (semi)automatically constructed. The dictionaries have substantial inaccuracies. Besides obvious instances, where the same word appears with different polarities in different dictionaries, the dictionaries exhibit complex cases, which cannot be detected by mere manual inspection. We introduce the concept of polarity consistency of words/senses in sentiment dictionaries in this paper. We show that the consistency problem is NP-complete. We reduce the polarity consistency problem to the satisfiability problem and utilize a fast SAT solver to detect inconsistencies in a sentiment dictionary. We perform experiments on four sentiment dictionaries and WordNet.

6 0.71911699 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

7 0.71166939 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

8 0.62845838 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

9 0.58234358 115 acl-2012-Identifying High-Impact Sub-Structures for Convolution Kernels in Document-level Sentiment Classification

10 0.55152172 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

11 0.40927991 120 acl-2012-Information-theoretic Multi-view Domain Adaptation

12 0.40859267 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation

13 0.40806368 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

14 0.40588611 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

15 0.39816314 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

16 0.39152354 31 acl-2012-Authorship Attribution with Author-aware Topic Models

17 0.3910315 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model

18 0.37837875 187 acl-2012-Subgroup Detection in Ideological Discussions

19 0.37410259 79 acl-2012-Efficient Tree-Based Topic Modeling

20 0.3739545 171 acl-2012-SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.017), (26, 0.029), (28, 0.037), (30, 0.018), (37, 0.057), (39, 0.106), (57, 0.219), (74, 0.042), (82, 0.033), (84, 0.015), (85, 0.014), (90, 0.138), (92, 0.084), (94, 0.013), (99, 0.071)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9082734 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

Author: Kareem Darwish ; Ahmed Ali

Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.

2 0.87815058 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model

Author: William Yang Wang ; Elijah Mayfield ; Suresh Naidu ; Jeremiah Dittmar

Abstract: We propose a latent variable model to enhance historical analysis of large corpora. This work extends prior work in topic modelling by incorporating metadata, and the interactions between the components in metadata, in a general way. To test this, we collect a corpus of slavery-related United States property law judgements sampled from the years 1730 to 1866. We study the language use in these legal cases, with a special focus on shifts in opinions on controversial topics across different regions. Because this is a longitudinal data set, we are also interested in understanding how these opinions change over the course of decades. We show that the joint learning scheme of our sparse mixed-effects model improves on other state-of-the-art generative and discriminative models on the region and time period identification tasks. Experiments show that our sparse mixed-effects model is more accurate quantitatively and qualitatively interesting, and that these improvements are robust across different parameter settings.

3 0.83855134 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li

Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1

4 0.82623595 83 acl-2012-Error Mining on Dependency Trees

Author: Claire Gardent ; Shashi Narayan

Abstract: In recent years, error mining approaches were developed to help identify the most likely sources of parsing failures in parsing systems using handcrafted grammars and lexicons. However the techniques they use to enumerate and count n-grams builds on the sequential nature of a text corpus and do not easily extend to structured data. In this paper, we propose an algorithm for mining trees and apply it to detect the most likely sources of generation failure. We show that this tree mining algorithm permits identifying not only errors in the generation system (grammar, lexicon) but also mismatches between the structures contained in the input and the input structures expected by our generator as well as a few idiosyncrasies/error in the input data.

same-paper 5 0.79965454 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

Author: Fangtao Li ; Sinno Jialin Pan ; Ou Jin ; Qiang Yang ; Xiaoyan Zhu

6 0.70361459 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

7 0.68392158 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

8 0.67739344 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

9 0.67561483 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

10 0.6732589 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

11 0.66752088 120 acl-2012-Information-theoretic Multi-view Domain Adaptation

12 0.66574407 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

13 0.66573393 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

14 0.66545647 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

15 0.66233462 198 acl-2012-Topic Models, Latent Space Models, Sparse Coding, and All That: A Systematic Understanding of Probabilistic Semantic Extraction in Large Corpus

16 0.66056228 191 acl-2012-Temporally Anchored Relation Extraction

17 0.65630019 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

18 0.6554758 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

19 0.65194315 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

20 0.65174907 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing