acl acl2013 acl2013-350 knowledge-graph by maker-knowledge-mining

350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection


Source: pdf

Author: Jiwei Li ; Claire Cardie ; Sujian Li

Abstract: Product reviews are now widely used by individuals and organizations for decision making (Litvin et al., 2008; Jansen, 2010). And because of the profits at stake, people have been known to try to game the system by writing fake reviews to promote target products. As a result, the task of deceptive review detection has been gaining increasing attention. In this paper, we propose a generative LDA-based topic modeling approach for fake review detection. Our model can aptly detect the subtle dif- ferences between deceptive reviews and truthful ones and achieves about 95% accuracy on review spam datasets, outperforming existing baselines by a large margin.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu l Abstract Product reviews are now widely used by individuals and organizations for decision making (Litvin et al. [sent-4, score-0.159]

2 And because of the profits at stake, people have been known to try to game the system by writing fake reviews to promote target products. [sent-6, score-0.258]

3 As a result, the task of deceptive review detection has been gaining increasing attention. [sent-7, score-0.666]

4 In this paper, we propose a generative LDA-based topic modeling approach for fake review detection. [sent-8, score-0.296]

5 Our model can aptly detect the subtle dif- ferences between deceptive reviews and truthful ones and achieves about 95% accuracy on review spam datasets, outperforming existing baselines by a large margin. [sent-9, score-1.387]

6 1 Introduction Consumers rely increasingly on user-generated online reviews to make purchase decisions. [sent-10, score-0.159]

7 This gives rise to deceptive opinion spam (Ott et al. [sent-12, score-0.829]

8 , 2008), fake reviews written to sound authentic and deliberately mislead readers. [sent-14, score-0.258]

9 Previous research has shown that humans have difficulty distinguishing fake from truthful reviews, operating for the most part at chance (Ott et al. [sent-15, score-0.442]

10 My husband and I stayed for two nights at the Hilton Chicago. [sent-19, score-0.171]

11 We were very pleased with the accommodations and enjoyed the service every minute of it! [sent-20, score-0.214]

12 The bedrooms are immaculate, and the linens are very soft. [sent-21, score-0.09]

13 We also appreciated the free wifi, as we could stay in touch with friends while staying in Chicago. [sent-22, score-0.19]

14 The bathroom was quite spacious, and I loved the smell of the shampoo they provided. [sent-23, score-0.266]

15 Their service was amazing, 1The first example is a deceptive review. [sent-24, score-0.594]

16 cn i and we absolutely loved the beautiful indoor pool. [sent-29, score-0.22]

17 We stayed at the Sheraton by Navy Pier the first weekend of November. [sent-32, score-0.116]

18 The view from both rooms was spectacular (as you can tellfrom thepicture attached). [sent-33, score-0.203]

19 They also left a plate of cookies and treats in the kids room upon check-in made us all feel very special. [sent-34, score-0.175]

20 The hotel is central to both Navy Pier and Michigan Ave. [sent-35, score-0.11]

21 so we walked, trolleyed, and cabbed all around the area. [sent-36, score-0.045]

22 We ate the breakfast buffet on both mornings and thought it was pretty good. [sent-37, score-0.181]

23 Our six year old ate free and our two eleven year old were $14 (instead of the adult $20). [sent-39, score-0.265]

24 The rooms were clean, the concierge and reception staff were both friendly and helpful. [sent-40, score-0.226]

25 we will definitely visit this Sheraton again when we stay in Chicago next time. [sent-43, score-0.129]

26 Because of the difficulty of recognizing deceptive opinions, there has been a widespread and growing interest in developing automatic, usually learningbased methods to help users identify deceptive reviews (Ott et al. [sent-44, score-1.259]

27 The state-of-the-art approach treats the task of spam detection as a text categorization problem and was first introduced by Jindal and Liu (2009) who trained a supervised classifier to distinguish duplicated reviews (assumed deceptive) from original ones (assumed truthful). [sent-51, score-0.41]

28 Since then, many supervised approaches have been proposed for spam detection. [sent-52, score-0.219]

29 (201 1) employed standard word and part-of-speech (POS) n-gram features for supervised learning and built a gold fsetaatnurdeasrd fo opinion idsaetdals eeat onfi n8g0 0an rdev biuewilts a. [sent-54, score-0.06]

30 (201 1) carefully explored review-related features based on content and sentiment, training a semi-supervised classifier for opinion spam detection. [sent-58, score-0.279]

31 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 217–2 1, diction of how likely a review is to be deceptive vs. [sent-62, score-0.666]

32 Furthermore, identifying features that provide direct evidence against deceptive reviews is always a hard problem. [sent-64, score-0.709]

33 , 2003) have widely been used for their ability to model latent topics in document collection. [sent-66, score-0.098]

34 In LDA, each document is presented as a mixture distribution of topics and each topic is presented as a mixture distribution of words. [sent-67, score-0.206]

35 Researchers also integrated different levels of information into LDA topic models to model the specific knowledge that they are interested in, such as user-specific information (Rosen-zvi et al. [sent-68, score-0.081]

36 (2009) developed a Labeled LDA model to define a oneto-one correspondence between LDA latent topics and tags. [sent-73, score-0.098]

37 (2008) illustrated that by considering background information and document-specific information, we can largely improve the performance of topic modeling. [sent-75, score-0.146]

38 In this paper, we propose a Bayesian approach called TopicSpam for deceptive review detection. [sent-76, score-0.666]

39 , 2003), aims to detect the subtle differences between the topic-word distributions of deceptive reviews vs. [sent-78, score-0.709]

40 In addition, our model can give a clear probabilistic prediction on how likely a review should be treated as deceptive or truthful. [sent-80, score-0.666]

41 (201 1) that contains 800 reviews of 20 Chicago hotels. [sent-82, score-0.159]

42 2 TopicSpam We are presented with four subsets of hotel reviews, M = {Mi}ii==41, representing deceptive train, truthful train, deceptive test and truthful test data, respectively. [sent-84, score-1.896]

43 Each review r is comprised of a number of words r = {wt}tt==n1r . [sent-85, score-0.116]

44 Input for the TopicSpam algorithm is t{hwe }datasets M; output is the label (deceptive, truthful) for each review in M3 and M4. [sent-86, score-0.116]

45 1 Details of TopicSpam In TopicSpam, each document is modeled as a bag of words, which are assumed to be generated from a mixture of latent topics. [sent-89, score-0.097]

46 Each word is associated with a latent variable that specifies Figure 1: Graphical Model for TopicSpam the topic from which it is generated. [sent-90, score-0.116]

47 Words in a document are assumed to be conditionally independent given the hidden topics. [sent-91, score-0.031]

48 A general background distribution φB and hotel-specific distributions φHj (j = 1, . [sent-92, score-0.065]

49 , 20) are first introduced to capture the background information and hotelspecific information. [sent-95, score-0.065]

50 To capture the difference between deceptive reviews and truthful reviews, TopicSpam also learns a deceptive topic distribution φD and truthful topic distribution φT. [sent-96, score-2.107]

51 The generative model of TopicSpam is shown as follows: • • • For a training review in r1j ∈ M1, words are originated ifnrgom re one wo ifn th re th∈ree M different topics: φB, φHj and φD. [sent-97, score-0.213]

52 For a training review in r2j ∈ M2, words are originated ifnrgom re one wo ifn th re th∈ree M different topics: φB, φHj and φT. [sent-98, score-0.213]

53 For a test review in rmj ∈ Mm, m = 3, 4, Fwoorrd as are originated rfrom one of the four different topics: φB, φHj φD and φT. [sent-99, score-0.168]

54 We use λ = (λG, λHi , λD, λT) to represent the asymmetric priors for topic-word distribution generation. [sent-101, score-0.033]

55 The intuition for the asymmetric priors is that there should be more words assigned to the background topic. [sent-105, score-0.098]

56 γ = [γB, γHi , γD, γT] denotes the priors for the document-level topic distribution in the LDA model. [sent-106, score-0.114]

57 We set γB = 2 and γT = γD = γHi = 1, reflecting the intuition that more words in each document should cover the background topic. [sent-107, score-0.065]

58 2 Inference We adopt the collapsed Gibbs sampling strategy to infer the latent parameters in TopicSpam. [sent-109, score-0.035]

59 P(zw = m|z−w , i,j,γ, λ) Nrm + γm Emw + λm (1) Pm0(Nrm0+ γm0) ·PVw0Ewm+ V λm where Nrm denotes the numPber of times that topic m appears in current review r and Emw denotes the number of times that word w is assigned to topic m. [sent-117, score-0.278]

60 After each sampling iteration, the latent parameters can be estimated using the following formulas: θrm=PmN0(rmNrm+0 γ+m γm) φ(mw)=PwE0Emwmw+0+ λm V λm (2) 2. [sent-118, score-0.035]

61 3 Labeling the Test Data For each review r in the test data, let NrD denote the number of words generated from the deceptive topic and NrT, the number of words generated from the truthful topic. [sent-119, score-1.09]

62 The decision for whether a review is deceptive or truthful is made as follows: • if NrD > NrT, r is deceptive. [sent-120, score-1.009]

63 Let P(D) denote the probability that r is deceptive and P(T) denote the probability that r is truthful. [sent-123, score-0.55]

64 (201 1), which contains reviews of the 20 most popular hotels on TripAdvisor in the Chicago areas. [sent-126, score-0.204]

65 There are 20 truthful and 20 deceptive reviews for each of the chosen hotels (800 reviews total). [sent-127, score-1.256]

66 2 Baselines We employ a number of techniques as baselines: TopicTD: A topic-modeling approach that only considers two topics: deceptive and truthful. [sent-133, score-0.55]

67 Words in deceptive train are all generated from the deceptive topic and words in truthful train are generated from the truthful topic. [sent-134, score-1.867]

68 Test documents are presented with a mixture of the deceptive and truthful topics. [sent-135, score-0.924]

69 TopicTDB: A topic-modeling approach that only considers background, deceptive and truthful information. [sent-136, score-0.893]

70 SVM-Unigram-Removal2: Same as SVMUnigram-removal-1 but removing all background words and hotel-specific words. [sent-142, score-0.065]

71 This illustrates the effectiveness ofmodeling background and hotel-specific information for the opinion spam detection problem. [sent-147, score-0.344]

72 This can be explained by the fact that both models use unigram frequency as features for the classifier or topic distribution training. [sent-164, score-0.081]

73 This can be explained as follows: for example, word ”my” has large probability to be generated from the background topic. [sent-169, score-0.065]

74 However it can also be generated by deceptive topic occasionaly but can hardly be generated from the truthful topic. [sent-170, score-0.974]

75 Here we present the results of the sample reviews from Section 1. [sent-173, score-0.213]

76 Stop words are labeled in black, background topics (B) in blue, hotel specific topics (H) in orange, deceptive topics (D) in red and truthful topic (T) in green. [sent-174, score-1.338]

77 My husband and I stayed for two nights at the Hilton Chicago. [sent-176, score-0.171]

78 We were very pleased with the accommodations and enjoyed the service every minute of it! [sent-177, score-0.214]

79 The bedrooms are immaculate,and the linens are very soft. [sent-178, score-0.09]

80 We also appreciated the free wifi, as we could stay in touch with friends while staying in Chicago. [sent-179, score-0.19]

81 The bathroom was quite spacious, and I loved the smell of the shampoo they provided not like most hotel shampoos. [sent-180, score-0.376]

82 Their service was amazing,and we absolutely loved the beautiful indoor pool. [sent-181, score-0.264]

83 We stayed at the Sheraton by Navy Pier the first weekend of November. [sent-186, score-0.116]

84 The view from both rooms was spectacular (as you can tellfrom thepicture attached). [sent-187, score-0.203]

85 They also left a plate of cookies and treats in the kids room upon check-in made us all feel very special. [sent-188, score-0.175]

86 The hotel is central to both Navy Pier and Michigan Ave. [sent-189, score-0.11]

87 so we walked, trolleyed, and cabbed all around the area. [sent-190, score-0.045]

88 We ate the breakfast buffet both mornings and thought it was pretty good. [sent-191, score-0.181]

89 Our six year old ate free and our two eleven year old were $14 ( instead of the adult $20) The rooms were clean, the concierge and reception staff were both friendly and helpful. [sent-193, score-0.491]

90 we will definitely visit this Sheraton again when we ’re in Chicago next time. [sent-196, score-0.072]

91 857 backgrounddeceptivetruthfulHilton hotelhotelroomHilton stay my ) palmer we chicago ( millennium room will but lockwood ! [sent-199, score-0.173]

92 Finding deceptive opinion spam by any stretch of the imagination. [sent-270, score-0.829]

93 Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora. [sent-279, score-0.081]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('deceptive', 0.55), ('truthful', 0.343), ('topicspam', 0.291), ('spam', 0.219), ('reviews', 0.159), ('ott', 0.141), ('hj', 0.141), ('nrt', 0.138), ('jindal', 0.138), ('review', 0.116), ('sheraton', 0.112), ('hotel', 0.11), ('fake', 0.099), ('nrd', 0.099), ('navy', 0.091), ('bathroom', 0.089), ('pier', 0.089), ('topic', 0.081), ('indoor', 0.079), ('stayed', 0.079), ('rooms', 0.073), ('loved', 0.073), ('lda', 0.073), ('dir', 0.069), ('shampoo', 0.067), ('topictd', 0.067), ('background', 0.065), ('topics', 0.063), ('chicago', 0.063), ('staying', 0.06), ('opinion', 0.06), ('bing', 0.058), ('stay', 0.057), ('sample', 0.054), ('room', 0.053), ('originated', 0.052), ('breakfast', 0.049), ('ate', 0.047), ('husband', 0.047), ('accommodations', 0.045), ('bedrooms', 0.045), ('cabbed', 0.045), ('concierge', 0.045), ('cookies', 0.045), ('diao', 0.045), ('hilton', 0.045), ('hotels', 0.045), ('ifnrgom', 0.045), ('kids', 0.045), ('linens', 0.045), ('minute', 0.045), ('mornings', 0.045), ('nights', 0.045), ('nrdn', 0.045), ('nrm', 0.045), ('spacious', 0.045), ('tellfrom', 0.045), ('thepicture', 0.045), ('topictdb', 0.045), ('trolleyed', 0.045), ('nitin', 0.044), ('service', 0.044), ('year', 0.042), ('tourism', 0.042), ('visit', 0.041), ('hi', 0.04), ('appreciated', 0.04), ('wifi', 0.04), ('zw', 0.04), ('buffet', 0.04), ('eggs', 0.04), ('pleased', 0.04), ('spectacular', 0.04), ('walked', 0.04), ('enjoyed', 0.04), ('emw', 0.04), ('staff', 0.04), ('beautiful', 0.037), ('weekend', 0.037), ('smell', 0.037), ('reception', 0.037), ('latent', 0.035), ('old', 0.035), ('chemudugunta', 0.034), ('padhraic', 0.034), ('eleven', 0.034), ('priors', 0.033), ('myle', 0.033), ('touch', 0.033), ('spammers', 0.033), ('treats', 0.032), ('cardie', 0.032), ('definitely', 0.031), ('friendly', 0.031), ('absolutely', 0.031), ('mixture', 0.031), ('assumed', 0.031), ('adult', 0.03), ('ramage', 0.03), ('ree', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

Author: Jiwei Li ; Claire Cardie ; Sujian Li

Abstract: Product reviews are now widely used by individuals and organizations for decision making (Litvin et al., 2008; Jansen, 2010). And because of the profits at stake, people have been known to try to game the system by writing fake reviews to promote target products. As a result, the task of deceptive review detection has been gaining increasing attention. In this paper, we propose a generative LDA-based topic modeling approach for fake review detection. Our model can aptly detect the subtle dif- ferences between deceptive reviews and truthful ones and achieves about 95% accuracy on review spam datasets, outperforming existing baselines by a large margin.

2 0.3897751 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

Author: Fangtao Li ; Yang Gao ; Shuchang Zhou ; Xiance Si ; Decheng Dai

Abstract: In Community question answering (QA) sites, malicious users may provide deceptive answers to promote their products or services. It is important to identify and filter out these deceptive answers. In this paper, we first solve this problem with the traditional supervised learning methods. Two kinds of features, including textual and contextual features, are investigated for this task. We further propose to exploit the user relationships to identify the deceptive answers, based on the hypothesis that similar users will have similar behaviors to post deceptive or authentic answers. To measure the user similarity, we propose a new user preference graph based on the answer preference expressed by users, such as “helpful” voting and “best answer” selection. The user preference graph is incorporated into traditional supervised learning framework with the graph regularization technique. The experiment results demonstrate that the user preference graph can indeed help improve the performance of deceptive answer prediction.

3 0.16979226 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features

Author: Maria Yancheva ; Frank Rudzicz

Abstract: It is important that the testimony of children be admissible in court, especially given allegations of abuse. Unfortunately, children can be misled by interrogators or might offer false information, with dire consequences. In this work, we evaluate various parameterizations of five classifiers (including support vector machines, neural networks, and random forests) in deciphering truth from lies given transcripts of interviews with 198 victims of abuse between the ages of 4 and 7. These evaluations are performed using a novel set of syntactic features, including measures of complexity. Our results show that sentence length, the mean number of clauses per utterance, and the StajnerMitkov measure of complexity are highly informative syntactic features, that classification accuracy varies greatly by the age of the speaker, and that accuracy up to 91.7% can be achieved by support vector machines given a sufficient amount of data.

4 0.12067006 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset

Author: Mohamed Aly ; Amir Atiya

Abstract: We introduce LABR, the largest sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars. We investigate the properties of the the dataset, and present its statistics. We explore using the dataset for two tasks: sentiment polarity classification and rating classification. We provide standard splits of the dataset into training and testing, for both polarity and rating classification, in both balanced and unbalanced settings. We run baseline experiments on the dataset to establish a benchmark.

5 0.11346823 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction

Author: Xiaojun Wan

Abstract: The task of review rating prediction can be well addressed by using regression algorithms if there is a reliable training set of reviews with human ratings. In this paper, we aim to investigate a more challenging task of crosslanguage review rating prediction, which makes use of only rated reviews in a source language (e.g. English) to predict the rating scores of unrated reviews in a target language (e.g. German). We propose a new coregression algorithm to address this task by leveraging unlabeled reviews. Evaluation results on several datasets show that our proposed co-regression algorithm can consistently improve the prediction results. 1

6 0.080522373 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

7 0.076745041 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

8 0.072665907 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

9 0.06917952 121 acl-2013-Discovering User Interactions in Ideological Discussions

10 0.069040231 244 acl-2013-Mining Opinion Words and Opinion Targets in a Two-Stage Framework

11 0.067305803 336 acl-2013-Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

12 0.066820107 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

13 0.065177858 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction

14 0.063120835 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

15 0.060237177 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

16 0.058939882 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

17 0.05152747 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews

18 0.049096204 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

19 0.047177244 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

20 0.046758637 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.104), (1, 0.113), (2, 0.005), (3, 0.048), (4, 0.026), (5, -0.009), (6, 0.03), (7, -0.128), (8, -0.081), (9, -0.016), (10, 0.043), (11, 0.055), (12, 0.039), (13, 0.043), (14, 0.015), (15, -0.043), (16, -0.072), (17, 0.059), (18, 0.03), (19, 0.026), (20, 0.025), (21, -0.042), (22, 0.046), (23, -0.084), (24, -0.01), (25, -0.033), (26, -0.054), (27, -0.086), (28, 0.032), (29, -0.073), (30, 0.001), (31, -0.079), (32, 0.084), (33, 0.074), (34, 0.129), (35, 0.125), (36, -0.136), (37, 0.088), (38, -0.054), (39, -0.011), (40, 0.068), (41, -0.126), (42, -0.193), (43, 0.22), (44, 0.138), (45, -0.161), (46, -0.201), (47, -0.04), (48, -0.114), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91015792 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

Author: Jiwei Li ; Claire Cardie ; Sujian Li

Abstract: Product reviews are now widely used by individuals and organizations for decision making (Litvin et al., 2008; Jansen, 2010). And because of the profits at stake, people have been known to try to game the system by writing fake reviews to promote target products. As a result, the task of deceptive review detection has been gaining increasing attention. In this paper, we propose a generative LDA-based topic modeling approach for fake review detection. Our model can aptly detect the subtle dif- ferences between deceptive reviews and truthful ones and achieves about 95% accuracy on review spam datasets, outperforming existing baselines by a large margin.

2 0.68877697 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

Author: Fangtao Li ; Yang Gao ; Shuchang Zhou ; Xiance Si ; Decheng Dai

Abstract: In Community question answering (QA) sites, malicious users may provide deceptive answers to promote their products or services. It is important to identify and filter out these deceptive answers. In this paper, we first solve this problem with the traditional supervised learning methods. Two kinds of features, including textual and contextual features, are investigated for this task. We further propose to exploit the user relationships to identify the deceptive answers, based on the hypothesis that similar users will have similar behaviors to post deceptive or authentic answers. To measure the user similarity, we propose a new user preference graph based on the answer preference expressed by users, such as “helpful” voting and “best answer” selection. The user preference graph is incorporated into traditional supervised learning framework with the graph regularization technique. The experiment results demonstrate that the user preference graph can indeed help improve the performance of deceptive answer prediction.

3 0.58008504 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features

Author: Maria Yancheva ; Frank Rudzicz

Abstract: It is important that the testimony of children be admissible in court, especially given allegations of abuse. Unfortunately, children can be misled by interrogators or might offer false information, with dire consequences. In this work, we evaluate various parameterizations of five classifiers (including support vector machines, neural networks, and random forests) in deciphering truth from lies given transcripts of interviews with 198 victims of abuse between the ages of 4 and 7. These evaluations are performed using a novel set of syntactic features, including measures of complexity. Our results show that sentence length, the mean number of clauses per utterance, and the StajnerMitkov measure of complexity are highly informative syntactic features, that classification accuracy varies greatly by the age of the speaker, and that accuracy up to 91.7% can be achieved by support vector machines given a sufficient amount of data.

4 0.5001356 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction

Author: Xiaojun Wan

Abstract: The task of review rating prediction can be well addressed by using regression algorithms if there is a reliable training set of reviews with human ratings. In this paper, we aim to investigate a more challenging task of crosslanguage review rating prediction, which makes use of only rated reviews in a source language (e.g. English) to predict the rating scores of unrated reviews in a target language (e.g. German). We propose a new coregression algorithm to address this task by leveraging unlabeled reviews. Evaluation results on several datasets show that our proposed co-regression algorithm can consistently improve the prediction results. 1

5 0.416502 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset

Author: Mohamed Aly ; Amir Atiya

Abstract: We introduce LABR, the largest sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars. We investigate the properties of the the dataset, and present its statistics. We explore using the dataset for two tasks: sentiment polarity classification and rating classification. We provide standard splits of the dataset into training and testing, for both polarity and rating classification, in both balanced and unbalanced settings. We run baseline experiments on the dataset to establish a benchmark.

6 0.38699669 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

7 0.3213037 126 acl-2013-Diverse Keyword Extraction from Conversations

8 0.31403658 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

9 0.30962422 241 acl-2013-Minimum Bayes Risk based Answer Re-ranking for Question Answering

10 0.30275255 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals

11 0.28577265 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction

12 0.28312773 33 acl-2013-A user-centric model of voting intention from Social Media

13 0.27734426 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

14 0.27352023 54 acl-2013-Are School-of-thought Words Characterizable?

15 0.27184483 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

16 0.27178136 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting

17 0.26026163 266 acl-2013-PAL: A Chatterbot System for Answering Domain-specific Questions

18 0.26015836 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

19 0.2548742 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

20 0.24681413 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (4, 0.011), (6, 0.02), (11, 0.042), (24, 0.048), (26, 0.035), (29, 0.463), (35, 0.073), (42, 0.024), (48, 0.031), (70, 0.034), (88, 0.025), (90, 0.02), (95, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80202997 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

Author: Jiwei Li ; Claire Cardie ; Sujian Li

Abstract: Product reviews are now widely used by individuals and organizations for decision making (Litvin et al., 2008; Jansen, 2010). And because of the profits at stake, people have been known to try to game the system by writing fake reviews to promote target products. As a result, the task of deceptive review detection has been gaining increasing attention. In this paper, we propose a generative LDA-based topic modeling approach for fake review detection. Our model can aptly detect the subtle dif- ferences between deceptive reviews and truthful ones and achieves about 95% accuracy on review spam datasets, outperforming existing baselines by a large margin.

2 0.66475934 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling

Author: Heike Adel ; Ngoc Thang Vu ; Tanja Schultz

Abstract: In this paper, we investigate the application of recurrent neural network language models (RNNLM) and factored language models (FLM) to the task of language modeling for Code-Switching speech. We present a way to integrate partof-speech tags (POS) and language information (LID) into these models which leads to significant improvements in terms of perplexity. Furthermore, a comparison between RNNLMs and FLMs and a detailed analysis of perplexities on the different backoff levels are performed. Finally, we show that recurrent neural networks and factored language models can . be combined using linear interpolation to achieve the best performance. The final combined language model provides 37.8% relative improvement in terms of perplexity on the SEAME development set and a relative improvement of 32.7% on the evaluation set compared to the traditional n-gram language model. Index Terms: multilingual speech processing, code switching, language modeling, recurrent neural networks, factored language models

3 0.5501892 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

Author: Alan Akbik ; Oresti Konomi ; Michail Melnikov

Abstract: The use ofdeep syntactic information such as typed dependencies has been shown to be very effective in Information Extraction. Despite this potential, the process of manually creating rule-based information extractors that operate on dependency trees is not intuitive for persons without an extensive NLP background. In this system demonstration, we present a tool and a workflow designed to enable initiate users to interactively explore the effect and expressivity of creating Information Extraction rules over dependency trees. We introduce the proposed five step workflow for creating information extractors, the graph query based rule language, as well as the core features of the PROP- MINER tool.

4 0.52261084 247 acl-2013-Modeling of term-distance and term-occurrence information for improving n-gram language model performance

Author: Tze Yuang Chong ; Rafael E. Banchs ; Eng Siong Chng ; Haizhou Li

Abstract: In this paper, we explore the use of distance and co-occurrence information of word-pairs for language modeling. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long history-contexts. Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively. Compared to the distant bigram, we show that word-pairs can be more effectively modeled in terms of both distance and occurrence. 1

5 0.42642617 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

Author: Kai Liu ; Yajuan Lu ; Wenbin Jiang ; Qun Liu

Abstract: This paper describes a novel strategy for automatic induction of a monolingual dependency grammar under the guidance of bilingually-projected dependency. By moderately leveraging the dependency information projected from the parsed counterpart language, and simultaneously mining the underlying syntactic structure of the language considered, it effectively integrates the advantages of bilingual projection and unsupervised induction, so as to induce a monolingual grammar much better than previous models only using bilingual projection or unsupervised induction. We induced dependency gram- mar for five different languages under the guidance of dependency information projected from the parsed English translation, experiments show that the bilinguallyguided method achieves a significant improvement of 28.5% over the unsupervised baseline and 3.0% over the best projection baseline on average.

6 0.33959165 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction

7 0.27050859 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

8 0.2702131 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

9 0.26918787 121 acl-2013-Discovering User Interactions in Ideological Discussions

10 0.26905954 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

11 0.26807374 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

12 0.26743749 172 acl-2013-Graph-based Local Coherence Modeling

13 0.26739573 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

14 0.26733661 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

15 0.26669234 318 acl-2013-Sentiment Relevance

16 0.26652345 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

17 0.26606032 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

18 0.2655828 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

19 0.26539296 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

20 0.2652033 351 acl-2013-Topic Modeling Based Classification of Clinical Reports