emnlp emnlp2013 emnlp2013-95 knowledge-graph by maker-knowledge-mining

95 emnlp-2013-Identifying Multiple Userids of the Same Author


Source: pdf

Author: Tieyun Qian ; Bing Liu

Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract This paper studies the problem of identifying users who use multiple userids to post in social media. [sent-4, score-0.5]

2 Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. [sent-5, score-0.523]

3 Instead, it uses documents from other userids for classifier building. [sent-7, score-0.604]

4 The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. [sent-11, score-0.577]

5 He/she then registers another userid in order to regain his/her status. [sent-15, score-0.398]

6 A user may also use multiple userids to instigate controversy or debates to popularize a topic to make it “hot” or even just to promote activities at a website. [sent-16, score-0.479]

7 Yet, a user may also use multiple userids to post fake or deceptive opinions to promote or demote some products (Liu, 2012). [sent-17, score-0.615]

8 Problem definition: Given a set of userids ID = { id1, idn} and each idi has a set of documents Di, we want to identify userids that belong to the same physical author. [sent-24, score-1.353]

9 The main related works to ours are in the area of authorship attribution (AA), which aims to identify authors of documents. [sent-25, score-0.28]

10 Let A = {a1, ak} be a set of authors (or classes) and each author ai  A has a set of training documents Di. [sent-27, score-0.277]

11 A classifier is then built to decide the author a of each test document d, where a  A. [sent-28, score-0.268]

12 This supervised AA formulation, however, is not suitable for our task because we only have userids but not real authors. [sent-30, score-0.5]

13 Since some of the userids may belong to the same author, we cannot treat each userid as a class because in that case, we will be classifying based on userids, which won’t help us find authors with multiple userids (see Sec… … …, … … …, tion 7 also). [sent-31, score-1.417]

14 To simplify the presentation, we assume that at most two userids can belong to a single author, but the algorithm can be extended to handle more than two userids from the same author. [sent-33, score-0.981]

15 Candidate identification: For each userid idi, we first find the most likely userid idj (i ≠ j) that may have the same author as idi. [sent-35, score-1.356]

16 Decision making: If k = i, we conclude that idi and idj are from the same author. [sent-48, score-0.693]

17 Otherwise, idi and idj are not from the same author. [sent-49, score-0.693]

18 We can first split the documents Di of each idi into two subsets, a query set Qi and a sample set Si. [sent-52, score-0.551]

19 We then compare each query document in Qi with each sample document in Sj from other userids idj ( ID – { idi}). [sent-53, score-1.272]

20 All the similarity scores are then aggregated and used to rank the userids in ID – { idi} . [sent-55, score-0.533]

21 Note that partitioning the documents of a userid idi into the query set Qi and the sample set Si is crucial here. [sent-57, score-0.949]

22 If so and we get candid-iden(idi) = idj, we will definitely get candid-iden(idj) = idi since the similarity function is symmetric. [sent-59, score-0.329]

23 Specifically, in LSS, each document d is first represented with a document space vector (called a d-vector) based on the document itself as in the traditional classification learning of AA. [sent-67, score-0.328]

24 sv consists of a set of similarity values between document d (in a dvector) and query q (in a d-vector): sv =Sim(d, q), 1125 where Sim is a similarity function consists of a set of similarity measures. [sent-71, score-0.486]

25 Thus, the d-vector for document d in the document space is transformed to an s-vector sv for d in the similarity space. [sent-72, score-0.292]

26 We also have two non-query documents, one is d1 which is written by the author of query q and the other is d2 which is not written by query author q. [sent-75, score-0.568]

27 27 … … … Class 1 means “written by author of query q”, also called q-positive, and class -1 means “not written by author of query q”, also called q-negative. [sent-84, score-0.568]

28 In this formulation, a test userid and his/her documents do not have to be seen in training as long as a set of known documents from this userid is available. [sent-86, score-0.99]

29 The resulting classifier is employed to compute a score for each review to be used in the two-step algorithm above to find the candidate for each userid and then the userids with the same authors. [sent-89, score-0.972]

30 Due to the use of query documents, the LSS formulation has some resemblance to document ranking based on learning to rank (Li, 2011; Liu, 2011). [sent-90, score-0.261]

31 However, classification will not return any document if the desired documents do not exist in the test data (unless there are classification errors). [sent-93, score-0.263]

32 Using online review as the application domain, we conduct experiments on a large number of reviews and their author/reviewer userids from Amazon. [sent-95, score-0.619]

33 Since we use online reviews as our experiment domain, our work is related to fake review detec- tion (Jindal and Liu, 2008) as imposters can use multiple userids to post fake reviews. [sent-137, score-0.806]

34 However, none of them identifies userids belonging to the same person. [sent-145, score-0.479]

35 Test data: We are given:  A query q from query author (userid) aq  A set of test documents DT = {dt1, dtm}. [sent-150, score-0.594]

36 ii) Unlike traditional supervised classification, here the test query author aq does not have to be used in training. [sent-153, score-0.376]

37 Training document representation: earlier, each document is represented As noted with a simi- larity vector (s-vector) computed using a similarity 1126 1. [sent-156, score-0.25]

38 For each query qij  Qi // produce positive s-training examples 4. [sent-159, score-0.281]

39 select a set of documents from author ari DRij  DRi – {qij} 5. [sent-160, score-0.268]

40 Sim takes a query document and a non-query document and produces a vector of similarity values or s-features to represent the nonquery document. [sent-170, score-0.412]

41 We randomly select a small set of queries Qi from documents DRi of each author ari (lines 1, and 2). [sent-177, score-0.339]

42 For each query qij  Qi (line 3), it selects a set of documents DRij also from DRi (excluding qij) of the same author (line 4) to be the positive documents for qij, called q-positive and labeled 1. [sent-178, score-0.617]

43 Then, for each document drijk in DRij, a q-positive s-training example with the label 1 is generated for drijk by computing the similarities of qij and drijk using the similarity function Sim (lines 5, 6). [sent-179, score-0.477]

44 In line 7, it selects a set of documents DRij,rest from other authors to be the negative documents for qij, called q-negative and labeled -1. [sent-180, score-0.263]

45 For each document drijk,rest in DRij,rest (line 8), a q-negative straining example with label -1 is generated for drijk by computing the similarities of qij and drijk,rest us1127 1. [sent-181, score-0.281]

46 For each document set Di of idi  ID do // step 1: candidate identification 4. [sent-184, score-0.398]

47 idj = candid-iden(idi, ID), i< j; // step 2: candidate confirmation 5. [sent-185, score-0.463]

48 If k = ithen idi and idj are from the same author 8. [sent-187, score-0.835]

49 else idi and idj are not from the same author Figure 3: Identifying userids from the same authors Function candidate-iden(idi, ID) 1. [sent-188, score-1.352]

50 For each sample document set Sj of idj  ID-{ idi} do 2. [sent-189, score-0.553]

51 score // Four methods to decide which idj is the candidate for idi 13. [sent-206, score-0.718]

52 If for all idj  ID-{ idi}, pcount[idi] = 0 then 14. [sent-207, score-0.418]

53 The classes are 1 (q-positive mean- ing “written by author of query qij”) and -1 (qnegative meaning “not written by author of query qij. [sent-220, score-0.568]

54 For easy presentation, we assume that there are k queries in every Qi, and p documents in every DRij and u documents in every DRij,rest. [sent-222, score-0.265]

55 Given a query q from author aq and a set of test documents DT, each test document dti is converted to a s-vector svi = Sim(dti, q). [sent-232, score-0.625]

56 To reflect svi is computed based on query q from author aq, a s-test case is thus represented as <(aq, q), (svi, ? [sent-233, score-0.335]

57 Lines 1-2 partitions the documents set Di of each idi in ID = {id1, id2, idn}, the set of userids that we are working on. [sent-247, score-0.851]

58 The candid-iden function takes two arguments: the query userid idi and the whole set of userids ID. [sent-254, score-1.294]

59 It classifies each sample ssjf in sample set Sj of idj  ID-{idi} to positive (qi-positive) or negative (qi-negative) (lines 4, 5, 6). [sent-255, score-0.56]

60 We then aggregate the classification results to determine which userid is likely to have the same author as idi. [sent-256, score-0.574]

61 We count the total number of positive classifications of the sample documents of each userid in ID-{idi}. [sent-258, score-0.589]

62 The userid idj with the highest count is the candidate cid which may share the same author as query idi. [sent-259, score-1.217]

63 Voting: For each sample from userid idj, if it is classified as positive, one vote/count is added to pcount[idj]. [sent-270, score-0.435]

64 The userid with the highest pcount is regarded as the candidate userid, cid (line 15). [sent-271, score-0.586]

65 Lines 13 and 14 mean that if all documents of all userids are classified as negative (pcount[idj] = 0, which also implies psum[idj] = psqsum[idj] = 0), we use method 4). [sent-273, score-0.576]

66 ScoreSum: This method works similarly to the voting method above except that instead of counting positive classifications, this method sums up all scores of positive classifications in psum[idj] for each userid (line 9). [sent-275, score-0.519]

67 ScoreSqSum: This method works similarly to ScoreSum above except that it sums up the squared scores of positive classifications in psqsum[idj] for each userid (line 10). [sent-278, score-0.455]

68 ScoreMax: This method works similarly to the voting method as well except that it finds the maximum classification score for the documents of each userid (lines 11 and 12). [sent-281, score-0.566]

69 Since s-features are calculated using dfeatures of a non-query document and a query document, we thus discuss d-features first, which are extracted from each document itself. [sent-284, score-0.379]

70 (lrd) denote the average word, sentence, and document length respectively, either in query q or nonquery document d. [sent-320, score-0.358]

71 The three formulae are given in Table 3, where f(t, s) is the frequency count of token t in a document s, and lq and ld are the average document length of the query and non-query document, respectively. [sent-327, score-0.362]

72 However, since they are randomly selected from a large number of userids, the probability that two sampled userids belong to the same person is very small. [sent-351, score-0.502]

73 Thus, it should be safe to assume that each userid here represents a unique author. [sent-352, score-0.398]

74 Training data: We randomly choose 1 (one) review for each author as the query and all of his/her other reviews as q-positive reviews. [sent-353, score-0.424]

75 The qnegative reviews consist of reviews randomly selected from the other 730 authors, two reviews per author. [sent-354, score-0.294]

76 The purpose is to simulate the situation where there are two userids idia and idib from the same author ai. [sent-361, score-0.753]

77 Our objective is that given one userid idia and its query set, we want to find the other userid idib from the same author. [sent-362, score-1.07]

78 For the review subset of idia (or idib), we randomly select 9 reviews as the query set and anoth- er 10 reviews as the sample set for the userid. [sent-363, score-0.478]

79 We don’t use more queries or sample reviews from each author since in the review domain most authors do not have many reviews (Jindal and Liu, 2008). [sent-365, score-0.526]

80 For example, T50_Q9S10 stands for a test data with 50 userids, and for each userid, 9 reviews are selected as queries and 10 reviews are selected as samples. [sent-368, score-0.267]

81 In the fake review detection research, researchers have manually label fake reviews and reviewers (Yoo and Gretzel 2009; Lim et al. [sent-372, score-0.306]

82  Type I: Identify two userids belong to the same author. [sent-380, score-0.502]

83 In each iteration, we plant one userid of an author in the test set and use the other userid of the same author as the query userid. [sent-382, score-1.222]

84 Test userids { id1a, , id(i-1)a, idib, idma} and their corresponding sample review sets {S1a, S(i-1)a, Sib, Sma}. [sent-385, score-0.558]

85 Note that the query userid idia and the test userid idib are from the same author. [sent-386, score-1.07]

86 That is, we do not plant any matching userid for the query userid. [sent-391, score-0.54]

87 Test userids {id1a, id(i-1)a, id(i+1)a, idma} and their sample review sets {S1a, S(i-1)a, S(i+1)a, , Sma}. [sent-395, score-0.558]

88 For each test userid id, we build a SVM classifier based on the one vs. [sent-411, score-0.426]

89 That is, for training we use id’s queries in T*_Q*S 10 as the positive documents, and all queries of the other test userids (e. [sent-413, score-0.648]

90 , 99 userids if the test data has 100 userids) as the negative documents. [sent-415, score-0.479]

91 Note that TSL cannot use the 731 userids for training as in LSS because they do not appear in the test data. [sent-416, score-0.479]

92 In testing, userid id’s sample (non-query) documents in T*_Q*S10 are used as positive documents, and the sample documents of all other test userids are used as negative documents. [sent-417, score-1.172]

93 We then apply the same 4 strategies to decide the final author attribution except voting as cosine similarity cannot classify. [sent-422, score-0.357]

94 We use 9 queries per userid in all other experiments. [sent-453, score-0.469]

95 The main reason again is that more samples from a userid give more identifying information about the userid. [sent-459, score-0.398]

96 We use 10 test documents (samples) per userid in all experiments. [sent-460, score-0.495]

97 all learning, the negative training data actually contain positive documents which are written by the same author using another userid as the positive data, which confuses the classifier. [sent-505, score-0.691]

98 8 Conclusion This paper proposed a novel method to identify userids that may be from the same author. [sent-514, score-0.479]

99 This learning method is able to better determine whether a document may be written by a known author, although no document from the author has been used in training (as long as we have some documents from the author to serve as queries). [sent-516, score-0.577]

100 Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. [sent-712, score-0.384]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('userids', 0.479), ('idj', 0.418), ('userid', 0.398), ('idi', 0.275), ('lss', 0.173), ('authorship', 0.155), ('query', 0.142), ('author', 0.142), ('tsl', 0.133), ('qij', 0.112), ('document', 0.098), ('reviews', 0.098), ('documents', 0.097), ('cid', 0.092), ('attribution', 0.087), ('fake', 0.083), ('sim', 0.082), ('scoresqsum', 0.082), ('drijk', 0.071), ('idib', 0.071), ('pcount', 0.071), ('queries', 0.071), ('aq', 0.071), ('richness', 0.064), ('drij', 0.061), ('idia', 0.061), ('qi', 0.061), ('id', 0.056), ('literary', 0.054), ('similarity', 0.054), ('dri', 0.053), ('psqsum', 0.051), ('psum', 0.051), ('scoremax', 0.051), ('simad', 0.051), ('simug', 0.051), ('svi', 0.051), ('shlomo', 0.045), ('type', 0.044), ('ii', 0.043), ('sv', 0.042), ('review', 0.042), ('dfeatures', 0.041), ('novak', 0.041), ('scoresum', 0.041), ('ssjf', 0.041), ('halteren', 0.04), ('authors', 0.038), ('sample', 0.037), ('rewrite', 0.037), ('voting', 0.037), ('cosine', 0.037), ('jindal', 0.034), ('classification', 0.034), ('deceptive', 0.032), ('argamon', 0.032), ('graham', 0.031), ('line', 0.031), ('aixdi', 0.031), ('hedegaard', 0.031), ('qia', 0.031), ('simc', 0.031), ('classifications', 0.03), ('ari', 0.029), ('decision', 0.028), ('bing', 0.028), ('classifier', 0.028), ('sj', 0.027), ('stylistic', 0.027), ('koppel', 0.027), ('ott', 0.027), ('liu', 0.027), ('positive', 0.027), ('idk', 0.027), ('sanderson', 0.027), ('candidate', 0.025), ('di', 0.025), ('idm', 0.025), ('gamon', 0.024), ('mosteller', 0.024), ('formulae', 0.024), ('narayanan', 0.024), ('dti', 0.024), ('hirst', 0.024), ('lines', 0.024), ('belong', 0.023), ('tokens', 0.023), ('supervised', 0.021), ('schler', 0.021), ('post', 0.021), ('formulation', 0.021), ('aidxi', 0.02), ('confirmation', 0.02), ('dfeature', 0.02), ('feiguina', 0.02), ('hapax', 0.02), ('iadrjg', 0.02), ('idarjg', 0.02), ('idma', 0.02), ('nonquery', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 95 emnlp-2013-Identifying Multiple Userids of the Same Author

Author: Tieyun Qian ; Bing Liu

Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1

2 0.1840736 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

3 0.12601726 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching

Author: Ahmed Hassan

Abstract: Web search users frequently modify their queries in hope of receiving better results. This process is referred to as “Query Reformulation”. Previous research has mainly focused on proposing query reformulations in the form of suggested queries for users. Some research has studied the problem of predicting whether the current query is a reformulation of the previous query or not. However, this work has been limited to bag-of-words models where the main signals being used are word overlap, character level edit distance and word level edit distance. In this work, we show that relying solely on surface level text similarity results in many false positives where queries with different intents yet similar topics are mistakenly predicted as query reformulations. We propose a new representation for Web search queries based on identifying the concepts in queries and show that we can sig- nificantly improve query reformulation performance using features of query concepts.

4 0.10551548 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

Author: Moshe Koppel ; Shachar Seidman

Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1

5 0.098488718 94 emnlp-2013-Identifying Manipulated Offerings on Review Portals

Author: Jiwei Li ; Myle Ott ; Claire Cardie

Abstract: Recent work has developed supervised methods for detecting deceptive opinion spam— fake reviews written to sound authentic and deliberately mislead readers. And whereas past work has focused on identifying individual fake reviews, this paper aims to identify offerings (e.g., hotels) that contain fake reviews. We introduce a semi-supervised manifold ranking algorithm for this task, which relies on a small set of labeled individual reviews for training. Then, in the absence of gold standard labels (at an offering level), we introduce a novel evaluation procedure that ranks artificial instances of real offerings, where each artificial offering contains a known number of injected deceptive reviews. Experiments on a novel dataset of hotel reviews show that the proposed method outperforms state-of-art learning baselines.

6 0.081813462 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries

7 0.066446163 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

8 0.062169239 24 emnlp-2013-Application of Localized Similarity for Web Documents

9 0.059726026 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

10 0.057255153 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

11 0.054668222 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching

12 0.053311877 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering

13 0.052269332 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

14 0.048417732 202 emnlp-2013-Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections Using Online Reviews

15 0.039640997 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

16 0.03942975 61 emnlp-2013-Detecting Promotional Content in Wikipedia

17 0.03901108 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction

18 0.037176173 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?

19 0.035577804 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

20 0.035340723 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.127), (1, 0.046), (2, -0.085), (3, -0.028), (4, 0.027), (5, -0.016), (6, 0.077), (7, 0.12), (8, 0.052), (9, -0.085), (10, -0.062), (11, 0.078), (12, -0.077), (13, -0.023), (14, 0.112), (15, 0.036), (16, -0.207), (17, -0.167), (18, -0.033), (19, -0.036), (20, -0.149), (21, 0.049), (22, 0.005), (23, -0.12), (24, -0.163), (25, 0.151), (26, 0.002), (27, -0.05), (28, 0.033), (29, -0.002), (30, 0.075), (31, -0.068), (32, -0.008), (33, 0.116), (34, -0.058), (35, -0.057), (36, -0.019), (37, 0.06), (38, -0.015), (39, -0.07), (40, -0.041), (41, 0.122), (42, -0.081), (43, 0.08), (44, 0.149), (45, -0.079), (46, 0.078), (47, -0.007), (48, -0.123), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92839509 95 emnlp-2013-Identifying Multiple Userids of the Same Author

Author: Tieyun Qian ; Bing Liu

Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1

2 0.71776426 37 emnlp-2013-Automatically Identifying Pseudepigraphic Texts

Author: Moshe Koppel ; Shachar Seidman

Abstract: The identification of pseudepigraphic texts texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus. 1

3 0.62049186 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

4 0.47575173 94 emnlp-2013-Identifying Manipulated Offerings on Review Portals

Author: Jiwei Li ; Myle Ott ; Claire Cardie

Abstract: Recent work has developed supervised methods for detecting deceptive opinion spam— fake reviews written to sound authentic and deliberately mislead readers. And whereas past work has focused on identifying individual fake reviews, this paper aims to identify offerings (e.g., hotels) that contain fake reviews. We introduce a semi-supervised manifold ranking algorithm for this task, which relies on a small set of labeled individual reviews for training. Then, in the absence of gold standard labels (at an offering level), we introduce a novel evaluation procedure that ranks artificial instances of real offerings, where each artificial offering contains a known number of injected deceptive reviews. Experiments on a novel dataset of hotel reviews show that the proposed method outperforms state-of-art learning baselines.

5 0.44728443 178 emnlp-2013-Success with Style: Using Writing Style to Predict the Success of Novels

Author: Vikas Ganjigunte Ashok ; Song Feng ; Yejin Choi

Abstract: Predicting the success of literary works is a curious question among publishers and aspiring writers alike. We examine the quantitative connection, if any, between writing style and successful literature. Based on novels over several different genres, we probe the predictive power of statistical stylometry in discriminating successful literary works, and identify characteristic stylistic elements that are more prominent in successful writings. Our study reports for the first time that statistical stylometry can be surprisingly effective in discriminating highly successful literature from less successful counterpart, achieving accuracy up to 84%. Closer analyses lead to several new insights into characteristics ofthe writing style in successful literature, including findings that are contrary to the conventional wisdom with respect to good writing style and readability. ,

6 0.44429889 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching

7 0.40688407 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

8 0.39396659 202 emnlp-2013-Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections Using Online Reviews

9 0.36467722 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

10 0.35794762 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries

11 0.31963268 24 emnlp-2013-Application of Localized Similarity for Web Documents

12 0.29157153 61 emnlp-2013-Detecting Promotional Content in Wikipedia

13 0.2801491 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment

14 0.27753446 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

15 0.27361667 26 emnlp-2013-Assembling the Kazakh Language Corpus

16 0.2582702 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching

17 0.24748042 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics

18 0.23792301 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

19 0.21204044 54 emnlp-2013-Decipherment with a Million Random Restarts

20 0.20498219 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.026), (18, 0.022), (22, 0.041), (29, 0.016), (30, 0.068), (40, 0.024), (50, 0.012), (51, 0.173), (59, 0.342), (66, 0.027), (71, 0.033), (73, 0.045), (75, 0.013), (77, 0.011), (90, 0.015), (95, 0.011), (96, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83906341 203 emnlp-2013-With Blinkers on: Robust Prediction of Eye Movements across Readers

Author: Franz Matthies ; Anders Sgaard

Abstract: Nilsson and Nivre (2009) introduced a treebased model of persons’ eye movements in reading. The individual variation between readers reportedly made application across readers impossible. While a tree-based model seems plausible for eye movements, we show that competitive results can be obtained with a linear CRF model. Increasing the inductive bias also makes learning across readers possible. In fact we observe next-to-no performance drop when evaluating models trained on gaze records of multiple readers on new readers.

2 0.71269011 30 emnlp-2013-Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora

Author: Ramy Eskander ; Nizar Habash ; Owen Rambow

Abstract: We present a method for automatically learning inflectional classes and associated lemmas from morphologically annotated corpora. The method consists of a core languageindependent algorithm, which can be optimized for specific languages. The method is demonstrated on Egyptian Arabic and German, two morphologically rich languages. Our best method for Egyptian Arabic provides an error reduction of 55.6% over a simple baseline; our best method for German achieves a 66.7% error reduction.

same-paper 3 0.7073006 95 emnlp-2013-Identifying Multiple Userids of the Same Author

Author: Tieyun Qian ; Bing Liu

Abstract: This paper studies the problem of identifying users who use multiple userids to post in social media. Since multiple userids may belong to the same author, it is hard to directly apply supervised learning to solve the problem. This paper proposes a new method, which still uses supervised learning but does not require training documents from the involved userids. Instead, it uses documents from other userids for classifier building. The classifier can be applied to documents of the involved userids. This is possible because we transform the document space to a similarity space and learning is performed in this new space. Our evaluation is done in the online review domain. The experimental results using a large number of userids and their reviews show that the proposed method is highly effective. 1

4 0.62877417 143 emnlp-2013-Open Domain Targeted Sentiment

Author: Margaret Mitchell ; Jacqui Aguilar ; Theresa Wilson ; Benjamin Van Durme

Abstract: We propose a novel approach to sentiment analysis for a low resource setting. The intuition behind this work is that sentiment expressed towards an entity, targeted sentiment, may be viewed as a span of sentiment expressed across the entity. This representation allows us to model sentiment detection as a sequence tagging problem, jointly discovering people and organizations along with whether there is sentiment directed towards them. We compare performance in both Spanish and English on microblog data, using only a sentiment lexicon as an external resource. By leveraging linguisticallyinformed features within conditional random fields (CRFs) trained to minimize empirical risk, our best models in Spanish significantly outperform a strong baseline, and reach around 90% accuracy on the combined task of named entity recognition and sentiment prediction. Our models in English, trained on a much smaller dataset, are not yet statistically significant against their baselines.

5 0.50722432 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

6 0.49102008 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

7 0.49096879 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

8 0.49062246 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

9 0.49011961 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

10 0.48926076 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

11 0.48858628 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

12 0.48848242 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

13 0.48775128 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

14 0.48709944 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

15 0.48683381 152 emnlp-2013-Predicting the Presence of Discourse Connectives

16 0.4866479 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

17 0.48633289 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors

18 0.48570102 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

19 0.48562706 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

20 0.48472524 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing