acl acl2013 acl2013-301 knowledge-graph by maker-knowledge-mining

301 acl-2013-Resolving Entity Morphs in Censored Data

Source: pdf

Author: Hongzhao Huang ; Zhen Wen ; Dian Yu ; Heng Ji ; Yizhou Sun ; Jiawei Han ; He Li

Abstract: In some societies, internet users have to create information morphs (e.g. “Peace West King” to refer to “Bo Xilai”) to avoid active censorship or achieve other communication goals. In this paper we aim to solve a new problem of resolving entity morphs to their real targets. We exploit temporal constraints to collect crosssource comparable corpora relevant to any given morph query and identify target candidates. Then we propose various novel similarity measurements including surface features, meta-path based semantic features and social correlation features and combine them in a learning-to-rank frame- work. Experimental results on Chinese Sina Weibo data demonstrate that our approach is promising and significantly outperforms baseline methods1 .

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu4 s Abstract In some societies, internet users have to create information morphs (e. [sent-10, score-0.422]

2 In this paper we aim to solve a new problem of resolving entity morphs to their real targets. [sent-13, score-0.491]

3 We exploit temporal constraints to collect crosssource comparable corpora relevant to any given morph query and identify target candidates. [sent-14, score-0.85]

4 Then we propose various novel similarity measurements including surface features, meta-path based semantic features and social correlation features and combine them in a learning-to-rank frame- work. [sent-15, score-0.409]

5 1 Introduction Language constantly evolves to maximize communicative success and expressive power in daily social interactions. [sent-17, score-0.229]

6 The proliferation of online social media significantly expedites this evolution, as new phrases triggered by social events may be disseminated rapidly in social media. [sent-18, score-0.67]

7 To automatically analyze such fast evolving language in social media, new computational models are demanded. [sent-19, score-0.199]

8 For example, when Chinese online users talk about the former politician “Bo Xilai”, they use a morph “Peace West King” instead, a historical figure four hundreds years ago who governed the same region as Bo. [sent-28, score-0.671]

9 However, social network plays an important role in generating morphs. [sent-31, score-0.295]

10 Usually morphs are generated by harvesting the collective wisdom of the crowd to achieve certain communication goals. [sent-32, score-0.357]

11 Aside from the purpose of avoiding censorship, other motivations for using morph include expressing sarcasm/irony, positive/negative sentiment or making descriptions more vivid toward some entities or events. [sent-33, score-0.671]

12 We can see that a morph can be either a regular term with new meaning or a newly created term. [sent-35, score-0.632]

13 We believe that successful resolution of morphs is a crucial step for automated understanding of the fast evolving social media language, which is important for social media marketing (Barwise and Meehan, 2010). [sent-36, score-0.915]

14 Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1083–1093, However, morph resolution in social media is challenging due to the following reasons. [sent-42, score-0.943]

15 First, the sensitive real targets that exist in the same data source under active censorship are often automatically filtered. [sent-43, score-0.259]

16 Table 2 presents the distributions of some examples of morphs and their targets in English Twitter and Chinese Sina Weibo. [sent-44, score-0.392]

17 Thus, the co-occurrence of a morph and its target is quite low in the vast amount of information in social media. [sent-46, score-0.912]

18 Second, most morphs were not created based on pronunciations, spellings or other encryptions oftheir original targets. [sent-47, score-0.357]

19 “Peace West King” as morph of “Bo Xilai”) and thus very difficult to capture based on typical lexical features. [sent-50, score-0.632]

20 Our approach builds and analyzes heterogeneous pinrofoarcmha btiuoinld sne atnwdor ankas yfrzoems multiple sources, such as Twitter, Sina Weibo and web documents in formal genre (e. [sent-55, score-0.202]

21 news) because a morph and its target tend to appear in similar contexts. [sent-57, score-0.713]

22 We propose two new similarity measures, as wWeell p as integrating temporal iintyfo mrmeaatsiuorne si,n taos the similarity measures to generate global semantic features. [sent-58, score-0.269]

23 • We model social user behaviors and use socWiael cmoorrdeella stioocni atol uassesirst b einh measuring semantic similarities because the users who posted a morph and its corresponding target tend to share similar interests and opinions. [sent-59, score-0.99]

24 2 Approach Overview STargetaigksruagndneLesacurfa eF ture Targ St mC a r tic dF aet oet R ea n in SocalFe tures Figure 1: Overview of Morph Resolution Given a morph query m, the goal of morph resolution is to find its real target. [sent-62, score-1.382]

25 In this paper we collect comparable censored data from Weibo and uncensored data from Twitter and Web documents such as news articles. [sent-70, score-0.266]

26 t e W Re explore v Raarinokus th feea ttaurrgeest including surface, semantic and social features, and incorporate them into a learning to 1084 rank framework. [sent-73, score-0.238]

27 3 Target Candidate Identification The general goal of the first step is to identify a list of target candidates for each morph query from the comparable corpora including Sina Weibo, Chinese News websites and English Twitter. [sent-75, score-0.799]

28 In addition, morphs are not limited to named entity forms. [sent-77, score-0.398]

29 The intuition is that a morph m and its real target e should have similar temporal distributions in terms of their occurrences. [sent-79, score-0.863]

30 , tmZm } be the set of temporal sl=ots {etach morph m occurs, a thnde Te = {te1, te2, . [sent-85, score-0.728]

31 1 Surface Features We first extract surface features between the morph and the candidate based on measuring orthographic similarity measures which were commonly used in entity coreference resolution (e. [sent-103, score-0.927]

32 1 Motivations Fortunately, although a morph and its target may have very different orthographic forms, they tend to be embedded in similar semantic contexts which involve similar topics and events. [sent-111, score-0.783]

33 Figure 2 presents some example messages under censorship (Weibo) and not under censorship (Twitter and Chinese Daily). [sent-112, score-0.258]

34 Weibo (censored) Twit er and Chinese News (uncensored) Figure 2: Cross-source Comparable Data Example (each morph and target pair is shown in the same color) 4. [sent-128, score-0.713]

35 An information network is homogeneous if and only if there is only one type for both objects and links, and an information network is heterogeneous when the objects are from multiple distinct types or there exist more than one type of links. [sent-133, score-0.373]

36 Unfortunately the stateof-the-art techniques for these tasks still perform poorly on social media in terms of both accuracy and coverage of important information, these sophisticated semantic links all produced negative impact on the target ranking performance. [sent-140, score-0.367]

37 A network schema of such networks is shown in Figure 3. [sent-144, score-0.187]

38 3 Meta-Path-Based Semantic Similarity Measurements Given the constructed network, a straightforward solution for finding the target for a morph is to use link-based similarity search. [sent-148, score-0.754]

39 Therefore, the semantic features generated from neighbors such as the entity “重庆 (Chongqing)” should be treated differently from other types of neighbors such as “人才 (talented people)” . [sent-151, score-0.296]

40 In this work, we propose to measure the similarity of two nodes over heterogeneous networks as shown in Figure 3, by distinguishing neighbors into three types according to the network schema (i. [sent-152, score-0.426]

41 , 2011b), which are defined over heterogeneous networks to extract semantic features. [sent-157, score-0.186]

42 For example, as shown in Figure 3, a morph and its target candidate can be connected by three meta-paths, including “M - E - E”, “M - EV - E”, and “M - NP - E”. [sent-159, score-0.762]

43 We denote the neighbor sets of certain type for a morph m and a target candidate e as Γ(m) and Γ(e), and a meta-path as P. [sent-164, score-0.791]

44 For m and e, the pairwise random walk probability of their neighbors can be represented as two probability vectors, then Kullback-Leibler distance (Hsiung et al. [sent-174, score-0.185]

45 Beyond the above similarity measures, we also propose to use cosine-similarity-style normaliza- tion method to modify common neighbor and pairwise random walk measures so that we can ensure the morph node and the target candidate node are strongly connected and also have similar popularity. [sent-176, score-0.961]

46 The above similarity measures can also be applied to homogeneous networks that do not differentiate the neighbor types. [sent-185, score-0.208]

47 4 Global Semantic Feafure Generation A morph tends to have higher temporal correlation with its real target, and share more similar topics compared to other irrelevant targets. [sent-188, score-0.829]

48 Therefore, we propose to incorporate temporal information into similarity measures to generate global semantic features. [sent-189, score-0.228]

49 t ∪he t set of target candidates for each morph m. [sent-199, score-0.758]

50 Integrate Cross PSource/Cross Genre Information Due to internet information censorship or surveillance, users may need to use morphs to post sensitive information. [sent-209, score-0.592]

51 In contrast, users are less restricted in some other uncensored social media such as Twitter. [sent-212, score-0.399]

52 )” contains both the morph and the real target 薄熙来 (Bo Xilai). [sent-225, score-0.767]

53 Twitter) to help resolution of sensitive morphs in Weibo. [sent-228, score-0.462]

54 Another difficulty from morph resolution in micro-blogging is that tweets are only allowed to contain maximum 140 characters with a lot of noise and diverse topics. [sent-229, score-0.796]

55 In this work, we also exploit the background web documents from the 者 1087 embedded URLs in tweets to enrich information network construction. [sent-232, score-0.285]

56 After applying the same annotation techniques as tweets for uncensored data sets, sentence-level co-occurrence relations are extracted and integrated into the network as shown in Figure 3. [sent-233, score-0.309]

57 3 Social Features It has been shown that there exist correlation between neighbors in social networks (Anagnostopoulos et al. [sent-235, score-0.411]

58 Because of such social correlation, close social neighbors in social media such as Twitter and Weibo may post similar information, or share similar opinion. [sent-237, score-0.753]

59 Therefore, we can utilize social correlation to assist in resolving morphs. [sent-238, score-0.285]

60 As social correlation can be defined as a function of social distance between a pair of users, we use social distance as a proxy to social correlation in our approach. [sent-239, score-0.89]

61 The social distance between user iand j is defined by considering the degree of separation in their interaction (e. [sent-240, score-0.199]

62 Similar definition has been shown ef- fective in characterizing social distance in social networks extracted from communication data (Lin et al. [sent-243, score-0.455]

63 We integrate social correlation and temporal information to define our social features. [sent-249, score-0.541]

64 The intuition is that when a morph is used by an user, the real target may also in the posts by the user or his/her close friends within a certain time period. [sent-250, score-0.767]

65 Let T be the set of temporal slots a morph m occurs, Ut be the set of users whose posts include m in slot t where t ∈ T, and Uc be the set of close finrie snlodts (i. [sent-251, score-0.804]

66 The social features are defined as s(m,e) =Pt∈Tf(|eT,|t,Ut,Uc). [sent-255, score-0.199]

67 where f(e, t, Ut, Uc) is a indicator function which return 1if one of the users in Ut or Uc posts tweets include the target candidate e within 7 days before t. [sent-256, score-0.269]

68 , 2011a), we then model the probability of linkage prediction between a morph m and its target candidate e as a function incorporating the surface, semantic and social features. [sent-260, score-1.0]

69 The learnt model is used to predict the probability of linking an unseen morph and its target candidate. [sent-263, score-0.713]

70 1 Data and Evaluation Metric We collected 1, 553, 347 tweets from Chinese Sina Weibo from May 1 to June 30 to construct the censored data set, and retrieved 66, 559 web documents from the embedded URLs in tweets as the initial uncensored data set. [sent-267, score-0.483]

71 We asked two native Chinese annotators to analyze the data, and construct a test set consisted of 107 morph entities (81 persons and 26 locations) and their real targets as our references. [sent-269, score-0.76]

72 We verified the references by Web resources including the summary of popular morphs in Wikipedia 2. [sent-270, score-0.357]

73 In addition, we used 23 sensitive morphs and the entities that appear in the tweets as queries and retrieved 25, 128 Chinese tweets from 10% Twitter feeds within the same time period, as well as 7, 473 web documents from the embedded URLs and added them into the uncensored data set. [sent-271, score-0.839]

74 To evaluate the system performance, we use leave-one-out cross validation by computing accuracy as Acc@k = where Ck is the total number of correctly resolved morphs at top k ranked answers, and Q is the total number of morph queries. [sent-272, score-1.037]

75 We consider a morph as correctly resolved at the top k answers if the top k answer set contains the real target of the morph. [sent-273, score-0.767]

76 The poor performance based on surface features shows that morph resolution task is very challenging since 70% of morphs are not orthographically similar to their real targets. [sent-282, score-1.18]

77 Specifically, comparing “HomB” and “HetB”, “HomE” and “HetE”, we can see that the semantic features based on heterogeneous networks have advantages over those based on homogeneous networks. [sent-285, score-0.215]

78 This indicates that capturing both temporal correlations and semantics of morphing simultaneously are important for morph resolution. [sent-291, score-0.728]

79 For example, using only surface features, the real target “乔布斯 Steve Jobs） ” of the morph “乔帮主 (Qiao Boss)” is not top ranked since some other candidates such as “乔治 (George)” are more orthographically similar. [sent-293, score-0.885]

80 2 Cross Source and Cross Genre Information We integrate the cross source information from Twitter, and the cross genre information from web documents into Weibo tweets for information network construction, and extract a new set of semantic features. [sent-299, score-0.443]

81 This is because Weibo dominates our dataset, and in Weibo many of these sensitive morphs are mostly used with their traditional meanings instead of the morph senses. [sent-302, score-1.03]

82 3 Effects of Social Features Table 7 shows that adding social features can improve the best performance achieved so far. [sent-307, score-0.199]

83 One includes a morph “Buhou” and the other includes its target “Bo Xilai”. [sent-326, score-0.713]

84 The gain is small since the combination of all features in the learning to rank framework can already well capture the relationship between a morph and a target candidate. [sent-329, score-0.713]

85 Named entities which co-occur at least δ times with a morph query in the same topic are selected as its target candidates. [sent-341, score-0.752]

86 As shown in Table9 (K is the number of predefined topics), PLSA is not quite effective mainly because traditional topic modeling ap- proaches do not perform well on short texts from social media. [sent-342, score-0.199]

87 One important aspect affecting the resolution performance is the morph & non-morph ambiguity. [sent-349, score-0.696]

88 We categorize a morph query as “Unique” if the string is mainly used as a morph when it occurs, such as “薄督 (Bodu)” which is used to refer to “Bo Xilai”; otherwise as “Common” (e. [sent-350, score-1.264]

89 We can see that the morphs in “Unique” 1090 category have much better resolution performance than those in “Common” category. [sent-354, score-0.421]

90 4c8c21@9 20 Table 10: Performance of Two Categories We also investigate the effects of popularity of morphs on the resolution performance. [sent-359, score-0.448]

91 4s320%81%386∼ 6 Related Work To analyze social media behavior under active censorship, (Bamman et al. [sent-367, score-0.247]

92 In contrast, our work goes beyond target idendification by resolving implicit morphs to their real targets. [sent-369, score-0.531]

93 We demonstrated that state-of-the-art alias detection methods did not perform well on morph resolution. [sent-374, score-0.765]

94 In this paper we exploit cross-genre information and social correlation to measure semantic similarity. [sent-375, score-0.285]

95 In contrast, our work constructs heterogeneous information networks from unstructured, noisy multi-genre text without explicit entity attributes. [sent-391, score-0.188]

96 7 Conclusion and Future Work To the best of our knowledge, this is the first work of resolving implicit information morphs from the data under active censorship. [sent-392, score-0.396]

97 Both of the Meta-path based and social correlation based semantic similarity measurements are proven powerful and complementary. [sent-394, score-0.361]

98 In addition, automatic identification of candidate morphs is another challenging task, especially when the mentions are ambiguous and can also refer to other real entities. [sent-397, score-0.46]

99 Our ongoing work includes identifying candidate morphs from scratch, as well as discovering morphs for a given target based on anomaly analysis and textual coherence modeling. [sent-398, score-0.844]

100 On the quality of inferring interests from social neighbors. [sent-571, score-0.199]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('morph', 0.632), ('morphs', 0.357), ('social', 0.199), ('weibo', 0.185), ('xilai', 0.177), ('hsiung', 0.145), ('censorship', 0.129), ('uncensored', 0.113), ('neighbors', 0.108), ('tweets', 0.1), ('alias', 0.099), ('buhou', 0.097), ('temporal', 0.096), ('network', 0.096), ('heterogeneous', 0.09), ('censored', 0.081), ('chongqing', 0.081), ('target', 0.081), ('bo', 0.077), ('twitter', 0.071), ('chinese', 0.067), ('peace', 0.066), ('resolution', 0.064), ('sina', 0.059), ('networks', 0.057), ('west', 0.056), ('genre', 0.054), ('ji', 0.054), ('real', 0.054), ('vk', 0.053), ('measures', 0.052), ('candidate', 0.049), ('hete', 0.048), ('homb', 0.048), ('yizhou', 0.048), ('media', 0.048), ('surface', 0.048), ('sun', 0.048), ('cross', 0.048), ('correlation', 0.047), ('candidates', 0.045), ('zhen', 0.043), ('tweet', 0.042), ('king', 0.042), ('comparable', 0.041), ('entity', 0.041), ('similarity', 0.041), ('sensitive', 0.041), ('walk', 0.041), ('heng', 0.04), ('entities', 0.039), ('resolving', 0.039), ('users', 0.039), ('semantic', 0.039), ('wen', 0.039), ('slots', 0.037), ('pairwise', 0.036), ('bollegala', 0.035), ('measurements', 0.035), ('link', 0.035), ('targets', 0.035), ('detection', 0.034), ('tac', 0.034), ('schema', 0.034), ('barwise', 0.032), ('hetb', 0.032), ('holzer', 0.032), ('kld', 0.032), ('retweeting', 0.032), ('simglobal', 0.032), ('simti', 0.032), ('sing', 0.032), ('object', 0.032), ('objects', 0.031), ('documents', 0.031), ('embedded', 0.031), ('jiawei', 0.031), ('daily', 0.03), ('uc', 0.029), ('homogeneous', 0.029), ('neighbor', 0.029), ('ncn', 0.029), ('hongzhao', 0.029), ('anagnostopoulos', 0.029), ('txi', 0.029), ('acc', 0.028), ('strength', 0.028), ('web', 0.027), ('popularity', 0.027), ('adamic', 0.026), ('wagner', 0.026), ('han', 0.026), ('internet', 0.026), ('urls', 0.026), ('grishman', 0.025), ('events', 0.025), ('award', 0.025), ('orthographically', 0.025), ('malicious', 0.025), ('shao', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 301 acl-2013-Resolving Entity Morphs in Censored Data

Author: Hongzhao Huang ; Zhen Wen ; Dian Yu ; Heng Ji ; Yizhou Sun ; Jiawei Han ; He Li

2 0.17214711 240 acl-2013-Microblogs as Parallel Corpora

Author: Wang Ling ; Guang Xiang ; Chris Dyer ; Alan Black ; Isabel Trancoso

Abstract: In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/∼lingwang/utopia.

3 0.14584886 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts

Author: Alexandra Balahur ; Hristo Tanev

Abstract: Nowadays, the importance of Social Media is constantly growing, as people often use such platforms to share mainstream media news and comment on the events that they relate to. As such, people no loger remain mere spectators to the events that happen in the world, but become part of them, commenting on their developments and the entities involved, sharing their opinions and distributing related content. This paper describes a system that links the main events detected from clusters of newspaper articles to tweets related to them, detects complementary information sources from the links they contain and subsequently applies sentiment analysis to classify them into positive, negative and neutral. In this manner, readers can follow the main events happening in the world, both from the perspective of mainstream as well as social media and the public’s perception on them. This system will be part of the EMM media monitoring framework working live and it will be demonstrated using Google Earth.

4 0.13323924 146 acl-2013-Exploiting Social Media for Natural Language Processing: Bridging the Gap between Language-centric and Real-world Applications

Author: Simone Paolo Ponzetto ; Andrea Zielinski

Abstract: unkown-abstract

5 0.11309175 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media

Author: Weiwei Guo ; Hao Li ; Heng Ji ; Mona Diab

Abstract: Many current Natural Language Processing [NLP] techniques work well assuming a large context of text as input data. However they become ineffective when applied to short texts such as Twitter feeds. To overcome the issue, we want to find a related newswire document to a given tweet to provide contextual support for NLP tasks. This requires robust modeling and understanding of the semantics of short texts. The contribution of the paper is two-fold: 1. we introduce the Linking-Tweets-toNews task as well as a dataset of linked tweet-news pairs, which can benefit many NLP applications; 2. in contrast to previ- ous research which focuses on lexical features within the short texts (text-to-word information), we propose a graph based latent variable model that models the inter short text correlations (text-to-text information). This is motivated by the observation that a tweet usually only covers one aspect of an event. We show that using tweet specific feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are able to extract text-to-text correlations, and thus completes the semantic picture of a short text. Our experiments show significant improvement of our new model over baselines with three evaluation metrics in the new task.

6 0.098329432 45 acl-2013-An Empirical Study on Uncertainty Identification in Social Media Context

7 0.093435906 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction

8 0.088827476 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

9 0.087619893 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

10 0.087570414 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

11 0.086680248 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

12 0.085933261 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

13 0.085434869 139 acl-2013-Entity Linking for Tweets

14 0.085314021 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

15 0.082912527 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction

16 0.0823742 114 acl-2013-Detecting Chronic Critics Based on Sentiment Polarity and Userâ•Žs Behavior in Social Media

17 0.077766091 339 acl-2013-Temporal Signals Help Label Temporal Relations

18 0.074115239 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

19 0.073489964 219 acl-2013-Learning Entity Representation for Entity Disambiguation

20 0.070080981 33 acl-2013-A user-centric model of voting intention from Social Media

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.193), (1, 0.094), (2, 0.003), (3, -0.002), (4, 0.15), (5, 0.088), (6, 0.043), (7, 0.096), (8, 0.104), (9, -0.07), (10, -0.108), (11, 0.01), (12, 0.084), (13, -0.046), (14, 0.034), (15, 0.01), (16, 0.02), (17, -0.024), (18, -0.025), (19, -0.061), (20, 0.034), (21, -0.039), (22, 0.024), (23, -0.034), (24, -0.027), (25, 0.044), (26, 0.035), (27, 0.004), (28, -0.038), (29, -0.023), (30, -0.024), (31, -0.012), (32, 0.022), (33, -0.007), (34, 0.087), (35, -0.006), (36, 0.014), (37, -0.045), (38, 0.053), (39, -0.041), (40, -0.0), (41, -0.002), (42, 0.021), (43, 0.005), (44, 0.046), (45, -0.023), (46, -0.018), (47, -0.041), (48, -0.035), (49, -0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91135281 301 acl-2013-Resolving Entity Morphs in Censored Data

Author: Hongzhao Huang ; Zhen Wen ; Dian Yu ; Heng Ji ; Yizhou Sun ; Jiawei Han ; He Li

2 0.80399144 45 acl-2013-An Empirical Study on Uncertainty Identification in Social Media Context

Author: Zhongyu Wei ; Junwen Chen ; Wei Gao ; Binyang Li ; Lanjun Zhou ; Yulan He ; Kam-Fai Wong

Abstract: Uncertainty text detection is important to many social-media-based applications since more and more users utilize social media platforms (e.g., Twitter, Facebook, etc.) as information source to produce or derive interpretations based on them. However, existing uncertainty cues are ineffective in social media context because of its specific characteristics. In this paper, we propose a variant of annotation scheme for uncertainty identification and construct the first uncertainty corpus based on tweets. We then conduct experiments on the generated tweets corpus to study the effectiveness of different types of features for uncertainty text identification.

3 0.7921418 146 acl-2013-Exploiting Social Media for Natural Language Processing: Bridging the Gap between Language-centric and Real-world Applications

Author: Simone Paolo Ponzetto ; Andrea Zielinski

Abstract: unkown-abstract

4 0.7516771 20 acl-2013-A Stacking-based Approach to Twitter User Geolocation Prediction

Author: Bo Han ; Paul Cook ; Timothy Baldwin

Abstract: We implement a city-level geolocation prediction system for Twitter users. The system infers a user’s location based on both tweet text and user-declared metadata using a stacking approach. We demonstrate that the stacking method substantially outperforms benchmark methods, achieving 49% accuracy on a benchmark dataset. We further evaluate our method on a recent crawl of Twitter data to investigate the impact of temporal factors on model generalisation. Our results suggest that user-declared location metadata is more sensitive to temporal change than the text of Twitter messages. We also describe two ways of accessing/demoing our system.

5 0.75066274 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media

Author: Weiwei Guo ; Hao Li ; Heng Ji ; Mona Diab

6 0.67362148 114 acl-2013-Detecting Chronic Critics Based on Sentiment Polarity and Userâ•Žs Behavior in Social Media

7 0.66708767 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

8 0.66182441 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts

9 0.64141476 33 acl-2013-A user-centric model of voting intention from Social Media

10 0.61693931 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

11 0.60829747 42 acl-2013-Aid is Out There: Looking for Help from Tweets during a Large Scale Disaster

12 0.58044738 240 acl-2013-Microblogs as Parallel Corpora

13 0.5802961 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

14 0.54626292 139 acl-2013-Entity Linking for Tweets

15 0.5233829 138 acl-2013-Enriching Entity Translation Discovery using Selective Temporality

16 0.52060688 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

17 0.52012432 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision

18 0.52006477 30 acl-2013-A computational approach to politeness with application to social factors

19 0.49782747 296 acl-2013-Recognizing Identical Events with Graph Kernels

20 0.45666966 219 acl-2013-Learning Entity Representation for Entity Disambiguation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.052), (6, 0.029), (11, 0.052), (15, 0.021), (24, 0.061), (26, 0.063), (28, 0.012), (35, 0.088), (42, 0.061), (48, 0.035), (70, 0.056), (74, 0.224), (88, 0.033), (90, 0.019), (95, 0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79967105 301 acl-2013-Resolving Entity Morphs in Censored Data

Author: Hongzhao Huang ; Zhen Wen ; Dian Yu ; Heng Ji ; Yizhou Sun ; Jiawei Han ; He Li

2 0.64132106 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

Author: Angeliki Lazaridou ; Ivan Titov ; Caroline Sporleder

Abstract: We propose a joint model for unsupervised induction of sentiment, aspect and discourse information and show that by incorporating a notion of latent discourse relations in the model, we improve the prediction accuracy for aspect and sentiment polarity on the sub-sentential level. We deviate from the traditional view of discourse, as we induce types of discourse relations and associated discourse cues relevant to the considered opinion analysis task; consequently, the induced discourse relations play the role of opinion and aspect shifters. The quantitative analysis that we conducted indicated that the integration of a discourse model increased the prediction accuracy results with respect to the discourse-agnostic approach and the qualitative analysis suggests that the induced representations encode a meaningful discourse structure.

3 0.6391961 318 acl-2013-Sentiment Relevance

Author: Christian Scheible ; Hinrich Schutze

Abstract: A number of different notions, including subjectivity, have been proposed for distinguishing parts of documents that convey sentiment from those that do not. We propose a new concept, sentiment relevance, to make this distinction and argue that it better reflects the requirements of sentiment analysis systems. We demonstrate experimentally that sentiment relevance and subjectivity are related, but different. Since no large amount of labeled training data for our new notion of sentiment relevance is available, we investigate two semi-supervised methods for creating sentiment relevance classifiers: a distant supervision approach that leverages structured information about the domain of the reviews; and transfer learning on feature representations based on lexical taxonomies that enables knowledge transfer. We show that both methods learn sentiment relevance classifiers that perform well.

4 0.63760054 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

Author: Barbara Plank ; Alessandro Moschitti

Abstract: Relation Extraction (RE) is the task of extracting semantic relationships between entities in text. Recent studies on relation extraction are mostly supervised. The clear drawback of supervised methods is the need of training data: labeled data is expensive to obtain, and there is often a mismatch between the training data and the data the system will be applied to. This is the problem of domain adaptation. In this paper, we propose to combine (i) term generalization approaches such as word clustering and latent semantic analysis (LSA) and (ii) structured kernels to improve the adaptability of relation extractors to new text genres/domains. The empirical evaluation on ACE 2005 domains shows that a suitable combination of syntax and lexical generalization is very promising for domain adaptation.

5 0.63540411 172 acl-2013-Graph-based Local Coherence Modeling

Author: Camille Guinaudeau ; Michael Strube

Abstract: We propose a computationally efficient graph-based approach for local coherence modeling. We evaluate our system on three tasks: sentence ordering, summary coherence rating and readability assessment. The performance is comparable to entity grid based approaches though these rely on a computationally expensive training phase and face data sparsity problems.

6 0.63442123 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

7 0.63440257 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

8 0.63390851 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

9 0.63326341 267 acl-2013-PARMA: A Predicate Argument Aligner

10 0.63322574 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

11 0.63183433 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

12 0.63149416 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

13 0.6312362 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media

14 0.63059676 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

15 0.63051987 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

16 0.62885702 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

17 0.62868452 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

18 0.62835824 224 acl-2013-Learning to Extract International Relations from Political Context

19 0.62810564 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction

20 0.62792635 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing