acl acl2012 acl2012-153 knowledge-graph by maker-knowledge-mining

153 acl-2012-Named Entity Disambiguation in Streaming Data


Source: pdf

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 − Fedeedrearlal U Unniviveresristiyty o of fM Ainmaasz Goenarasi {agCdoamvpiu tse ,a Sdcrieincaeno Devp, m. [sent-8, score-0.046]

2 e− −i F Fread ,rla ale Unndivee rrs}i t@y yd ocfc A . [sent-9, score-0.1]

3 bsr −− F aialtn io @v d,mcce Abstract The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. [sent-11, score-0.916]

4 This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. [sent-12, score-0.288]

5 High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. [sent-13, score-0.647]

6 On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. [sent-14, score-0.331]

7 Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. [sent-15, score-0.642]

8 Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. [sent-16, score-0.158]

9 We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster. [sent-17, score-0.574]

10 The task of named entity disambiguation is to identify which names refer to the same entity in a textual collection (Sarmento et al. [sent-22, score-1.165]

11 The emergence of new communication technologies, such as micro-blog platforms, brought a humongous amount of textual mentions with ambiguous entity names, raising an urgent need for novel disambiguation approaches and algorithms. [sent-26, score-0.885]

12 In this paper we address the named entity disambiguation task under a particularly challenging scenario. [sent-27, score-0.645]

13 We are given a stream of messages from a micro-blog channel such as Twitter2 and a list of names n1, n2, . [sent-28, score-0.803]

14 Our problem is to monitor the stream and predict whether an incoming message containing ni indeed refers to e (positive example) or not (negative example). [sent-32, score-0.885]

15 First, micro-blog messages are composed of a small amount of words and they are written in informal, sometimes cryptic style. [sent-34, score-0.359]

16 These characteristics make hard the identification of entities and the semantics of their relationships (Liu et al. [sent-35, score-0.24]

17 Further, the scarcity of text in the messages makes it even harder to properly characterize a common context for the entities. [sent-37, score-0.405]

18 Second, as we need to monitor messages that keep coming at a fast pace, we cannot afford to gather information from external Human language is not exact. [sent-38, score-0.618]

19 Finally, fresh data coming in the tity1 may be referred by multiple names (i. [sent-40, score-0.411]

20 , poly- stream introduces new patterns, quickly invalidating semy), and also the same name may refer to different static disambiguation models. [sent-42, score-0.542]

21 , 2Twitter is one of the fastest-growing micro-blog channels, 1The term entity refers to anything that has a distinct, sepa- and an authoritative source for breaking news (Jansen et al. [sent-45, score-0.358]

22 The information embedded in such a stream of messages may be exploited for entity disambiguation through the application of supervised learning methods, for instance, with the application of binary classifiers. [sent-51, score-1.163]

23 Such methods, however, suffer from a data acquisition bottleneck, since they are based on training datasets that are built by skilled human annotators who manually inspect the messages. [sent-52, score-0.093]

24 This annotation process is usually lengthy and laborious, being clearly unfeasible to be adopted in data streaming scenarios. [sent-53, score-0.422]

25 As an alternative to such manual process, a large amount of unlabeled data, augmented with a small amount of (likely) positive examples, can be collected automatically from the message stream (Liu et al. [sent-54, score-0.742]

26 Binary classifiers may be learned from such data by considering unlabeled data as negative examples. [sent-58, score-0.351]

27 This strategy, however, leads to classifiers with poor disambiguation performance, due to a potentially large number of false-negative examples. [sent-59, score-0.571]

28 In this paper we propose to refine binary classifiers iteratively, by performing Expectation-Maximization (EM) approaches (Dempster et al. [sent-60, score-0.229]

29 Basically, a partial classifier is used to evaluate the likelihood of an un- labeled example being a positive example or a negative example, thus automatically and (continuously) creating a labeled training corpus. [sent-62, score-0.227]

30 This process continues iteratively by changing the label of some examples (an operation we call label-transition), so that, after some iterations, the combination of labels is expected to converge to the one for which the observed data is most likely. [sent-63, score-0.155]

31 Based on such an approach, we introduce novel disambiguation algorithms that differ among themselves on the granularity in which the classifier is updated, and on the label-transition operations that are allowed. [sent-64, score-0.385]

32 An important feature of the proposed approach is that, at each iteration of the EM-process, a new classifier (an improved one) is produced in order to account for the current set of labeled examples. [sent-65, score-0.122]

33 We introduce a novel strategy to maintain the classifiers 816 up-to-date incrementally after each iteration, or even after each label-transition operation. [sent-66, score-0.204]

34 Indeed, we theoretically show that our classifier needs to be updated just partially and we are able to determine exactly which parts must be updated, making our dis- ambiguation methods extremely fast. [sent-67, score-0.277]

35 To evaluate the effectiveness of the proposed algorithms, we performed a systematic set of experiments using large-scale Twitter data containing messages with ambiguous entity names. [sent-68, score-0.702]

36 In order to validate our claims, disambiguation performance is investigated by varying the proportion of falsenegative examples in the unlabeled dataset. [sent-69, score-0.505]

37 Our algorithms are compared against a state-of-the-art technique for named entity disambiguation based on classifiers, providing performance gains ranging from 1% to 20% and being roughly 120 times faster. [sent-70, score-0.691]

38 2 Related Work In the context of databases, traditional entity disambiguation methods rely on similarity functions over attributes associated to the entities (de Carvalho et al. [sent-71, score-0.663]

39 Obviously, such an approach is unfeasible for the scenario we consider here. [sent-73, score-0.158]

40 al (2005) propose graph-based disambiguation methods that generate clusters of coreferent entities using known relationships between entities of several types. [sent-75, score-0.679]

41 Methods to disambiguate person names in e-mail (Minkov et al. [sent-76, score-0.207]

42 In emails, information taken from the header of the messages leads to establish relationships between users and building a co-reference graph. [sent-79, score-0.488]

43 Such graph-based approach could hardly be applied to the context we consider, in which the implied relationships between entities mentioned in a given micro-blog message are not clearly defined. [sent-81, score-0.351]

44 In the case of textual corpora, traditional disambiguation methods represent entity names and their context (Hasegawa et al. [sent-82, score-0.826]

45 , words, phrases and other names occurring near them) as weighted vectors (Bagga and Baldwin, 1998; Pedersen et al. [sent-85, score-0.207]

46 To evaluate whether two names refer to the same entity, these methods compute the similarity between these vectors. [sent-87, score-0.207]

47 Clusters of co-referent names are then built based on such similarity measure. [sent-88, score-0.207]

48 Although effective for the tasks considered in these papers, the simplistic BOW-based approaches they adopt are not suitable for cases in which the context is harder to capture due to the small number of terms available or to informal writing style. [sent-89, score-0.186]

49 To address these problems, some authors argue that contextual information may be enriched with knowledge from external sources, such as search results and the Wikipedia (Cucerzan, 2007; Bunescu and Pasca, 2006; Han and Zhao, 2009). [sent-90, score-0.072]

50 While such a strategy is feasible in an off-line setting, two problems arise when monitoring streams of micro-blog messages. [sent-91, score-0.089]

51 First, gathering information from external sources through the Internet can be costly and, second, informal mentions to named entities make it hard to look for related information in such sources. [sent-92, score-0.505]

52 The disambiguation methods we propose fall into a learning scenario known as PU (positive and unlabeled) learning (Liu et al. [sent-93, score-0.372]

53 , 2000), in which a classifier is built from a set of positive examples plus unlabeled data. [sent-96, score-0.41]

54 Most of the approaches for PU learning, such as the biased-SVM approach (Li and Liu, 2003), are based on extracting negative examples from unlabeled data. [sent-97, score-0.262]

55 We notice that existing approaches for PU learning are not likely to scale given the restrictions imposed by streaming data. [sent-98, score-0.284]

56 Thus, we propose highly incremental approaches, which are able to process large-scale streaming data. [sent-99, score-0.284]

57 3 Disambiguation in Streaming Data Consider a stream of messages from a micro-blog channel such as Twitter and let n1, n2, . [sent-100, score-0.596]

58 , nN be names used for mentioning a specific entity e in these messages. [sent-103, score-0.547]

59 Our problem is to continually monitor the stream and predict whether an incoming message containing ni indeed refers to e or not. [sent-104, score-0.928]

60 In this case, we are given an input data set called the training corpus (denoted as D) which consists of examples of tphues f (odremno , whhicehre c e sisi tthse o entity, m iss a message containing the entity name (i. [sent-106, score-0.624]

61 } is a binary variable that specifies wanhdet che ∈r o {r? [sent-110, score-0.071]

62 t }th ies e an tbitiyna rnyam vear i anb lme hr eafter ssp etcoi ftihees 817 desired real-world entity e. [sent-112, score-0.35]

63 The training corpus is used to produce a classifier that relates textual patterns (i. [sent-113, score-0.138]

64 The test set (denoted as T ) consists of a set ooff cre. [sent-116, score-0.041]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('disambiguation', 0.306), ('messages', 0.296), ('streaming', 0.284), ('entity', 0.254), ('stream', 0.236), ('names', 0.207), ('message', 0.165), ('classifiers', 0.158), ('monitor', 0.138), ('unlabeled', 0.13), ('comit', 0.116), ('letouzey', 0.116), ('coming', 0.112), ('updated', 0.105), ('pu', 0.103), ('entities', 0.103), ('incoming', 0.101), ('unfeasible', 0.092), ('fresh', 0.092), ('informal', 0.086), ('mentioning', 0.086), ('positive', 0.085), ('named', 0.085), ('twitter', 0.084), ('ni', 0.084), ('relationships', 0.083), ('classifier', 0.079), ('denis', 0.077), ('external', 0.072), ('binary', 0.071), ('examples', 0.069), ('scenario', 0.066), ('channel', 0.064), ('ambiguous', 0.064), ('amount', 0.063), ('negative', 0.063), ('refers', 0.061), ('textual', 0.059), ('leads', 0.059), ('harder', 0.059), ('databases', 0.059), ('sources', 0.057), ('indeed', 0.055), ('hard', 0.054), ('scarcity', 0.05), ('bhattacharya', 0.05), ('ale', 0.05), ('inspect', 0.05), ('ocfc', 0.05), ('hoffart', 0.05), ('yosef', 0.05), ('header', 0.05), ('sisi', 0.05), ('homonymy', 0.05), ('ambiguation', 0.05), ('adriano', 0.05), ('anb', 0.05), ('hasegawa', 0.05), ('urgent', 0.05), ('mentions', 0.048), ('poor', 0.048), ('plus', 0.047), ('strategy', 0.046), ('tse', 0.046), ('ssp', 0.046), ('lengthy', 0.046), ('federal', 0.046), ('channels', 0.046), ('alberto', 0.046), ('ranging', 0.046), ('iteratively', 0.045), ('containing', 0.045), ('extremely', 0.043), ('liu', 0.043), ('compensated', 0.043), ('minkov', 0.043), ('bagga', 0.043), ('coreferent', 0.043), ('monitoring', 0.043), ('fto', 0.043), ('skilled', 0.043), ('expectationmaximization', 0.043), ('continually', 0.043), ('authoritative', 0.043), ('cucerzan', 0.043), ('systematic', 0.043), ('iteration', 0.043), ('nn', 0.041), ('denoted', 0.041), ('clusters', 0.041), ('emails', 0.041), ('simplistic', 0.041), ('wagner', 0.041), ('getoor', 0.041), ('tthse', 0.041), ('pasca', 0.041), ('continues', 0.041), ('platforms', 0.041), ('emergence', 0.041), ('ooff', 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

2 0.14338467 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: To discover relation types from text, most methods cluster shallow or syntactic patterns of relation mentions, but consider only one possible sense per pattern. In practice this assumption is often violated. In this paper we overcome this issue by inducing clusters of pattern senses from feature representations of patterns. In particular, we employ a topic model to partition entity pairs associated with patterns into sense clusters using local and global features. We merge these sense clusters into semantic relations using hierarchical agglomerative clustering. We compare against several baselines: a generative latent-variable model, a clustering method that does not disambiguate between path senses, and our own approach but with only local features. Experimental results show our proposed approach discovers dramatically more accurate clusters than models without sense disambiguation, and that incorporating global features, such as the document theme, is crucial.

3 0.10801408 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions

Author: Dani Yogatama ; Yanchuan Sim ; Noah A. Smith

Abstract: We present a statistical model for canonicalizing named entity mentions into a table whose rows represent entities and whose columns are attributes (or parts of attributes). The model is novel in that it incorporates entity context, surface features, firstorder dependencies among attribute-parts, and a notion of noise. Transductive learning from a few seeds and a collection of mention tokens combines Bayesian inference and conditional estimation. We evaluate our model and its components on two datasets collected from political blogs and sports news, finding that it outperforms a simple agglomerative clustering approach and previous work.

4 0.10257109 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

Author: Enrique Alfonseca ; Katja Filippova ; Jean-Yves Delort ; Guillermo Garrido

Abstract: We describe the use of a hierarchical topic model for automatically identifying syntactic and lexical patterns that explicitly state ontological relations. We leverage distant supervision using relations from the knowledge base FreeBase, but do not require any manual heuristic nor manual seed list selections. Results show that the learned patterns can be used to extract new relations with good precision.

5 0.10243277 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.

6 0.10203619 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

7 0.096798293 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

8 0.095504664 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale

9 0.095372267 73 acl-2012-Discriminative Learning for Joint Template Filling

10 0.094814964 160 acl-2012-Personalized Normalization for a Multilingual Chat System

11 0.08656209 216 acl-2012-Word Epoch Disambiguation: Finding How Words Change Over Time

12 0.077821836 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

13 0.076579429 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

14 0.075313047 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

15 0.070593134 15 acl-2012-A Meta Learning Approach to Grammatical Error Correction

16 0.06529431 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

17 0.064500622 42 acl-2012-Bootstrapping via Graph Propagation

18 0.063904479 7 acl-2012-A Computational Approach to the Automation of Creative Naming

19 0.063225001 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

20 0.062040258 134 acl-2012-Learning to Find Translations and Transliterations on the Web


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.172), (1, 0.133), (2, -0.007), (3, 0.079), (4, 0.081), (5, 0.101), (6, 0.038), (7, 0.046), (8, 0.077), (9, -0.008), (10, 0.138), (11, -0.062), (12, -0.058), (13, 0.039), (14, 0.065), (15, 0.019), (16, -0.017), (17, 0.012), (18, -0.086), (19, -0.045), (20, -0.101), (21, -0.017), (22, -0.017), (23, -0.059), (24, -0.049), (25, 0.061), (26, -0.002), (27, 0.06), (28, 0.01), (29, -0.03), (30, 0.09), (31, -0.042), (32, 0.07), (33, 0.138), (34, -0.01), (35, -0.002), (36, 0.1), (37, 0.115), (38, -0.076), (39, 0.147), (40, 0.029), (41, 0.083), (42, -0.115), (43, 0.092), (44, -0.138), (45, -0.056), (46, -0.158), (47, 0.21), (48, 0.081), (49, -0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97545028 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

2 0.60155308 160 acl-2012-Personalized Normalization for a Multilingual Chat System

Author: Ai Ti Aw ; Lian Hau Lee

Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1

3 0.54976535 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

Author: Fei Liu ; Fuliang Weng ; Xiao Jiang

Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.

4 0.54914331 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

Author: Xiaohua Liu ; Ming Zhou ; Xiangyang Zhou ; Zhongyang Fu ; Furu Wei

Abstract: Tweets represent a critical source of fresh information, in which named entities occur frequently with rich variations. We study the problem of named entity normalization (NEN) for tweets. Two main challenges are the errors propagated from named entity recognition (NER) and the dearth of information in a single tweet. We propose a novel graphical model to simultaneously conduct NER and NEN on multiple tweets to address these challenges. Particularly, our model introduces a binary random variable for each pair of words with the same lemma across similar tweets, whose value indicates whether the two related words are mentions of the same entity. We evaluate our method on a manually annotated data set, and show that our method outperforms the baseline that handles these two tasks separately, boosting the F1 from 80.2% to 83.6% for NER, and the Accuracy from 79.4% to 82.6% for NEN, respectively.

5 0.54523319 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

Author: Marco Guerini ; Carlo Strapparava ; Oliviero Stock

Abstract: In recent years there has been a growing interest in crowdsourcing methodologies to be used in experimental research for NLP tasks. In particular, evaluation of systems and theories about persuasion is difficult to accommodate within existing frameworks. In this paper we present a new cheap and fast methodology that allows fast experiment building and evaluation with fully-automated analysis at a low cost. The central idea is exploiting existing commercial tools for advertising on the web, such as Google AdWords, to measure message impact in an ecological setting. The paper includes a description of the approach, tips for how to use AdWords for scientific research, and results of pilot experiments on the impact of affective text variations which confirm the effectiveness of the approach.

6 0.52245396 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

7 0.5196805 73 acl-2012-Discriminative Learning for Joint Template Filling

8 0.51600707 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions

9 0.4730038 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

10 0.46373552 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

11 0.46117991 216 acl-2012-Word Epoch Disambiguation: Finding How Words Change Over Time

12 0.41867983 7 acl-2012-A Computational Approach to the Automation of Creative Naming

13 0.39016044 186 acl-2012-Structuring E-Commerce Inventory

14 0.38524625 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

15 0.37122902 195 acl-2012-The Creation of a Corpus of English Metalanguage

16 0.36988243 42 acl-2012-Bootstrapping via Graph Propagation

17 0.3690061 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

18 0.36841959 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

19 0.36766744 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

20 0.3657662 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(26, 0.028), (28, 0.032), (39, 0.025), (74, 0.017), (82, 0.011), (85, 0.015), (90, 0.068), (92, 0.038), (99, 0.68)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97882748 169 acl-2012-Reducing Wrong Labels in Distant Supervision for Relation Extraction

Author: Shingo Takamatsu ; Issei Sato ; Hiroshi Nakagawa

Abstract: In relation extraction, distant supervision seeks to extract relations between entities from text by using a knowledge base, such as Freebase, as a source of supervision. When a sentence and a knowledge base refer to the same entity pair, this approach heuristically labels the sentence with the corresponding relation in the knowledge base. However, this heuristic can fail with the result that some sentences are labeled wrongly. This noisy labeled data causes poor extraction performance. In this paper, we propose a method to reduce the number of wrong labels. We present a novel generative model that directly models the heuristic labeling process of distant supervision. The model predicts whether assigned labels are correct or wrong via its hidden variables. Our experimental results show that this model detected wrong labels with higher performance than baseline methods. In the ex- periment, we also found that our wrong label reduction boosted the performance of relation extraction.

same-paper 2 0.96702188 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

3 0.92350703 149 acl-2012-Movie-DiC: a Movie Dialogue Corpus for Research and Development

Author: Rafael E. Banchs

Abstract: This paper describes Movie-DiC a Movie Dialogue Corpus recently collected for research and development purposes. The collected dataset comprises 132,229 dialogues containing a total of 764,146 turns that have been extracted from 753 movies. Details on how the data collection has been created and how it is structured are provided along with its main statistics and characteristics. 1

4 0.92328221 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

Author: Elena Cabrio ; Serena Villata

Abstract: Blogs and forums are widely adopted by online communities to debate about various issues. However, a user that wants to cut in on a debate may experience some difficulties in extracting the current accepted positions, and can be discouraged from interacting through these applications. In our paper, we combine textual entailment with argumentation theory to automatically extract the arguments from debates and to evaluate their acceptability.

5 0.88175607 101 acl-2012-Fully Abstractive Approach to Guided Summarization

Author: Pierre-Etienne Genest ; Guy Lapalme

Abstract: This paper shows that full abstraction can be accomplished in the context of guided summarization. We describe a work in progress that relies on Information Extraction, statistical content selection and Natural Language Generation. Early results already demonstrate the effectiveness of the approach.

6 0.8604725 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

7 0.56226617 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

8 0.5519951 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

9 0.54563856 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

10 0.50231779 191 acl-2012-Temporally Anchored Relation Extraction

11 0.49842736 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

12 0.49396443 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

13 0.48480299 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP

14 0.47397298 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

15 0.47278893 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

16 0.46937868 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

17 0.45544302 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

18 0.45052958 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

19 0.44673243 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

20 0.44338581 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base