emnlp emnlp2011 emnlp2011-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
Reference: text
sentIndex sentText sentNum sentScore
1 edu , , Abstract People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. [sent-3, score-0.159]
2 T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. [sent-7, score-0.511]
3 LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. [sent-8, score-0.28]
4 , 2008), tweets are particularly terse and difficult (See Table 1). [sent-12, score-0.43]
5 Yet tweets provide a unique compilation of information that is more upto-date and inclusive than news articles, due to the low-barrier to tweeting, and the proliferation of mobile devices. [sent-13, score-0.395]
6 1 The corpus of tweets already exceeds 1See the “trending topics” displayed on twitter . [sent-14, score-0.623]
7 Not surprisingly, the performance of “off the shelf” NLP tools, which were trained on news corpora, is weak on tweet corpora. [sent-17, score-0.142]
8 In response, we report on a re-trained “NLP pipeline” that leverages previously-tagged out-of- domain text, 2 tagged tweets, and unlabeled tweets to achieve more effective part-of-speech tagging, chunking, and named-entity recognition. [sent-18, score-0.463]
9 edSI We find that classifying named entities in tweets is a difficult task for two reasons. [sent-24, score-0.893]
10 First, tweets contain a plethora of distinctive named entity types (Companies, Products, Bands, Movies, and more). [sent-25, score-1.019]
11 Almost all these types (except for People and Locations) are relatively infrequent, so even a large sample of manually annotated tweets will contain few training examples. [sent-26, score-0.454]
12 , 2009) to leverage large amounts of unlabeled data in addition to large dictionaries of entities gathered from Freebase, and combines information about an entity’s context across its mentions. [sent-32, score-0.385]
13 By utilizing in-domain, outof-domain, and unlabeled data we are able to substantially boost performance, for example obtaining a 52% increase in F1 score on segmenting named entities. [sent-38, score-0.396]
14 This approach increases F1 score by 25% relative to co-training (Blum and Mitchell, 1998; Yarowsky, 1995) on the task of classifying named entities in Tweets. [sent-43, score-0.529]
15 We first present our approaches to shallow syntax part of speech tagging (§2. [sent-46, score-0.152]
16 All tools in §2 are used as fietaatluizraetsi fnor i nna am twede entity segmentation irne §3. [sent-55, score-0.416]
17 We also discuss a novel capitalization classifier in §2. [sent-63, score-0.186]
18 re T generation for named entity recognition in the next section. [sent-67, score-0.554]
19 1 Part of Speech Tagging Part of speech tagging is applicable to a wide range of NLP tasks including named entity segmentation and information extraction. [sent-72, score-0.657]
20 However, the application of a similar baseline on tweets (see Table 2) obtains a much weaker 0. [sent-77, score-0.364]
21 In addition to differences in vocabulary, the grammar of tweets is quite different from edited news text. [sent-98, score-0.395]
22 For instance, tweets often start with a verb (where the subject ‘I’ is implied), as in: “watchng american dad. [sent-99, score-0.364]
23 ” To overcome these differences in style and vocabulary, we manually annotated a set of 800 tweets (16K tokens) with tags from the Penn TreeBank tag set for use as in-domain training data for our POS tagging system, T-POS. [sent-100, score-0.458]
24 Accurate shallow parsing of tweets could benefit sev- eral applications such as Information Extraction and Named Entity Recognition. [sent-127, score-0.456]
25 Off the shelf shallow parsers perform noticeably worse on tweets, motivating us again to annotate indomain training data. [sent-128, score-0.158]
26 We annotate the same set of 800 tweets mentioned previously with tags from the CoNLL shared task (Tjong Kim Sang and Buchholz, 2000). [sent-129, score-0.364]
27 3 Capitalization A key orthographic feature for recognizing named entities is capitalization (Florian, 2002; Downey et al. [sent-137, score-0.643]
28 Unfortunately in tweets, capitalization is much less reliable than in edited texts. [sent-139, score-0.186]
29 In some tweets capitalization is informative, whereas in other cases, non-entity words are capitalized simply for emphasis. [sent-141, score-0.602]
30 Some tweets contain all lowercase words (8%), whereas others are in ALL CAPS (0. [sent-142, score-0.364]
31 sage to determine whether or not its capitalization is informative. [sent-150, score-0.186]
32 To this end, we build a capitalization classifier, T-CAP, which predicts whether or not a tweet is informatively capitalized. [sent-151, score-0.297]
33 The criteria we use for labeling is as follows: if a tweet contains any non-entity words which are capitalized, but do not begin a sentence, or it contains any entities which are not capitalized, then its capitalization is “uninformative”, otherwise it is “informative”. [sent-154, score-0.518]
34 Results comparing against the majority baseline, which predicts capitalization is always informative, are shown in Table 5. [sent-157, score-0.186]
35 on our capitalization wcelas sshifoiwer improve performance at named entity segmentation. [sent-159, score-0.702]
36 3 Named Entity Recognition We now discuss our approach to named entity recognition on Twitter data. [sent-160, score-0.554]
37 As with POS tagging and shallow parsing, off the shelf named-entity recog- nizers perform poorly on tweets. [sent-161, score-0.218]
38 (2009), we treat classification and segmentation of named entities as separate tasks. [sent-175, score-0.584]
39 For example, we are able to use discriminative methods for named entity segmentation and distantly supervised approaches for classification. [sent-177, score-0.714]
40 While it might be beneficial to jointly model segmentation and (distantly supervised) classification using a joint sequence labeling and topic model similar to that proposed by Sauper et al. [sent-178, score-0.212]
41 Because most words found in tweets are not part of an entity, we need a larger annotated dataset to effectively learn a model of named entities. [sent-180, score-0.634]
42 We therefore use a randomly sampled set of 2,400 tweets for NER. [sent-181, score-0.364]
43 1 Segmenting Named Entities Because capitalization in Twitter is less informative than news, in-domain data is needed to train models which rely less heavily on capitalization, and also are able to utilize features provided by T-CAP. [sent-184, score-0.234]
44 We exhaustively annotated our set of 2,400 tweets (34K tokens) with named entities. [sent-185, score-0.634]
45 We deliberately choose not to annotate @usernames as entities in our data set because they are both unambiguous, and trivial to identify with 100% accuracy using a simple regular expression, and would only serve to inflate our performance statistics. [sent-187, score-0.221]
46 T-SEG models Named Entity Segmentation as a sequence-labeling task using IOB encoding for representing segmentations (each word either begins, is inside, or is outside of a named entity), and uses Conditional Random Fields for learning and inference. [sent-193, score-0.236]
47 Again we include orthographic, contextual and dictionary features; our dictionaries included a set of type lists gathered from Freebase. [sent-194, score-0.196]
48 We report results at segmenting named entities in Table 6. [sent-196, score-0.513]
49 2 Classifying Named Entities Because Twitter contains many distinctive, and infrequent entity types, gathering sufficient training data for named entity classification is a difficult task. [sent-200, score-0.873]
50 Moreover, due to their terse nature, individual tweets often do not contain enough context to determine the type of the entities they contain. [sent-202, score-0.686]
51 without any prior knowledge, there is not enough context to determine what type of entity “KKTNY” refers to, however by exploiting redundancy in the data (Downey et al. [sent-213, score-0.315]
52 9 In order to handle the problem of many infrequent types, we leverage large lists of entities and their types gathered from an open-domain ontology (Freebase) as a source of distant supervision, allowing use of large amounts of unlabeled data in learning. [sent-215, score-0.503]
53 Freebase Baseline: Although Freebase has very broad coverage, simply looking up entities and their types is inadequate for classifying named entities in context (0. [sent-216, score-0.806]
54 This problem is very common: 35% of the entities in our data appear in more than one of our (mutually exclusive) Freebase dictionaries. [sent-224, score-0.221]
55 Additionally, 30% of entities mentioned on Twitter do not appear in any Freebase dictionary, as they are either too new (for example a newly released videogame), or are misspelled or abbreviated (for example ‘mbp’ is often used to refer to the “mac book pro”). [sent-225, score-0.221]
56 Distant Supervision with Topic Models: To model unlabeled entities and their possible types, we apply LabeledLDA (Ramage et al. [sent-226, score-0.289]
57 This allows information about an entity’s distribution over types to be shared across mentions, naturally handling ambiguous entity strings whose mentions could refer to different types. [sent-229, score-0.511]
58 Each entity string in our data is associated with a bag of words found within a context window around all of its mentions, and also within the entity itself. [sent-230, score-0.592]
59 1529 constrain θe, the distribution over topics for each entity string, based on its set of possible types, FB[e] . [sent-235, score-0.348]
60 For entities which aren’t found in any of the Freebase dictionaries, we leave their topic distributions θe unconstrained. [sent-237, score-0.266]
61 In making predictions, we found it beneficial to consider θterain as a prior distribution over types for entities which were encountered during training. [sent-255, score-0.353]
62 In practice this sharing of information across contexts is very beneficial as there is often insufficient evidence in an isolated tweet to determine an entity’s type. [sent-256, score-0.151]
63 For entities which weren’t encountered during training, we instead use a prior based on the distribution of types across all entities. [sent-257, score-0.313]
64 One approach to classifying entities in context is to assume that θterain is fixed, and that all of the words inside the entity mention and context, w, are drawn based on a single topic, z, that is they are all drawn from Multinomial(βz). [sent-258, score-0.573]
65 No entities which are shown were found in Freebase; these are typically either too new to have been added, or are misspelled/abbreviated (for example rhobh=”Real Housewives of Beverly Hills”). [sent-260, score-0.221]
66 In order to make predictions, for each entity we use an informative Dirichlet prior based on θterain and perform 100 iterations of Gibbs Sampling holding the hidden topic variables in the training data fixed (Yao et al. [sent-263, score-0.373]
67 In some cases, we combine multiple Freebase types to create a dictionary of entities representing a single type (for example the COMPANY dictionary contains Freebase types /business/consumer company and /business/brand). [sent-270, score-0.48]
68 Training: To gather unlabeled data for inference, we run T-SEG, our entity segmenter (from §3. [sent-272, score-0.348]
69 This results in a set of 23,65 1 distinct entity strings. [sent-274, score-0.28]
70 For each entity string, we collect words occurring in a context window of 3 words 1530 from all mentions in our data, and use a vocabulary of the 100K most frequent words. [sent-275, score-0.419]
71 Table 7 displays the 20 entities (not found in Freebase) whose posterior distribution θe assigns highest probability to selected types. [sent-277, score-0.257]
72 Results: Table 8 presents the classification results of T-CLASS compared against a majority baseline which simply picks the most frequent class (PERSON), in addition to the Freebase baseline, which only makes predictions if an entity appears in exactly one dictionary (i. [sent-278, score-0.36]
73 T-CLASS also outperforms a simple supervised baseline which applies a MaxEnt classifier using 4-fold cross validation over the 1,450 entities which were annotated for testing. [sent-281, score-0.29]
74 Additionally we compare against the co-training algorithm of Collins and Singer (1999) which also leverages unlabeled data and uses our Freebase type lists; for seed rules we use the “unambiguous” Freebase entities. [sent-282, score-0.134]
75 Tables 9 and 10 present a breakdown of F1 scores by type, both collapsing types into the standard classes used in the MUC competitions (PERSON, LOCATION, ORGANIZATION), and using the 10 popular Twitter types described earlier. [sent-284, score-0.145]
76 LabeledLDA groups together words across all mentions of an en- SLFDMruayLebspj-teCoblrmaveisdtyrLeaBiDdnaABsealine P0 . [sent-287, score-0.139]
77 15643 Table 11: Comparing LabeledLDA and DL-Cotrain grouping unlabeled data by entities vs. [sent-305, score-0.289]
78 tity string, and infers a distribution over its possible types, whereas DL-Cotrain considers the entity mentions separately as unlabeled examples and predicts a type independently for each. [sent-315, score-0.558]
79 In order to ensure that the difference in performance between LabeledLDA and DL-Cotrain is not simply due to this difference in representation, we compare both DL-Cotrain and LabeledLDA using both unlabeled datasets (grouping words by all mentions vs. [sent-316, score-0.207]
80 As expected, DL-Cotrain performs poorly when the unlabeled examples group mentions; this makes sense, since CoTraining uses a discriminative learning algorithm, so when trained on entities and tested on individual mentions, the performance decreases. [sent-318, score-0.289]
81 Additionally, LabeledLDA’s performance is poorer when considering mentions as “documents”. [sent-319, score-0.139]
82 Locke and Martin (2009) train a classifier to recognize named entities based on annotated Twitter data, handling the types PERSON, LOCATION, and OR- GANIZATION. [sent-325, score-0.547]
83 (201 1) build a POS tagger for tweets using 20 coarse-grained tags. [sent-329, score-0.405]
84 , 2011) has proposed lexical normalization of tweets which may be useful as a preprocessing step for the upstream tasks like POS tagging and NER. [sent-334, score-0.424]
85 (2010) apply a minimally supervised approach to extracting entities from text advertisements. [sent-341, score-0.256]
86 In addition we take a distantly supervised approach to Named Entity Classification which exploits large dictionaries of entities gathered from Freebase, requires no manually annotated data, and as a result is able to handle a larger number of types than previous work. [sent-343, score-0.524]
87 Although we found manually annotated data to be very beneficial for named entity segmentation, we were motivated to explore approaches that don’t rely on manual labels for classification due to Twitter’s wide range of named entity types. [sent-344, score-1.152]
88 Additionally, unlike previous work on NER in informal text, our approach allows the sharing of information across an entity’s mentions which is quite beneficial due to Twitter’s terse nature. [sent-345, score-0.283]
89 Previous work on Semantic Bootstrapping has taken a weakly-supervised approach to classifying named entities based on large amounts of unlabeled text (Etzioni et al. [sent-346, score-0.597]
90 In contrast, rather than predicting which classes an entity belongs to (e. [sent-349, score-0.28]
91 a multi-label classification task), LabeledLDA estimates a distribution over its types, which is then useful as a prior when classifying mentions in context. [sent-351, score-0.293]
92 , 2005) which enforce consistency when classifying multi- ple occurrences of an entity within a document. [sent-353, score-0.352]
93 LabeledLDA) for classifying named entities has a similar effect, in that information about an entity’s distribution of possible types is shared across its mentions. [sent-356, score-0.621]
94 To address this challenge we have annotated tweets and built tools trained on unlabeled, in-domain and outof-domain data, showing substantial improvement over their state-of-the art news-trained counterparts, for example, T-POS outperforms the Stanford POS Tagger, reducing error by 41%. [sent-358, score-0.453]
95 We identified named entity classification as a particularly challenging task on Twitter. [sent-360, score-0.562]
96 Due to their terse nature, tweets often lack enough context to identify the types of the entities they contain. [sent-361, score-0.707]
97 In ad- dition, a plethora of distinctive named entity types are present, necessitating large amounts of training data. [sent-362, score-0.655]
98 Named entity recognition as a house of cards: classifier stacking. [sent-444, score-0.318]
99 Extracting personal names from email: applying named entity recognition to informal text. [sent-530, score-0.592]
100 Collective segmentation and labeling of distant entities in information extraction. [sent-565, score-0.36]
wordName wordTfidf (topN-words)
[('tweets', 0.364), ('labeledlda', 0.343), ('freebase', 0.303), ('entity', 0.28), ('twitter', 0.259), ('named', 0.236), ('entities', 0.221), ('capitalization', 0.186), ('mentions', 0.139), ('tweet', 0.111), ('shallow', 0.092), ('downey', 0.083), ('irc', 0.082), ('distantly', 0.082), ('segmentation', 0.081), ('terain', 0.076), ('stanford', 0.076), ('muc', 0.074), ('classifying', 0.072), ('pos', 0.069), ('unlabeled', 0.068), ('shelf', 0.066), ('terse', 0.066), ('tagging', 0.06), ('distant', 0.058), ('dictionaries', 0.058), ('cotraining', 0.057), ('types', 0.056), ('segmenting', 0.056), ('elsner', 0.055), ('tools', 0.055), ('fb', 0.053), ('oov', 0.053), ('capitalized', 0.052), ('person', 0.051), ('interjections', 0.049), ('singer', 0.048), ('informative', 0.048), ('chunking', 0.047), ('classification', 0.046), ('topic', 0.045), ('usernames', 0.045), ('distinctive', 0.045), ('company', 0.044), ('brown', 0.042), ('opennlp', 0.041), ('misclassified', 0.041), ('tagger', 0.041), ('beneficial', 0.04), ('recognizer', 0.04), ('chat', 0.039), ('doug', 0.039), ('gathered', 0.038), ('informal', 0.038), ('forsythand', 0.038), ('github', 0.038), ('kktny', 0.038), ('kobus', 0.038), ('locke', 0.038), ('nintendo', 0.038), ('plethora', 0.038), ('uller', 0.038), ('yess', 0.038), ('recognition', 0.038), ('ramage', 0.037), ('oregon', 0.037), ('gibbs', 0.036), ('obtaining', 0.036), ('distribution', 0.036), ('etzioni', 0.036), ('additionally', 0.036), ('location', 0.036), ('supervised', 0.035), ('portland', 0.035), ('type', 0.035), ('dictionary', 0.034), ('annotated', 0.034), ('oren', 0.033), ('recognizers', 0.033), ('aritt', 0.033), ('competitions', 0.033), ('gouws', 0.033), ('benson', 0.033), ('string', 0.032), ('topics', 0.032), ('nlp', 0.031), ('news', 0.031), ('infrequent', 0.031), ('leverages', 0.031), ('tokens', 0.031), ('lists', 0.031), ('clusters', 0.03), ('band', 0.03), ('chunker', 0.03), ('loc', 0.03), ('minkov', 0.03), ('sha', 0.03), ('mult', 0.03), ('sauper', 0.03), ('doubles', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
2 0.30900258 89 emnlp-2011-Linguistic Redundancy in Twitter
Author: Fabio Massimo Zanzotto ; Marco Pennaccchiotti ; Kostas Tsioutsiouliklis
Abstract: In the last few years, the interest of the research community in micro-blogs and social media services, such as Twitter, is growing exponentially. Yet, so far not much attention has been paid on a key characteristic of microblogs: the high level of information redundancy. The aim of this paper is to systematically approach this problem by providing an operational definition of redundancy. We cast redundancy in the framework of Textual Entailment Recognition. We also provide quantitative evidence on the pervasiveness of redundancy in Twitter, and describe a dataset of redundancy-annotated tweets. Finally, we present a general purpose system for identifying redundant tweets. An extensive quantitative evaluation shows that our system successfully solves the redundancy detection task, improving over baseline systems with statistical significance.
3 0.2853348 128 emnlp-2011-Structured Relation Discovery using Generative Models
Author: Limin Yao ; Aria Haghighi ; Sebastian Riedel ; Andrew McCallum
Abstract: We explore unsupervised approaches to relation extraction between two named entities; for instance, the semantic bornIn relation between a person and location entity. Concretely, we propose a series of generative probabilistic models, broadly similar to topic models, each which generates a corpus of observed triples of entity mention pairs and the surface syntactic dependency path between them. The output of each model is a clustering of observed relation tuples and their associated textual expressions to underlying semantic relation types. Our proposed models exploit entity type constraints within a relation as well as features on the dependency path between entity mentions. We examine effectiveness of our approach via multiple evaluations and demonstrate 12% error reduction in precision over a state-of-the-art weakly supervised baseline.
4 0.28341806 41 emnlp-2011-Discriminating Gender on Twitter
Author: John D. Burger ; John Henderson ; George Kim ; Guido Zarrella
Abstract: Accurate prediction of demographic attributes from social media and other informal online content is valuable for marketing, personalization, and legal investigation. This paper describes the construction of a large, multilingual dataset labeled with gender, and investigates statistical models for determining the gender of uncharacterized Twitter users. We explore several different classifier types on this dataset. We show the degree to which classifier accuracy varies based on tweet volumes as well as when various kinds of profile metadata are included in the models. We also perform a large-scale human assessment using Amazon Mechanical Turk. Our methods significantly out-perform both baseline models and almost all humans on the same task.
5 0.26381898 71 emnlp-2011-Identifying and Following Expert Investors in Stock Microblogs
Author: Roy Bar-Haim ; Elad Dinur ; Ronen Feldman ; Moshe Fresko ; Guy Goldstein
Abstract: Information published in online stock investment message boards, and more recently in stock microblogs, is considered highly valuable by many investors. Previous work focused on aggregation of sentiment from all users. However, in this work we show that it is beneficial to distinguish expert users from non-experts. We propose a general framework for identifying expert investors, and use it as a basis for several models that predict stock rise from stock microblogging messages (stock tweets). In particular, we present two methods that combine expert identification and per-user unsupervised learning. These methods were shown to achieve relatively high precision in predicting stock rise, and significantly outperform our baseline. In addition, our work provides an in-depth analysis of the content and potential usefulness of stock tweets.
6 0.23708875 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
7 0.21391954 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs
8 0.1268539 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
10 0.10314306 38 emnlp-2011-Data-Driven Response Generation in Social Media
11 0.1007923 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
12 0.095420383 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
13 0.095202833 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
14 0.092187747 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
15 0.087150469 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases
16 0.085442193 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
17 0.084316753 14 emnlp-2011-A generative model for unsupervised discovery of relations and argument classes from clinical texts
18 0.082131825 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions
19 0.080394909 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues
20 0.078936078 17 emnlp-2011-Active Learning with Amazon Mechanical Turk
topicId topicWeight
[(0, 0.265), (1, -0.371), (2, 0.234), (3, 0.066), (4, -0.382), (5, -0.008), (6, 0.024), (7, -0.037), (8, -0.054), (9, 0.024), (10, 0.063), (11, 0.162), (12, 0.032), (13, 0.01), (14, -0.027), (15, -0.065), (16, 0.074), (17, 0.057), (18, 0.131), (19, -0.042), (20, -0.159), (21, -0.013), (22, -0.003), (23, -0.125), (24, -0.003), (25, -0.015), (26, 0.006), (27, 0.05), (28, -0.075), (29, -0.006), (30, 0.029), (31, -0.032), (32, 0.071), (33, 0.015), (34, 0.022), (35, 0.005), (36, 0.012), (37, -0.074), (38, -0.059), (39, 0.011), (40, 0.091), (41, -0.164), (42, -0.024), (43, 0.014), (44, 0.006), (45, 0.029), (46, -0.031), (47, -0.007), (48, -0.005), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.95633078 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
2 0.71550816 89 emnlp-2011-Linguistic Redundancy in Twitter
Author: Fabio Massimo Zanzotto ; Marco Pennaccchiotti ; Kostas Tsioutsiouliklis
Abstract: In the last few years, the interest of the research community in micro-blogs and social media services, such as Twitter, is growing exponentially. Yet, so far not much attention has been paid on a key characteristic of microblogs: the high level of information redundancy. The aim of this paper is to systematically approach this problem by providing an operational definition of redundancy. We cast redundancy in the framework of Textual Entailment Recognition. We also provide quantitative evidence on the pervasiveness of redundancy in Twitter, and describe a dataset of redundancy-annotated tweets. Finally, we present a general purpose system for identifying redundant tweets. An extensive quantitative evaluation shows that our system successfully solves the redundancy detection task, improving over baseline systems with statistical significance.
3 0.70061642 71 emnlp-2011-Identifying and Following Expert Investors in Stock Microblogs
Author: Roy Bar-Haim ; Elad Dinur ; Ronen Feldman ; Moshe Fresko ; Guy Goldstein
Abstract: Information published in online stock investment message boards, and more recently in stock microblogs, is considered highly valuable by many investors. Previous work focused on aggregation of sentiment from all users. However, in this work we show that it is beneficial to distinguish expert users from non-experts. We propose a general framework for identifying expert investors, and use it as a basis for several models that predict stock rise from stock microblogging messages (stock tweets). In particular, we present two methods that combine expert identification and per-user unsupervised learning. These methods were shown to achieve relatively high precision in predicting stock rise, and significantly outperform our baseline. In addition, our work provides an in-depth analysis of the content and potential usefulness of stock tweets.
4 0.63327259 41 emnlp-2011-Discriminating Gender on Twitter
Author: John D. Burger ; John Henderson ; George Kim ; Guido Zarrella
Abstract: Accurate prediction of demographic attributes from social media and other informal online content is valuable for marketing, personalization, and legal investigation. This paper describes the construction of a large, multilingual dataset labeled with gender, and investigates statistical models for determining the gender of uncharacterized Twitter users. We explore several different classifier types on this dataset. We show the degree to which classifier accuracy varies based on tweet volumes as well as when various kinds of profile metadata are included in the models. We also perform a large-scale human assessment using Amazon Mechanical Turk. Our methods significantly out-perform both baseline models and almost all humans on the same task.
5 0.63000464 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs
Author: Vahed Qazvinian ; Emily Rosengren ; Dragomir R. Radev ; Qiaozhu Mei
Abstract: A rumor is commonly defined as a statement whose true value is unverifiable. Rumors may spread misinformation (false information) or disinformation (deliberately false information) on a network of people. Identifying rumors is crucial in online social media where large amounts of information are easily spread across a large network by sources with unverified authority. In this paper, we address the problem of rumor detection in microblogs and explore the effectiveness of 3 categories of features: content-based, network-based, and microblog-specific memes for correctly identifying rumors. Moreover, we show how these features are also effective in identifying disinformers, users who endorse a rumor and further help it to spread. We perform our experiments on more than 10,000 manually annotated tweets collected from Twitter and show how our retrieval model achieves more than 0.95 in Mean Average Precision (MAP). Fi- nally, we believe that our dataset is the first large-scale dataset on rumor detection. It can open new dimensions in analyzing online misinformation and other aspects of microblog conversations.
6 0.61901307 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
7 0.5755657 139 emnlp-2011-Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
8 0.5070014 128 emnlp-2011-Structured Relation Discovery using Generative Models
9 0.44515491 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues
10 0.42599842 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
11 0.3517392 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
12 0.33957714 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
13 0.30574939 14 emnlp-2011-A generative model for unsupervised discovery of relations and argument classes from clinical texts
14 0.2970244 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
16 0.28850907 114 emnlp-2011-Relation Extraction with Relation Topics
17 0.27868858 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion
18 0.27565181 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
19 0.27008897 38 emnlp-2011-Data-Driven Response Generation in Social Media
20 0.26663974 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases
topicId topicWeight
[(15, 0.011), (23, 0.13), (36, 0.022), (37, 0.026), (45, 0.083), (52, 0.056), (53, 0.028), (54, 0.019), (57, 0.028), (62, 0.025), (64, 0.03), (66, 0.041), (69, 0.016), (75, 0.232), (79, 0.042), (82, 0.021), (87, 0.023), (96, 0.061), (98, 0.026)]
simIndex simValue paperId paperTitle
1 0.85243785 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues
Author: Altaf Rahman ; Vincent Ng
Abstract: An entity in a dialogue may be old, new, or mediated/inferrable with respect to the hearer’s beliefs. Knowing the information status of the entities participating in a dialogue can therefore facilitate its interpretation. We address the under-investigated problem of automatically determining the information status of discourse entities. Specifically, we extend Nissim’s (2006) machine learning approach to information-status determination with lexical and structured features, and exploit learned knowledge of the information status of each discourse entity for coreference resolution. Experimental results on a set of Switchboard dialogues reveal that (1) incorporating our proposed features into Nissim’s feature set enables our system to achieve stateof-the-art performance on information-status classification, and (2) the resulting information can be used to improve the performance of learning-based coreference resolvers.
same-paper 2 0.78476703 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
3 0.63131809 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: Context-dependent word similarity can be measured over multiple cross-cutting dimensions. For example, lung and breath are similar thematically, while authoritative and superficial occur in similar syntactic contexts, but share little semantic similarity. Both of these notions of similarity play a role in determining word meaning, and hence lexical semantic models must take them both into account. Towards this end, we develop a novel model, Multi-View Mixture (MVM), that represents words as multiple overlapping clusterings. MVM finds multiple data partitions based on different subsets of features, subject to the marginal constraint that feature subsets are distributed according to Latent Dirich- let Allocation. Intuitively, this constraint favors feature partitions that have coherent topical semantics. Furthermore, MVM uses soft feature assignment, hence the contribution of each data point to each clustering view is variable, isolating the impact of data only to views where they assign the most features. Through a series of experiments, we demonstrate the utility of MVM as an inductive bias for capturing relations between words that are intuitive to humans, outperforming related models such as Latent Dirichlet Allocation.
4 0.63039929 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
Author: Matthias Hartung ; Anette Frank
Abstract: This paper introduces an attribute selection task as a way to characterize the inherent meaning of property-denoting adjectives in adjective-noun phrases, such as e.g. hot in hot summer denoting the attribute TEMPERATURE, rather than TASTE. We formulate this task in a vector space model that represents adjectives and nouns as vectors in a semantic space defined over possible attributes. The vectors incorporate latent semantic information obtained from two variants of LDA topic models. Our LDA models outperform previous approaches on a small set of 10 attributes with considerable gains on sparse representations, which highlights the strong smoothing power of LDA models. For the first time, we extend the attribute selection task to a new data set with more than 200 classes. We observe that large-scale attribute selection is a hard problem, but a subset of attributes performs robustly on the large scale as well. Again, the LDA models outperform the VSM baseline.
5 0.61315 41 emnlp-2011-Discriminating Gender on Twitter
Author: John D. Burger ; John Henderson ; George Kim ; Guido Zarrella
Abstract: Accurate prediction of demographic attributes from social media and other informal online content is valuable for marketing, personalization, and legal investigation. This paper describes the construction of a large, multilingual dataset labeled with gender, and investigates statistical models for determining the gender of uncharacterized Twitter users. We explore several different classifier types on this dataset. We show the degree to which classifier accuracy varies based on tweet volumes as well as when various kinds of profile metadata are included in the models. We also perform a large-scale human assessment using Amazon Mechanical Turk. Our methods significantly out-perform both baseline models and almost all humans on the same task.
6 0.60198063 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
7 0.59735179 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
8 0.59596896 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
9 0.59432864 128 emnlp-2011-Structured Relation Discovery using Generative Models
10 0.58955628 136 emnlp-2011-Training a Parser for Machine Translation Reordering
11 0.58913034 33 emnlp-2011-Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs
12 0.58791226 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
13 0.58695412 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
14 0.5846048 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
15 0.58460158 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
16 0.58137411 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
17 0.5807687 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
18 0.57958364 79 emnlp-2011-Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction
19 0.57945395 38 emnlp-2011-Data-Driven Response Generation in Social Media
20 0.57914406 89 emnlp-2011-Linguistic Redundancy in Twitter