acl acl2011 acl2011-133 knowledge-graph by maker-knowledge-mining

133 acl-2011-Extracting Social Power Relationships from Natural Language

Source: pdf

Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso

Abstract: Sociolinguists have long argued that social context influences language use in all manner of ways, resulting in lects 1. This paper explores a text classification problem we will call lect modeling, an example of what has been termed computational sociolinguistics. In particular, we use machine learning techniques to identify social power relationships between members of a social network, based purely on the content of their interpersonal communication. We rely on statistical methods, as opposed to language-specific engineering, to extract features which represent vocabulary and grammar usage indicative of social power lect. We then apply support vector machines to model the social power lects representing superior-subordinate communication in the Enron email corpus. Our results validate the treatment of lect modeling as a text classification problem – albeit a hard one – and constitute a case for future research in computational sociolinguistics. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Sociolinguists have long argued that social context influences language use in all manner of ways, resulting in lects 1. [sent-4, score-0.475]

2 This paper explores a text classification problem we will call lect modeling, an example of what has been termed computational sociolinguistics. [sent-5, score-0.286]

3 In particular, we use machine learning techniques to identify social power relationships between members of a social network, based purely on the content of their interpersonal communication. [sent-6, score-0.848]

4 We rely on statistical methods, as opposed to language-specific engineering, to extract features which represent vocabulary and grammar usage indicative of social power lect. [sent-7, score-0.598]

5 We then apply support vector machines to model the social power lects representing superior-subordinate communication in the Enron email corpus. [sent-8, score-0.811]

6 Our results validate the treatment of lect modeling as a text classification problem – albeit a hard one – and constitute a case for future research in computational sociolinguistics. [sent-9, score-0.34]

7 1 Introduction Linguists in sociolinguistics, pragmatics and related fields have analyzed the influence of social context on language and have catalogued countless phenomena that are influenced by it, confirming many with qualitative and quantitative studies. [sent-10, score-0.292]

8 com deed, social context and function influence language at every level – morphologically, lexically, syntactically, and semantically, through discourse structure, and through higher-level abstractions such as pragmatics. [sent-15, score-0.292]

9 Considered together, the extent to which speakers modify their language for a social context amounts to an identifiable variation on language, which we call a lect. [sent-16, score-0.292]

10 In this paper, we describe lect classifiers for social power relationships. [sent-18, score-0.723]

11 We refer to these lects as: • • • UpSpeak: Communication directed to someone with greater social authority. [sent-19, score-0.475]

12 DownSpeak: Communication directed to someone with less social authority. [sent-20, score-0.292]

13 PeerSpeak: Communication to someone of equal social authority. [sent-21, score-0.292]

14 We call the problem of modeling these lects Social Power Modeling (SPM). [sent-22, score-0.237]

15 Our approach first identifies statistically salient phrases of words and parts of speech – known as n-grams – in training texts generated in conditions where the social power Proce dinPgosrt olafn thde, 4 O9rtehg Aon ,n Ju anle M 1e9e-2tin4g, 2 o0f1 t1h. [sent-26, score-0.467]

16 This methodology is a cost-effective approach to modeling social information and requires no language- or culture-specific feature engineering, although we believe sociolinguistics-inspired features hold promise. [sent-31, score-0.442]

17 When applied to the corpus of emails sent and received by Enron employees (CALO Project 2009), this approach produced solid results, despite a limited number of training and test instances. [sent-32, score-0.209]

18 Since manually determining the power structure of social networks is a time-consuming process, even for an expert, effective SPM could support data driven sociocultural research and greatly aid analysts doing national intelligence work. [sent-34, score-0.467]

19 Social network analysis (SNA) presupposes a collection of individuals, whereas a social power lect classifier, once trained, would provide useful information about individual author-recipient links. [sent-35, score-0.66]

20 If SPM were yoked with sentiment analysis, we might identify which opinions belong to respected members of online communities or lay the groundwork for understanding how respect is earned in social networks. [sent-37, score-0.386]

21 The results in this paper suggest that successes to date modeling authorship, sentiment, emotion, and personality extend to social power modeling, and our approach may well be applicable to other dimensions of social meaning. [sent-39, score-0.938]

22 2 Related Work The feasibility of Social Power Modeling is supported by sociolinguistic research identifying specific ways in which a person’s language reflects his relative power over others. [sent-42, score-0.254]

23 Similarly, Erikson et al identified measurable characteristics of the speech of witnesses in a courtroom setting which were directly associated with the witness’s level of social power (Erikson, 1978). [sent-49, score-0.467]

24 Given, then, that there are distinct differences among what we term UpSpeak and DownSpeak, we treat Social Power Modeling as an instance of text classification (or categorization): we seek to assign a class (UpSpeak or DownSpeak) to a text sample. [sent-50, score-0.18]

25 Closely related natural language processing problems are authorship attribution, sentiment analysis, emotion detection, and personality classification: all aim to extract higher-level information from language. [sent-51, score-0.4]

26 The earliest modern authorship attribution work was (Mosteller & Wallace, 1964), although forensic authorship analysis has been around much longer. [sent-53, score-0.29]

27 Since then, authorship identification has become a mature area productively exploring a broad spectrum of features (stylistic, lexical, syntactic, and semantic) and many generative and discriminative modeling approaches (Stamatatos, 2009). [sent-55, score-0.221]

28 The generative models of authorship identification motivated our statistically extracted lexical and grammatical features, and future work should consider these language modeling (a. [sent-56, score-0.172]

29 For example, the polarity of the expression is determined by the majority polarity of its lexical items or by rules applied to syntactic patterns of expressions on how to de- termine the polarity from its lexical components. [sent-66, score-0.201]

30 Their work jointly classifies sentiment at both levels instead of using independent classifiers for each level or cascaded classifiers. [sent-68, score-0.157]

31 Unlike their works, our text classification techniques take into account the frequency of occurrence of word n-grams and part-of-speech (POS) tag sequences, and other measures of statistical salience in training data. [sent-70, score-0.217]

32 Text-based emotion prediction is another instance of text classification, where the goal is to detect the emotion appropriate to a text (Alm, Roth & Sproat, 2005) or provoked by an author, for example (Strapparava & Mihalcea, 2008). [sent-71, score-0.194]

33 Alm, Roth, and Sproat explored a broad array of lexical and syntactic features, reminiscent of those of authorship attribution, as well as features related to story structure. [sent-72, score-0.167]

34 775 In personality classification, a person’s language is used to classify him on different personality dimensions, such as extraversion or neuroticism (Oberlander & Nowson, 2006; Mairesse & Walker; 2006). [sent-78, score-0.283]

35 Oberlander and Nowson explore using a Naïve Bayes and an SVM classifier to perform binary classification of text on each personality dimension. [sent-80, score-0.29]

36 Their attempt to classify each personality trait as either “high” or “low” echoes early sentiment analysis work that reduced sentiments to either positive or negative (Pang, Lee, & Vaithyanathan, 2002), and supports initially treating Social Power Modeling as a binary classification task. [sent-82, score-0.311]

37 Personality classification seems to be the application of text classification which is the most relevant to Social Power Modeling. [sent-83, score-0.152]

38 As Mairesse and Walker note, certain personality traits are indicative of leaders. [sent-84, score-0.207]

39 Thus, the ability to model personality suggests an ability to model social power lects as well. [sent-85, score-0.775]

40 This was the first significant work to model the content and relationships of communication in a social network. [sent-88, score-0.385]

41 However, we model social power relationships, not roles or topics, and our approach produces discriminative classifiers, not generative models, which enables more concrete evaluation. [sent-96, score-0.503]

42 Namata, Getoor, and Diehl effectively applied role modeling to the Enron email corpus, allowing them to infer the social hierarchy structure of Enron (Namata et al. [sent-97, score-0.5]

43 They applied machine learning classifiers to map individuals to their roles in the hierarchy based on features related to email traffic patterns. [sent-99, score-0.34]

44 They also attempt to identify cases of manager-subordinate relationships within the email domain by ranking emails using traffic-based and content-based features (Diehl et al. [sent-100, score-0.364]

45 While their task is similar to ours, our goal is to classify any case in which one person has more social power than the other, not just identify instances of direct reporting. [sent-102, score-0.543]

46 Morand’s study, for instance, identified specific features that correlate with the direction of communication within a social hierarchy (Morand, 2000). [sent-106, score-0.426]

47 The feature associated with S on text T would be: …, f( S , T )=∑i=k1freq( ni , T ) where freq( ni ,T ) is the relative frequency (defined later) of ni in text T. [sent-114, score-0.384]

48 The frequency of this n-gram in T would then be 1/9, where 1 is the number of substrings in T that match 2 2 To distinguish a comma separating elements of a set with a comma as part of an ngram, we use ‘comma’ to denote the punctuation mark ‘,’ as part of the ngram. [sent-121, score-0.215]

49 please ^VB and 9 is the number of bigrams in T, excluding sentence initial and final markers. [sent-122, score-0.157]

50 The other n-gram, the trigram please ^‘ ‘comma ’ ^VB, does not have any match, so the final value of the feature is 1/9. [sent-123, score-0.17]

51 These are: • • Absolute frequency: The total number of times a particular n-gram occurs in the text of a given class (social power lect). [sent-128, score-0.262]

52 Normalization by the size of the class makes relative frequency a better metric for comparing n-gram usage across classes. [sent-130, score-0.187]

53 We require that the ratio of the relative frequency of the n-gram in one class to its relative frequency in the other class is also greater than a threshold. [sent-133, score-0.42]

54 In experiments based on the bag-of-words model, we only consider an absolute frequency threshold, whereas in later experiments, we also take into account the relative frequency ratio threshold. [sent-135, score-0.263]

55 3 N-gram Binning In experiments in which we bin n-grams, selected n-grams are assigned to the class in which their relative frequency is highest. [sent-137, score-0.187]

56 For example, an ngram whose relative frequency in UpSpeak text is twice that in DownSpeak text would be assigned to the class UpSpeak. [sent-138, score-0.285]

57 This partition is based on the n-gram type, the length of n-grams and the relative frequency ratio of the n-grams. [sent-141, score-0.18]

58 While the n-grams composing a set may themselves be indicative of social power lects, this method of grouping them makes no guarantees as to how indicative the overall set is. [sent-142, score-0.631]

59 Many features are weak on their own; they either occur rarely or occur frequently but only hint weakly at social information. [sent-149, score-0.341]

60 However, we generally achieved the best results using support vector machines, a machine learning classifier which has been successfully applied to many previous text classification problems. [sent-151, score-0.165]

61 After filtering for duplicates and removing empty or otherwise unusable emails, the total number of emails is 245K, containing roughly 90 million words. [sent-156, score-0.153]

62 However, this total includes emails to non-Enron employees, such as family members and employees of other corporations, emails to multiple people, and emails received from Enron employees without a known corporate role. [sent-157, score-0.622]

63 Because the author-recipient relationships of these emails could not be established, they were not included in our experiments. [sent-158, score-0.2]

64 From this information, we determined the author-recipient relationship by applying general rules about the structure of a corporate hierarchy (an email from an Employee to a CEO, for instance, is UpSpeak). [sent-161, score-0.205]

65 The emails were pre-processed to eliminate text not written by the author, such as forwarded text and email headers. [sent-164, score-0.336]

66 Then, we used text authored by individuals in A as a training set and text authored by individuals in B as a test set. [sent-171, score-0.144]

67 found that partitioning by authors was necessary to avoid artificially inflated scores, because the clas778 sifiers pick up aspects of particular authors’ language (idiolect) in addition to social power lect information. [sent-176, score-0.66]

68 It was not necessary to account for recipients because the emails did not contain text from the recipients. [sent-177, score-0.187]

69 Because preliminary experiments suggested that smaller text samples were harder to classify, the classifiers we describe in this paper were both trained and tested on a subset of the Enron corpus where at least 500 words of text was communicated from a specific author to a specific recipient. [sent-179, score-0.169]

70 Varying the weight given to training instances is a technique for creating a classifier that is cost-sensitive, since a classifier built on an unbalanced training set can be biased towards avoiding errors on the overrepresented class (Witten, 2005). [sent-182, score-0.24]

71 A baseline classifier that always predicted the majority class would, on its own, achieve an accuracy of 74% on UpSpeak/DownSpeak classification of unweighted test set instances with a minimum length of 500 words. [sent-188, score-0.297]

72 2 UpSpeak/DownSpeak Classifiers In this section, we describe experiments on classification of interpersonal email communication into UpSpeak and DownSpeak. [sent-192, score-0.262]

73 For these experiments, only emails exchanged between two people related by a superior/subordinate power relationship were weighted training set and evaluation against the weighted and unweighted test sets. [sent-193, score-0.464]

74 While the feature set was too small to produce notable results, we identified which features actually were indicative of lect. [sent-200, score-0.178]

75 The polite imperative feature was represented by the n-gram set: {please ^VB, please ^‘ ‘comma ’ ^VB}. [sent-202, score-0.234]

76 Features used in these experiments consist of single words which occurred a minimum of four times in the relevant lects (UpSpeak and DownSpeak) of the training set. [sent-204, score-0.183]

77 We then performed experiments with word bigrams, selecting as features those which occurred at least seven times in the relevant lects of the training set. [sent-206, score-0.232]

78 While the bigrams on their own were less successful than the unigrams, as seen in line (2), adding them to the unigram features improved accuracy against the test set, shown in line (3). [sent-208, score-0.147]

79 As we had speculated that including surface- level grammar information in the form of tag ngrams would be beneficial to our problem, we performed experiments using all tag unigrams and all tag bigrams occurring in the training set as features. [sent-209, score-0.246]

80 In addition to binning, we also reduced the total number of n-grams by setting higher frequency thresholds and relative frequency ratio thresholds. [sent-215, score-0.263]

81 18 * nrlinks / n, where nrlinks is the number of links in each class (43 1 for UpSpeak and 328 for DownSpeak), and n is the number of words in the class. [sent-220, score-0.161]

82 The relative frequency ratio was required to be at least 1. [sent-221, score-0.18]

83 The tag sequences were required to meet an absolute frequency threshold of 20, but the same relative frequency ratio of 1. [sent-223, score-0.304]

84 Binning the n-grams into features was done based on both the length of the n-gram and the relative frequency ratio. [sent-225, score-0.183]

85 For example, one feature might represent the set of all word unigrams which have a relative frequency ratio between 1. [sent-226, score-0.271]

86 Before filtering for low information gain, we used six word n-gram bins per class (relative frequency ratios of 1. [sent-230, score-0.178]

87 To ascertain which feature reduction method had the greatest effect on performance – binning or setting a relative frequency ratio threshold – we performed an experiment in which all the n-grams that we used in the previous experiment were their own features. [sent-248, score-0.338]

88 780 Our goal was to have successful results using only statistically extracted features; however, we examined the effect of augmenting this feature set with the most indicative of the human-identified feature – polite imperatives. [sent-251, score-0.24]

89 On the first iteration, we trained the classifier on the labeled training set, classified the instances of the unlabeled test set, and then added the instances of the test set along with their predicted class to the training set to be used for the next iteration. [sent-259, score-0.211]

90 After three iterations, the accuracy of the classifier when evaluated on the weighted test set improved to 82%, suggesting that our classifiers would benefit from more data. [sent-260, score-0.168]

91 5 Conclusions and Future Research We presented a corpus-based statistical learning approach to modeling social power relationships and experimental results for our methods. [sent-265, score-0.568]

92 knowledge, this is the first corpus-based approach to learning social power lects beyond those in direct reporting relationships. [sent-268, score-0.65]

93 Our work strongly suggests that statistically extracted features are an efficient and effective approach to modeling social information. [sent-269, score-0.395]

94 Our methods exploit many aspects of language use and effectively model social power information while using statistical methods at every stage to tease out the information we seek, significantly reducing language-, culture-, and lect-specific engineering needs. [sent-270, score-0.467]

95 Our text classification problem is similar to sentiment analysis in that there are class dependencies; for example, DownSpeak is more closely related to PeerSpeak than to UpSpeak. [sent-285, score-0.24]

96 In early, unpublished work, we had promising results with generative model-based approach to SPM, and we plan to revisit it; language models are a natural fit for lect modeling. [sent-288, score-0.193]

97 Finally, we hope to investigate how SPM and SNA can enhance one another, and explore other lect classification problems for which the ground truth can be found. [sent-289, score-0.252]

98 Adapting a polarity lexicon using integer linear programming for domainspecific sentiment classification. [sent-327, score-0.161]

99 Topic and role discovery in social networks with experiments on Enron and academic eMail. [sent-371, score-0.292]

100 Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. [sent-399, score-0.194]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('upspeak', 0.438), ('downspeak', 0.42), ('social', 0.292), ('lect', 0.193), ('lects', 0.183), ('power', 0.175), ('enron', 0.153), ('emails', 0.153), ('personality', 0.125), ('please', 0.123), ('authorship', 0.118), ('email', 0.115), ('binning', 0.111), ('spm', 0.11), ('vb', 0.098), ('sentiment', 0.094), ('frequency', 0.083), ('indicative', 0.082), ('morand', 0.073), ('peerspeak', 0.073), ('classifier', 0.072), ('unweighted', 0.07), ('polarity', 0.067), ('comma', 0.066), ('polite', 0.064), ('classifiers', 0.063), ('emotion', 0.063), ('choi', 0.061), ('namata', 0.059), ('classification', 0.059), ('employees', 0.056), ('bramsen', 0.055), ('attribution', 0.054), ('modeling', 0.054), ('politeness', 0.053), ('sociolinguistics', 0.053), ('class', 0.053), ('relative', 0.051), ('corporate', 0.051), ('freq', 0.051), ('features', 0.049), ('diehl', 0.048), ('levinson', 0.048), ('sna', 0.048), ('feature', 0.047), ('relationships', 0.047), ('ratio', 0.046), ('communication', 0.046), ('ngrams', 0.045), ('ni', 0.045), ('unigrams', 0.044), ('oberlander', 0.044), ('mairesse', 0.044), ('mosteller', 0.044), ('instances', 0.043), ('bins', 0.042), ('weka', 0.042), ('interpersonal', 0.042), ('tag', 0.041), ('hierarchy', 0.039), ('individuals', 0.038), ('pang', 0.038), ('author', 0.038), ('alm', 0.038), ('erikson', 0.037), ('fairclough', 0.037), ('nrlinks', 0.037), ('rart', 0.037), ('yejin', 0.037), ('roles', 0.036), ('strapparava', 0.035), ('witten', 0.034), ('text', 0.034), ('links', 0.034), ('bigrams', 0.034), ('classify', 0.033), ('sproat', 0.033), ('weighted', 0.033), ('line', 0.032), ('nowson', 0.032), ('galileo', 0.032), ('calo', 0.032), ('imperatives', 0.032), ('misclassifying', 0.032), ('organizational', 0.032), ('tactics', 0.032), ('mixed', 0.032), ('mccallum', 0.031), ('dimensionality', 0.031), ('cardie', 0.031), ('tb', 0.031), ('ngram', 0.03), ('ll', 0.03), ('vaithyanathan', 0.03), ('ve', 0.029), ('claire', 0.029), ('pos', 0.028), ('wallace', 0.028), ('sociolinguistic', 0.028), ('employee', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 133 acl-2011-Extracting Social Power Relationships from Natural Language

Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso

2 0.24896775 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

Author: Apoorv Agarwal

Abstract: In my thesis, Ipropose to build a system that would enable extraction of social interactions from texts. To date Ihave defined a comprehensive set of social events and built a preliminary system that extracts social events from news articles. Iplan to improve the performance of my current system by incorporating semantic information. Using domain adaptation techniques, Ipropose to apply my system to a wide range of genres. By extracting linguistic constructs relevant to social interactions, I will be able to empirically analyze different kinds of linguistic constructs that people use to express social interactions. Lastly, I will attempt to make convolution kernels more scalable and interpretable.

3 0.12053417 204 acl-2011-Learning Word Vectors for Sentiment Analysis

Author: Andrew L. Maas ; Raymond E. Daly ; Peter T. Pham ; Dan Huang ; Andrew Y. Ng ; Christopher Potts

Abstract: Unsupervised vector-based approaches to semantics can model rich lexical meanings, but they largely fail to capture sentiment information that is central to many word meanings and important for a wide range of NLP tasks. We present a model that uses a mix of unsupervised and supervised techniques to learn word vectors capturing semanticterm–documentinformation as well as rich sentiment content. The proposed model can leverage both continuous and multi-dimensional sentiment information as well as non-sentiment annotations. We instantiate the model to utilize the document-level sentiment polarity annotations present in many online documents (e.g. star ratings). We evaluate the model using small, widely used sentiment and subjectivity corpora and find it out-performs several previously introduced methods for sentiment classification. We also introduce a large dataset , of movie reviews to serve as a more robust benchmark for work in this area.

4 0.11710786 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

5 0.11344883 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

Author: Danushka Bollegala ; David Weir ; John Carroll

Abstract: We describe a sentiment classification method that is applicable when we do not have any labeled data for a target domain but have some labeled data for multiple other domains, designated as the source domains. We automat- ically create a sentiment sensitive thesaurus using both labeled and unlabeled data from multiple source domains to find the association between words that express similar sentiments in different domains. The created thesaurus is then used to expand feature vectors to train a binary classifier. Unlike previous cross-domain sentiment classification methods, our method can efficiently learn from multiple source domains. Our method significantly outperforms numerous baselines and returns results that are better than or comparable to previous cross-domain sentiment classification methods on a benchmark dataset containing Amazon user reviews for different types of products.

6 0.11203165 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features

7 0.10189823 194 acl-2011-Language Use: What can it tell us?

8 0.10154372 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

9 0.089867875 292 acl-2011-Target-dependent Twitter Sentiment Classification

10 0.082931794 253 acl-2011-PsychoSentiWordNet

11 0.081627756 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

12 0.081338152 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models

13 0.081106551 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

14 0.07702215 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

15 0.075980544 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis

16 0.075425245 105 acl-2011-Dr Sentiment Knows Everything!

17 0.074520193 218 acl-2011-MemeTube: A Sentiment-based Audiovisual System for Analyzing and Displaying Microblog Messages

18 0.073934846 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

19 0.073471755 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification

20 0.072434321 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.183), (1, 0.142), (2, 0.054), (3, -0.028), (4, -0.002), (5, 0.037), (6, 0.035), (7, -0.027), (8, 0.002), (9, 0.009), (10, -0.038), (11, -0.019), (12, -0.01), (13, 0.049), (14, -0.02), (15, -0.01), (16, -0.037), (17, -0.016), (18, 0.004), (19, -0.071), (20, 0.079), (21, -0.009), (22, -0.058), (23, 0.049), (24, -0.026), (25, -0.028), (26, 0.04), (27, 0.02), (28, -0.022), (29, -0.046), (30, -0.007), (31, 0.029), (32, -0.105), (33, 0.055), (34, 0.085), (35, -0.033), (36, -0.109), (37, -0.031), (38, -0.082), (39, 0.158), (40, 0.064), (41, 0.087), (42, 0.051), (43, -0.091), (44, -0.14), (45, 0.082), (46, -0.114), (47, -0.03), (48, -0.069), (49, -0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90577161 133 acl-2011-Extracting Social Power Relationships from Natural Language

Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso

2 0.84466606 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

Author: Apoorv Agarwal

3 0.79718012 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Author: Sara Rosenthal ; Kathleen McKeown

Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.

4 0.7451421 194 acl-2011-Language Use: What can it tell us?

Author: Marjorie Freedman ; Alex Baron ; Vasin Punyakanok ; Ralph Weischedel

Abstract: For 20 years, information extraction has focused on facts expressed in text. In contrast, this paper is a snapshot of research in progress on inferring properties and relationships among participants in dialogs, even though these properties/relationships need not be expressed as facts. For instance, can a machine detect that someone is attempting to persuade another to action or to change beliefs or is asserting their credibility? We report results on both English and Arabic discussion forums. 1

5 0.7212137 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

6 0.63898402 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

7 0.63699234 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

8 0.63301164 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

9 0.57189941 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

10 0.56284207 218 acl-2011-MemeTube: A Sentiment-based Audiovisual System for Analyzing and Displaying Microblog Messages

11 0.51664013 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System

12 0.50604862 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues

13 0.49511629 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

14 0.4891164 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

15 0.48861584 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

16 0.48518267 55 acl-2011-Automatically Predicting Peer-Review Helpfulness

17 0.45948684 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts

18 0.44871452 150 acl-2011-Hierarchical Text Classification with Latent Concepts

19 0.44785884 74 acl-2011-Combining Indicators of Allophony

20 0.43709633 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (5, 0.059), (17, 0.04), (26, 0.048), (36, 0.013), (37, 0.097), (39, 0.05), (41, 0.049), (53, 0.014), (55, 0.029), (59, 0.04), (63, 0.182), (72, 0.044), (88, 0.012), (91, 0.045), (93, 0.011), (96, 0.131), (97, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83610499 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

Author: Greg Durrett ; Dan Klein

Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.

same-paper 2 0.828215 133 acl-2011-Extracting Social Power Relationships from Natural Language

Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso

3 0.71454418 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

Author: Stefan Rud ; Massimiliano Ciaramita ; Jens Muller ; Hinrich Schutze

Abstract: We use search engine results to address a particularly difficult cross-domain language processing task, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in the search results as additional information for correctly classifying the token. We achieve strong gains in NER performance on news, in-domain and out-of-domain, and on web queries.

4 0.71304792 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

Author: Ines Rehbein ; Josef Ruppenhofer

Abstract: Active Learning (AL) has been proposed as a technique to reduce the amount of annotated data needed in the context of supervised classification. While various simulation studies for a number of NLP tasks have shown that AL works well on goldstandard data, there is some doubt whether the approach can be successful when applied to noisy, real-world data sets. This paper presents a thorough evaluation of the impact of annotation noise on AL and shows that systematic noise resulting from biased coder decisions can seriously harm the AL process. We present a method to filter out inconsistent annotations during AL and show that this makes AL far more robust when ap- plied to noisy data.

5 0.71148127 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

Author: Shane Bergsma ; David Yarowsky ; Kenneth Church

Abstract: Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via achieves data and pervised tations. co-training. The co-trained classifier close to 96% accuracy on Treebank makes 20% fewer errors than a susystem trained with Treebank anno-

6 0.71005976 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

7 0.70852274 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

8 0.70710927 311 acl-2011-Translationese and Its Dialects

9 0.70702726 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

10 0.70659167 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

11 0.70547569 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

12 0.70524341 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

13 0.70503855 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

14 0.70497441 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

15 0.70411503 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

16 0.70274782 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

17 0.70128977 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

18 0.70102429 44 acl-2011-An exponential translation model for target language morphology

19 0.70091468 292 acl-2011-Target-dependent Twitter Sentiment Classification

20 0.70068479 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora