emnlp emnlp2012 emnlp2012-134 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Katja Filippova
Abstract: We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. We also show that the gender can be predicted from language alone (89%). A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. We also investigate how the two views (linguistic and social) can be combined and analyse how prediction accuracy changes over different age groups.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. [sent-4, score-1.446]
2 We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. [sent-5, score-1.903]
3 We also show that the gender can be predicted from language alone (89%). [sent-6, score-0.727]
4 A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. [sent-7, score-0.934]
5 The last decade has seen several studies investigating the relationship between the language and the demographics of the users of blogs or Twitter (see Sec. [sent-27, score-0.409]
6 Most of those studies used social network sites to collect labeled data– samples of text together with the demographics variable. [sent-29, score-0.462]
7 However, they did not analyse how social environment affects language, although very similar questions have been recently posed (but not yet answered) by Ellist (2009). [sent-30, score-0.342]
8 In particular, we consider the task of user gender prediction on YouTube and contrast two information sources: (1) the comments written by the user and (2) her social neighborhood as defined by the bipartite user-video graph. [sent-32, score-1.441]
9 We use the comments to train a gender classifier on a variety of linguistic features. [sent-33, score-0.807]
10 We also introduce a simple gender propagation procedure to predict person’s gender from the user-video graph. [sent-34, score-1.522]
11 The paper is organized as follows: we first re– – view related work on the language of social media and user demographics (Sec. [sent-41, score-0.602]
12 5) and the experiments on supervised learning gender from language (Sec. [sent-46, score-0.727]
13 (2) A standard goal of an NLP study is to build an automatic system which accurately solves a given task which in the case of demographics is predicting user age, gender or country of origin. [sent-50, score-1.128]
14 Then we briefly summarize a selection of 5Although it might be more correct to talk about the user’s sex in place of gender (Eckert & McConnell-Ginet, 2003), we stick to the terminology adopted in previous NLP research gender prediction. [sent-53, score-1.454]
15 1 Language and demographics analysis Previous sociolinguistic studies mostly checked hypotheses formulated before the widespread use of the Internet, such as that women use hedges more often (Lakoff, 1973) or that men use more negations (Mulac et al. [sent-56, score-0.351]
16 Looking at a number of stylistic features which had previously been claimed to be predictive of gender (Argamon et al. [sent-62, score-0.727]
17 , 2004), such as personal pronouns, determiners and other function words, they find no gender effect. [sent-64, score-0.727]
18 , age, gender, sexuality) and find language features indicative of gender (e. [sent-68, score-0.755]
19 2 Demographics prediction from language The studies we review here used supervised machine learning to obtain models for predicting gender or age. [sent-78, score-0.806]
20 Also, generative approaches have been applied to discover associations between language and demographics of social media users (Eisenstein et al. [sent-80, score-0.545]
21 For supervised approaches, major feature sources are the text the user has written and also her profile which may list the name, interests, friends, etc. [sent-82, score-0.411]
22 There have also been studies which did not look at the language at all but considered the social environment only. [sent-83, score-0.338]
23 What they found is that there is a remarkable correlation between the age and the location of the user and those of her friends, although there are interesting exceptions. [sent-85, score-0.402]
24 (201 1) train a gender classifier on tweets with word and character-based ngram features achieving accuracy of 75. [sent-87, score-0.799]
25 Other kinds of sociolinguistic features and a different classifier have been applied to gender prediction on tweets by Rao & Yarowsky (2010). [sent-92, score-0.923]
26 Nowson & Oberlander (2006) achieve 92% accuracy on the gender prediction task using ngram features only. [sent-93, score-0.766]
27 However, the ngram features were preselected based on whether they occurred with significant relative frequency in the language of one gender over the other. [sent-95, score-0.727]
28 Yan & Yan (2006) train a Naive Bayes classifier to predict the gender of a blog entry author. [sent-97, score-0.857]
29 Indeed, a topic which has not yet been investigated much in the reviewed studies on language and user demo- graphics is the relationship between the language of the user and her social environment. [sent-114, score-0.647]
30 2) mostly relied on language and user profile features and considered users in isolation. [sent-124, score-0.529]
31 An exception to this is Garera & Yarowsky (2009) who showed that, for gender prediction in a dialogue, it helps to know the interlocutor’s gender. [sent-125, score-0.766]
32 For example, it has been shown that the more a person is integrated in a certain community and the tighter the ties of the social network are, the more prominent are the representative traits of that community in the language of the person (Milroy & Milroy, 1992; Labov, 1994). [sent-134, score-0.355]
33 In our study we adopt a similar view and analyse the implications it has for gender prediction. [sent-135, score-0.771]
34 Given its social nature, does the language reflect the norms of a community the user belongs to or the actual value of a demographic variable? [sent-136, score-0.43]
35 We use language-based features and a supervised approach to gender prediction to analyse the relationship between the language and the variable to be predicted. [sent-139, score-0.81]
36 To our knowledge, we are the first to question whether it is really the in- born gender that language-based classifiers learn to predict. [sent-140, score-0.727]
37 Concerning the prediction task, how can we make use of what we know about the user’s social environment to reduce the effect of noise? [sent-145, score-0.337]
38 How can we benefit from the language samples from the users whose gender we do not know at all? [sent-146, score-0.873]
39 When analyzing the language of a user, how much are its gender-specific traits due to the user’s inborn gender and to which extent can they be explained by her social environment? [sent-148, score-1.084]
40 Using our modeling technique and a languagebased gender classifier, how is its performance affected by what we know about the online social environment of the user? [sent-149, score-1.047]
41 Concerning gender predictions across different age groups, how does classifier performance 1481 change? [sent-151, score-0.955]
42 Judging from the online communication, do teenagers signal their gender identity more than older people? [sent-152, score-0.903]
43 In terms of classifier accuracy, is it easier to predict a teenager’s gender than the gender of an adult? [sent-153, score-1.531]
44 4 Data Most social networks strive to protect user privacy and by default do not expose profile information or reveal user activity (e. [sent-156, score-0.787]
45 Most of the YouTube registered users list their gender, age and location on their profile pages which, like their comments, are publicly available. [sent-161, score-0.485]
46 From the users, videos and the comment relationship we build an affiliation graph (Easley & Kleinberg, 2010): a user and a video are connected if the user commented on the video (Fig. [sent-184, score-0.928]
47 The users’ gender distributions is presented in Table 1. [sent-196, score-0.727]
48 A random sample from a pool of users without the 20-comments threshold showed that there are more male commenters overall, although the difference is less remarkable for teenagers: 58% of the teenagers with known gender are male as opposed to 74% and 79% for the age groups 20-29 and 30+. [sent-199, score-1.465]
49 Although we did not filter users based on their location or mother tongue as many users comment in multiple languages, the comment set is overwhelmingly English. [sent-201, score-0.378]
50 1482 5 Gender propagation We first consider the user’s social environment to see whether there is any correlation between the gender of a user and the gender distribution in her vicinity, independent of the language. [sent-206, score-1.986]
51 We send the gender information (female, male or unknown) to all the videos the user has commented on. [sent-209, score-1.276]
52 We send the gender distributions from every video back to all the users who commented on it and average over all the videos the user is connected with (see Fig. [sent-213, score-1.392]
53 However, in doing so we adjust the distribution for every user so that her own demographics is excluded. [sent-215, score-0.401]
54 This way we have a fair setting where the original gender of the user is never included in what she gets back from the connected videos. [sent-216, score-0.93]
55 Thus, the gender of a user contributes to the vicinity distributions of all the neighbors but not to her own final gender distribution. [sent-217, score-1.778]
56 Note that if the user’s gender has no influence on her choice of videos, then, on average, we would expect every video to have the same distribution as in our data overall: 62% male, 26% female and 12% unknown (Table 1). [sent-224, score-0.878]
57 To obtain a single gender prediction from the propagated distribution, for a given user we select the gender class (female or male) which got more blue=male, red=female, grey=unknown. [sent-225, score-1.788]
58 The exact procedure is as follows: given user u connected with videos Vu = {v1, . [sent-230, score-0.354]
59 , vm}, there are m gender distributions sent t{ov u: PV(u) {p(g|vi) : 1e ≤ ir ≤ m, g ∈ {f, m, n}}. [sent-233, score-0.727]
60 , the fact that 70% of our users (62/(26 + 62)) with known gender are male, we select the female gender if (a) it got more than zero mass and at least as much mass as male: pˆ(f) > 0 ∧ pˆ(f) pˆ (m), or (b) it got at laesa mst τ o:f pˆ pt(hfe mass: pˆ(f) τ. [sent-237, score-1.705]
61 The fact that the optimal τ value is different from the overall proportion of females (26%) is not surprising given that we aggregate per video distributions and not raw user counts. [sent-246, score-0.354]
62 The baseline of assigning all the users the majority class (all male) provides us with the accuracy of 70% the proportion of males among the users with known gender. [sent-248, score-0.38]
63 Although the purpose of this section is not to present a gender prediction method, we find it worth emphasizing that 90% accuracy is remarkable given that we only look at the immediate user vicinity. [sent-249, score-1.009]
64 In the following section we are going to investigate how this social view on demographics can help us in – 1483 faBmelamalse lineA9c7- c0 %98P24-%-. [sent-250, score-0.399]
65 6 Supervised learning of gender In this section we start by describing our first gender prediction experiment and several extensions to it and then turn to the results. [sent-254, score-1.493]
66 We do not rely on any information from the social environment of the user and do not use any features extracted from the user profile, like name, which would make the gender prediction task considerably easier (Burger et al. [sent-257, score-1.47]
67 Finally, we do not extract any features from the videos the user has commented on because our goal here is to explore the language as a sole source of information. [sent-259, score-0.401]
68 Here we simply want to investigate the extent to which the language of the user is indicative of her gender which is found in the profile and which, ignoring the noise, corresponds to the inborn gender. [sent-260, score-1.226]
69 We take 80% of the users for training and generate a training instance for every user who made her gender visible on the profile page (4. [sent-264, score-1.256]
70 The first question we consider is how the affiliation graph and propagated gender can be used to enhance our data for the supervised experiments. [sent-270, score-0.883]
71 One possibility would be to train a classifier on a refined set of users by eliminating all those whose reported gender did not match the gender predicted by the neighborhood. [sent-271, score-1.666]
72 Another possibility would be to extend the training set with the users who did not make their gender visible to the public but whose gender we can predict from their vicinity. [sent-273, score-1.637]
73 The next question posed in the motivation section is as follows: Does the fact that language is a social phenomenon and that it is being shaped by the social environment of the speaker impact our gender classifier? [sent-278, score-1.271]
74 If there are truly gender-specific language traits and they are reflected in our features, then we should not observe any significant difference between the prediction results on the users whose gender matches the gender propagated from the vicinity and those whose gender does not match. [sent-279, score-2.597]
75 A contrary hypothesis would be that what the classifier actually 1484 learns to predict is not as much the inborn but a social gender. [sent-280, score-0.366]
76 In this case, the classifier trained on the propagated gender labels should be more accurate than the one trained on the labels extracted from the profiles. [sent-281, score-0.834]
77 Finally, we look at how gender predictions change with age and train three age-specific models to predict gender for teenagers (13-19), people in their twenties (20-29) and people over thirty (30+), the age is also extracted from the profiles. [sent-284, score-2.057]
78 These groups are identified in order to check whether teenagers tend to signalize their gender identity more than older people, a hypothesis investigated earlier on a sample of blog posts (Huffaker & Calvert, 2005). [sent-285, score-1.047]
79 In order to investigate the relationship between the social environment of a person, her gender and the language, we split the users from the test set into two groups: those whose profile gender matched the gender propagated from the vicinity and those for whom there was a mismatch. [sent-292, score-2.993]
80 fam felaml (esd( aids(mfasdif)maiefm)e )A9c4- c47%P5849-%4952R3895- 76%48F94-5816514T36125o7207tKa K l Table 4: Results for users whose profile gender matches/differs from the vicinity gender. [sent-298, score-1.174]
81 In the next experiment we refined the training set by removing all the users whose vicinity gender did not match the gender reported in the profile. [sent-300, score-1.747]
82 The refined model performed slightly (< 1%) better than the starting one on the users whose vicinity and the profile genders matched but got very poor results on the users with a gender mismatch, the accuracy being as low as 37%. [sent-303, score-1.418]
83 In another experiment we extended the training data with the users whose gender was unknown but was predicted with the propagation method. [sent-306, score-0.904]
84 To investigate this, we looked at the results of the model which knew nothing about the profile gender but was trained to predict the vicinity gender instead (Table 6). [sent-315, score-1.835]
85 This model relied on the exact same set of features but both for training and testing it used the gender labels obtained from the propagation procedure described in Section 5. [sent-316, score-0.758]
86 According to all the evaluation metrics, for both genders the performance of the classifier trained and tested on the propagated gender is higher (cf. [sent-318, score-0.881]
87 This indicates that it is the predominant environment gender that a language-based classifier is better at learning rather than the inborn gender. [sent-320, score-0.989]
88 Finally, to address the question of whether gender differences are more prominent and thus easer to identify in the language of younger people, we looked at the accuracy of gender predictions across three age groups. [sent-322, score-1.729]
89 For a comparison, the accuracy of the propagated gender (Prop-acc) also decreases from younger to older age groups although it is slightly higher than that of language-based predictions. [sent-326, score-1.082]
90 One conclusion we can make at this point is that a teenager’s gender is easier to predict from the language which is in line with the hypothesis that younger people signalize their gender identities more than older people. [sent-327, score-1.649]
91 Another observation is that, as the person gets older, we can be less sure about her gender by looking at her social environment. [sent-328, score-0.972]
92 This in turn might be an explanation of why there are less gender signals in the language of a person: the environment becomes more mixed, and the influence of both genders becomes more balanced. [sent-329, score-0.871]
93 We also investigated a few ways of how the performance of a language-based classifier can be enhanced by the social aspect, compared the accuracy of predictions across different age groups 1486 and found support for hypotheses made in earlier sociolinguistic studies. [sent-333, score-0.584]
94 We are not the first to predict gender from language features with online data. [sent-334, score-0.764]
95 However, to our knowledge, we are the first to contrast the two views, social and language-based, using online data and to question whether there is a clear understanding of what gender classifiers actually learn to predict from language. [sent-335, score-0.965]
96 Our results indicate that from the standard language cues we are better at predicting a social gender, that is the gender defined by the environment of a person, rather than the inborn gender. [sent-336, score-1.113]
97 On the practical side, it may have implications for targeted advertisement as it enriches the understanding of what gender classifiers predict. [sent-338, score-0.727]
98 The identity of bloggers: Openness and gender in personal weblogs. [sent-542, score-0.761]
99 Inferring gender of movie reviewers: Exploiting writing style, content and metadata. [sent-547, score-0.727]
100 Mining user home location and gender from Flickr tags. [sent-553, score-0.93]
wordName wordTfidf (topN-words)
[('gender', 0.727), ('user', 0.203), ('social', 0.201), ('demographics', 0.198), ('profile', 0.18), ('age', 0.159), ('videos', 0.151), ('users', 0.146), ('youtube', 0.132), ('male', 0.126), ('vicinity', 0.121), ('teenagers', 0.099), ('environment', 0.097), ('video', 0.096), ('inborn', 0.088), ('males', 0.088), ('sociolinguistic', 0.085), ('propagated', 0.067), ('affiliation', 0.066), ('herring', 0.066), ('females', 0.055), ('female', 0.055), ('blog', 0.053), ('people', 0.049), ('commented', 0.047), ('weblogs', 0.047), ('genders', 0.047), ('analyse', 0.044), ('person', 0.044), ('blogger', 0.044), ('milroy', 0.044), ('teenager', 0.044), ('younger', 0.044), ('traits', 0.043), ('older', 0.043), ('comment', 0.043), ('looked', 0.043), ('groups', 0.042), ('burger', 0.042), ('studies', 0.04), ('comments', 0.04), ('classifier', 0.04), ('remarkable', 0.04), ('prediction', 0.039), ('argamon', 0.038), ('predict', 0.037), ('predominant', 0.037), ('profiles', 0.037), ('yarowsky', 0.034), ('identity', 0.034), ('baxter', 0.033), ('bloggers', 0.033), ('bybee', 0.033), ('calvert', 0.033), ('huffaker', 0.033), ('tweets', 0.032), ('koppel', 0.032), ('propagation', 0.031), ('predictions', 0.029), ('indicative', 0.028), ('hypotheses', 0.028), ('yan', 0.028), ('spring', 0.028), ('written', 0.028), ('posts', 0.027), ('sociolinguistics', 0.026), ('interests', 0.026), ('demographic', 0.026), ('eckert', 0.026), ('refined', 0.026), ('blogs', 0.025), ('got', 0.025), ('march', 0.025), ('analyzing', 0.025), ('rao', 0.023), ('interactions', 0.023), ('network', 0.023), ('graph', 0.023), ('speaker', 0.023), ('friends', 0.022), ('baluja', 0.022), ('coulmas', 0.022), ('easley', 0.022), ('ellist', 0.022), ('hudson', 0.022), ('kapidzic', 0.022), ('labov', 0.022), ('lakoff', 0.022), ('languagebased', 0.022), ('livejournal', 0.022), ('mackinnon', 0.022), ('mulac', 0.022), ('paolillo', 0.022), ('rosenthal', 0.022), ('send', 0.022), ('shaped', 0.022), ('shimoni', 0.022), ('signalize', 0.022), ('thirty', 0.022), ('trudgill', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 134 emnlp-2012-User Demographics and Language in an Implicit Social Network
Author: Katja Filippova
Abstract: We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. We also show that the gender can be predicted from language alone (89%). A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. We also investigate how the two views (linguistic and social) can be combined and analyse how prediction accuracy changes over different age groups.
2 0.16947301 120 emnlp-2012-Streaming Analysis of Discourse Participants
Author: Benjamin Van Durme
Abstract: Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification.
Author: Ahmed Hassan ; Amjad Abu-Jbara ; Dragomir Radev
Abstract: A mixture of positive (friendly) and negative (antagonistic) relations exist among users in most social media applications. However, many such applications do not allow users to explicitly express the polarity of their interactions. As a result most research has either ignored negative links or was limited to the few domains where such relations are explicitly expressed (e.g. Epinions trust/distrust). We study text exchanged between users in online communities. We find that the polarity of the links between users can be predicted with high accuracy given the text they exchange. This allows us to build a signed network representation of discussions; where every edge has a sign: positive to denote a friendly relation, or negative to denote an antagonistic relation. We also connect our analysis to social psychology theories of balance. We show that the automatically predicted networks are consistent with those theories. Inspired by that, we present a technique for identifying subgroups in discussions by partitioning singed networks representing them.
4 0.097669654 102 emnlp-2012-Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems
Author: Nina Dethlefs ; Helen Hastie ; Verena Rieser ; Oliver Lemon
Abstract: Incremental processing allows system designers to address several discourse phenomena that have previously been somewhat neglected in interactive systems, such as backchannels or barge-ins, but that can enhance the responsiveness and naturalness of systems. Unfortunately, prior work has focused largely on deterministic incremental decision making, rendering system behaviour less flexible and adaptive than is desirable. We present a novel approach to incremental decision making that is based on Hierarchical Reinforcement Learning to achieve an interactive optimisation of Information Presentation (IP) strategies, allowing the system to generate and comprehend backchannels and barge-ins, by employing the recent psycholinguistic hypothesis of information density (ID) (Jaeger, 2010). Results in terms of average rewards and a human rating study show that our learnt strategy outperforms several baselines that are | v not sensitive to ID by more than 23%.
5 0.075435489 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management
Author: Aciel Eshky ; Ben Allison ; Mark Steedman
Abstract: User simulation is frequently used to train statistical dialog managers for task-oriented domains. At present, goal-driven simulators (those that have a persistent notion of what they wish to achieve in the dialog) require some task-specific engineering, making them impossible to evaluate intrinsically. Instead, they have been evaluated extrinsically by means of the dialog managers they are intended to train, leading to circularity of argument. In this paper, we propose the first fully generative goal-driven simulator that is fully induced from data, without hand-crafting or goal annotation. Our goals are latent, and take the form of topics in a topic model, clustering together semantically equivalent and phonetically confusable strings, implicitly modelling synonymy and speech recognition noise. We evaluate on two standard dialog resources, the Communicator and Let’s Go datasets, and demonstrate that our model has substantially better fit to held out data than competing approaches. We also show that features derived from our model allow significantly greater improvement over a baseline at distinguishing real from randomly permuted dialogs.
6 0.065885648 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge
7 0.064480633 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities
8 0.063430965 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media
9 0.053754397 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid
10 0.040270787 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure
11 0.039771087 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
12 0.036691822 112 emnlp-2012-Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge
13 0.035837173 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
14 0.033935852 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
15 0.028916731 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
16 0.027497126 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls
17 0.027069353 97 emnlp-2012-Natural Language Questions for the Web of Data
18 0.027039645 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
19 0.02688501 73 emnlp-2012-Joint Learning for Coreference Resolution with Markov Logic
20 0.026861999 41 emnlp-2012-Entity based QA Retrieval
topicId topicWeight
[(0, 0.107), (1, 0.068), (2, 0.013), (3, 0.054), (4, -0.04), (5, -0.025), (6, 0.015), (7, -0.06), (8, 0.097), (9, 0.059), (10, 0.11), (11, -0.195), (12, -0.283), (13, -0.046), (14, -0.021), (15, 0.005), (16, 0.045), (17, -0.041), (18, -0.115), (19, -0.08), (20, -0.248), (21, -0.142), (22, -0.036), (23, -0.089), (24, 0.04), (25, 0.144), (26, -0.037), (27, 0.141), (28, 0.06), (29, 0.185), (30, -0.09), (31, 0.101), (32, 0.145), (33, 0.065), (34, 0.124), (35, 0.045), (36, -0.1), (37, -0.088), (38, -0.012), (39, 0.053), (40, -0.011), (41, 0.01), (42, 0.009), (43, 0.092), (44, 0.04), (45, -0.062), (46, 0.027), (47, -0.148), (48, -0.048), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.98404002 134 emnlp-2012-User Demographics and Language in an Implicit Social Network
Author: Katja Filippova
Abstract: We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. We also show that the gender can be predicted from language alone (89%). A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. We also investigate how the two views (linguistic and social) can be combined and analyse how prediction accuracy changes over different age groups.
2 0.70748615 120 emnlp-2012-Streaming Analysis of Discourse Participants
Author: Benjamin Van Durme
Abstract: Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification.
Author: Ahmed Hassan ; Amjad Abu-Jbara ; Dragomir Radev
Abstract: A mixture of positive (friendly) and negative (antagonistic) relations exist among users in most social media applications. However, many such applications do not allow users to explicitly express the polarity of their interactions. As a result most research has either ignored negative links or was limited to the few domains where such relations are explicitly expressed (e.g. Epinions trust/distrust). We study text exchanged between users in online communities. We find that the polarity of the links between users can be predicted with high accuracy given the text they exchange. This allows us to build a signed network representation of discussions; where every edge has a sign: positive to denote a friendly relation, or negative to denote an antagonistic relation. We also connect our analysis to social psychology theories of balance. We show that the automatically predicted networks are consistent with those theories. Inspired by that, we present a technique for identifying subgroups in discussions by partitioning singed networks representing them.
4 0.47585267 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities
Author: Xin Zhao ; Baihan Shu ; Jing Jiang ; Yang Song ; Hongfei Yan ; Xiaoming Li
Abstract: Activities on social media increase at a dramatic rate. When an external event happens, there is a surge in the degree of activities related to the event. These activities may be temporally correlated with one another, but they may also capture different aspects of an event and therefore exhibit different bursty patterns. In this paper, we propose to identify event-related bursts via social media activities. We study how to correlate multiple types of activities to derive a global bursty pattern. To model smoothness of one state sequence, we propose a novel function which can capture the state context. The experiments on a large Twitter dataset shows our methods are very effective.
5 0.34993002 60 emnlp-2012-Generative Goal-Driven User Simulation for Dialog Management
Author: Aciel Eshky ; Ben Allison ; Mark Steedman
Abstract: User simulation is frequently used to train statistical dialog managers for task-oriented domains. At present, goal-driven simulators (those that have a persistent notion of what they wish to achieve in the dialog) require some task-specific engineering, making them impossible to evaluate intrinsically. Instead, they have been evaluated extrinsically by means of the dialog managers they are intended to train, leading to circularity of argument. In this paper, we propose the first fully generative goal-driven simulator that is fully induced from data, without hand-crafting or goal annotation. Our goals are latent, and take the form of topics in a topic model, clustering together semantically equivalent and phonetically confusable strings, implicitly modelling synonymy and speech recognition noise. We evaluate on two standard dialog resources, the Communicator and Let’s Go datasets, and demonstrate that our model has substantially better fit to held out data than competing approaches. We also show that features derived from our model allow significantly greater improvement over a baseline at distinguishing real from randomly permuted dialogs.
6 0.34121111 102 emnlp-2012-Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems
7 0.27655321 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media
8 0.26253703 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid
9 0.24332628 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge
10 0.18177149 62 emnlp-2012-Identifying Constant and Unique Relations by using Time-Series Text
11 0.17607817 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure
12 0.15361805 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution
13 0.15319344 15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification
14 0.15177025 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
15 0.14586836 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
16 0.13801572 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls
17 0.12955077 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
18 0.1263576 41 emnlp-2012-Entity based QA Retrieval
19 0.12488353 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
20 0.12485167 73 emnlp-2012-Joint Learning for Coreference Resolution with Markov Logic
topicId topicWeight
[(2, 0.025), (16, 0.02), (25, 0.01), (34, 0.042), (45, 0.01), (60, 0.079), (63, 0.037), (64, 0.018), (65, 0.027), (70, 0.018), (73, 0.02), (74, 0.037), (76, 0.085), (80, 0.02), (86, 0.041), (87, 0.374), (95, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.78769463 134 emnlp-2012-User Demographics and Language in an Implicit Social Network
Author: Katja Filippova
Abstract: We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. We also show that the gender can be predicted from language alone (89%). A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. We also investigate how the two views (linguistic and social) can be combined and analyse how prediction accuracy changes over different age groups.
Author: Ahmed Hassan ; Amjad Abu-Jbara ; Dragomir Radev
Abstract: A mixture of positive (friendly) and negative (antagonistic) relations exist among users in most social media applications. However, many such applications do not allow users to explicitly express the polarity of their interactions. As a result most research has either ignored negative links or was limited to the few domains where such relations are explicitly expressed (e.g. Epinions trust/distrust). We study text exchanged between users in online communities. We find that the polarity of the links between users can be predicted with high accuracy given the text they exchange. This allows us to build a signed network representation of discussions; where every edge has a sign: positive to denote a friendly relation, or negative to denote an antagonistic relation. We also connect our analysis to social psychology theories of balance. We show that the automatically predicted networks are consistent with those theories. Inspired by that, we present a technique for identifying subgroups in discussions by partitioning singed networks representing them.
3 0.3403292 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
Author: Jayant Krishnamurthy ; Tom Mitchell
Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.
4 0.33696935 116 emnlp-2012-Semantic Compositionality through Recursive Matrix-Vector Spaces
Author: Richard Socher ; Brody Huval ; Christopher D. Manning ; Andrew Y. Ng
Abstract: Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network (RNN) model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.
5 0.33438656 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
Author: Heeyoung Lee ; Marta Recasens ; Angel Chang ; Mihai Surdeanu ; Dan Jurafsky
Abstract: We introduce a novel coreference resolution system that models entities and events jointly. Our iterative method cautiously constructs clusters of entity and event mentions using linear regression to model cluster merge operations. As clusters are built, information flows between entity and event clusters through features that model semantic role dependencies. Our system handles nominal and verbal events as well as entities, and our joint formulation allows information from event coreference to help entity coreference, and vice versa. In a cross-document domain with comparable documents, joint coreference resolution performs significantly better (over 3 CoNLL F1 points) than two strong baselines that resolve entities and events separately.
6 0.33164239 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
7 0.32968616 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
8 0.32967469 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
9 0.32849643 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
10 0.32785457 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
11 0.32650489 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
12 0.32503933 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields
13 0.32476515 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation
14 0.32235858 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
15 0.32220468 120 emnlp-2012-Streaming Analysis of Discourse Participants
16 0.3215363 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
17 0.3209587 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
18 0.32087657 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
19 0.31910723 122 emnlp-2012-Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts
20 0.31862006 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon