emnlp emnlp2013 emnlp2013-89 knowledge-graph by maker-knowledge-mining

89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts


Source: pdf

Author: Morgane Ciot ; Morgan Sonderegger ; Derek Ruths

Abstract: While much work has considered the problem of latent attribute inference for users of social media such as Twitter, little has been done on non-English-based content and users. Here, we conduct the first assessment of latent attribute inference in languages beyond English, focusing on gender inference. We find that the gender inference problem in quite diverse languages can be addressed using existing machinery. Further, accuracy gains can be made by taking language-specific features into account. We identify languages with complex orthography, such as Japanese, as difficult for existing methods, suggesting a valuable direction for future research.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract While much work has considered the problem of latent attribute inference for users of social media such as Twitter, little has been done on non-English-based content and users. [sent-8, score-0.576]

2 Here, we conduct the first assessment of latent attribute inference in languages beyond English, focusing on gender inference. [sent-9, score-0.769]

3 We find that the gender inference problem in quite diverse languages can be addressed using existing machinery. [sent-10, score-0.665]

4 1 Introduction A 2012 study reported that US-based Twitter users now account for only 28% of all active accounts on the platform (Semiocast, 2012). [sent-13, score-0.289]

5 It is remarkable, then, that advances in latent attribute inference on social media have been largely confined to English content, e. [sent-21, score-0.315]

6 Here we specifically focus on gender inference, as it has been the basis for significant work in recent years (Liu et al. [sent-30, score-0.403]

7 First, we quantify the extent to which established gender inference methods can be used with non-English Twitter content. [sent-36, score-0.539]

8 This second aspect, in particular, acknowledges the fact that latent attribute inference may be easier in some languages due not to conventions in word usage, but to syntactic structure. [sent-38, score-0.396]

9 Each dataset consisted of approximately 1000 users who tweeted primarily in a given language. [sent-40, score-0.303]

10 Gender in Japanese, in contrast, could not be reliably inferred with any reasonable accuracy (61% on average) despite numerous attempts to preprocess the tweets and tune the classifier to accommodate the language’s complex orthography. [sent-52, score-0.313]

11 French is a valuable case study because, unlike English, it has a number of syntax-based mechanisms that can encode the gender of the speaker. [sent-55, score-0.431]

12 The most common instantiation of gender marking is the modification of adjective and some past participle endings to match the gender of the subject in constructions beginning with “je suis” (trans. [sent-56, score-1.051]

13 Overall, our results show that, with little modification, existing gender inference machinery can perform comparably to English on several other languages. [sent-59, score-0.565]

14 The majority of recent work in this area has focused on Twitter users (Rao et al. [sent-89, score-0.261]

15 With one exception, gender inference accuracy has been reported between 80% and 85%. [sent-104, score-0.553]

16 The one study which reported 90% accuracy involved the use of a dataset which has been shown to be quite different from typical anglophone Twitter users (Burger et al. [sent-105, score-0.388]

17 This same study did involve non-English Twitter users, but did not analyze the performance of the classifier on different languages (e. [sent-107, score-0.297]

18 Human languages can be classified into different language families, defined as a set of languages which are all descended from a single, ancient parent language. [sent-113, score-0.262]

19 This selection of languages allows us to conduct the most far-reaching survey of non-English latent attribute inference performance to date. [sent-119, score-0.366]

20 A variety of features make each language selected interesting within the gender inference context. [sent-120, score-0.503]

21 Like many languages of the world, they do not have distinct male and female pronouns (like English and French), or 1138 grammatical gender (like French). [sent-128, score-1.067]

22 1 Data The core data for this project consisted of four datasets of content from Twitter users who tweeted predominantly in one of four languages—French, Indonesian, Turkish, and Japanese—collected using the methods described below. [sent-131, score-0.303]

23 , 2012), the dominant way of obtaining datasets consisting of Twitter users with highconfidence gender-labels is to use gender-name associations. [sent-143, score-0.261]

24 We instead used Amazon Mechanical Turk workers to identify the gender ofthe person shown in the profile picture associated with a user’s account (Liu and Ruths, 2013). [sent-146, score-0.446]

25 Users with non-photographic or celebrity-based profile pictures was discarded, as well as any users with profile pictures where the gender could not be confidently assessed (less than 4 out of 5 votes for one gender). [sent-148, score-0.75]

26 2 Methods The majority of prior work in gender inference (and latent inference in general) has used support vector machines (SVMs). [sent-154, score-0.65]

27 We followed prior work in this regard, particularly since our intent here is to evaluate the relevance of existing gender inference ma- chinery on other languages. [sent-155, score-0.534]

28 If the numbers of male and female users were unbalanced in a dataset, the larger set was subsampled randomly to obtain a set of users the same size as the smaller labeled set. [sent-192, score-1.001]

29 e678 r3 al values of the features were extracted from the training users (e. [sent-199, score-0.261]

30 In this way, the gender model implemented by the SVM was language-specific, in the sense that a particular language’s gender model contained a different set of features. [sent-202, score-0.806]

31 We call our method language-agnostic on the grounds that, given a labeled set of users and tweets drawn from a particular language, a model can be built without any knowledge of the structure or content of the language itself. [sent-203, score-0.386]

32 Overall, the classifier demonstrated good performance on all languages except for Japanese. [sent-210, score-0.269]

33 Throughout, we omit discussion of non-alphanumeric “words” (such as punctuation or emoticons), and call the k-top discriminating words for male and female users the k-top male words and k-top female words. [sent-213, score-1.264]

34 The k-top words for men and women are of very different grammatical types. [sent-215, score-0.396]

35 In contrast, many female words (11/25) are pronouns or basic verb forms referring to the speaker or a single addressee (je ‘I’, mon/mes/ma ‘my’, tu ‘you’, j’ai ‘I have’). [sent-219, score-0.573]

36 The most salient pattern is that use of words (pronouns, basic verbs) associated with talking about the speaker or addressee indicates a tweet is more likely to be from a female user. [sent-221, score-0.512]

37 These patterns reflect known gen- der differences in word usage by male and female French speakers (Witsen, 1981). [sent-223, score-0.516]

38 The k-top lists for men and women give some justification for why the classifier performed well. [sent-226, score-0.534]

39 Some differences can be tentatively linked to general trends in how men and women use language differently across cultures. [sent-227, score-0.396]

40 It seems plausible that men tweet about soccer significantly more than women. [sent-229, score-0.401]

41 In such a situation, a reasonable concern is that our classifier discriminated soccer from non-soccer enthusiasts rather than males from females. [sent-230, score-0.294]

42 More interestingly, many of the k-top words correspond to men and women using different terms of address and self-reference. [sent-235, score-0.396]

43 Among the k-top words, 7/25 for men and 4/25 for women are terms of address or self-reference. [sent-236, score-0.396]

44 The terms men use are mostly highly informal, including the slang term lu (you) and the English borrowing bro; the address terms women use are mostly medium-formality, such as aku (I) and kamu (you). [sent-237, score-0.396]

45 Thus, women seem to be using “more polite” self-reference and address terms than men on average on Indonesian Twitter, in line with the more general tendency for women to use polite forms more frequently than men crossculturally (Holmes, 1995). [sent-238, score-0.921]

46 In fact, to our knowledge, this is the highest accuracy achieved in the entire Twitter gender inference literature on a dataset drawn from the Twitter general population. [sent-241, score-0.553]

47 The k-top lists of male and female words again give some justification for the classifier’s performance. [sent-242, score-0.479]

48 Many differences between the male and female lists can be linked to men and women talking about different topics, or to differ- ent people. [sent-243, score-0.914]

49 Several of the male words refer to soccer (gol ‘goal’ , galatasaray ‘popular Istanbul team’, mac ¸ ‘match’, at ‘[part of imperative for] score’), which men plausibly tweet about more. [sent-244, score-0.638]

50 Many other k-top words are familiar terms of address for men (lan, abi, karde sim, adam, kanka) or a greeting used mainly between men (eyvallah), suggesting that male users are addressing or discussing men more often than female users are. [sent-248, score-1.646]

51 In contrast, 9/25 of the k-top female words are pronouns referring to the speaker, a familiar addressee, or a third party (he/she/it), while none of the k-top male words are, suggesting female users are more often talking directly about themselves or to others. [sent-249, score-1.126]

52 Finally, 2/25 of the k-top male words are profanity (amk, ulan), while none of the female k-top words are, suggesting male users swear more. [sent-250, score-0.977]

53 Despite the classifier’s poor performance, the ktop discriminating words for male and female users differ in interesting ways. [sent-255, score-0.785]

54 Japanese speakers have a choice of many firstperson singular pronouns (equivalent to “I”), which signal different levels of politeness and of male versus female speech. [sent-257, score-0.533]

55 The pronoun boku (僕) is associated with informal male speech; accordingly, it is among the k-top male words. [sent-258, score-0.474]

56 Women tend to use polite verb forms and honorifics more frequently than men in Japanese speech (Peng, 1981). [sent-260, score-0.42]

57 In agreement with this pattern, several polite verb forms (-masu, -mashi) and a polite honorific (o-) are among the k-top female words, as is a diminutive honorific often used to refer to women (-chan). [sent-261, score-0.723]

58 Thus, it is in principle often possible to infer the gender ofthe speaker by which form they use, although it is not clear a priori that this method will work for Twitter data. [sent-268, score-0.488]

59 1 Method French grammar dictates that which forms of words are used often reflects the gender of the speaker. [sent-270, score-0.461]

60 Adjectives must agree in gender with the noun they refer to. [sent-272, score-0.403]

61 For example, “I am happy” would be je suis heureuse for a female speaker and je suis heureux for a male speaker (literally “I-am-happy”); heureuse and heureux are the feminine and masculine singular forms of the adjective, and are pronounced differently. [sent-273, score-1.506]

62 Past participles of verbs also agree with the gender of the subject or object of the verb, for certain verbs and constructions. [sent-274, score-0.445]

63 For example, “I went” would be je suis all e´e for a female speaker and je suis all e´ for a male speaker (here suis is used to form the simple past of the verb aller, ‘to go’); all e´ and all e´e are the masculine and feminine forms of the past participle of aller, and are pronounced the same. [sent-275, score-1.799]

64 Note that the phraseje suis (“I am”) occurs in both the adjectival and verbal constructions referring to the speaker; however, the function of suis differs between the two. [sent-276, score-0.361]

65 suis is the first-person singular form of the verb ˆ etre (“to be”), and functions as a copula when followed by an adjective (“I am happy”) but as an auxiliary verb to mark the past tense, when followed by the past participle of certain verbs (“I went”). [sent-277, score-0.577]

66 For our purposes, what is important is that, in both cases, a following adjective or past participle will take on the gender of the speaker. [sent-278, score-0.648]

67 When this construction occurs in a tweet, it is likely that je is referring to the author of the tweet, and the rules of French grammar dictate that the gender of the associated adjective or past participle should reflect the gender of the tweet’s author. [sent-279, score-1.205]

68 We implemented a classifier that used this logic to classify the gender of francophone Twitter users. [sent-280, score-0.541]

69 It is worth emphasizing that the existence of adjectives and participles which reflect the speaker’s gender does not automatically make gender identification in French tweets a trivial task. [sent-281, score-0.973]

70 j en n seuis su pisas p,a jsm, je su sius,is jm pas u,is j,su jins,m jseunis ui psa ps,as jn,s jeumis puais , jnesuis pas, jmesuis, je me suis, je ne me suis pas tweets to be a reliably used for speaker gender identification. [sent-284, score-1.153]

71 Of these, we can identify the number of those tweets that involve an adjective or past participle with a female ending TsFuis (u) ⊆ Tsuis (u). [sent-288, score-0.612]

72 As expected, cursory inspection of tweets revealed that Twitter users often employed shorthand forms of the suisconstruction. [sent-291, score-0.472]

73 Recognizing the gender of the adjective or past participle involved in a suis-construction required a second processing stage. [sent-294, score-0.648]

74 If the tag was not an adjective or verb, the construction was discarded as it would not contain a gender indication. [sent-296, score-0.483]

75 If the word was recognized as an adjective or verb, Lexique would also return the gender, which would be returned as the gender indication for that particular suis-construction. [sent-297, score-0.483]

76 We evaluated a number of policies for assigning the user’s gender based on the relative values of TsFuis (u) and Tsuis (u). [sent-299, score-0.403]

77 In the end, however, the best performing threshold was (u) > 1: simply labeling as female any user Table 4: The component-wise and overall accuracy of the combined suis-construction and SVM classifier. [sent-300, score-0.362]

78 This threshold makes sense given the plausible intuition that females will (almost always) be the only users to employ a female suis-construction; however, it is quite sensitive to uses of female suisconstructions by males. [sent-306, score-0.813]

79 Since not all users had tweets which contained suis-constructions, we combined the SVM-based classifier used previously with the suis-construction-based classifier. [sent-308, score-0.524]

80 The SVM component was applied to any users who lacked suisconstructions entirely in their tweet history. [sent-309, score-0.4]

81 In spite of our concerns over the occurrence frequency and detectability of the suis-construction in tweets, our results show that suis-constructions were found in tweets belonging to nearly 75% of all users in the dataset. [sent-314, score-0.386]

82 In fact, when we looked through the tweets of users who were flagged as not having useful suis-constructions in their tweets, we discovered that many actually did. [sent-318, score-0.386]

83 On the set of users for which the suis-construction was detected, the classifier did very well, achieving an average accuracy of 90%. [sent-322, score-0.449]

84 This was largely due to female users being misclassified as males, indicating that females do not exclusively use female suis-constructions (this was confirmed via manual inspection of a number of female tweet histories). [sent-325, score-1.13]

85 Since forming the female form of an adjective or participle typically requires adding an additional character (or more) to the base of the word, this may reflect a tendency towards dropping gender modifiers in favor of typing less. [sent-327, score-0.833]

86 While the suis-construction classifier performed well, the SVM component did not do nearly as well on the Twitter users that could not be labeled using the suisconstruction, achieving an average performance of 62%. [sent-329, score-0.399]

87 This result stands in opposition to our earlier finding that French users could be labeled with 75% accuracy. [sent-331, score-0.261]

88 This disparity suggests that the non-suis-construction users comprise a particularly difficult-to-classify group. [sent-332, score-0.261]

89 The finding that the SVM classifier performed poorly in the combined classification setting suggests that the suisconstruction classifier is acting as a very effective filter for users that are hard for it to classify. [sent-334, score-0.618]

90 Such filters can decrease classification error by simply flagging those users who cannot be easily clas- sified, leaving them to be handled more carefully by more powerful classifiers or human coding. [sent-336, score-0.261]

91 This result suggests a question for future work: whether it is possible to build classifiers that accurately label the sets of users that are discarded by the suis-construction classifier. [sent-338, score-0.261]

92 Despite the relatively poor performance of the SVM component, the accuracy of the combined classifier improved on the original SVM-only classifier by 8%, which is a substantial increase in accuracy. [sent-340, score-0.326]

93 With some additional focus on classifying the difficult users who could not be labeled by suis-construction usage, we feel that this accuracy can be increased upwards of 90%. [sent-341, score-0.311]

94 5 Discussion In this project, we have extended, for the first time, the latent attribute inference problem to users who tweet primarily in languages other than English. [sent-342, score-0.734]

95 While accuracy levels certainly vary across languages, overall an existing SVM-based classifier, when trained on users from a given language, can classify the gender of other users from that same language with accuracy comparable to performance reported for English. [sent-345, score-1.056]

96 The results obtained for French stand in contrast to various, relatively unsuccessful attempts to boost gender inference by incorporating syntactic features of English into the classifier (e. [sent-355, score-0.641]

97 Such studies could be radically scaled up in terms of the number of languages considered using a language-agnostic gender classifier. [sent-365, score-0.534]

98 Using first names as features for gender inference in Twitter. [sent-452, score-0.503]

99 Using social media to infer gender composition from commuter 1145 populations. [sent-460, score-0.483]

100 Homophily and latent attribute inference: Inferring latent attributes of Twitter users from neighbors. [sent-561, score-0.443]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gender', 0.403), ('users', 0.261), ('female', 0.242), ('male', 0.237), ('twitter', 0.228), ('men', 0.215), ('women', 0.181), ('japanese', 0.17), ('zamal', 0.162), ('suis', 0.155), ('classifier', 0.138), ('languages', 0.131), ('french', 0.131), ('tweets', 0.125), ('participle', 0.108), ('tweet', 0.107), ('je', 0.103), ('inference', 0.1), ('tsfuis', 0.097), ('indonesian', 0.096), ('attribute', 0.088), ('rao', 0.086), ('speaker', 0.085), ('conover', 0.081), ('ruths', 0.081), ('suisconstruction', 0.081), ('social', 0.08), ('adjective', 0.08), ('soccer', 0.079), ('males', 0.077), ('burger', 0.075), ('svm', 0.074), ('polite', 0.071), ('user', 0.07), ('pennacchiotti', 0.069), ('popescu', 0.068), ('tsuis', 0.065), ('masculine', 0.064), ('turkish', 0.062), ('forms', 0.058), ('past', 0.057), ('tumasjan', 0.056), ('tokenization', 0.055), ('pronouns', 0.054), ('feminine', 0.054), ('weblogs', 0.053), ('referring', 0.051), ('accuracy', 0.05), ('anglophone', 0.049), ('brazil', 0.049), ('lexique', 0.049), ('latent', 0.047), ('discriminating', 0.045), ('political', 0.045), ('verb', 0.044), ('profile', 0.043), ('tweeted', 0.042), ('participles', 0.042), ('calves', 0.042), ('mcgill', 0.042), ('quebec', 0.042), ('gon', 0.039), ('addressee', 0.039), ('talking', 0.039), ('pas', 0.038), ('usage', 0.037), ('pronounced', 0.037), ('extent', 0.036), ('sakaki', 0.036), ('females', 0.036), ('orthography', 0.036), ('canada', 0.035), ('liu', 0.035), ('akioka', 0.032), ('aller', 0.032), ('englishlanguage', 0.032), ('etre', 0.032), ('genetically', 0.032), ('heureuse', 0.032), ('heureux', 0.032), ('honorifics', 0.032), ('kuromoji', 0.032), ('microtext', 0.032), ('mislove', 0.032), ('mocanu', 0.032), ('ratkiewicz', 0.032), ('rences', 0.032), ('suisconstructions', 0.032), ('english', 0.032), ('existing', 0.031), ('arxiv', 0.031), ('machinery', 0.031), ('conventions', 0.03), ('hashtags', 0.029), ('study', 0.028), ('untokenized', 0.028), ('went', 0.028), ('honorific', 0.028), ('shorthand', 0.028), ('bamman', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

Author: Morgane Ciot ; Morgan Sonderegger ; Derek Ruths

Abstract: While much work has considered the problem of latent attribute inference for users of social media such as Twitter, little has been done on non-English-based content and users. Here, we conduct the first assessment of latent attribute inference in languages beyond English, focusing on gender inference. We find that the gender inference problem in quite diverse languages can be addressed using existing machinery. Further, accuracy gains can be made by taking language-specific features into account. We identify languages with complex orthography, such as Japanese, as difficult for existing methods, suggesting a valuable direction for future research.

2 0.35310543 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

Author: Svitlana Volkova ; Theresa Wilson ; David Yarowsky

Abstract: Theresa Wilson Human Language Technology Center of Excellence Johns Hopkins University Baltimore, MD t aw@ j hu .edu differences may Different demographics, e.g., gender or age, can demonstrate substantial variation in their language use, particularly in informal contexts such as social media. In this paper we focus on learning gender differences in the use of subjective language in English, Spanish, and Russian Twitter data, and explore cross-cultural differences in emoticon and hashtag use for male and female users. We show that gender differences in subjective language can effectively be used to improve sentiment analysis, and in particular, polarity classification for Spanish and Russian. Our results show statistically significant relative F-measure improvement over the gender-independent baseline 1.5% and 1% for Russian, 2% and 0.5% for Spanish, and 2.5% and 5% for English for polarity and subjectivity classification.

3 0.17393439 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter

Author: Qiming Diao ; Jing Jiang

Abstract: With the rapid growth of social media, Twitter has become one of the most widely adopted platforms for people to post short and instant message. On the one hand, people tweets about their daily lives, and on the other hand, when major events happen, people also follow and tweet about them. Moreover, people’s posting behaviors on events are often closely tied to their personal interests. In this paper, we try to model topics, events and users on Twitter in a unified way. We propose a model which combines an LDA-like topic model and the Recurrent Chinese Restaurant Process to capture topics and events. We further propose a duration-based regularization component to find bursty events. We also propose to use event-topic affinity vectors to model the asso- . ciation between events and topics. Our experiments shows that our model can accurately identify meaningful events and the event-topic affinity vectors are effective for event recommendation and grouping events by topics.

4 0.15647501 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?

Author: Shi Feng ; Le Zhang ; Binyang Li ; Daling Wang ; Ge Yu ; Kam-Fai Wong

Abstract: Extensive experiments have validated the effectiveness of the corpus-based method for classifying the word’s sentiment polarity. However, no work is done for comparing different corpora in the polarity classification task. Nowadays, Twitter has aggregated huge amount of data that are full of people’s sentiments. In this paper, we empirically evaluate the performance of different corpora in sentiment similarity measurement, which is the fundamental task for word polarity classification. Experiment results show that the Twitter data can achieve a much better performance than the Google, Web1T and Wikipedia based methods.

5 0.12881145 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

6 0.087133497 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

7 0.085621715 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation

8 0.082750335 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

9 0.078131422 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization

10 0.077354722 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems

11 0.074795358 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

12 0.073261566 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

13 0.069490544 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

14 0.068905577 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

15 0.067282625 170 emnlp-2013-Sentiment Analysis: How to Derive Prior Polarities from SentiWordNet

16 0.066461354 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

17 0.066363551 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics

18 0.060746502 67 emnlp-2013-Easy Victories and Uphill Battles in Coreference Resolution

19 0.059340261 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

20 0.05921318 151 emnlp-2013-Paraphrasing 4 Microblog Normalization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.207), (1, 0.086), (2, -0.141), (3, -0.164), (4, 0.032), (5, -0.164), (6, -0.046), (7, -0.065), (8, 0.083), (9, 0.021), (10, -0.148), (11, 0.213), (12, 0.128), (13, -0.048), (14, 0.017), (15, -0.051), (16, 0.038), (17, -0.052), (18, -0.083), (19, 0.014), (20, -0.097), (21, 0.104), (22, 0.158), (23, 0.132), (24, -0.122), (25, -0.035), (26, -0.059), (27, -0.022), (28, -0.078), (29, 0.12), (30, -0.21), (31, 0.134), (32, 0.122), (33, -0.153), (34, -0.094), (35, -0.021), (36, -0.09), (37, -0.007), (38, 0.114), (39, -0.012), (40, 0.03), (41, -0.098), (42, -0.003), (43, 0.0), (44, 0.027), (45, 0.194), (46, -0.066), (47, -0.065), (48, 0.07), (49, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97581148 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

Author: Morgane Ciot ; Morgan Sonderegger ; Derek Ruths

Abstract: While much work has considered the problem of latent attribute inference for users of social media such as Twitter, little has been done on non-English-based content and users. Here, we conduct the first assessment of latent attribute inference in languages beyond English, focusing on gender inference. We find that the gender inference problem in quite diverse languages can be addressed using existing machinery. Further, accuracy gains can be made by taking language-specific features into account. We identify languages with complex orthography, such as Japanese, as difficult for existing methods, suggesting a valuable direction for future research.

2 0.72283185 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

Author: Svitlana Volkova ; Theresa Wilson ; David Yarowsky

Abstract: Theresa Wilson Human Language Technology Center of Excellence Johns Hopkins University Baltimore, MD t aw@ j hu .edu differences may Different demographics, e.g., gender or age, can demonstrate substantial variation in their language use, particularly in informal contexts such as social media. In this paper we focus on learning gender differences in the use of subjective language in English, Spanish, and Russian Twitter data, and explore cross-cultural differences in emoticon and hashtag use for male and female users. We show that gender differences in subjective language can effectively be used to improve sentiment analysis, and in particular, polarity classification for Spanish and Russian. Our results show statistically significant relative F-measure improvement over the gender-independent baseline 1.5% and 1% for Russian, 2% and 0.5% for Spanish, and 2.5% and 5% for English for polarity and subjectivity classification.

3 0.46376139 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems

Author: Julien Gaillard ; Marc El-Beze ; Eitan Altman ; Emmanuel Ethis

Abstract: Recommendation systems (RS) take advantage ofproducts and users information in order to propose items to consumers. Collaborative, content-based and a few hybrid RS have been developed in the past. In contrast, we propose a new domain-independent semantic RS. By providing textually well-argued recommendations, we aim to give more responsibility to the end user in his decision. The system includes a new similarity measure keeping up both the accuracy of rating predictions and coverage. We propose an innovative way to apply a fast adaptation scheme at a semantic level, providing recommendations and arguments in phase with the very recent past. We have performed several experiments on films data, providing textually well-argued recommendations.

4 0.46127194 27 emnlp-2013-Authorship Attribution of Micro-Messages

Author: Roy Schwartz ; Oren Tsur ; Ari Rappoport ; Moshe Koppel

Abstract: Work on authorship attribution has traditionally focused on long texts. In this work, we tackle the question of whether the author of a very short text can be successfully identified. We use Twitter as an experimental testbed. We introduce the concept of an author’s unique “signature”, and show that such signatures are typical of many authors when writing very short texts. We also present a new authorship attribution feature (“flexible patterns”) and demonstrate a significant improvement over our baselines. Our results show that the author of a single tweet can be identified with good accuracy in an array of flavors of the authorship attribution task.

5 0.43800595 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?

Author: Shi Feng ; Le Zhang ; Binyang Li ; Daling Wang ; Ge Yu ; Kam-Fai Wong

Abstract: Extensive experiments have validated the effectiveness of the corpus-based method for classifying the word’s sentiment polarity. However, no work is done for comparing different corpora in the polarity classification task. Nowadays, Twitter has aggregated huge amount of data that are full of people’s sentiments. In this paper, we empirically evaluate the performance of different corpora in sentiment similarity measurement, which is the fundamental task for word polarity classification. Experiment results show that the Twitter data can achieve a much better performance than the Google, Web1T and Wikipedia based methods.

6 0.40123278 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter

7 0.39956093 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

8 0.38134408 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

9 0.37716141 163 emnlp-2013-Sarcasm as Contrast between a Positive Sentiment and Negative Situation

10 0.36484474 131 emnlp-2013-Mining New Business Opportunities: Identifying Trend related Products by Leveraging Commercial Intents from Microblogs

11 0.3516078 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics

12 0.34753016 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

13 0.33698529 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization

14 0.31396931 170 emnlp-2013-Sentiment Analysis: How to Derive Prior Polarities from SentiWordNet

15 0.30443835 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students

16 0.29798326 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

17 0.26655924 203 emnlp-2013-With Blinkers on: Robust Prediction of Eye Movements across Readers

18 0.26473129 23 emnlp-2013-Animacy Detection with Voting Models

19 0.26186609 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment

20 0.2585609 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.035), (10, 0.244), (13, 0.022), (18, 0.017), (22, 0.039), (30, 0.065), (45, 0.011), (47, 0.018), (50, 0.021), (51, 0.172), (53, 0.013), (55, 0.01), (66, 0.053), (71, 0.032), (75, 0.027), (77, 0.024), (90, 0.033), (96, 0.056)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81677866 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

Author: Joern Wuebker ; Stephan Peitz ; Felix Rietig ; Hermann Ney

Abstract: Automatically clustering words from a monolingual or bilingual training corpus into classes is a widely used technique in statistical natural language processing. We present a very simple and easy to implement method for using these word classes to improve translation quality. It can be applied across different machine translation paradigms and with arbitrary types of models. We show its efficacy on a small German→English and a larger F ornenc ah s→mGalelrm Gaenrm mtarann→slEatniognli tsahsk a nwdit ha lbaortghe rst Farnednacrhd→ phrase-based salandti nhie traaskrch wiciathl phrase-based translation systems for a common set of models. Our results show that with word class models, the baseline can be improved by up to 1.4% BLEU and 1.0% TER on the French→German task and 0.3% BLEU aonnd t h1e .1 F%re nTcEhR→ on tehrem German→English Btask.

same-paper 2 0.81271356 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

Author: Morgane Ciot ; Morgan Sonderegger ; Derek Ruths

Abstract: While much work has considered the problem of latent attribute inference for users of social media such as Twitter, little has been done on non-English-based content and users. Here, we conduct the first assessment of latent attribute inference in languages beyond English, focusing on gender inference. We find that the gender inference problem in quite diverse languages can be addressed using existing machinery. Further, accuracy gains can be made by taking language-specific features into account. We identify languages with complex orthography, such as Japanese, as difficult for existing methods, suggesting a valuable direction for future research.

3 0.78491789 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

Author: Oier Lopez de Lacalle ; Mirella Lapata

Abstract: In this paper we present an unsupervised approach to relational information extraction. Our model partitions tuples representing an observed syntactic relationship between two named entities (e.g., “X was born in Y” and “X is from Y”) into clusters corresponding to underlying semantic relation types (e.g., BornIn, Located). Our approach incorporates general domain knowledge which we encode as First Order Logic rules and automatically combine with a topic model developed specifically for the relation extraction task. Evaluation results on the ACE 2007 English Relation Detection and Categorization (RDC) task show that our model outperforms competitive unsupervised approaches by a wide margin and is able to produce clusters shaped by both the data and the rules.

4 0.65922201 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

Author: Svitlana Volkova ; Theresa Wilson ; David Yarowsky

Abstract: Theresa Wilson Human Language Technology Center of Excellence Johns Hopkins University Baltimore, MD t aw@ j hu .edu differences may Different demographics, e.g., gender or age, can demonstrate substantial variation in their language use, particularly in informal contexts such as social media. In this paper we focus on learning gender differences in the use of subjective language in English, Spanish, and Russian Twitter data, and explore cross-cultural differences in emoticon and hashtag use for male and female users. We show that gender differences in subjective language can effectively be used to improve sentiment analysis, and in particular, polarity classification for Spanish and Russian. Our results show statistically significant relative F-measure improvement over the gender-independent baseline 1.5% and 1% for Russian, 2% and 0.5% for Spanish, and 2.5% and 5% for English for polarity and subjectivity classification.

5 0.63333809 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou

Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1

6 0.63250035 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

7 0.62729573 143 emnlp-2013-Open Domain Targeted Sentiment

8 0.6233561 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

9 0.62227893 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

10 0.62198228 197 emnlp-2013-Using Paraphrases and Lexical Semantics to Improve the Accuracy and the Robustness of Supervised Models in Situated Dialogue Systems

11 0.62191796 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

12 0.62140572 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

13 0.62077475 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

14 0.61983275 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

15 0.61918294 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

16 0.61844945 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

17 0.61811328 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

18 0.61780024 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

19 0.61748272 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation

20 0.61725527 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction