acl acl2012 acl2012-6 knowledge-graph by maker-knowledge-mining

6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

Source: pdf

Author: Apoorv Agarwal ; Adinoyi Omuya ; Aaron Harnly ; Owen Rambow

Abstract: Many researchers have attempted to predict the Enron corporate hierarchy from the data. This work, however, has been hampered by a lack of data. We present a new, large, and freely available gold-standard hierarchy. Using our new gold standard, we show that a simple lower bound for social network-based systems outperforms an upper bound on the approach taken by current NLP systems.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Many researchers have attempted to predict the Enron corporate hierarchy from the data. [sent-7, score-0.294]

2 Using our new gold standard, we show that a simple lower bound for social network-based systems outperforms an upper bound on the approach taken by current NLP systems. [sent-10, score-0.278]

3 1 Introduction Since the release of the Enron email corpus, many researchers have attempted to predict the Enron corporate hierarchy from the email data. [sent-11, score-1.189]

4 This work, however, has been hampered by a lack of data about the organizational hierarchy. [sent-12, score-0.195]

5 Most researchers have used the job titles assembled by (Shetty and Adibi, 2004), and then have attempted to predict the relative ranking of two people’s job titles (Rowe et al. [sent-13, score-0.211]

6 A major limitation of the list compiled by Shetty and Adibi (2004) is that it only covers those “core” employees for whom the complete email inboxes are available in the Enron dataset. [sent-16, score-0.745]

7 However, it is also interesting to determine whether we can predict the hierarchy of other employees, for whom we only have an incomplete set of emails (those that they sent to or received from the core employees). [sent-17, score-0.398]

8 This is difficult in particular because there are dominance relations between two employees such that no email between them is available in the Enron data set. [sent-18, score-1.284]

9 , 2011a) use 142 dominance pairs for training and testing. [sent-23, score-0.552]

10 Our gold standard contains 1,5 18 employees, and 13,724 dominance pairs (pairs of employees such that the first dominates the second in the hierarchy, not necessarily immediately). [sent-26, score-0.997]

11 All of the employees in the hierarchy are email correspondents on the Enron email database, though obviously many are not from the core group of about 158 Enron employees for which we have the complete inbox. [sent-27, score-1.635]

12 The hierarchy is linked to a threaded representation of the Enron corpus using shared IDs for the employees who are participants in the email conversation. [sent-28, score-0.863]

13 We show the usefulness of this resource by investigating a simple predictor for hierarchy based on social network analysis (SNA), namely degree centrality of the social network induced by the email correspondence (Section 4). [sent-30, score-1.452]

14 Degree centrality is one of the features used by Rowe et al. [sent-32, score-0.205]

15 (2007), but they did not perform a quantitative evaluation, and to our knowledge there are no published experiments using only degree centrality. [sent-33, score-0.096]

16 Current systems using natural language processing (NLP) are restricted to making informed predictions on dominance pairs for which email exchange is available. [sent-34, score-1.1]

17 We show (Section 5) that the upper bound performance of such Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-35, score-0.073]

18 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 6s1–165, NLP-based systems is much lower than our SNAbased system on the entire gold standard. [sent-37, score-0.091]

19 We also contrast the simple SN-based system with a specific NLP system based on (Gilbert, 2012), and show that even if we restrict ourselves to pairs for which email exchange is available, our simple SNA-based systems outperforms the NLP-based system. [sent-38, score-0.585]

20 2 Work on Enron Hierarchy Prediction The Enron email corpus was introduced by Klimt and Yang (2004). [sent-39, score-0.456]

21 Since then numerous researchers have analyzed the network formed by connecting people with email exchange links (Diesner et al. [sent-40, score-0.915]

22 (2007) use the email exchange network (and other features) to predict the dominance relations between people in the Enron email corpus. [sent-47, score-1.963]

23 (201 1b) and Gilbert (2012) present NLP based models to predict dominance relations between Enron employees. [sent-50, score-0.651]

24 Gilbert (2012) produce training and test data as follows: an email message is labeled upward only when every recipient outranks the sender. [sent-54, score-0.542]

25 An email message is labeled not-upward only when every recipient does not outrank the sender. [sent-55, score-0.508]

26 They use an n-gram based model with Support Vector Machines (SVM) to predict if an email is of class upward or not-upward. [sent-56, score-0.552]

27 We use their n-grams with SVM to predict dominance relations of employees in our gold standard and show that a simple SNA based approach outperforms this baseline. [sent-58, score-1.007]

28 Moreover, Gilbert (2012) exploit dominance relations of only 132 people in the Enron corpus for creating their training and test data. [sent-59, score-0.714]

29 Our gold stan- dard has dominance relations for 1518 Enron employees. [sent-60, score-0.68]

30 3 The Enron Hierarchy Gold Standard Klimt and Yang (2004) introduced the Enron email corpus. [sent-61, score-0.456]

31 They reported a total of 619,446 emails 162 taken from folders of 158 employees of the Enron corporation. [sent-62, score-0.359]

32 We created a database of organizational hierarchy relations by studying the original Enron organizational charts. [sent-63, score-0.562]

33 We found a few documents with organizational charts, which were always either Excel or Visio files. [sent-65, score-0.171]

34 We then searched all remaining emails for attachments of the same filetype, and exhaustively examined those with additional org charts. [sent-66, score-0.135]

35 We then manually transcribed the information contained in all org charts we found. [sent-67, score-0.053]

36 Our resulting gold standard has a total of 15 18 nodes (employees) which are described as being in immediate dominance relations (manager- subordinate). [sent-68, score-0.747]

37 There are 2155 immediate dominance relations spread over 65 levels of dominance (CEO, manager, trader etc. [sent-69, score-1.135]

38 ) From these relations, we formed the transitive closure and obtained 13,724 hierarchal relations. [sent-70, score-0.106]

39 For example, if A immediately dominates B and B immediately dominates C, then the set of valid organizational dominance relations are A dominates B, B dominates C and A dominates C. [sent-71, score-1.247]

40 This data set is much larger than any other data set used in the literature for the sake of predicting organizational hierarchy. [sent-72, score-0.198]

41 We link this representation of the hierarchy to the threaded Enron corpus created by Yeh and Harnley (2006). [sent-73, score-0.168]

42 They pre-processed the dataset by combining emails into threads and restoring some missing emails from their quoted form in other emails. [sent-74, score-0.261]

43 They also co-referenced multiple email addresses belonging to one person, and assigned unique identifiers and names to persons. [sent-75, score-0.502]

44 Therefore, each person is apriori associated with a set of email addresses and names (or name variants), but has only one unique identifier. [sent-76, score-0.499]

45 We use these unique identifiers to express our gold hierarchy. [sent-79, score-0.137]

46 This means that we can easily retrieve all emails associated with people in our gold hierarchy, and we can easily determine the hierarchical relation between the sender and receivers of any email. [sent-80, score-0.368]

47 The whole set of person nodes is divided into two parts: core and non-core. [sent-81, score-0.125]

48 The set of core people are those whose inboxes were taken to create the Enron email network (a set of 158 people). [sent-82, score-0.876]

49 The set of noncore people are the remaining people in the network who either send an email to and/or receive an email from a member of the core group. [sent-83, score-1.457]

50 As expected, the email exchange network (the network induced from the emails) is densest among core people (density of 20. [sent-84, score-1.142]

51 997% in the email exchange network), and much less dense among the non-core people (density of 0. [sent-85, score-0.697]

52 4 A Hierarchy Predictor Based on the Social Network We construct the email exchange network as follows. [sent-89, score-0.731]

53 This network is represented as an undirected weighted graph. [sent-90, score-0.183]

54 We add a link between two employees if one sends at least one email to the other (who can be a TO, CC, or BCC recipient). [sent-92, score-0.695]

55 The weight is the number of emails exchanged between the two. [sent-93, score-0.12]

56 Our email exchange network consists of 407,095 weighted links and 93,421 nodes. [sent-94, score-0.748]

57 Our algorithm for predicting the dominance relation using social network analysis metric is simple. [sent-95, score-0.842]

58 We calculate the degree centrality of every node in the email exchange network, and then rank the nodes by their degree centrality. [sent-96, score-0.942]

59 Recall that the degree centrality is the proportion of nodes in the network with which a node is connected. [sent-97, score-0.506]

60 For a discussion of the use of degree centrality as a valid indication ofimportance ofnodes in a network, see (Chuah and Coman, 2009). [sent-99, score-0.276]

61 ) Let CD (n) be the degree centrality of node n, and let DOM be the dominance relation (transitive, not symmetric) induced by the organizational hierarchy. [sent-100, score-1.056]

62 We then simply assume that for two people p1 and p2, if CD(p1) > CD(p2), then DOM(p1 ,p2). [sent-101, score-0.125]

63 For every pair of people who are related with an organizational dominance relation in the gold standard, we then predict which person dominates the other. [sent-102, score-1.122]

64 Note that we do not predict if two people are in a dominance relation to begin with. [sent-103, score-0.749]

65 The task of predicting if two people are 163 Type# pairs%Acc Table1:PredICNAincloterino -Caocreuac1y6348b,47y30276ty4peo789f43 p. [sent-104, score-0.152]

66 583r7e815dicte organi- zational dominance pair; “Inter” means that one element of the pair is from the core and the other is not; a negative error reduction indicates an increase in error in a dominance relation is different and we do not address that task in this paper. [sent-105, score-1.17]

67 Therefore, we restrict our evaluation to pairs of people (p1,p2) who are related hierarchically (i. [sent-106, score-0.162]

68 , either DOM(p1 ,p2) or DOM(p2,p1) in the gold standard). [sent-108, score-0.091]

69 Since we only predict the directionality of the dominance relation of people given they are in a hierarchical relation,1 the random baseline for our task performs at 50%. [sent-109, score-0.749]

70 We have 13,724 such pairs of people in the gold standard. [sent-110, score-0.238]

71 When we use the network induced simply by the email exchanges, we get a remarkably high accuracy of 83. [sent-111, score-0.664]

72 In this paper, we also make an observation crucial for the task of hierarchy prediction, based on the distinction between the core and the non-core groups (see Section 3). [sent-114, score-0.216]

73 This distinction is crucial for this task since by definition the degree centrality measure (which depends on how accurately the underlying network expresses the communication network) suffers from missing email messages (for the noncore group). [sent-115, score-1.006]

74 Since we have a richer network for the core group, degree centrality is a better predictor for this group than for the non-core group. [sent-117, score-0.6]

75 We also note that the prediction accuracy is by far the highest for the inter hierarchal pairs. [sent-118, score-0.13]

76 The inter hierarchal pairs are those in which one node is from the core group of people and the other node is from the non-core group of people. [sent-119, score-0.429]

77 This is explained by the fact that the core group was chosen by law enforcement because they were most likely to contain information relevant to the legal proceedings against Enron; i. [sent-120, score-0.127]

78 Furthermore, because of the network characteristics described above (a relatively dense network), the core people are also more likely to have a high centrality degree, as compared to the non-core people. [sent-126, score-0.615]

79 Therefore, the correlation between centrality degree and hierarchal dominance will be high. [sent-127, score-0.874]

80 5 Using NLP and SNA In this section we compare and contrast the performance of NLP-based systems with that of SNAbased systems on the Enron hierarchy gold standard we introduce in this paper. [sent-128, score-0.255]

81 We first determine an upper bound for current NLPbased systems. [sent-131, score-0.073]

82 Current NLP-based systems predict dominance relations between a pair of people by using the language used in email exchanges between these people; if there is no email exchange, such methods cannot make a prediction. [sent-132, score-1.722]

83 Let G be the set of all dominance relations in the gold standard (|G| = 13, 723). [sent-133, score-0.706]

84 We define T ⊂ G to be tdhaer set Gof| pairs i,n7 2th3e) gold s dtaenfidnaerd T su ⊂ch Gtha tto t bhee people involved in the pair in T communicate with each other. [sent-134, score-0.258]

85 These are precisely the dominance relations in the gold standard which can be established using a current NLP-based approach. [sent-135, score-0.706]

86 Therefore, if we cofon ssuidcher a perfect N|L =P system that correctly predicts the dominance of 2, 640 tuples and randomly guesses the dominance relation of the remaining 11, 084 tuples, the system would achieve an accuracy of (2640 + 11084/2)/13724 = 59. [sent-137, score-1.147]

87 We refer to this number as the upper bound on the best performing NLP system for the gold standard. [sent-139, score-0.164]

88 27% absolute) than a simple SNA-based system (SNAG, explained in section 4) that predicts the dominance relation for all the tuples in the gold standard G. [sent-142, score-0.754]

89 164 As explained in section 2, we use the phrases provided by Gilbert (2012) to build an NLP-based model for predicting dominance relations of tuples in set T ⊂ G. [sent-143, score-0.691]

90 Note that we only use the tuples from ⊂the gold standard where the NLP-based system may hope to make a prediction (i. [sent-144, score-0.2]

91 37% compared to the social network-based approach (SNAT) which achieves a higher accuracy of 87. [sent-148, score-0.07]

92 This comparison shows that SNA-based approach out-performs the NLP-based approach even if we evaluate on a much smaller part of the gold standard, namely the part where an NLP-based approach does not suffer from having to make a random prediction for nodes that do not comunicate via email. [sent-150, score-0.144]

93 6 Future Work One key challenge of the problem of predicting domination relations of Enron employees based on their emails is that the underlying network is incomplete. [sent-155, score-0.628]

94 We hypothesize that SNA-based approaches are sensitive to the goodness with which the underlying network represents the true social network. [sent-156, score-0.253]

95 Part of the missing network may be recoverable by analyzing the content of emails. [sent-157, score-0.204]

96 Using sophisticated NLP techniques, we may be able to enrich the network and use standard SNA metrics to predict the dominance relations in the gold standard. [sent-158, score-0.951]

97 Segmentation and automated social hierarchy detection through email network analysis. [sent-186, score-0.847]

98 Communication networks from the enron email corpus it’s always about the people. [sent-196, score-0.799]

99 In Proceedings of the 2006 conference on Statistical network analysis, ICML’06, pages 179–181, Berlin, Heidelberg. [sent-213, score-0.183]

100 Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pages 109–1 17. [sent-222, score-0.253]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dominance', 0.53), ('email', 0.456), ('enron', 0.343), ('employees', 0.239), ('centrality', 0.205), ('network', 0.183), ('organizational', 0.171), ('hierarchy', 0.138), ('people', 0.125), ('emails', 0.12), ('rowe', 0.12), ('exchange', 0.092), ('gold', 0.091), ('gilbert', 0.089), ('dominates', 0.089), ('core', 0.078), ('sna', 0.074), ('degree', 0.071), ('social', 0.07), ('hierarchal', 0.068), ('shetty', 0.068), ('predict', 0.062), ('bramsen', 0.06), ('relations', 0.059), ('tuples', 0.055), ('dom', 0.054), ('lumbi', 0.054), ('adibi', 0.051), ('creamer', 0.051), ('diehl', 0.051), ('klimt', 0.051), ('namata', 0.045), ('bound', 0.044), ('charts', 0.038), ('recipient', 0.036), ('inter', 0.034), ('predictor', 0.034), ('chuah', 0.034), ('diesner', 0.034), ('exchanges', 0.034), ('hershkop', 0.034), ('inboxes', 0.034), ('mongodb', 0.034), ('nlpbased', 0.034), ('noncore', 0.034), ('palus', 0.034), ('snabased', 0.034), ('upward', 0.034), ('relation', 0.032), ('apoorv', 0.03), ('galileo', 0.03), ('salvatore', 0.03), ('threaded', 0.03), ('group', 0.029), ('upper', 0.029), ('prediction', 0.028), ('cd', 0.028), ('attempted', 0.028), ('titles', 0.028), ('lise', 0.027), ('co', 0.027), ('predicting', 0.027), ('standard', 0.026), ('quantitative', 0.025), ('identifiers', 0.025), ('shlomo', 0.025), ('researchers', 0.025), ('induced', 0.025), ('nodes', 0.025), ('corporate', 0.024), ('dense', 0.024), ('hampered', 0.024), ('columbia', 0.023), ('database', 0.023), ('aaron', 0.023), ('nlp', 0.022), ('node', 0.022), ('person', 0.022), ('pairs', 0.022), ('immediately', 0.021), ('missing', 0.021), ('density', 0.021), ('transitive', 0.021), ('unique', 0.021), ('explained', 0.02), ('communicate', 0.02), ('job', 0.02), ('ny', 0.02), ('communication', 0.019), ('formed', 0.017), ('messages', 0.017), ('links', 0.017), ('edu', 0.017), ('resource', 0.017), ('message', 0.016), ('immediate', 0.016), ('berlin', 0.016), ('limitation', 0.016), ('restrict', 0.015), ('org', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

Author: Apoorv Agarwal ; Adinoyi Omuya ; Aaron Harnly ; Owen Rambow

2 0.081215598 73 acl-2012-Discriminative Learning for Joint Template Filling

Author: Einat Minkov ; Luke Zettlemoyer

Abstract: This paper presents a joint model for template filling, where the goal is to automatically specify the fields of target relations such as seminar announcements or corporate acquisition events. The approach models mention detection, unification and field extraction in a flexible, feature-rich model that allows for joint modeling of interdependencies at all levels and across fields. Such an approach can, for example, learn likely event durations and the fact that start times should come before end times. While the joint inference space is large, we demonstrate effective learning with a Perceptron-style approach that uses simple, greedy beam decoding. Empirical results in two benchmark domains demonstrate consistently strong performance on both mention de- tection and template filling tasks.

3 0.059138507 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

Author: Wen-Tai Hsieh ; Chen-Ming Wu ; Tsun Ku ; Seng-cho T. Chou

Abstract: Social Event Radar is a new social networking-based service platform, that aim to alert as well as monitor any merchandise flaws, food-safety related issues, unexpected eruption of diseases or campaign issues towards to the Government, enterprises of any kind or election parties, through keyword expansion detection module, using bilingual sentiment opinion analysis tool kit to conclude the specific event social dashboard and deliver the outcome helping authorities to plan “risk control” strategy. With the rapid development of social network, people can now easily publish their opinions on the Internet. On the other hand, people can also obtain various opinions from others in a few seconds even though they do not know each other. A typical approach to obtain required information is to use a search engine with some relevant keywords. We thus take the social media and forum as our major data source and aim at collecting specific issues efficiently and effectively in this work. 163 Chen-Ming Wu Institute for Information Industry cmwu@ i i i .org .tw Seng-cho T. Chou Department of IM, National Taiwan University chou @ im .ntu .edu .tw 1

4 0.041673563 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

Author: Tsung-Ting Kuo ; San-Chuan Hung ; Wei-Shih Lin ; Nanyun Peng ; Shou-De Lin ; Wei-Fen Lin

Abstract: This paper brings a marriage of two seemly unrelated topics, natural language processing (NLP) and social network analysis (SNA). We propose a new task in SNA which is to predict the diffusion of a new topic, and design a learning-based framework to solve this problem. We exploit the latent semantic information among users, topics, and social connections as features for prediction. Our framework is evaluated on real data collected from public domain. The experiments show 16% AUC improvement over baseline methods. The source code and dataset are available at http://www.csie.ntu.edu.tw/~d97944007/dif fusion/ 1 Background The diffusion of information on social networks has been studied for decades. Generally, the proposed strategies can be categorized into two categories, model-driven and data-driven. The model-driven strategies, such as independent cascade model (Kempe et al., 2003), rely on certain manually crafted, usually intuitive, models to fit the diffusion data without using diffusion history. The data-driven strategies usually utilize learning-based approaches to predict the future propagation given historical records of prediction (Fei et al., 2011; Galuba et al., 2010; Petrovic et al., 2011). Data-driven strategies usually perform better than model-driven approaches because the past diffusion behavior is used during learning (Galuba et al., 2010). Recently, researchers started to exploit content information in data-driven diffusion models (Fei et al., 2011; Petrovic et al., 2011; Zhu et al., 2011). 344 However, most of the data-driven approaches assume that in order to train a model and predict the future diffusion of a topic, it is required to obtain historical records about how this topic has propagated in a social network (Petrovic et al., 2011; Zhu et al., 2011). We argue that such assumption does not always hold in the real-world scenario, and being able to forecast the propagation of novel or unseen topics is more valuable in practice. For example, a company would like to know which users are more likely to be the source of ‘viva voce’ of a newly released product for advertising purpose. A political party might want to estimate the potential degree of responses of a half-baked policy before deciding to bring it up to public. To achieve such goal, it is required to predict the future propagation behavior of a topic even before any actual diffusion happens on this topic (i.e., no historical propagation data of this topic are available). Lin et al. also propose an idea aiming at predicting the inference of implicit diffusions for novel topics (Lin et al., 2011). The main difference between their work and ours is that they focus on implicit diffusions, whose data are usually not available. Consequently, they need to rely on a model-driven approach instead of a datadriven approach. On the other hand, our work focuses on the prediction of explicit diffusion behaviors. Despite the fact that no diffusion data of novel topics is available, we can still design a data- driven approach taking advantage of some explicit diffusion data of known topics. Our experiments show that being able to utilize such information is critical for diffusion prediction. 2 The Novel-Topic Diffusion Model We start by assuming an existing social network G = (V, E), where V is the set of nodes (or user) v, and E is the set of link e. The set of topics is Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 4s4–348, denoted as T. Among them, some are considered as novel topics (denoted as N), while the rest (R) are used as the training records. We are also given a set of diffusion records D = {d | d = (src, dest, t) }, where src is the source node (or diffusion source), dest is the destination node, and t is the topic of the diffusion that belongs to R but not N. We assume that diffusions cannot occur between nodes without direct social connection; any diffusion pair implies the existence of a link e = (src, dest) ∈ E. Finally, we assume there are sets of keywords or tags that relevant to each topic (including existing and novel topics). Note that the set of keywords for novel topics should be seen in that of existing topics. From these sets of keywords, we construct a topicword matrix TW = (P(wordj | topici))i,j of which the elements stand for the conditional probabilities that a word appears in the text of a certain topic. Similarly, we also construct a user-word matrix UW= (P(wordj | useri))i,j from these sets of keywords. Given the above information, the goal is to predict whether a given link is active (i.e., belongs to a diffusion link) for topics in N. 2.1 The Framework The main challenge of this problem lays in that the past diffusion behaviors of new topics are missing. To address this challenge, we propose a supervised diffusion discovery framework that exploits the latent semantic information among users, topics, and their explicit / implicit interactions. Intuitively, four kinds of information are useful for prediction: • Topic information: Intuitively, knowing the signatures of a topic (e.g., is it about politics?) is critical to the success of the prediction. • User information: The information of a user such as the personality (e.g., whether this user is aggressive or passive) is generally useful. • User-topic interaction: Understanding the users' preference on certain topics can improve the quality of prediction. • Global information: We include some global features (e.g., topology info) of social network. Below we will describe how these four kinds of information can be modeled in our framework. 2.2 Topic Information We extract hidden topic category information to model topic signature. In particular, we exploit the 345 Latent Dirichlet Allocation (LDA) method (Blei et al., 2003), which is a widely used topic modeling technique, to decompose the topic-word matrix TW into hidden topic categories: TW = TH * HW , where TH is a topic-hidden matrix, HW is hiddenword matrix, and h is the manually-chosen parameter to determine the size of hidden topic categories. TH indicates the distribution of each topic to hidden topic categories, and HW indicates the distribution of each lexical term to hidden topic categories. Note that TW and TH include both existing and novel topics. We utilize THt,*, the row vector of the topic-hidden matrix TH for a topic t, as a feature set. In brief, we apply LDA to extract the topic-hidden vector THt,* to model topic signature (TG) for both existing and novel topics. Topic information can be further exploited. To predict whether a novel topic will be propagated through a link, we can first enumerate the existing topics that have been propagated through this link. For each such topic, we can calculate its similarity with the new topic based on the hidden vectors generated above (e.g., using cosine similarity between feature vectors). Then, we sum up the similarity values as a new feature: topic similarity (TS). For example, a link has previously propagated two topics for a total of three times {ACL, KDD, ACL}, and we would like to know whether a new topic, EMNLP, will propagate through this link. We can use the topic-hidden vector to generate the similarity values between EMNLP and the other topics (e.g., {0.6, 0.4, 0.6}), and then sum them up (1.6) as the value of TS. 2.3 User Information Similar to topic information, we extract latent personal information to model user signature (the users are anonymized already). We apply LDA on the user-word matrix UW: UW = UM * MW , where UM is the user-hidden matrix, MW is the hidden-word matrix, and m is the manually-chosen size of hidden user categories. UM indicates the distribution of each user to the hidden user categories (e.g., age). We then use UMu,*, the row vector of UM for the user u, as a feature set. In brief, we apply LDA to extract the user-hidden vector UMu,* for both source and destination nodes of a link to model user signature (UG). 2.4 User-Topic Interaction Modeling user-topic interaction turns out to be non-trivial. It is not useful to exploit latent semantic analysis directly on the user-topic matrix UR = UQ * QR , where UR represents how many times each user is diffused for existing topic R (R ∈ T), because UR does not contain information of novel topics, and neither do UQ and QR. Given no propagation record about novel topics, we propose a method that allows us to still extract implicit user-topic information. First, we extract from the matrix TH (described in Section 2.2) a subset RH that contains only information about existing topics. Next we apply left division to derive another userhidden matrix UH: UH = (RH \ URT)T = ((RHT RH )-1 RHT URT)T Using left division, we generate the UH matrix using existing topic information. Finally, we exploit UHu,*, the row vector of the user-hidden matrix UH for the user u, as a feature set. Note that novel topics were included in the process of learning the hidden topic categories on RH; therefore the features learned here do implicitly utilize some latent information of novel topics, which is not the case for UM. Experiments confirm the superiority of our approach. Furthermore, our approach ensures that the hidden categories in topic-hidden and user-hidden matrices are identical. Intuitively, our method directly models the user’s preference to topics’ signature (e.g., how capable is this user to propagate topics in politics category?). In contrast, the UM mentioned in Section 2.3 represents the users’ signature (e.g., aggressiveness) and has nothing to do with their opinions on a topic. In short, we obtain the user-hidden probability vector UHu,* as a feature set, which models user preferences to latent categories (UPLC). 2.5 Global Features Given a candidate link, we can extract global social features such as in-degree (ID) and outdegree (OD). We tried other features such as PageRank values but found them not useful. Moreover, we extract the number of distinct topics (NDT) for a link as a feature. The intuition behind this is that the more distinct topics a user has diffused to another, the more likely the diffusion will happen for novel topics. 346 2.6 Complexity Analysis The complexity to produce each feature is as below: (1) Topic information: O(I * |T| * h * Bt) for LDA using Gibbs sampling, where Iis # of the iterations in sampling, |T| is # of topics, and Bt is the average # of tokens in a topic. (2) User information: O(I * |V| * m * Bu) , where |V| is # of users, and Bu is the average # of tokens for a user. (3) User-topic interaction: the time complexity is O(h3 + h2 * |T| + h * |T| * |V|). (4) Global features: O(|D|), where |D| is # of diffusions. 3 Experiments For evaluation, we try to use the diffusion records of old topics to predict whether a diffusion link exists between two nodes given a new topic. 3.1 Dataset and Evaluation Metric We first identify 100 most popular topic (e.g., earthquake) from the Plurk micro-blog site between 01/201 1 and 05/201 1. Plurk is a popular micro-blog service in Asia with more than 5 million users (Kuo et al., 2011). We manually separate the 100 topics into 7 groups. We use topic-wise 4-fold cross validation to evaluate our method, because there are only 100 available topics. For each group, we select 3/4 of the topics as training and 1/4 as validation. The positive diffusion records are generated based on the post-response behavior. That is, if a person x posts a message containing one of the selected topic t, and later there is a person y responding to this message, we consider a diffusion of t has occurred from x to y (i.e., (x, y, t) is a positive instance). Our dataset contains a total of 1,642,894 positive instances out of 100 distinct topics; the largest and smallest topic contains 303,424 and 2,166 diffusions, respectively. Also, the same amount of negative instances for each topic (totally 1,642,894) is sampled for binary classification (similar to the setup in KDD Cup 2011 Track 2). The negative links of a topic t are sampled randomly based on the absence of responses for that given topic. The underlying social network is created using the post-response behavior as well. We assume there is an acquaintance link between x and y if and only if x has responded to y (or vice versa) on at least one topic. Eventually we generated a social network of 163,034 nodes and 382,878 links. Furthermore, the sets of keywords for each topic are required to create the TW and UW matrices for latent topic analysis; we simply extract the content of posts and responses for each topic to create both matrices. We set the hidden category number h = m = 7, which is equal to the number of topic groups. We use area under ROC curve (AUC) to evaluate our proposed framework (Davis and Goadrich, 2006); we rank the testing instances based on their likelihood of being positive, and compare it with the ground truth to compute AUC. 3.2 Implementation and Baseline After trying many classifiers and obtaining similar results for all of them, we report only results from LIBLINEAR with c=0.0001 (Fan et al., 2008) due to space limitation. We remove stop-words, use SCWS (Hightman, 2012) for tokenization, and MALLET (McCallum, 2002) and GibbsLDA++ (Phan and Nguyen, 2007) for LDA. There are three baseline models we compare the result with. First, we simply use the total number of existing diffusions among all topics between two nodes as the single feature for prediction. Second, we exploit the independent cascading model (Kempe et al., 2003), and utilize the normalized total number of diffusions as the propagation probability of each link. Third, we try the heat diffusion model (Ma et al., 2008), set initial heat proportional to out-degree, and tune the diffusion time parameter until the best results are obtained. Note that we did not compare with any data-driven approaches, as we have not identified one that can predict diffusion of novel topics. 3.3 Results The result of each model is shown in Table 1. All except two features outperform the baseline. The best single feature is TS. Note that UPLC performs better than UG, which verifies our hypothesis that maintaining the same hidden features across different LDA models is better. We further conduct experiments to evaluate different combinations of features (Table 2), and found that the best one (TS + ID + NDT) results in about 16% improvement over the baseline, and outperforms the combination of all features. As stated in (Witten et al., 2011), 347 adding useless features may cause the performance of classifiers to deteriorate. Intuitively, TS captures both latent topic and historical diffusion information, while ID and NDT provide complementary social characteristics of users. 4 Conclusions The main contributions of this paper are as below: 1. We propose a novel task of predicting the diffusion of unseen topics, which has wide applications in real-world. 2. Compared to the traditional model-driven or content-independent data-driven works on diffusion analysis, our solution demonstrates how one can bring together ideas from two different but promising areas, NLP and SNA, to solve a challenging problem. 3. Promising experiment result (74% in AUC) not only demonstrates the usefulness of the proposed models, but also indicates that predicting diffusion of unseen topics without historical diffusion data is feasible. Acknowledgments This work was also supported by National Science Council, National Taiwan University and Intel Corporation under Grants NSC 100-291 1-I-002-001, and 101R7501. References David M. Blei, Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3.993-1022. Jesse Davis & Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang & Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res., 9.1871-74. Hongliang Fei, Ruoyi Jiang, Yuhao Yang, Bo Luo & Jun Huan. 2011. Content based social behavior prediction: a multi-task learning approach. Proceedings of the 20th ACM international conference on Information and knowledge management, Glasgow, Scotland, UK. Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic & Wolfgang Kellerer. 2010. Outtweeting the twitterers - predicting information cascades in microblogs. Proceedings of the 3rd conference on Online social networks, Boston, MA. Hightman. 2012. Simple Chinese Words Segmentation (SCWS). David Kempe, Jon Kleinberg & Eva Tardos. 2003. Maximizing the spread of influence through a social network. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C. Tsung-Ting Kuo, San-Chuan Hung, Wei-Shih Lin, Shou-De Lin, Ting-Chun Peng & Chia-Chun Shih. 2011. Assessing the Quality of Diffusion Models Using Real-World Social Network Data. Conference on Technologies and Applications of Artificial Intelligence, 2011. C.X. Lin, Q.Z. Mei, Y.L. Jiang, J.W. Han & S.X. Qi. 2011. Inferring the Diffusion and Evolution of Topics in Social Communities. Proceedings of the IEEE International Conference on Data Mining, 2011. Hao Ma, Haixuan Yang, Michael R. Lyu & Irwin King. 2008. Mining social networks using heat diffusion processes for marketing candidates selection. Proceeding of the 17th ACM conference on Information and knowledge management, Napa Valley, California, USA. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Sasa Petrovic, Miles Osborne & Victor Lavrenko. 2011. RT to Win! Predicting Message Propagation in Twitter. International AAAI Conference on Weblogs and Social Media, 2011. 348 Xuan-Hieu Phan & Cam-Tu Nguyen. 2007. GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA). Ian H. Witten, Eibe Frank & Mark A. Hall. 2011. Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc. Jiang Zhu, Fei Xiong, Dongzhen Piao, Yun Liu & Ying Zhang. 2011. Statistically Modeling the Effectiveness of Disaster Information in Social Media. Proceedings of the 2011 IEEE Global Humanitarian Technology Conference.

5 0.039254718 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

Author: Mark Johnson ; Katherine Demuth ; Michael Frank

Abstract: This paper uses an unsupervised model of grounded language acquisition to study the role that social cues play in language acquisition. The input to the model consists of (orthographically transcribed) child-directed utterances accompanied by the set of objects present in the non-linguistic context. Each object is annotated by social cues, indicating e.g., whether the caregiver is looking at or touching the object. We show how to model the task of inferring which objects are being talked about (and which words refer to which objects) as standard grammatical inference, and describe PCFG-based unigram models and adaptor grammar-based collocation models for the task. Exploiting social cues improves the performance of all models. Our models learn the relative importance of each social cue jointly with word-object mappings and collocation structure, consis- tent with the idea that children could discover the importance of particular social information sources during word learning.

6 0.035099797 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums

7 0.032316685 117 acl-2012-Improving Word Representations via Global Context and Multiple Word Prototypes

8 0.030374004 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations

9 0.028916305 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

10 0.027928095 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

11 0.027569193 188 acl-2012-Subgroup Detector: A System for Detecting Subgroups in Online Discussions

12 0.026831239 191 acl-2012-Temporally Anchored Relation Extraction

13 0.026824489 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

14 0.025921732 153 acl-2012-Named Entity Disambiguation in Streaming Data

15 0.02539308 187 acl-2012-Subgroup Detection in Ideological Discussions

16 0.024899947 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

17 0.02443191 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

18 0.024377951 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures

19 0.023804547 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

20 0.023616668 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.071), (1, 0.053), (2, -0.008), (3, 0.016), (4, 0.011), (5, 0.011), (6, -0.01), (7, 0.007), (8, 0.026), (9, 0.038), (10, 0.017), (11, 0.003), (12, -0.011), (13, 0.007), (14, 0.014), (15, 0.021), (16, -0.038), (17, 0.009), (18, -0.013), (19, -0.02), (20, -0.013), (21, -0.023), (22, 0.021), (23, 0.021), (24, -0.043), (25, 0.07), (26, -0.016), (27, -0.033), (28, 0.03), (29, -0.03), (30, 0.02), (31, -0.037), (32, 0.078), (33, 0.112), (34, 0.075), (35, 0.068), (36, 0.062), (37, -0.025), (38, -0.076), (39, -0.028), (40, 0.008), (41, 0.02), (42, -0.031), (43, -0.066), (44, 0.102), (45, -0.003), (46, -0.099), (47, -0.074), (48, 0.036), (49, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93313974 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

Author: Apoorv Agarwal ; Adinoyi Omuya ; Aaron Harnly ; Owen Rambow

2 0.61178446 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations

Author: JinYeong Bak ; Suin Kim ; Alice Oh

Abstract: In social psychology, it is generally accepted that one discloses more of his/her personal information to someone in a strong relationship. We present a computational framework for automatically analyzing such self-disclosure behavior in Twitter conversations. Our framework uses text mining techniques to discover topics, emotions, sentiments, lexical patterns, as well as personally identifiable information (PII) and personally embarrassing information (PEI). Our preliminary results illustrate that in relationships with high relationship strength, Twitter users show significantly more frequent behaviors of self-disclosure.

3 0.61114776 70 acl-2012-Demonstration of IlluMe: Creating Ambient According to Instant Message Logs

Author: Lun-Wei Ku ; Cheng-Wei Sun ; Ya-Hsin Hsueh

Abstract: We present IlluMe, a software tool pack which creates a personalized ambient using the music and lighting. IlluMe includes an emotion analysis software, the small space ambient lighting, and a multimedia controller. The software analyzes emotional changes from instant message logs and corresponds the detected emotion to the best sound and light settings. The ambient lighting can sparkle with different forms of light and the smart phone can broadcast music respectively according to different atmosphere. All settings can be modified by the multimedia controller at any time and the new settings will be feedback to the emotion analysis software. The IlluMe system, equipped with the learning function, provides a link between residential situation and personal emotion. It works in a Chinese chatting environment to illustrate the language technology in life.

4 0.57291114 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

Author: Wen-Tai Hsieh ; Chen-Ming Wu ; Tsun Ku ; Seng-cho T. Chou

5 0.47922936 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

Author: Mark Johnson ; Katherine Demuth ; Michael Frank

6 0.43172908 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

7 0.36739251 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

8 0.36627012 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

9 0.35017231 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction

10 0.34815037 57 acl-2012-Concept-to-text Generation via Discriminative Reranking

11 0.34562513 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

12 0.34272268 73 acl-2012-Discriminative Learning for Joint Template Filling

13 0.34149891 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

14 0.32842705 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

15 0.32112998 129 acl-2012-Learning High-Level Planning from Text

16 0.31100872 160 acl-2012-Personalized Normalization for a Multilingual Chat System

17 0.30793324 195 acl-2012-The Creation of a Corpus of English Metalanguage

18 0.29527876 153 acl-2012-Named Entity Disambiguation in Streaming Data

19 0.29521245 216 acl-2012-Word Epoch Disambiguation: Finding How Words Change Over Time

20 0.29285556 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.011), (25, 0.013), (26, 0.038), (28, 0.03), (30, 0.02), (37, 0.035), (39, 0.077), (74, 0.013), (82, 0.032), (84, 0.015), (90, 0.062), (92, 0.052), (94, 0.43), (99, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9332093 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

Author: Seung-Wook Lee ; Dongdong Zhang ; Mu Li ; Ming Zhou ; Hae-Chang Rim

Abstract: In this paper, we propose a novel method of reducing the size of translation model for hierarchical phrase-based machine translation systems. Previous approaches try to prune infrequent entries or unreliable entries based on statistics, but cause a problem of reducing the translation coverage. On the contrary, the proposed method try to prune only ineffective entries based on the estimation of the information redundancy encoded in phrase pairs and hierarchical rules, and thus preserve the search space of SMT decoders as much as possible. Experimental results on Chinese-toEnglish machine translation tasks show that our method is able to reduce almost the half size of the translation model with very tiny degradation of translation performance.

same-paper 2 0.81266594 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

Author: Apoorv Agarwal ; Adinoyi Omuya ; Aaron Harnly ; Owen Rambow

3 0.80207229 176 acl-2012-Sentence Compression with Semantic Role Constraints

Author: Katsumasa Yoshikawa ; Ryu Iida ; Tsutomu Hirao ; Manabu Okumura

Abstract: For sentence compression, we propose new semantic constraints to directly capture the relations between a predicate and its arguments, whereas the existing approaches have focused on relatively shallow linguistic properties, such as lexical and syntactic information. These constraints are based on semantic roles and superior to the constraints of syntactic dependencies. Our empirical evaluation on the Written News Compression Corpus (Clarke and Lapata, 2008) demonstrates that our system achieves results comparable to other state-of-the-art techniques.

4 0.7941969 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

Author: Ashish Vaswani ; Liang Huang ; David Chiang

Abstract: Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. Although many models have surpassed them in accuracy, none have supplanted them in practice. In this paper, we propose a simple extension to the IBM models: an ‘0 prior to encourage sparsity in the word-to-word translation model. We explain how to implement this extension efficiently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).

5 0.78279006 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations

Author: JinYeong Bak ; Suin Kim ; Alice Oh

6 0.47820044 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

7 0.42283183 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

8 0.40273753 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation

9 0.38718146 108 acl-2012-Hierarchical Chunk-to-String Translation

10 0.37423202 136 acl-2012-Learning to Translate with Multiple Objectives

11 0.36903828 140 acl-2012-Machine Translation without Words through Substring Alignment

12 0.36790454 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale

13 0.3640497 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

14 0.36038336 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

15 0.35283902 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

16 0.35145196 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

17 0.35139534 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

18 0.34917748 83 acl-2012-Error Mining on Dependency Trees

19 0.34813234 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

20 0.34749609 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction