emnlp emnlp2012 emnlp2012-33 knowledge-graph by maker-knowledge-mining

33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections


Source: pdf

Author: Jennifer Gillenwater ; Alex Kulesza ; Ben Taskar

Abstract: We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads—singlylinked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. [sent-3, score-0.152]

2 As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. [sent-4, score-0.094]

3 To this end, we focus on extracting diverse sets of threads—singlylinked, coherent chains of important documents. [sent-6, score-0.09]

4 To illustrate, we extract research threads from citation graphs and construct timelines from news articles. [sent-7, score-0.922]

5 Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. [sent-8, score-0.129]

6 Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges. [sent-9, score-0.16]

7 In this work we propose a novel approach: threading structured document collections. [sent-13, score-0.198]

8 Con710 sider a large graph, with documents as nodes and edges indicating relationships, as in Figure 1. [sent-14, score-0.136]

9 Our goal is to find a diverse set of paths (or threads) through the collection that are individually coherent and together cover the most salient parts. [sent-15, score-0.178]

10 For example, given a collection of academic papers, we might want to identify the most significant lines of research, threading the citation graph to produce chains of important papers. [sent-16, score-0.22]

11 Or, given news articles connected chronologically, we might want to extract threads of articles to form timelines describing the major events from the most significant news stories. [sent-17, score-1.177]

12 Top-tier news organizations like The New York Times and The Guardian regularly publish such timelines, but have so far been limited to creating them by hand. [sent-18, score-0.089]

13 We show how these kinds of threading tasks can be done efficiently, providing a simple, practical tool for representing graph-based data that offers new possibilities compared with existing models. [sent-20, score-0.134]

14 Several of TDT’s core tasks, like link detection, topic detection, and topic tracking, can be seen as subroutines for the threading problem. [sent-22, score-0.322]

15 and relatedness to weight nodes We first build a graph from the collection, using (documents) and build edges from this graph, we extract a diverse, salient set of threads to represent (relationships) . [sent-27, score-0.902]

16 Then, The supplement contains a version of this figure for our real-world news dataset. [sent-29, score-0.118]

17 and Taskar, 2010) , which offer a natural probabilistic model over sets of structures (such as threads) where diversity is desired, and we incorporate k-DPP extensions to control the number of threads (Kulesza and Taskar, 2011) . [sent-30, score-0.709]

18 We apply our model to two real-world datasets, extracting threads of research papers and timelines of news articles. [sent-31, score-0.919]

19 An example of news threads extracted using our model is shown in Figure 2. [sent-32, score-0.768]

20 Quantitative evaluation shows that our model significantly outperforms multiple baselines, including dynamic topic models, in comparisons with human-produced news summaries. [sent-33, score-0.218]

21 It also outperforms baseline methods in a user evaluation of thread coherence, and runs 75 times faster than a dynamic topic model. [sent-34, score-0.271]

22 2 Related Work A variety of papers from the topic tracking literature are broadly related to our work (Mei and Zhai, 2005; Blei and Lafferty, 2006; Leskovec et al. [sent-36, score-0.17]

23 Blei and Lafferty (2006) recently introduced dynamic topic models (DTMs) . [sent-38, score-0.129]

24 Assuming a division of documents into time slices, a DTM draws in each slice a set of topics from a Gaussian distribution whose mean is determined by the topics from the previous slice. [sent-39, score-0.111]

25 We engineer a baseline for constructing document threads from DTM topic threads (see Section 6. [sent-42, score-1.516]

26 However, iDTMs still require placing documents into discrete epochs, and the issue of generating topic rather than document threads remains. [sent-47, score-0.888]

27 Swan and Jensen (2000) proposed a system for finding temporally clustered named entities in news text and presenting them on a timeline. [sent-50, score-0.089]

28 Allan, Gupta, and Khandelwal (2001) introduced the task of temporal summarization, which takes a stream of news articles on a particular topic and tries to extract sentences describing important events as they occur. [sent-51, score-0.284]

29 Here, we are interested not in extracting topically grouped entities or sentences, but instead in organizing a subset of the articles themselves into timelines, with topic identification as a side effect. [sent-53, score-0.195]

30 Above, the threads are shown on a timeline with the most salient words superimposed; below, the dates and headlines from the threads appearing at the bottom are listed. [sent-56, score-1.582]

31 Topic models are not designed for threading and often link together topically similar documents that do not constitute a coherent news story, as on the right. [sent-57, score-0.305]

32 Shahaf, Guestrin, and Horvitz (2012) recently proposed metro maps as alternative structured representations of related news stories. [sent-59, score-0.123]

33 Metro maps are effectively sets of non-chronological threads that are encouraged to intersect and thus create a “map” of events and topics. [sent-60, score-0.679]

34 Shahaf and Guestrin (2010) , for example, assume the thread endpoints are specified, and Chieu and Lee (2004) require a set of query words. [sent-62, score-0.142]

35 We prove that even a logarithmic number of projections is sufficient to yield a close approximation to the original SDPP distribution. [sent-70, score-0.104]

36 We assume that the collection has been transformed into a directed graph G = (V, E) on n vertices, where each node corresponds to a document and each edge represents a relationship between documents whose semantics depend on the task. [sent-72, score-0.204]

37 The feature map on a thread is then just a sum over the nodes in the thread: φ(y) =tXT=1φ? [sent-84, score-0.173]

38 ) Given this framework, our goal is to develop a probabilistic model over sets of k threads of length T, favoring sets whose threads have large weight but are also distinct from one another with respect to φ. [sent-88, score-1.358]

39 In other words, a high-probability set under the model should include threads that are both salient and diverse. [sent-89, score-0.767]

40 This is a daunting problem, given that the number of possible sets of threads is O(nkT). [sent-90, score-0.679]

41 as a normalized D-dimensional feature vector such that φ(yi)>φ(yj) ∈ [−1, 1] is a measure of similarity between item∈s yi and yj . [sent-105, score-0.174]

42 To understand why this is the case, note that determinants are closely related to volumes; in particular, det(LY) is proportional to the volume spanned by the vectors q(yi)φ(yi) for yi ∈ Y. [sent-107, score-0.224]

43 In our setting, Y contains all threads of length T, so each yi ∈ Y Yis a sequence where is the document included in the thread at position t. [sent-111, score-1.014]

44 However, it is possible (and efficient, due to the linear scaling) to allow longer threads, as well as threads of variable length. [sent-127, score-0.679]

45 The latter effect can be achieved by adding a sin- gle “dummy” node to the document graph, with incoming edges from all other documents and a single outgoing self-loop edge. [sent-128, score-0.242]

46 Shorter threads will simply transition to this dummy node when they are complete. [sent-129, score-0.718]

47 714 5 Random projections As described above, the time complexity for sampling sets from SDPPs is O(TrnD2) . [sent-136, score-0.112]

48 More recently, Magen and Zouzias (2008) extended this idea to the preservation of volumes spanned by sets of points. [sent-142, score-0.112]

49 Here, we use a relationship between determinants and volumes to × adapt the latter result. [sent-143, score-0.103]

50 Practically, Theorem 1 says that if we project φ down to dimension d logarithmic in the number of documents and linear in thread length, the L1 ×× variational distance between the true model and the projected model is bounded. [sent-163, score-0.311]

51 Displayed beside each thread are a few of its maximum-tfidf words. [sent-234, score-0.142]

52 Paper titles from two of the threads are shown to the right. [sent-235, score-0.679]

53 6 Experiments We begin by showing the performance of random projections on a small, synthetic threading task where the exact model is tractable, with n = 600 and D = 150. [sent-236, score-0.202]

54 1 Cora citation graph To qualitatively illustrate our model, we apply it to Cora (McCallum et al. [sent-240, score-0.086]

55 We construct a directed graph with papers as nodes and citations as edges; after removing papers with missing metadata or zero outgoing citations, our graph contains n = 28,155 papers. [sent-243, score-0.267]

56 We represent each document by the 1000 documents to which it is most similar according to NCS; this results in binary φ of dimension m = n with exactly 1000 non-zeros. [sent-250, score-0.115]

57 The dot product between the similarity features of two documents is thus proportional to the fraction of top-1000 similar documents they have in common. [sent-251, score-0.102]

58 The discovered threads occupy distinct regions of word-space, standing apart visually, and contain diverse salient terms. [sent-254, score-0.826]

59 2 News articles For quantitative evaluation, we use newswire data. [sent-256, score-0.101]

60 Our dataset comprises over 200,000 articles from the New York Times, collected from 2005-2007 as part of the English Gigaword corpus (Graff and Cieri, 2009) . [sent-257, score-0.101]

61 We split the articles into six-month time periods, with an average of n = 34,504 articles per period. [sent-258, score-0.202]

62 For each time period, we generate a graph with articles as nodes. [sent-260, score-0.151]

63 We use LexRank for node weights and the top-1000 similar documents as similarity features φ, projecting to d = 50, as before (Section 6. [sent-265, score-0.09]

64 For all of the following results, we use T = 8 and k = 10 so that the resulting timelines are of a manageable size for analysis. [sent-269, score-0.118]

65 1 Graph visualizations The (very large) news graph for the first half of 2005 can be viewed interactively at http://zoom. [sent-274, score-0.139]

66 In this graph each node (dark circle) represents a news article, and is an- notated with its headline. [sent-276, score-0.178]

67 The five colored paths indicate a set of threads sampled from the k-SDPP. [sent-279, score-0.708]

68 Headlines of the articles in each thread are colored to match the thread. [sent-280, score-0.272]

69 We provide a view of a small subgraph for illustration purposes in Figure 6, which shows the incoming and outgoing edges for a single node. [sent-283, score-0.088]

70 SNIXFEACPUSTNBRFEIMOHAUTNYDRBOZIAUFMSTGEDVR'OIAEDBROPUANISEGLOBRVAE'SILGMTNOUH ARTSNM Figure 6: Snapshot of a single article node and all of its neghboring article nodes. [sent-330, score-0.101]

71 2 Baselines k-means baseline: A simple baseline is to split each six-month period of articles into T equal time slices, then apply k-means clustering to each slice, using NCS to measure distance. [sent-335, score-0.101]

72 We then select the most central article from each cluster, and finally match the k articles from time slice ione-to-one with those from slice i+ 1 by computing the pairing that maximizes the average NCS of the pairs, i. [sent-336, score-0.252]

73 The result is a set of k threads of length T, where no two threads contain the same article. [sent-339, score-1.358]

74 DTM baseline: A more sophisticated baseline is the dynamic topic model (Blei and Lafferty, 2006) , which explicitly attempts to find topics that are smooth through time. [sent-341, score-0.129]

75 We then choose, for each topic at each time step, the document with the highest per-word probability of being generated by that topic. [sent-343, score-0.158]

76 Documents from the same topic form a single thread. [sent-344, score-0.094]

77 3 Comparison to human summaries We compare the threads generated by our baselines and sampled from the k-SDPP to a set of human-generated news summaries. [sent-360, score-0.839]

78 The human summaries are not threaded; they are flat, roughly daily news summaries published by Agence France-Presse and found in the Gigaword corpus, distinguished by their “multi” type tag. [sent-361, score-0.231]

79 We compute four statistics: • • Cosine similarity: NCS (in percent) betCwoeseinn eth sei mcoilnacraitteynated threads and concatenated human summaries. [sent-365, score-0.679]

80 The hyperparameters for all methods—such as the constant feature magnitude ρ for k-SDPPs and the parameter governing topic proportions for DTMs—were tuned to optimize cosine similarity on a development set from January-June 2005. [sent-366, score-0.094]

81 Under each measure, the k-SDPP threads more closely resemble human summaries. [sent-369, score-0.679]

82 Interlopers: average number of interloper articles identified (out of 2) . [sent-375, score-0.151]

83 To obtain a large-scale evaluation of thread coherence, we turn to Mechanical Turk. [sent-379, score-0.142]

84 We asked Turkers to read the headlines and first few sentences of each article in a timeline and then rate the overall narrative coherence of the timeline on a scale of 1 ( “the articles are totally unrelated” ) to 5 ( “the articles tell a single clear story” ) . [sent-380, score-0.51]

85 We also had Turkers evaluate threads implicitly by performing a simple task. [sent-383, score-0.679]

86 We showed them timelines into which two additional “interloper” articles selected at random had been inserted, and asked them to remove the two articles that they thought should be removed to “improve the flow of the timeline” . [sent-384, score-0.32]

87 Intuitively, the interlopers should be selected more often when the original timeline is coherent. [sent-386, score-0.139]

88 The average number of interloper articles correctly identified is shown in Table 2. [sent-387, score-0.151]

89 5 Runtimes Finally, we report in Table 3 the time required to produce a complete set of threads for each method. [sent-393, score-0.679]

90 • • • Neither baseline directly models the document threads themselves. [sent-399, score-0.743]

91 This makes the k-SDPP a better choice for applications where, for instance, the coherence of individual threads is important. [sent-401, score-0.715]

92 While the baselines seek threads that cover or explain as much of the dataset as possible, k-SDPPs are better suited for tasks where a balance between quality and diversity is key, since its hyperparameters correspond to weights on these quantities. [sent-402, score-0.709]

93 With news timelines, for example, we want not just topical diversity but also a focus on the most important stories. [sent-403, score-0.119]

94 Both baselines require input to be split into time slices, whereas the k-SDPP does not; this flexibility allows the k-SDPP to put multiple articles from a single time slice in a thread, or to build threads that span only part of the input period. [sent-404, score-0.84]

95 • While clustering and topic models rely on EM to approximately optimize their objectives, the k-SDPP comes with an exact, polynomial-time sampling algorithm. [sent-405, score-0.138]

96 The k-SDPP produces more consistent threads due to its use of graph information, while the DTM threads, though topic-focused, are less coherent as a story. [sent-407, score-0.76]

97 Furthermore, DTM threads span the entire time period, while our method selects threads covering only relevant spans. [sent-408, score-1.358]

98 7 Conclusion We introduced the novel problem of finding diverse and salient threads in graphs of large document collections. [sent-410, score-0.89]

99 We developed a probabilistic approach, combining SDPPs and k-SDPPs, and showed how random projections make inference efficient and yield an approximate model with bounded variational distance to the original. [sent-411, score-0.113]

100 We then demonstrated that the method produces qualitatively reasonable results, and, relative to several baslines, reproduces human news summaries more faithfully, builds more coherent story threads, and is significantly faster. [sent-412, score-0.191]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('threads', 0.679), ('sdpps', 0.168), ('yx', 0.159), ('thread', 0.142), ('shahaf', 0.134), ('threading', 0.134), ('yi', 0.129), ('kulesza', 0.118), ('dtm', 0.118), ('dtms', 0.118), ('ncs', 0.118), ('timelines', 0.118), ('timeline', 0.105), ('articles', 0.101), ('topic', 0.094), ('news', 0.089), ('salient', 0.088), ('determinantal', 0.084), ('dpps', 0.084), ('kpk', 0.084), ('mobile', 0.072), ('guestrin', 0.072), ('slices', 0.072), ('summaries', 0.071), ('projections', 0.068), ('dpp', 0.067), ('idtms', 0.067), ('magen', 0.067), ('sdpp', 0.067), ('lemma', 0.065), ('document', 0.064), ('slice', 0.06), ('volumes', 0.06), ('diverse', 0.059), ('taskar', 0.058), ('lexrank', 0.058), ('chieu', 0.058), ('cora', 0.058), ('theorem', 0.056), ('edges', 0.054), ('pk', 0.054), ('det', 0.054), ('vol', 0.052), ('spanned', 0.052), ('documents', 0.051), ('baghdad', 0.05), ('clients', 0.05), ('hesterberg', 0.05), ('interloper', 0.05), ('turkers', 0.05), ('graph', 0.05), ('ahmed', 0.048), ('mar', 0.045), ('variational', 0.045), ('yj', 0.045), ('sampling', 0.044), ('determinants', 0.043), ('server', 0.043), ('erkan', 0.043), ('swan', 0.043), ('tracking', 0.043), ('xt', 0.041), ('blei', 0.04), ('gy', 0.039), ('mei', 0.039), ('leskovec', 0.039), ('node', 0.039), ('projected', 0.037), ('citations', 0.036), ('citation', 0.036), ('logarithmic', 0.036), ('coherence', 0.036), ('dynamic', 0.035), ('jan', 0.035), ('outgoing', 0.034), ('apr', 0.034), ('chronologically', 0.034), ('egt', 0.034), ('ekt', 0.034), ('feb', 0.034), ('gaza', 0.034), ('interlopers', 0.034), ('iraqi', 0.034), ('lij', 0.034), ('metro', 0.034), ('tdt', 0.034), ('zoomable', 0.034), ('zouzias', 0.034), ('papers', 0.033), ('allan', 0.032), ('yan', 0.032), ('article', 0.031), ('coherent', 0.031), ('nodes', 0.031), ('headlines', 0.031), ('discovering', 0.03), ('diversity', 0.03), ('graff', 0.029), ('supplement', 0.029), ('colored', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

Author: Jennifer Gillenwater ; Alex Kulesza ; Ben Taskar

Abstract: We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads—singlylinked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.

2 0.12749316 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

3 0.096983835 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

Author: Ahmed Hassan ; Amjad Abu-Jbara ; Dragomir Radev

Abstract: A mixture of positive (friendly) and negative (antagonistic) relations exist among users in most social media applications. However, many such applications do not allow users to explicitly express the polarity of their interactions. As a result most research has either ignored negative links or was limited to the few domains where such relations are explicitly expressed (e.g. Epinions trust/distrust). We study text exchanged between users in online communities. We find that the polarity of the links between users can be predicted with high accuracy given the text they exchange. This allows us to build a signed network representation of discussions; where every edge has a sign: positive to denote a friendly relation, or negative to denote an antagonistic relation. We also connect our analysis to social psychology theories of balance. We show that the automatically predicted networks are consistent with those theories. Inspired by that, we present a technique for identifying subgroups in discussions by partitioning singed networks representing them.

4 0.074007519 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

Author: Keith Stevens ; Philip Kegelmeyer ; David Andrzejewski ; David Buttler

Abstract: We apply two new automated semantic evaluations to three distinct latent topic models. Both metrics have been shown to align with human evaluations and provide a balance between internal measures of information gain and comparisons to human ratings of coherent topics. We improve upon the measures by introducing new aggregate measures that allows for comparing complete topic models. We further compare the automated measures to other metrics for topic models, comparison to manually crafted semantic tests and document classification. Our experiments reveal that LDA and LSA each have different strengths; LDA best learns descriptive topics while LSA is best at creating a compact semantic representation ofdocuments and words in a corpus.

5 0.071763739 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

Author: Kristian Woodsend ; Mirella Lapata

Abstract: Multi-document summarization involves many aspects of content selection and surface realization. The summaries must be informative, succinct, grammatical, and obey stylistic writing conventions. We present a method where such individual aspects are learned separately from data (without any hand-engineering) but optimized jointly using an integer linear programme. The ILP framework allows us to combine the decisions of the expert learners and to select and rewrite source content through a mixture of objective setting, soft and hard constraints. Experimental results on the TAC-08 data set show that our model achieves state-of-the-art performance using ROUGE and significantly improves the informativeness of the summaries.

6 0.070783533 72 emnlp-2012-Joint Inference for Event Timeline Construction

7 0.067603722 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

8 0.064078838 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

9 0.063266829 19 emnlp-2012-An Entity-Topic Model for Entity Linking

10 0.063186973 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

11 0.062451191 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

12 0.058935832 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

13 0.05556035 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

14 0.050761107 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

15 0.047897089 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering

16 0.044602409 43 emnlp-2012-Exact Sampling and Decoding in High-Order Hidden Markov Models

17 0.042844497 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling

18 0.040199496 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

19 0.038348779 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

20 0.037335251 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.162), (1, 0.074), (2, 0.043), (3, 0.063), (4, -0.135), (5, 0.077), (6, -0.023), (7, -0.058), (8, 0.013), (9, 0.063), (10, 0.02), (11, -0.031), (12, 0.02), (13, -0.013), (14, -0.021), (15, 0.018), (16, -0.022), (17, -0.026), (18, 0.175), (19, -0.004), (20, -0.097), (21, 0.027), (22, 0.188), (23, -0.042), (24, 0.185), (25, 0.018), (26, -0.085), (27, -0.088), (28, 0.045), (29, -0.114), (30, 0.123), (31, 0.107), (32, -0.057), (33, -0.061), (34, 0.055), (35, -0.004), (36, 0.055), (37, 0.051), (38, -0.001), (39, 0.085), (40, 0.102), (41, 0.052), (42, -0.067), (43, 0.278), (44, -0.004), (45, 0.147), (46, 0.005), (47, 0.115), (48, 0.048), (49, 0.188)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93977189 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

Author: Jennifer Gillenwater ; Alex Kulesza ; Ben Taskar

Abstract: We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads—singlylinked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.

2 0.52615976 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

Author: Ahmed Hassan ; Amjad Abu-Jbara ; Dragomir Radev

Abstract: A mixture of positive (friendly) and negative (antagonistic) relations exist among users in most social media applications. However, many such applications do not allow users to explicitly express the polarity of their interactions. As a result most research has either ignored negative links or was limited to the few domains where such relations are explicitly expressed (e.g. Epinions trust/distrust). We study text exchanged between users in online communities. We find that the polarity of the links between users can be predicted with high accuracy given the text they exchange. This allows us to build a signed network representation of discussions; where every edge has a sign: positive to denote a friendly relation, or negative to denote an antagonistic relation. We also connect our analysis to social psychology theories of balance. We show that the automatically predicted networks are consistent with those theories. Inspired by that, we present a technique for identifying subgroups in discussions by partitioning singed networks representing them.

3 0.47966129 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

Author: David McClosky ; Christopher D. Manning

Abstract: We present a distantly supervised system for extracting the temporal bounds of fluents (relations which only hold during certain times, such as attends school). Unlike previous pipelined approaches, our model does not assume independence between each fluent or even between named entities with known connections (parent, spouse, employer, etc.). Instead, we model what makes timelines of fluents consistent by learning cross-fluent constraints, potentially spanning entities as well. For example, our model learns that someone is unlikely to start a job at age two or to marry someone who hasn’t been born yet. Our system achieves a 36% error reduction over a pipelined baseline.

4 0.45769393 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

Author: Stephen Roller ; Michael Speriosu ; Sarat Rallapalli ; Benjamin Wing ; Jason Baldridge

Abstract: The geographical properties of words have recently begun to be exploited for geolocating documents based solely on their text, often in the context of social media and online content. One common approach for geolocating texts is rooted in information retrieval. Given training documents labeled with latitude/longitude coordinates, a grid is overlaid on the Earth and pseudo-documents constructed by concatenating the documents within a given grid cell; then a location for a test document is chosen based on the most similar pseudo-document. Uniform grids are normally used, but they are sensitive to the dispersion of documents over the earth. We define an alternative grid construction using k-d trees that more robustly adapts to data, especially with larger training sets. We also provide a better way of choosing the locations for pseudo-documents. We evaluate these strategies on existing Wikipedia and Twitter corpora, as well as a new, larger Twitter corpus. The adaptive grid achieves competitive results with a uniform grid on small training sets and outperforms it on the large Twitter corpus. The two grid constructions can also be combined to produce consistently strong results across all training sets.

5 0.42354792 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

6 0.32881895 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

7 0.32109776 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics

8 0.31768441 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

9 0.29995561 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model

10 0.29909429 75 emnlp-2012-Large Scale Decipherment for Out-of-Domain Machine Translation

11 0.29232183 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

12 0.26950809 72 emnlp-2012-Joint Inference for Event Timeline Construction

13 0.24271798 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

14 0.24024661 43 emnlp-2012-Exact Sampling and Decoding in High-Order Hidden Markov Models

15 0.23789866 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

16 0.23697983 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

17 0.22325456 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

18 0.22254543 19 emnlp-2012-An Entity-Topic Model for Entity Linking

19 0.21137363 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

20 0.20593624 46 emnlp-2012-Exploiting Reducibility in Unsupervised Dependency Parsing


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.017), (11, 0.012), (16, 0.035), (25, 0.026), (34, 0.077), (45, 0.013), (60, 0.069), (63, 0.073), (64, 0.021), (65, 0.019), (70, 0.014), (73, 0.015), (74, 0.03), (76, 0.048), (78, 0.353), (80, 0.039), (86, 0.022), (95, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76006657 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

Author: Jennifer Gillenwater ; Alex Kulesza ; Ben Taskar

Abstract: We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads—singlylinked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.

2 0.38265964 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

Author: Jianxing Yu ; Zheng-Jun Zha ; Tat-Seng Chua

Abstract: This paper proposes to generate appropriate answers for opinion questions about products by exploiting the hierarchical organization of consumer reviews. The hierarchy organizes product aspects as nodes following their parent-child relations. For each aspect, the reviews and corresponding opinions on this aspect are stored. We develop a new framework for opinion Questions Answering, which enables accurate question analysis and effective answer generation by making use the hierarchy. In particular, we first identify the (explicit/implicit) product aspects asked in the questions and their sub-aspects by referring to the hierarchy. We then retrieve the corresponding review fragments relevant to the aspects from the hierarchy. In order to gener- ate appropriate answers from the review fragments, we develop a multi-criteria optimization approach for answer generation by simultaneously taking into account review salience, coherence, diversity, and parent-child relations among the aspects. We conduct evaluations on 11 popular products in four domains. The evaluated corpus contains 70,359 consumer reviews and 220 questions on these products. Experimental results demonstrate the effectiveness of our approach.

3 0.37700668 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

Author: Lizhen Qu ; Rainer Gemulla ; Gerhard Weikum

Abstract: We propose the weakly supervised MultiExperts Model (MEM) for analyzing the semantic orientation of opinions expressed in natural language reviews. In contrast to most prior work, MEM predicts both opinion polarity and opinion strength at the level of individual sentences; such fine-grained analysis helps to understand better why users like or dislike the entity under review. A key challenge in this setting is that it is hard to obtain sentence-level training data for both polarity and strength. For this reason, MEM is weakly supervised: It starts with potentially noisy indicators obtained from coarse-grained training data (i.e., document-level ratings), a small set of diverse base predictors, and, if available, small amounts of fine-grained training data. We integrate these noisy indicators into a unified probabilistic framework using ideas from ensemble learning and graph-based semi-supervised learning. Our experiments indicate that MEM outperforms state-of-the-art methods by a significant margin.

4 0.37425175 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

5 0.37373936 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

6 0.37272936 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

7 0.37099022 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

8 0.37030718 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM

9 0.36934397 97 emnlp-2012-Natural Language Questions for the Web of Data

10 0.36871013 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

11 0.36839843 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

12 0.36815363 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

13 0.36591056 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

14 0.36541623 64 emnlp-2012-Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints

15 0.36522424 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

16 0.36493945 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

17 0.36442307 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

18 0.36349031 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

19 0.36344397 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries

20 0.36310732 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT