acl acl2013 acl2013-126 knowledge-graph by maker-knowledge-mining

126 acl-2013-Diverse Keyword Extraction from Conversations


Source: pdf

Author: Maryam Habibi ; Andrei Popescu-Belis

Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Diverse Keyword Extraction from Conversations Maryam Habibi Idiap Research Institute and EPFL Rue Marconi 19, CP 592 1920 Martigny, Switzerland maryam . [sent-1, score-0.066]

2 ch Abstract A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. [sent-3, score-0.806]

3 Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. [sent-4, score-0.296]

4 The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. [sent-5, score-0.165]

5 1 Introduction The goal of keyword extraction from texts is to provide a set of words that are representative of the semantic content of the texts. [sent-7, score-0.542]

6 In the application intended here, keywords are automatically extracted from transcripts of conversation fragments, and are used to formulate queries to a just-in-time document recommender system. [sent-8, score-0.431]

7 It is thus important that the keyword set preserves the diversity of topics from the conversation. [sent-9, score-0.65]

8 In this paper, we propose a new method for keyword extraction that rewards both word similarity, to extract the most representative words, and word diversity, to cover several topics if necessary. [sent-11, score-0.673]

9 In Section 2 we re- view existing methods for keyword extraction. [sent-13, score-0.392]

10 In Section 3 we describe our proposal, which relies on topic modeling and a novel topic-aware diverse keyword extraction algorithm. [sent-14, score-0.648]

11 2 State of the Art in Keyword Extraction Numerous studies have been conducted to automatically extract keywords from a text or a transcribed conversation. [sent-19, score-0.229]

12 These approaches do not consider word meaning, so they may ignore low- frequency words which together indicate a highlysalient topic (Nenkova and McKeown, 2012). [sent-22, score-0.118]

13 Semantic relations between words can be obtained from a manuallyconstructed thesaurus such as WordNet, or from Wikipedia, or from an automatically-built thesaurus using latent topic modeling techniques. [sent-24, score-0.19]

14 Harwath and Hazen (2012) used topic modeling with PLSA to build a thesaurus, which they used to rank words based on topical similarity to the topics of a transcribed conversation. [sent-27, score-0.299]

15 However, although they considered topical similarity, the above methods did not explicitly reward diversity and might miss secondary topics. [sent-42, score-0.309]

16 Supervised methods have been used to learn a model for extracting keywords with various learning algorithms (Turney, 1999; Frank et al. [sent-43, score-0.229]

17 These approaches, however, rely on the availability of in-domain training data, and the objective functions they use for learning do not consider yet the diversity of keywords. [sent-45, score-0.164]

18 3 Diverse Keyword Extraction We propose to build a topical representation of a conversation fragment, and then to select keywords using topical similarity while also rewarding the diversity of topic coverage, inspired by recent summarization methods (Lin and Bilmes, 2011; Li et al. [sent-46, score-0.925]

19 1 Representing Topic Information Topic models such as Probabilistic Latent Semantic Analysis (PLSA) or Latent Dirichlet Allocation (LDA) can be used to determine the distribution over the topic z of a word w, noted p(z|w), from a large ahem otoupnitc o zf o training dd wo,cu nmoteendt ps. [sent-49, score-0.184]

20 The distribution of each topic z in a given conversation fragment t, noted p(z|t), can be computed by summing over aotlle probabilities p(z|w) moftphuet eNd bwyo srudms w spoken ri anl tlh per fragment: p(z|t) =N1Xp(z|w). [sent-52, score-0.517]

21 2 Selecting Keywords The problem of keyword extraction with maximal topic coverage is formulated as follows. [sent-54, score-0.59]

22 To find a monotone submodular function for keyword extraction, we used inspiration from recent work on extractive summarization methods (Lin and Bilmes, 2011; Li et al. [sent-61, score-0.531]

23 , 2012), which proposed a square root function for diverse selection of sentences to cover the maximum number of key concepts of a given document. [sent-62, score-0.058]

24 The function re- wards diversity by increasing the gain of selecting a sentence including a concept that was not yet covered by a previously selected sentence. [sent-63, score-0.164]

25 This must be adapted for keyword extraction by defining an appropriate reward function. [sent-64, score-0.53]

26 We first introduce rS,z, the topical similarity with respect to topic z of the keyword set S selected from the fragment t, defined as follows: rS,z = Xp(z|w) · p(z|t). [sent-65, score-0.728]

27 Xw∈S We then propose the following reward function for each topic, where p(z|t) is the importance of tfoher topic taonpdi cλ, iws a parameter sbe tthwee iemnp p0o ratnadn c1e : f : rS,z → p(z|t) · rSλ,z . [sent-66, score-0.176]

28 This is clearly a submodular function with diminishing returns as rS,z increases. [sent-67, score-0.134]

29 Finally, the keywords S ⊆ t, with |S| k, are cinhaolsley,n by maximizing th ⊆e c tu,m wuiltahti|v Se | re ≤war kd, function over all the topics, formulated as follows: ≤ = R(S) Xp(z|t) · rSλ,z. [sent-68, score-0.262]

30 Xz∈Z Since R(S) is submodular, rithm for maximizing R(S) the greedy algo- is shown as Algo- rithm 1on the next page, with r{w},z being similar = {w}. [sent-69, score-0.107]

31 to rS,z with S If λ = 1, the reward func- tion is liwnietahr S San =d only measures t,h teh topical dsifm uinlac-rity of words with the main topics of t. [sent-70, score-0.239]

32 However, when 0 < λ < 1, as soon as a word is selected from a topic, other words from the same topic start having diminishing gains. [sent-71, score-0.151]

33 Image (a) represents the conversation fragment better than (b). [sent-74, score-0.333]

34 Image (b) represents the conversation fragment better than (a). [sent-76, score-0.333]

35 Figure 1: Example of a HIT based on an AMI discussion about the impact on sales of some features of remote controls (the conversation transcript is given in the Appendix). [sent-81, score-0.374]

36 The word cloud was generated using WordleTM from the list produced by the diverse keyword extraction method with λ = 0. [sent-82, score-0.597]

37 75)) for image (a) and by a topic similarity method (TS) for image (b). [sent-84, score-0.218]

38 TS over-represents the topic “color” by selecting three words related to it, but misses other topics such as “remote control”, “losing a device” and “buying a device” which are also representative of the fragment. [sent-85, score-0.282]

39 The former corpus contains about 11,000 topic-labeled telephone conversations, on 40 pre-selected topics (one per conversation). [sent-88, score-0.094]

40 We created a topic model using Mallet over two thirds of the Fisher Corpus, given its large number of single-topic documents, with 40 topics. [sent-89, score-0.118]

41 The remaining data is used to build 11 artificial “conversations” (1-2 minutes long) for testing, by concatenating 11times three fragments about three different topics. [sent-90, score-0.113]

42 The AMI Corpus contains 171 half-hour meetings about remote control design, which include several topics each so they cannot be directly used for learning topic models. [sent-91, score-0.33]

43 While selecting for testing 8 conversation fragments of 2-3 minutes each, we trained topic models on a subset of the English Wikipedia (10% or 124,684 articles). [sent-92, score-0.433]

44 Following several previous studies, the number of topics was set to 100 (Boyd-Graber et al. [sent-93, score-0.094]

45 To evaluate the relevance (or representativeness) of extracted keywords with respect to a conversation fragment, we designed comparison tasks. [sent-96, score-0.514]

46 In each task, a fragment is shown, followed by three control questions about its content, and then by two lists of nine keywords each, from two different extraction methods. [sent-97, score-0.488]

47 To improve readability, the keyword lists are presented to the judges using a word cloud representation generated by WordleTM (http://www. [sent-98, score-0.493]

48 net), in which the words ranked higher are emphasized in the word – cloud (see example in Figure 1). [sent-100, score-0.067]

49 The judges had to read the conversation transcript, answer the control questions, and then decide which word cloud better represents the content of the conversation. [sent-101, score-0.351]

50 One of them is exemplified in Figure 1, without the control questions, and the respective conversation transcript is given in the Appendix. [sent-103, score-0.352]

51 After collecting judgments, the comparative relevance values were computed by first applying a qualification control factor to the human judgments, and then averaging results over all judgments (Habibi and Popescu-Belis, 2012). [sent-106, score-0.232]

52 Moreover, to verify the diversity of the key653 lva? [sent-107, score-0.164]

53 750 ) Figure 2: Average α-NDCG over the 11 conversations from the Fisher Corpus, for 1 to 15 extracted keywords. [sent-113, score-0.076]

54 , 2008) proposed for information retrieval, which rewards a mixture of relevance and diversity with equal weights when α = . [sent-115, score-0.284]

55 We only apply α-NDCG to the three-topic conversation fragments from the Fisher Corpus, relevance of a keyword being set to 1 when it belongs to the fragment corresponding to the topic. [sent-117, score-0.921]

56 A higher value indicates that keywords are more uniformly distributed across the three topics. [sent-118, score-0.229]

57 – 5 Experimental Results We have compared several versions of the diverse keyword extraction method, noted D(λ), for λ ∈ {. [sent-119, score-0.596]

58 ,T fhoer λfirs ∈t one uses only witohrd t frequency (not including stopwords) and is noted WF. [sent-122, score-0.066]

59 We did not use TFIDF because it sets low weights on keywords that are repeated in many fragments but which are nevertheless important to extract. [sent-123, score-0.342]

60 The second method is based on topical similarity (noted TS) but does not specifically enforce diversity (Harwath and Hazen, 2012). [sent-124, score-0.251]

61 In fact TS coincides with D(1), so it is noted TS. [sent-125, score-0.066]

62 First of all, we compared the four methods with respect to the diversity constraint over the con- HITABCDEFGH TBDBoS(o. [sent-130, score-0.164]

63 t204 51 3081 901 602 602 6103810 Table 1: Number of answers for each of the four options of the comparative evaluation task, from ten human judges. [sent-132, score-0.046]

64 5)S677088224220 Table 2: Comparative relevance scores of keyword extraction methods based on human judgments. [sent-146, score-0.555]

65 catenated fragments of the Fisher Corpus, by using α-NDCG to measure how evenly the extracted keywords were distributed across the three topics. [sent-147, score-0.342]

66 Figure 2 shows results averaged over 11conversations for various sizes of the keyword set (1–15). [sent-148, score-0.392]

67 The values for TS are quite low, and only increase for a large number of keywords, demonstrating that TS does not cope well with topic diversity, but on the contrary first selects keywords from the dominant topic. [sent-152, score-0.347]

68 The values for WF are more uniform as it does not consider topics at all. [sent-153, score-0.094]

69 To measure the overall representativeness of keywords, we performed binary comparisons between the outputs of each method, using crowdsourcing, over 11 fragments from the Fisher Corpus and 8 fragments from AMI. [sent-154, score-0.267]

70 AMT workers compared two lists of nine keywords each, with four options: X more representative or relevant than Y , or vice-versa, or both relevant, or both irrelevant. [sent-156, score-0.343]

71 Table 1 shows the judgments collected when comparing the output of D(. [sent-157, score-0.055]

72 Workers disagreed for the first two HITs, but then found that the keywords extracted by D(. [sent-159, score-0.229]

73 The consolidated rel654 evance (Habibi and Popescu-Belis, 2012) is 78% for D(. [sent-161, score-0.05]

74 The averaged relevance values for all comparisons needed to rank the four methods are shown in Table 2 separately for the Fisher and AMI Cor- pora. [sent-164, score-0.083]

75 Although the exact differences vary, the human judgments over the two corpora both indicate the following ranking: D(. [sent-165, score-0.055]

76 75, and with this value, our diversity-aware method extracts more representative keyword sets than TS and WF. [sent-169, score-0.462]

77 The differences between methods are larger for the Fisher Corpus, due to the artificial fragments that concatenate three topics, but they are still visible on the natural fragments of the AMI Corpus. [sent-170, score-0.226]

78 5) are found to be due, upon inspection, to the low relevance of keywords. [sent-172, score-0.083]

79 6 Conclusion The diverse keyword extraction method with λ = . [sent-178, score-0.53]

80 75 provides the keyword sets that are judged most representative of the conversation fragments (two conversational datasets) by a large number of human judges recruited via AMT, and has the highest α-NDCG value. [sent-179, score-0.883]

81 Therefore, enforcing both rel- evance and diversity brings an effective improvement to keyword extraction. [sent-180, score-0.606]

82 In the future, we will use keywords to retrieve documents from a repository and recommend them to conversation participants by formulating topically-separate queries. [sent-184, score-0.431]

83 Appendix: Conversation transcript of AMI ES2005a meeting (00:00:5-00:01:52) The following transcript of a four-party conversations (speakers noted A through D) was submitted to our keyword extraction method and a baseline one, generating respectively the two word clouds shown in Figure 1. [sent-185, score-0.818]

84 A : I ve ’ te So The only only usual ly us ed the come levi s i , on and they ’ uh D: Yeah . [sent-186, score-0.19]

85 remot e cont ro l s with the re fai rly bas i . [sent-188, score-0.467]

86 D : Yeah I was thinking that as we l l I think the the only one s that I ve s een ’ that you buy are the s o rt o f one for all t ype things whe re they ’ re yeah . [sent-191, score-0.395]

87 D : Yeah yeah Uh ’ cau s e I mean what uh twenty five Euro s that ’ s about I dunno fift een P ounds o r s o ? [sent-199, score-0.507]

88 D : And that ’ s quit e a l ot for a remot e cont rol . [sent-201, score-0.488]

89 C : Mm Um we l my first thought s l would be mo st remot e cont rol s are grey or black As you s aid they come with the TV s o it ’ s normal ly j ust your ba s i c grey black remot e cont ro l funct i ons so maybe we could think about col our ? [sent-203, score-1.055]

90 Um, and as you s ay we need t o have s ome kind o f gimmi ck so um I thought maybe s omething l ike i you f l s e it and you can whi st l o e you know tho s e things ? [sent-205, score-0.37]

91 C : Be cau se we always l s e our remot e o cont ro l . [sent-214, score-0.508]

92 B : Uh yeah uh be ing as a Market ing Expe rt I wi l l l ike t o s ay l ike be fore de c iding the co st o f thi s remote cont ro l or any othe r things we mu st s ee the market pot ent i al for thi s product l ike what i the compet it i s on in the market ? [sent-215, score-1.724]

93 What are the avai l able pri ce s o f the othe r remot e cont rol s in the pri ce s ? [sent-216, score-0.726]

94 What spe ciality othe r remot e cont ro l s are having and how compl i cat ed it i t o s us e the s e remot e cont ro l as compared t o s othe r remot e cont rol s avai l able in the market . [sent-217, score-1.708]

95 Using crowdsourcing to compare document recommendation strategies for conversations. [sent-263, score-0.069]

96 Topic identification based extrinsic evaluation of summa- rization techniques applied to conversational speech. [sent-268, score-0.035]

97 Unsupervised approaches for automatic keyword extraction using meeting transcripts. [sent-291, score-0.472]

98 Keyword extraction from a single document using word co-occurrence statistical information. [sent-307, score-0.08]

99 An analysis of approximations for maximizing submodular set functions. [sent-326, score-0.134]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('keyword', 0.392), ('yeah', 0.31), ('keywords', 0.229), ('remot', 0.225), ('conversation', 0.202), ('ami', 0.201), ('cont', 0.175), ('diversity', 0.164), ('fisher', 0.161), ('ike', 0.155), ('fragment', 0.131), ('uh', 0.123), ('ts', 0.122), ('topic', 0.118), ('fragments', 0.113), ('transcript', 0.102), ('submodular', 0.101), ('habibi', 0.1), ('topics', 0.094), ('othe', 0.088), ('rol', 0.088), ('topical', 0.087), ('thi', 0.084), ('relevance', 0.083), ('extraction', 0.08), ('conversations', 0.076), ('harwath', 0.075), ('okay', 0.075), ('remote', 0.07), ('representative', 0.07), ('cloud', 0.067), ('ly', 0.067), ('ro', 0.067), ('maryam', 0.066), ('rue', 0.066), ('noted', 0.066), ('salton', 0.061), ('hits', 0.061), ('market', 0.06), ('reward', 0.058), ('diverse', 0.058), ('keyphrase', 0.055), ('judgments', 0.055), ('things', 0.052), ('plsa', 0.051), ('fore', 0.051), ('avai', 0.05), ('evance', 0.05), ('hazen', 0.05), ('idiap', 0.05), ('marconi', 0.05), ('martigny', 0.05), ('omething', 0.05), ('playe', 0.05), ('pri', 0.05), ('wenyi', 0.05), ('wordletm', 0.05), ('yabin', 0.05), ('image', 0.05), ('control', 0.048), ('comparative', 0.046), ('amt', 0.046), ('wf', 0.046), ('liu', 0.045), ('um', 0.045), ('idi', 0.044), ('nemhauser', 0.044), ('workers', 0.044), ('tfidf', 0.043), ('mallet', 0.041), ('mihalcea', 0.041), ('zhiyuan', 0.041), ('cau', 0.041), ('representativeness', 0.041), ('csomai', 0.041), ('cieri', 0.041), ('matsuo', 0.038), ('ye', 0.038), ('summarization', 0.038), ('recruited', 0.037), ('hoffman', 0.037), ('rithm', 0.037), ('tarau', 0.037), ('rewards', 0.037), ('crowdsourcing', 0.036), ('thesaurus', 0.036), ('ay', 0.035), ('switzerland', 0.035), ('andrei', 0.035), ('maosong', 0.035), ('conversational', 0.035), ('judges', 0.034), ('maximizing', 0.033), ('playing', 0.033), ('gordon', 0.033), ('een', 0.033), ('maybe', 0.033), ('recommendation', 0.033), ('diminishing', 0.033), ('mm', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 126 acl-2013-Diverse Keyword Extraction from Conversations

Author: Maryam Habibi ; Andrei Popescu-Belis

Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.

2 0.19076806 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

Author: Sanjika Hewavitharana ; Dennis Mehay ; Sankaranarayanan Ananthakrishnan ; Prem Natarajan

Abstract: We describe a translation model adaptation approach for conversational spoken language translation (CSLT), which encourages the use of contextually appropriate translation options from relevant training conversations. Our approach employs a monolingual LDA topic model to derive a similarity measure between the test conversation and the set of training conversations, which is used to bias translation choices towards the current context. A significant novelty of our adaptation technique is its incremental nature; we continuously update the topic distribution on the evolving test conversation as new utterances become available. Thus, our approach is well-suited to the causal constraint of spoken conversations. On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER, and NIST. Interestingly, the incremental approach outperforms a non-incremental oracle that has up-front knowledge of the whole conversation.

3 0.12179676 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

Author: Romain Deveaud ; Eric SanJuan ; Patrice Bellot

Abstract: The current topic modeling approaches for Information Retrieval do not allow to explicitly model query-oriented latent topics. More, the semantic coherence of the topics has never been considered in this field. We propose a model-based feedback approach that learns Latent Dirichlet Allocation topic models on the top-ranked pseudo-relevant feedback, and we measure the semantic coherence of those topics. We perform a first experimental evaluation using two major TREC test collections. Results show that retrieval perfor- mances tend to be better when using topics with higher semantic coherence.

4 0.10559049 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang

Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.

5 0.10330193 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

6 0.093600087 121 acl-2013-Discovering User Interactions in Ideological Discussions

7 0.086015649 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

8 0.084904477 332 acl-2013-Subtree Extractive Summarization via Submodular Maximization

9 0.081910782 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization

10 0.078979775 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

11 0.078579292 290 acl-2013-Question Analysis for Polish Question Answering

12 0.07644663 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

13 0.07262294 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

14 0.069156758 333 acl-2013-Summarization Through Submodularity and Dispersion

15 0.063292436 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

16 0.061887383 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

17 0.060290515 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

18 0.059840735 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

19 0.056656063 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

20 0.054246679 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.146), (1, 0.076), (2, 0.037), (3, -0.062), (4, 0.059), (5, -0.037), (6, 0.095), (7, -0.052), (8, -0.13), (9, -0.051), (10, 0.015), (11, 0.058), (12, 0.035), (13, 0.116), (14, 0.027), (15, -0.015), (16, -0.004), (17, 0.007), (18, -0.017), (19, -0.02), (20, -0.023), (21, 0.001), (22, -0.017), (23, -0.028), (24, -0.002), (25, -0.013), (26, -0.037), (27, -0.073), (28, -0.009), (29, -0.004), (30, 0.008), (31, 0.024), (32, 0.006), (33, -0.018), (34, 0.017), (35, -0.034), (36, -0.002), (37, 0.021), (38, 0.091), (39, -0.026), (40, 0.011), (41, 0.04), (42, 0.008), (43, 0.044), (44, 0.007), (45, 0.027), (46, -0.082), (47, -0.03), (48, 0.008), (49, 0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94027692 126 acl-2013-Diverse Keyword Extraction from Conversations

Author: Maryam Habibi ; Andrei Popescu-Belis

Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.

2 0.83435929 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

Author: Romain Deveaud ; Eric SanJuan ; Patrice Bellot

Abstract: The current topic modeling approaches for Information Retrieval do not allow to explicitly model query-oriented latent topics. More, the semantic coherence of the topics has never been considered in this field. We propose a model-based feedback approach that learns Latent Dirichlet Allocation topic models on the top-ranked pseudo-relevant feedback, and we measure the semantic coherence of those topics. We perform a first experimental evaluation using two major TREC test collections. Results show that retrieval perfor- mances tend to be better when using topics with higher semantic coherence.

3 0.82173902 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

4 0.77675605 142 acl-2013-Evolutionary Hierarchical Dirichlet Process for Timeline Summarization

Author: Jiwei Li ; Sujian Li

Abstract: Timeline summarization aims at generating concise summaries and giving readers a faster and better access to understand the evolution of news. It is a new challenge which combines salience ranking problem with novelty detection. Previous researches in this field seldom explore the evolutionary pattern of topics such as birth, splitting, merging, developing and death. In this paper, we develop a novel model called Evolutionary Hierarchical Dirichlet Process(EHDP) to capture the topic evolution pattern in time- line summarization. In EHDP, time varying information is formulated as a series of HDPs by considering time-dependent information. Experiments on 6 different datasets which contain 3 156 documents demonstrates the good performance of our system with regard to ROUGE scores.

5 0.76035953 54 acl-2013-Are School-of-thought Words Characterizable?

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

Abstract: School of thought analysis is an important yet not-well-elaborated scientific knowledge discovery task. This paper makes the first attempt at this problem. We focus on one aspect of the problem: do characteristic school-of-thought words exist and whether they are characterizable? To answer these questions, we propose a probabilistic generative School-Of-Thought (SOT) model to simulate the scientific authoring process based on several assumptions. SOT defines a school of thought as a distribution of topics and assumes that authors determine the school of thought for each sentence before choosing words to deliver scientific ideas. SOT distinguishes between two types of school-ofthought words for either the general background of a school of thought or the original ideas each paper contributes to its school of thought. Narrative and quantitative experiments show positive and promising results to the questions raised above. 1

6 0.74151981 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

7 0.70901966 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

8 0.70474595 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

9 0.6973598 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

10 0.69431865 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

11 0.65258956 257 acl-2013-Natural Language Models for Predicting Programming Comments

12 0.65114802 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

13 0.59452969 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

14 0.58446568 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

15 0.5833106 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

16 0.56286234 121 acl-2013-Discovering User Interactions in Ideological Discussions

17 0.54900599 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning

18 0.54529202 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections

19 0.53801358 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

20 0.51452321 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (6, 0.034), (11, 0.056), (15, 0.01), (19, 0.19), (24, 0.06), (26, 0.035), (28, 0.02), (35, 0.088), (42, 0.028), (46, 0.045), (48, 0.044), (61, 0.012), (64, 0.015), (70, 0.052), (72, 0.011), (80, 0.011), (88, 0.031), (90, 0.071), (95, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84653097 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision

Author: Michael Speriosu ; Jason Baldridge

Abstract: Toponym resolvers identify the specific locations referred to by ambiguous placenames in text. Most resolvers are based on heuristics using spatial relationships between multiple toponyms in a document, or metadata such as population. This paper shows that text-driven disambiguation for toponyms is far more effective. We exploit document-level geotags to indirectly generate training instances for text classifiers for toponym resolution, and show that textual cues can be straightforwardly integrated with other commonly used ones. Results are given for both 19th century texts pertaining to the American Civil War and 20th century newswire articles.

same-paper 2 0.80836314 126 acl-2013-Diverse Keyword Extraction from Conversations

Author: Maryam Habibi ; Andrei Popescu-Belis

Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.

3 0.72559434 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval

Author: Xuchen Yao ; Benjamin Van Durme ; Peter Clark

Abstract: Information Retrieval (IR) and Answer Extraction are often designed as isolated or loosely connected components in Question Answering (QA), with repeated overengineering on IR, and not necessarily performance gain for QA. We propose to tightly integrate them by coupling automatically learned features for answer extraction to a shallow-structured IR model. Our method is very quick to implement, and significantly improves IR for QA (measured in Mean Average Precision and Mean Reciprocal Rank) by 10%-20% against an uncoupled retrieval baseline in both document and passage retrieval, which further leads to a downstream 20% improvement in QA F1.

4 0.71298975 222 acl-2013-Learning Semantic Textual Similarity with Structural Representations

Author: Aliaksei Severyn ; Massimo Nicosia ; Alessandro Moschitti

Abstract: Measuring semantic textual similarity (STS) is at the cornerstone of many NLP applications. Different from the majority of approaches, where a large number of pairwise similarity features are used to represent a text pair, our model features the following: (i) it directly encodes input texts into relational syntactic structures; (ii) relies on tree kernels to handle feature engineering automatically; (iii) combines both structural and feature vector representations in a single scoring model, i.e., in Support Vector Regression (SVR); and (iv) delivers significant improvement over the best STS systems.

5 0.70958376 4 acl-2013-A Context Free TAG Variant

Author: Ben Swanson ; Elif Yamangil ; Eugene Charniak ; Stuart Shieber

Abstract: We propose a new variant of TreeAdjoining Grammar that allows adjunction of full wrapping trees but still bears only context-free expressivity. We provide a transformation to context-free form, and a further reduction in probabilistic model size through factorization and pooling of parameters. This collapsed context-free form is used to implement efficient gram- mar estimation and parsing algorithms. We perform parsing experiments the Penn Treebank and draw comparisons to TreeSubstitution Grammars and between different variations in probabilistic model design. Examination of the most probable derivations reveals examples of the linguistically relevant structure that our variant makes possible.

6 0.683074 317 acl-2013-Sentence Level Dialect Identification in Arabic

7 0.63231516 176 acl-2013-Grounded Unsupervised Semantic Parsing

8 0.62937868 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

9 0.61099678 139 acl-2013-Entity Linking for Tweets

10 0.60821462 250 acl-2013-Models of Translation Competitions

11 0.60525793 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

12 0.60384738 172 acl-2013-Graph-based Local Coherence Modeling

13 0.60362816 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

14 0.60349971 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

15 0.60322249 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

16 0.60260767 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

17 0.60215783 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

18 0.59999454 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

19 0.59875888 360 acl-2013-Translating Italian connectives into Italian Sign Language

20 0.59760976 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval