acl acl2013 acl2013-351 knowledge-graph by maker-knowledge-mining

351 acl-2013-Topic Modeling Based Classification of Clinical Reports


Source: pdf

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Topic Modeling Based Classification of Clinical Reports Efsun Sarioglu Computer Science Department The George Washington University Washington, DC, USA e fsun@ gwu . [sent-1, score-0.083]

2 edu Abstract Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . [sent-2, score-0.115]

3 edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . [sent-3, score-0.115]

4 edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. [sent-4, score-0.551]

5 Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. [sent-5, score-0.14]

6 As a case study, we analyzed classification of CT imaging reports into binary categories. [sent-7, score-0.489]

7 In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. [sent-8, score-0.718]

8 Topic modeling of the corpora provides interpretable themes that exist in these reports. [sent-9, score-0.203]

9 Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. [sent-10, score-1.1]

10 A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. [sent-11, score-1.155]

11 And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. [sent-12, score-1.287]

12 Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation. [sent-13, score-0.741]

13 1 Introduction Large amounts of medical data are now stored as electronic health records (EHRs). [sent-14, score-0.185]

14 Some of these data are in the form of free text and they need to be processed and coded for better utilization in automatic or semi-automatic systems. [sent-15, score-0.131]

15 One possible utilization is to support clinical decision-making, costs. [sent-16, score-0.304]

16 This type of automated analysis of patient reports can help medical professionals make clinical decisions much faster with more confidence by providing predicted outcomes. [sent-17, score-0.773]

17 In this study, we developed several topic modeling based classification systems for clinical reports. [sent-18, score-0.929]

18 Topic modeling is an unsupervised technique that can automatically identify themes from a given set of documents and find topic distributions of each document. [sent-19, score-0.741]

19 Representing reports ac- cording to their topic distributions is more compact and can be processed faster than raw text in subsequent automated processing. [sent-20, score-1.047]

20 , 2005) and nouns, compared to other parts of speech, tend to specialize into topics (Griffiths et al. [sent-22, score-0.144]

21 Therefore, topic model output of patient reports could contain very useful clinical information. [sent-24, score-0.971]

22 2 Background This study utilized prospective patient data previously collected for a traumatic orbital fracture project (Yadav et al. [sent-25, score-0.398]

23 Staff radiologists dictated each CT report and the outcome of acute orbital fracture was extracted by a trained data abstractor. [sent-27, score-0.288]

24 Among the 3,705 reports, 3,242 had negative outcome while 463 had positive. [sent-28, score-0.087]

25 A random subset of 507 CT reports were double-coded, and inter-rater analysis revealed excellent agreement between the data abstractor and study physician, with Cohen’s kappa of 0. [sent-29, score-0.228]

26 1 Bag-of-Words (BoW) Representation Text data need to be converted to a suitable format for automated processing. [sent-32, score-0.068]

27 One way of doing this is bag-of-words (BoW) representation where each document becomes a vector of its words/tokens. [sent-33, score-0.107]

28 tc ud2e0n1t3 R Aes seoacricahti Wonor foksrh Coopm, p augteasti 6o7n–a7l3 L,inguistics The entries in this matrix could be binary stating the existence or absence of a word in a document or it could be weighted such as number of times a word exists in a document. [sent-36, score-0.142]

29 2 Topic Modeling Topic modeling is an unsupervised learning algorithm that can automatically discover themes of a document collection. [sent-38, score-0.242]

30 PLSA is considered probabilistic version of LSA where an unobserved class variable zk ∈ {z1, . [sent-46, score-0.238]

31 PLSA solves the polysemy problem; however it is not considered a fully generative model of documents and it is known to be overfitting (Blei et al. [sent-51, score-0.144]

32 , 2003), defines topic as a distribution over a fixed vocabu- lary, where each document can exhibit them with different proportions. [sent-55, score-0.526]

33 For each word in the document: (a) Randomly choose a topic from the distribution over topics. [sent-59, score-0.472]

34 The probability of generating the word wj from document di can be calculated as below: XK P(wj|di;θ,φ) =Xk=1P(wj|zk;φz)P(zk|di;θd) where θ for each Dirichlet sampling is sampled from a Dirichlet distribution document di and φ is sampled from a distribution for each topic zk. [sent-61, score-0.771]

35 , 2009) can be used to train a topic model based on LDA. [sent-63, score-0.431]

36 LDA performs better than PLSA for small datasets since it avoids overfitting and it supports polysemy (Blei et al. [sent-64, score-0.162]

37 3 Text Classification Text classification is a supervised learning algorithm where documents’ categories are learned from pre-labeled set of documents. [sent-68, score-0.17]

38 Support vector machines (SVM) is a popular classification algorithm that attempts to find a decision boundary between classes that is the farthest from any point in the training dataset. [sent-69, score-0.17]

39 , N where xt ∈ RM and yt ∈ {1, −1}, SVM tries to find a separating hyperplane w−1ith} ,t SheV mMax trimiesum to margin (Platt, 1998). [sent-73, score-0.083]

40 To evaluate formance, precision, recall, are typically used (Manning 3 unseen documents in the classification perand F-score measures et al. [sent-78, score-0.226]

41 Related Work For text classification, topic modeling techniques have been utilized in various ways. [sent-80, score-0.639]

42 , 2008), it is used as a keyword selection mechanism by selecting the top words from topics based on their entropy. [sent-82, score-0.144]

43 In our study, we removed the most frequent and infrequent words to have a manageable vocabulary size but we did not utilize topic model output for this purpose. [sent-83, score-0.469]

44 , 2012) and (Sriurai, 2011) compare BoW representation to topic model representation for classification using varying and fixed number of topics respectively. [sent-85, score-0.851]

45 This is similar to our topic vec68 tor classification results with SVM, however (Sriurai, 2011) uses a fixed number of topics, whereas we evaluated different number of topics since typ- ically this is not known in advance. [sent-86, score-0.745]

46 In (Banerjee, 2008), topics are used as additional features to BoW features for the purpose of classification. [sent-87, score-0.144]

47 In our approaches, we used topic vector representation as an alternative to BoW and not additional. [sent-88, score-0.484]

48 , 2011) developed a resampling approach based on topic modeling when the class distributions are not balanced. [sent-91, score-0.725]

49 In this study, resampling approaches are also utilized to compare skewed dataset results to datasets with equal class distributions; however, we used randomized resampling approaches for this purpose. [sent-92, score-0.671]

50 4 Experiments Figure 1shows the three approaches of using topic model of clinical reports to classify them and they are explained below. [sent-93, score-0.911]

51 1 Preprocessing During preprocessing, all protected health information were removed to meet Institutional Review Board requirements. [sent-95, score-0.103]

52 Medical record num- bers from each report were replaced by observation numbers, which are sequence numbers that are automatically assigned to each report. [sent-96, score-0.067]

53 2 Topic Modeling LDA was chosen to generate the topic models of clinical reports due to its being a generative probabilistic system for documents and its robustness to overfitting. [sent-100, score-1.053]

54 Stanford Topic Modeling Toolbox (TMT) 1 was used to conduct the experiments which is an open source software that provides ways to train and infer topic models for text data. [sent-101, score-0.469]

55 3 Topic Vectors Topic modeling of reports produces a topic distribution for each report which can be used to represent them as topic vectors. [sent-103, score-1.207]

56 This is an alternative representation to BoW where terms are replaced 1http : / / nlp . [sent-104, score-0.088]

57 4 / with topics and entries for each report show the probability of a specific topic for that report. [sent-107, score-0.609]

58 This representation is more compact than BoW as the vocabulary for a text collection usually has thousands of entries whereas a topic model is typically built with a maximum of hundreds of topics. [sent-108, score-0.633]

59 4 Supervised Classification SVM was chosen as the classification algorithm as it was shown that it performs well in text classification tasks (Joachims, 1998; Yang and Liu, 1999) and it is robust to overfitting (Sebastiani, 2002). [sent-110, score-0.478]

60 Weka was used to conduct classification which is a collection of machine learning algorithms for data mining tasks written in Java (Hall et al. [sent-111, score-0.17]

61 Accordingly, the raw text of the reports and topic vectors are compiled into individual files with their corresponding outcomes in ARFF and then classified with SVM. [sent-114, score-0.851]

62 5 Aggregate Topic Classifier (ATC) With this approach, a representative topic vector for each class was composed by averaging their corresponding topic distributions in the training dataset. [sent-116, score-1.063]

63 A discriminative topic was then chosen so that the difference between positive and negative representative vectors is maximum. [sent-117, score-0.615]

64 The reports in the test datasets were then classified by analyzing the values of this topic and a threshold was chosen to determine the predicted class. [sent-118, score-0.899]

65 This threshold could be chosen automatically based on class distributions if the dataset is skewed or cross validation methods can be applied to pick a threshold that gives the best classification performance in a validation dataset. [sent-119, score-0.652]

66 This approach is called Aggregate Topic Classifier (ATC) since training labels were utilized in an aggregate fashion using an average function and not individually. [sent-120, score-0.168]

67 6 Binary Topic Classification (BTC) Topic modeling of the data with two topics was also analyzed as an unsupervised classification technique. [sent-122, score-0.46]

68 In this approach, binary topics were assumed to correspond to the binary classes. [sent-123, score-0.252]

69 After topic model was learned, the topic with the higher probability was assigned as the predicted class for each document. [sent-124, score-1.002]

70 If the dataset is skewed, which topic corresponds to which class was found out by checking predicted class proportions. [sent-125, score-0.742]

71 For datasets 69 Figure 1: System overview with equal class distributions, each of the possible assignments were checked and the one with the better classification performance was chosen. [sent-126, score-0.381]

72 These training and test datasets were randomized and stratified to make sure each subset is a good representation of the original dataset. [sent-129, score-0.169]

73 For ATC, we evaluated different quantile points: 75, 80, 82, 85, 87 as threshold and picked the one that gives the best classification performance. [sent-130, score-0.205]

74 These were chosen as candidates based on the positive class ratio of original dataset of 12%. [sent-131, score-0.221]

75 Best classification performance was achieved with 15 topics for ATC and 100 topics for SVM. [sent-132, score-0.458]

76 As number of topics increased, it got harder to find a very discriminative single topic and therefore ATC’s performance got worse whereas SVM’s performance got better as it got more information with more number of topics. [sent-134, score-1.011]

77 However, using topic vectors to represent reports still provided great dimension reduction as raw text of the reports had 1,296 terms and made the subsequent classification with SVM faster. [sent-135, score-1.251]

78 We analyzed the performance of classification using binary topics with three datasets: original, undersampled, and oversampled. [sent-138, score-0.405]

79 In the undersampled dataset, excess amount of negative cases were removed and the resulting dataset consisted of 463 documents for each class. [sent-139, score-0.315]

80 For oversampled dataset, positive cases were oversampled while keeping the total number of documents the same. [sent-140, score-0.272]

81 This approach produced a dataset consisting of 1,895 positive and 1,810 negative cases. [sent-141, score-0.113]

82 With the original dataset, we could see the performance on a highly skewed real dataset and with the resampled datasets, we could see the performance on data with equal class distributions. [sent-142, score-0.341]

83 Balanced datasets performed better compared to skewed original dataset using this approach. [sent-146, score-0.278]

84 This is also due to the fact that skewed dataset had a higher baseline compared to the undersampled and oversampled datasets. [sent-147, score-0.42]

85 In Table 3, the best performance ofeach Figure 2: Precision Figure 3: Recall technique for the original dataset is summarized. [sent-148, score-0.079]

86 o657 2re datasets with equal class distribution, for the original skewed dataset, it got worse results than the baseline. [sent-152, score-0.445]

87 ATC, on the other hand, got compara- ble results with SVM using both topic vectors and raw text. [sent-153, score-0.661]

88 In addition, ATC used fewer number of topics than SVM for its best performance. [sent-154, score-0.144]

89 Table 3: Overall classification performance 6 TAoRBpliagcABwsoTverCil tenchxmeotrsP98e76c i. [sent-155, score-0.17]

90 o3671re Conclusion In this study, topic modeling of clinical reports are utilized in different ways with the end goal of classification. [sent-158, score-1.081]

91 Firstly, bag-of-words representation is replaced with topic vectors which provide good dimensionality reduction and still get comparable classification performance. [sent-159, score-0.746]

92 In aggregate topic classifier, representative topic vectors for positive and negative classes are composed and used as a guide to classify the reports in the test dataset. [sent-160, score-1.298]

93 This approach was competitive with classification with SVM using raw text and topic vectors. [sent-161, score-0.703]

94 In addition, it required few topics to get the best performance. [sent-162, score-0.144]

95 And finally, in the unsupervised setting, binary topic models are built for each dataset with the assumption that each topic corresponds to a class. [sent-163, score-1.064]

96 For datasets with equal class distribution, this approach showed improvement over baseline approaches. [sent-164, score-0.211]

97 Improving text classification accuracy using topic modeling over an additional corpus. [sent-171, score-0.715]

98 Exploiting probabilistic topic models to improve text categorization under class imbalance. [sent-186, score-0.661]

99 Improved identification of noun phrases in clinical radiology reports using a highperformance statistical natural language parser augmented with the UMLS specialist lexicon. [sent-230, score-0.48]

100 Derivation of a clinical risk score for traumatic orbital fracture. [sent-261, score-0.424]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('topic', 0.431), ('atc', 0.303), ('clinical', 0.252), ('reports', 0.228), ('classification', 0.17), ('bow', 0.159), ('yadav', 0.144), ('topics', 0.144), ('skewed', 0.125), ('svm', 0.119), ('zk', 0.11), ('got', 0.109), ('kabir', 0.108), ('orbital', 0.108), ('oversampled', 0.108), ('sarioglu', 0.108), ('undersampled', 0.108), ('plsa', 0.097), ('utilized', 0.094), ('class', 0.092), ('lsa', 0.085), ('washington', 0.085), ('gwu', 0.083), ('medical', 0.082), ('blei', 0.08), ('dataset', 0.079), ('themes', 0.079), ('modeling', 0.076), ('aggregate', 0.074), ('datasets', 0.074), ('arff', 0.072), ('btc', 0.072), ('efsun', 0.072), ('fracture', 0.072), ('sriurai', 0.072), ('tmt', 0.072), ('automated', 0.068), ('distributions', 0.066), ('health', 0.065), ('lda', 0.065), ('hofmann', 0.064), ('categorization', 0.064), ('raw', 0.064), ('ehrs', 0.064), ('traumatic', 0.064), ('asuncion', 0.064), ('deerwester', 0.063), ('patient', 0.06), ('resampling', 0.06), ('thomas', 0.057), ('vectors', 0.057), ('documents', 0.056), ('wj', 0.056), ('acute', 0.055), ('classifier', 0.054), ('binary', 0.054), ('document', 0.054), ('representation', 0.053), ('outcome', 0.053), ('utilization', 0.052), ('dc', 0.051), ('overfitting', 0.05), ('latent', 0.05), ('chosen', 0.05), ('predicted', 0.048), ('griffiths', 0.048), ('interpretable', 0.048), ('di', 0.047), ('equal', 0.045), ('dirichlet', 0.044), ('representative', 0.043), ('yt', 0.043), ('randomized', 0.042), ('distribution', 0.041), ('compact', 0.041), ('xk', 0.041), ('processed', 0.041), ('xt', 0.04), ('george', 0.039), ('records', 0.038), ('text', 0.038), ('polysemy', 0.038), ('removed', 0.038), ('analyzed', 0.037), ('steyvers', 0.037), ('weka', 0.037), ('ct', 0.037), ('probabilistic', 0.036), ('built', 0.036), ('replaced', 0.035), ('subsequent', 0.035), ('faster', 0.035), ('threshold', 0.035), ('negative', 0.034), ('preprocessing', 0.034), ('entries', 0.034), ('classified', 0.033), ('unsupervised', 0.033), ('edu', 0.032), ('bers', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

2 0.24003029 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

Author: Romain Deveaud ; Eric SanJuan ; Patrice Bellot

Abstract: The current topic modeling approaches for Information Retrieval do not allow to explicitly model query-oriented latent topics. More, the semantic coherence of the topics has never been considered in this field. We propose a model-based feedback approach that learns Latent Dirichlet Allocation topic models on the top-ranked pseudo-relevant feedback, and we measure the semantic coherence of those topics. We perform a first experimental evaluation using two major TREC test collections. Results show that retrieval perfor- mances tend to be better when using topics with higher semantic coherence.

3 0.23072645 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang

Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.

4 0.19852756 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

Author: Sanjika Hewavitharana ; Dennis Mehay ; Sankaranarayanan Ananthakrishnan ; Prem Natarajan

Abstract: We describe a translation model adaptation approach for conversational spoken language translation (CSLT), which encourages the use of contextually appropriate translation options from relevant training conversations. Our approach employs a monolingual LDA topic model to derive a similarity measure between the test conversation and the set of training conversations, which is used to bias translation choices towards the current context. A significant novelty of our adaptation technique is its incremental nature; we continuously update the topic distribution on the evolving test conversation as new utterances become available. Thus, our approach is well-suited to the causal constraint of spoken conversations. On an English-to-Iraqi CSLT task, the proposed approach gives significant improvements over a baseline system as measured by BLEU, TER, and NIST. Interestingly, the incremental approach outperforms a non-incremental oracle that has up-front knowledge of the whole conversation.

5 0.18095234 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

Author: Xiaoming Lu ; Lei Xie ; Cheung-Chi Leung ; Bin Ma ; Haizhou Li

Abstract: We present an efficient approach for broadcast news story segmentation using a manifold learning algorithm on latent topic distributions. The latent topic distribution estimated by Latent Dirichlet Allocation (LDA) is used to represent each text block. We employ Laplacian Eigenmaps (LE) to project the latent topic distributions into low-dimensional semantic representations while preserving the intrinsic local geometric structure. We evaluate two approaches employing LDA and probabilistic latent semantic analysis (PLSA) distributions respectively. The effects of different amounts of training data and different numbers of latent topics on the two approaches are studied. Experimental re- sults show that our proposed LDA-based approach can outperform the corresponding PLSA-based approach. The proposed approach provides the best performance with the highest F1-measure of 0.7860.

6 0.15412909 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

7 0.15091746 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

8 0.1506404 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

9 0.13998403 121 acl-2013-Discovering User Interactions in Ideological Discussions

10 0.13615207 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

11 0.12932143 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

12 0.12288976 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

13 0.11690657 257 acl-2013-Natural Language Models for Predicting Programming Comments

14 0.10330193 126 acl-2013-Diverse Keyword Extraction from Conversations

15 0.10111313 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction

16 0.098295279 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

17 0.080542132 142 acl-2013-Evolutionary Hierarchical Dirichlet Process for Timeline Summarization

18 0.080522373 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

19 0.079363152 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

20 0.078447267 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.142), (2, 0.04), (3, -0.042), (4, 0.104), (5, -0.089), (6, 0.135), (7, -0.013), (8, -0.215), (9, -0.075), (10, 0.134), (11, 0.1), (12, 0.124), (13, 0.149), (14, 0.007), (15, -0.077), (16, -0.136), (17, 0.114), (18, -0.029), (19, -0.046), (20, -0.039), (21, 0.051), (22, -0.055), (23, 0.004), (24, 0.006), (25, 0.032), (26, -0.051), (27, -0.16), (28, -0.049), (29, -0.04), (30, 0.066), (31, 0.025), (32, 0.037), (33, 0.037), (34, -0.036), (35, 0.017), (36, 0.004), (37, 0.046), (38, 0.091), (39, 0.02), (40, -0.02), (41, 0.009), (42, 0.007), (43, -0.004), (44, -0.052), (45, 0.015), (46, -0.031), (47, -0.029), (48, 0.018), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98528713 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

2 0.87094069 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?

Author: Romain Deveaud ; Eric SanJuan ; Patrice Bellot

Abstract: The current topic modeling approaches for Information Retrieval do not allow to explicitly model query-oriented latent topics. More, the semantic coherence of the topics has never been considered in this field. We propose a model-based feedback approach that learns Latent Dirichlet Allocation topic models on the top-ranked pseudo-relevant feedback, and we measure the semantic coherence of those topics. We perform a first experimental evaluation using two major TREC test collections. Results show that retrieval perfor- mances tend to be better when using topics with higher semantic coherence.

3 0.8306098 126 acl-2013-Diverse Keyword Extraction from Conversations

Author: Maryam Habibi ; Andrei Popescu-Belis

Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.

4 0.81626689 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

Author: Xiaoming Lu ; Lei Xie ; Cheung-Chi Leung ; Bin Ma ; Haizhou Li

Abstract: We present an efficient approach for broadcast news story segmentation using a manifold learning algorithm on latent topic distributions. The latent topic distribution estimated by Latent Dirichlet Allocation (LDA) is used to represent each text block. We employ Laplacian Eigenmaps (LE) to project the latent topic distributions into low-dimensional semantic representations while preserving the intrinsic local geometric structure. We evaluate two approaches employing LDA and probabilistic latent semantic analysis (PLSA) distributions respectively. The effects of different amounts of training data and different numbers of latent topics on the two approaches are studied. Experimental re- sults show that our proposed LDA-based approach can outperform the corresponding PLSA-based approach. The proposed approach provides the best performance with the highest F1-measure of 0.7860.

5 0.80992919 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

Author: Jun Zhu ; Xun Zheng ; Bo Zhang

Abstract: Supervised topic models with a logistic likelihood have two issues that potentially limit their practical use: 1) response variables are usually over-weighted by document word counts; and 2) existing variational inference methods make strict mean-field assumptions. We address these issues by: 1) introducing a regularization constant to better balance the two parts based on an optimization formulation of Bayesian inference; and 2) developing a simple Gibbs sampling algorithm by introducing auxiliary Polya-Gamma variables and collapsing out Dirichlet variables. Our augment-and-collapse sampling algorithm has analytical forms of each conditional distribution without making any restricting assumptions and can be easily parallelized. Empirical results demonstrate significant improvements on prediction performance and time efficiency.

6 0.78295732 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

7 0.76252788 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

8 0.74511713 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information

9 0.74428248 257 acl-2013-Natural Language Models for Predicting Programming Comments

10 0.73108232 54 acl-2013-Are School-of-thought Words Characterizable?

11 0.69048154 147 acl-2013-Exploiting Topic based Twitter Sentiment for Stock Prediction

12 0.67329824 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

13 0.65510273 142 acl-2013-Evolutionary Hierarchical Dirichlet Process for Timeline Summarization

14 0.62519759 121 acl-2013-Discovering User Interactions in Ideological Discussions

15 0.59965295 220 acl-2013-Learning Latent Personas of Film Characters

16 0.58982271 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords

17 0.58781952 350 acl-2013-TopicSpam: a Topic-Model based approach for spam detection

18 0.57539195 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

19 0.55041748 182 acl-2013-High-quality Training Data Selection using Latent Topics for Graph-based Semi-supervised Learning

20 0.54098332 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.061), (6, 0.03), (11, 0.041), (15, 0.012), (24, 0.065), (26, 0.068), (35, 0.136), (42, 0.032), (48, 0.065), (58, 0.2), (70, 0.057), (88, 0.049), (90, 0.022), (91, 0.014), (95, 0.071)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90481949 1 acl-2013-"Let Everything Turn Well in Your Wife": Generation of Adult Humor Using Lexical Constraints

Author: Alessandro Valitutti ; Hannu Toivonen ; Antoine Doucet ; Jukka M. Toivanen

Abstract: We propose a method for automated generation of adult humor by lexical replacement and present empirical evaluation results of the obtained humor. We propose three types of lexical constraints as building blocks of humorous word substitution: constraints concerning the similarity of sounds or spellings of the original word and the substitute, a constraint requiring the substitute to be a taboo word, and constraints concerning the position and context of the replacement. Empirical evidence from extensive user studies indicates that these constraints can increase the effectiveness of humor generation significantly.

same-paper 2 0.82715183 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

3 0.80154383 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: Generative probabilistic models have been used for content modelling and template induction, and are typically trained on small corpora in the target domain. In contrast, vector space models of distributional semantics are trained on large corpora, but are typically applied to domaingeneral lexical disambiguation tasks. We introduce Distributional Semantic Hidden Markov Models, a novel variant of a hidden Markov model that integrates these two approaches by incorporating contextualized distributional semantic vectors into a generative model as observed emissions. Experiments in slot induction show that our approach yields improvements in learning coherent entity clusters in a domain. In a subsequent extrinsic evaluation, we show that these improvements are also reflected in multi-document summarization.

4 0.73503411 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing

Author: Guangyou Zhou ; Jun Zhao

Abstract: This paper is concerned with the problem of heterogeneous dependency parsing. In this paper, we present a novel joint inference scheme, which is able to leverage the consensus information between heterogeneous treebanks in the parsing phase. Different from stacked learning methods (Nivre and McDonald, 2008; Martins et al., 2008), which process the dependency parsing in a pipelined way (e.g., a second level uses the first level outputs), in our method, multiple dependency parsing models are coordinated to exchange consensus information. We conduct experiments on Chinese Dependency Treebank (CDT) and Penn Chinese Treebank (CTB), experimental results show that joint infer- ence can bring significant improvements to all state-of-the-art dependency parsers.

5 0.71726823 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

Author: Angeliki Lazaridou ; Ivan Titov ; Caroline Sporleder

Abstract: We propose a joint model for unsupervised induction of sentiment, aspect and discourse information and show that by incorporating a notion of latent discourse relations in the model, we improve the prediction accuracy for aspect and sentiment polarity on the sub-sentential level. We deviate from the traditional view of discourse, as we induce types of discourse relations and associated discourse cues relevant to the considered opinion analysis task; consequently, the induced discourse relations play the role of opinion and aspect shifters. The quantitative analysis that we conducted indicated that the integration of a discourse model increased the prediction accuracy results with respect to the discourse-agnostic approach and the qualitative analysis suggests that the induced representations encode a meaningful discourse structure.

6 0.71015012 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

7 0.70511967 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics

8 0.70449477 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

9 0.70373613 318 acl-2013-Sentiment Relevance

10 0.70367497 172 acl-2013-Graph-based Local Coherence Modeling

11 0.70285547 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval

12 0.70260876 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

13 0.70009649 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

14 0.69794321 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval

15 0.6973291 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

16 0.69681555 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

17 0.69662446 224 acl-2013-Learning to Extract International Relations from Political Context

18 0.69611585 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

19 0.69539678 175 acl-2013-Grounded Language Learning from Video Described with Sentences

20 0.69528455 238 acl-2013-Measuring semantic content in distributional vectors