acl acl2013 acl2013-351 acl2013-351-reference knowledge-graph by maker-knowledge-mining

351 acl-2013-Topic Modeling Based Classification of Clinical Reports

Source: pdf

Author: Efsun Sarioglu ; Kabir Yadav ; Hyeong-Ah Choi

Abstract: Kabir Yadav Emergency Medicine Department The George Washington University Washington, DC, USA kyadav@ gwu . edu Hyeong-Ah Choi Computer Science Department The George Washington University Washington, DC, USA hcho i gwu . edu @ such as recommending the need for a certain medical test while avoiding intrusive tests or medical Electronic health records (EHRs) contain important clinical information about pa- tients. Some of these data are in the form of free text and require preprocessing to be able to used in automated systems. Efficient and effective use of this data could be vital to the speed and quality of health care. As a case study, we analyzed classification of CT imaging reports into binary categories. In addition to regular text classification, we utilized topic modeling of the entire dataset in various ways. Topic modeling of the corpora provides interpretable themes that exist in these reports. Representing reports according to their topic distributions is more compact than bag-of-words representation and can be processed faster than raw text in subsequent automated processes. A binary topic model was also built as an unsupervised classification approach with the assumption that each topic corresponds to a class. And, finally an aggregate topic classifier was built where reports are classified based on a single discriminative topic that is determined from the training dataset. Our proposed topic based classifier system is shown to be competitive with existing text classification techniques and provides a more efficient and interpretable representation.

reference text

Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee-Whye Teh. 2009. On smoothing and inference for topic models. In UAI. Somnath Banerjee. 2008. Improving text classification accuracy using topic modeling over an additional corpus. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 867–868. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res., 3:993–1022. Enhong Chen, Yanggang Lin, Hui Xiong, Qiming Luo, and Haiping Ma. 2011. Exploiting probabilistic topic models to improve text categorization under class imbalance. Inf. Process. Manage., 47(2):202– 214. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407. Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. PNAS, 101(suppl. 1):5228–5235. Thomas L. Griffiths, Mark Steyvers, David M. Blei, and Joshua B. Tenenbaum. 2004. Integrating Topics and Syntax. In NIPS, pages 537–544. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1): 10–18. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In UAI. Thomas Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1-2): 177–196. Yang Huang, Henry J Lowe, Dan Klein, and Russell J Cucina. 2005. Improved identification of noun phrases in clinical radiology reports using a highperformance statistical natural language parser augmented with the UMLS specialist lexicon. J Am Med Inform Assoc, 12(3):275–285. Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142. 72 Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch u¨tze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA. John C. Platt. 1998. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Efsun Sarioglu, Kabir Yadav, and Hyeong-Ah Choi. 2012. Clinical Report Classification Using Natural Language Processing and Topic Modeling. 11th International Conference on Machine Learning and Applications (ICMLA), pages 204–209. Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34(1): 1–47. Wongkot Sriurai. 2011. Improving Text Categorization by Using a Topic Model. Advanced Computing: An International Journal (ACIJ), 2(6). Kabir Yadav, Ethan Cowan, Jason S Haukoos, Zachary Ashwell, Vincent Nguyen, Paul Gennis, and Stephen P Wall. 2012. Derivation of a clinical risk score for traumatic orbital fracture. J Trauma Acute Care Surg, 73(5): 1313–1318. Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42–49. Zhiwei Zhang, Xuan-Hieu Phan, and Susumu Horiguchi. 2008. An Efficient Feature Selection Using Hidden Topic in Text Categorization. In Proceedings of the 22nd International Conference on Advanced Information Networking and Applications - Workshops, AINAW ’08, pages 1223–1228. 73