acl acl2010 acl2010-246 acl2010-246-reference knowledge-graph by maker-knowledge-mining

246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

Source: pdf

Author: Minwoo Jeong ; Ivan Titov

Abstract: Documents often have inherently parallel structure: they may consist of a text and commentaries, or an abstract and a body, or parts presenting alternative views on the same problem. Revealing relations between the parts by jointly segmenting and predicting links between the segments, would help to visualize such documents and construct friendlier user interfaces. To address this problem, we propose an unsupervised Bayesian model for joint discourse segmentation and alignment. We apply our method to the “English as a second language” podcast dataset where each episode is composed of two parallel parts: a story and an explanatory lecture. The predicted topical links uncover hidden re- lations between the stories and the lectures. In this domain, our method achieves competitive results, rivaling those of a previously proposed supervised technique.

reference text

Doug Beeferman, Adam Berger, and John Lafferty. 1999. Statistical models for text segmentation. Computational Linguistics, 34(1–3): 177–210. David M. Blei, Andrew Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. JMLR, 3:993– 1022. Hal Daum e´ and Daniel Marcu. 2004. A phrase-based hmm approach to document/abstract alignment. In Proceedings of EMNLP, pages 137–144. Hal Daum e´ and Daniel Marcu. 2006. Bayesian queryfocused summarization. In Proceedings of ACL, pages 305–3 12. Jacob Eisenstein and Regina Barzilay. 2008. Bayesian unsupervised topic segmentation. In Proceedings of EMNLP, pages 334–343. Thomas S. Ferguson. 1973. A Bayesian analysis of some non-parametric problems. Annals of Statistics, 1:209–230. Michel Galley, Kathleen R. McKeown, Eric FoslerLussier, and Hongyan Jing. 2003. Discourse segmentation of multi-party conversation. In Proceedings of ACL, pages 562–569. M. A. K. Halliday and Ruqaiya Hasan. 1976. Cohesion in English. Longman. Marti Hearst. 1994. Multi-paragraph segmentation of expository text. In Proceedings of ACL, pages 9–16. Hongyan Jing. 2002. Using hidden Markov modeling to decompose human-written summaries. Computational Linguistics, 28(4):527–543. Igor Malioutov and Regina Barzilay. 2006. Minimum cut model for spoken lecture segmentation. In Proceedings of ACL, pages 25–32. Daniel Marcu. 1999. The automatic construction of large-scale corpora for summarization research. In Proceedings of ACM SIGIR, pages 137–144. Hyungjong Noh, Minwoo Jeong, Sungjin Lee, Jonghoon Lee, and Gary Geunbae Lee. 2010. Script-description pair extraction from text documents of English as second language podcast. In Proceedings of the 2nd International Conference on Computer Supported Education. Lev Pevzner and Marti Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1): 19– 36. Jayaram Sethuraman. 1994. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650. Bingjun Sun, Prasenjit Mitra, C. Lee Giles, John Yen, and Hongyuan Zha. 2007. Topic segmentation with shared topic detection and alignment of multiple documents. In Proceedings of ACM SIGIR, pages 199–206. Masao Utiyama and Hitoshi Isahara. 2001 . A statistical model for domain-independent text segmentation. In Proceedings of ACL, pages 491–498. Kai Yu, Shipeng Yu, and Vokler Tresp. 2005. Dirichlet enhanced latent semantic analysis. In Proceedings of AISTATS. 155