emnlp emnlp2013 emnlp2013-124 emnlp2013-124-reference knowledge-graph by maker-knowledge-mining

124 emnlp-2013-Leveraging Lexical Cohesion and Disruption for Topic Segmentation

Source: pdf

Author: Anca-Roxana Simon ; Guillaume Gravier ; Pascale Sebillot

Abstract: Topic segmentation classically relies on one of two criteria, either finding areas with coherent vocabulary use or detecting discontinuities. In this paper, we propose a segmentation criterion combining both lexical cohesion and disruption, enabling a trade-off between the two. We provide the mathematical formulation of the criterion and an efficient graph based decoding algorithm for topic segmentation. Experimental results on standard textual data sets and on a more challenging corpus of automatically transcribed broadcast news shows demonstrate the benefit of such a combination. Gains were observed in all conditions, with segments of either regular or varying length and abrupt or smooth topic shifts. Long segments benefit more than short segments. However the algorithm has proven robust on automatic transcripts with short segments and limited vocabulary reoccurrences.

reference text

Doug Beeferman, Adam Berger, and John Lafferty. 1997. Text segmentation using exponential models. In 2nd Conference on Empirical Methods in Natural Language Processing, pages 35–46. Gillian Brown and George Yule. 1983. Discourse analysis. Cambridge University Press. Freddy Y. Y. Choi. 2000. Advances in domain independent linear text segmentation. In 1st International Conference of the North American Chapter of the As- sociation for Computational Linguistics, pages 26–33. Vincent Claveau and S ´ebastien Lef e`vre. 2011. Topic segmentation of TV-streams by mathematical morphology and vectorization. In 12th International Conference of the International Speech Communication Association, pages 1105–1 108. Manolis Delakis, Guillaume Gravier, and Patrick Gros. 2008. Audiovisual integration with segment models for tennis video parsing. Computer Vision and Image Understanding, 111(2): 142–154. Jacob Eisenstein and Regina Barzilay. 2008. Bayesian unsupervised topic segmentation. In Conference on Empirical Methods in Natural Language Processing, pages 334–343. 1323 Olivier Ferret, Brigitte Grau, and Nicolas Masson. 1998. Thematic segmentation of texts: Two methods for two kinds of texts. In 36th Annual Meeting of the Association for Computational Linguistics and 1 In7th ternational Conference on Computational Linguistics, pages 392–396. Jean-Luc Gauvain, Lori Lamel, and Gilles Adda. 2002. The LIMSI broadcast news transcription system. Speech Communication, 37(1–2):89–108. Barbara J. Grosz and Candace L. Sidner. 1986. Attention, intentions, and the structure of discourse. Com- putational Linguistics, 12(3): 175–204. Barbara J. Grosz, Scott Weinstein, and Aravind K. Joshi. 1995. Centering: a framework for modeling the local coherence of discourse. Computational Linguistics, 21(2):203–225, June. Camille Guinaudeau, Guillaume Gravier, and Pascale S ´ebillot. 2012. Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation. Computer Speech and Language, 26(2):90–104. Marti A. Hearst. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33–64. Nicolas Hernandez and Brigitte Grau. 2002. Analyse th ´ematique du discours : segmentation, structuration, description et repr ´esentation. In 5e colloque international sur le document ´e lectronique, pages 277–285. St´ ephane Huet, Guillaume Gravier, and Pascale S ´ebillot. 2010. Morpho-syntactic post-processing ofn-best lists for improved French automatic speech recognition. Computer Speech and Language, 24(4):663–684. Xiang Ji and Hongyuan Zha. 2003. Domain-independent text segmentation using anisotropic diffusion and dynamic programming. In 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 322–329. Diane J. Litman and Rebecca J. Passonneau. 1995. Combining multiple knowledge sources for discourse segmentation. In 33rd Annual Meeting of the Association for Computational Linguistics, pages 108–1 15. Igor Malioutov and Regina Barzilay. 2006. Minimum cut model for spoken lecture segmentation. In 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 25–32. Hemant Misra, Fran ¸cois Yvon, Joemon M. Jose, and Olivier Cappe. 2009. Text segmentation via topic modeling: an analytical study. In Proc. ACM conference on Information and knowledge management, pages 1553–1556. Marie-Francine Moens and Rik De Busser. 2001. Generic topic segmentation of document texts. In 24th International Conference on Research and Developement in Information Retrieval, pages 418–419. John Niekrasz and Johanna D. Moore. 2010. Unbiased discourse segmentation evaluation. In Spoken Language Technology, pages 43–48. Mari Ostendorf, Vassilios V. Digalakis, and Owen A. Kimball. 1996. From HMM’s to segment models: a unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5):360–378. Lev Pevzner and Marti A. Hearst. 2002. A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28: 19–36. Jeffrey C. Reynar. 1994. An automatic method of finding topic boundaries. In 32nd Annual Meeting on Association for Computational Linguistics, pages 33 1–333. Anca Simon, Guillaume Gravier, and Pascale S ´ebillot. 2013. Un mod e`le segmental probabiliste combinant coh e´sion lexicale et rupture lexicale pour la segmentation th ´ematique. In 20e conf e´rence Traitement Automatique des Langues Naturelles, pages 202–214. Masao Utiyama and Hitoshi Isahara. 2001. A statistical model for domain-independent text segmentation. In 39th Annual Meeting on the Association for Computational Linguistics, pages 499–506. Kenneth H. Walker, Dallas W. Hall, and Willis J. Hurst. 1990. Clinical Methods: The History, Physical, and Laboratory Examinations. Butterworths. 1324