acl acl2012 acl2012-47 acl2012-47-reference knowledge-graph by maker-knowledge-mining

47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

Source: pdf

Author: Yaqin Yang ; Nianwen Xue

Abstract: The Chinese comma signals the boundary of discourse units and also anchors discourse relations between adjacent text spans. In this work, we propose a discourse structureoriented classification of the comma that can be automatically extracted from the Chinese Treebank based on syntactic patterns. We then experimented with two supervised learning methods that automatically disambiguate the Chinese comma based on this classification. The first method integrates comma classification into parsing, and the second method adopts a “post-processing” approach that extracts features from automatic parses to train a classifier. The experimental results show that the second approach compares favorably against the first approach.

reference text

L Carlson, D Marcu, M E Okurowski. 2002. RST Discourse Treebank. Linguistic Data Consortium 2002. Caroline Sporleder, Mirella Lapata. 2005. Discourse chunking and its application to sentence compression. In Proceedings of HLT/EMNLP 2005. Livia Polanyi, Chris Culy, Martin Van Den Berg, Gian Lorenzo Thione and David Ahn. 2004. Sentential structure and discourse parsing. In Proceeedings of the ACL 2004 Workshop on Discourse Annotation 2004. Hen-Hsen Huang and Hsin-Hsi Chen. 2011. Chinese Discourse Relation Recognition. In Proceedings of the 5th International Joint Conference on Natural Language Processing 2011,pages 1442-1446. Daniel Marcu and Abdessamad Echihabi. 2002. An Unsupervised Approach to Recognizing Discourse Relations. In Proceedings of the ACL, July 6-12, 2002, Philadelphia, PA, USA. Radu Soricut and Daniel Marcu. 2003. Sentence Level Discourse Parsing using Syntactic and Lexical Information. In Proceedings of the ACL 2003. Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi andBonnie Webber. 2004. The Penn Discourse Treebank. In Proceedings of LREC 2004. Nianwen Xue and Yaqin Yang. 2011. Chinese sentence segmentation as comma classification. In Proceedings of ACL 2011. Nianwen Xue, Fei Xia, Fu-Dong Chiou and Martha Palmer. 2005. The Penn Chinese Treebank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering, 11(2):207-238. Slav Petrov and Dan Klein. 2007. Improved Inferencing for Unlexicalized Parsing. In Proceedings of HLTNAACL 2007. E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strzalkowski. 1991. A procedure for quantitively comparing the syntactic coverage of English grammars. In Proceedings of the DARPA Speech and Natural Language Workshop, pages 306311. Mann, William C. and Sandra A. Thompson. 1988. Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8 (3): 243-281 . Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0.. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008). Meixun Jin, Mi-Young Kim, Dong-Il Kim, and JongHyeok Lee. 2004. Segmentation of Chinese Long 794 Sentences Using Commas. In Proceedings of the SIGHANN Workshop on Chinese Language Processing. Xing Li, Chengqing Zong, and Rile Hu. 2005. A Hierarchical Parsing Approach with Punctuation Processing for Long Sentence Sentences. In Proceedings of the Second International Joint Conference on Natural Language Processing: Companion Volume including Posters/Demos and Tutorial Abstracts. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu. Church, K., and Hanks, P. 1989. Word Association Norms, Mutual Information and Lexicography. Association for Computational Linguistics, Vancouver , Canada