acl acl2011 acl2011-66 acl2011-66-reference knowledge-graph by maker-knowledge-mining

66 acl-2011-Chinese sentence segmentation as comma classification

Source: pdf

Author: Nianwen Xue ; Yaqin Yang

Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.

reference text

E. Black, S. Abney, D. Flickinger, C. Gdaniec, R. Gr- ishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, B. Santorini, and T. Strzalkowski. 1991. S. Roukos, A proce- dure for quantitively comparing the syntactic coverage of English grammars. In Proceedings of the DARPA Speech and Natural Language Workshop, pages 306– 311. Yuqing Guo, Haifeng Wang, and Josef Van Genabith. 2010. A Linguistically Inspired Statistical Model for Chinese Punctuation Generation. ACM Transactions on Asian Language Processing, 9(2). Meixun Jin, Mi-Young Kim, Dong-Il Kim, and JongHyeok Lee. 2004. Segmentation of Chinese Long Sentences Using Commas. In Proceedings of the SIGHANN Workshop on Chinese Language Processing. Xing Li, Chengqing Zong, and Rile Hu. 2005. A Hierarchical Parsing Approach with Punctuation Processing for Long Sentence Sentences. In Proceedings of the Second International Joint Conference on Natural Language Processing: Companion Volume including Posters/Demos and Tutorial Abstracts. We Lu and Hwee Tou Ng. 2010. Better Punctuation Prediction with Dynamic Conditional Random Fields. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, MIT, Massachusetts. M. Marcus, B. Santorini, and M. A. Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: the Penn Treebank. Computational Linguistics. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. Slav Petrov and Dan Klein. 2007. Improved Inferencing for Unlexicalized Parsing. In Proc of HLT-NAACL. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP), Washington, D.C. Nianwen Xue, Fei Xia, Fu dong Chiou, and Martha Palmer. 2005. The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering, 11(2):207–238. 635