emnlp emnlp2011 emnlp2011-48 emnlp2011-48-reference knowledge-graph by maker-knowledge-mining

48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

Source: pdf

Author: Weiwei Sun ; Jia Xu

Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

reference text

Haodi Feng, Kang Chen, Xiaotie Deng, and Weimin Zheng. 2004. Accessor variety criteria for Chinese word extraction. Comput. Linguist., 30:75– 93. Wenbin Jiang, Liang Huang, and Qun Liu. 2009. Automatic adaptation of annotation standards: Chinese word segmentation and pos tagging a case study. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 522– 978 530. Association for Computational Linguistics, Suntec, Singapore. Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In Proceedings of ACL-08: HLT, pages 595– 603. Association for Computational Linguistics, Columbus, Ohio. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML ’01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Zhongguo Li and Maosong Sun. 2009. Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist., 35:505–512. Scott Miller, Jethran Guinness, and Alex Zamanian. – 2004. Name tagging with word clusters and discriminative training. In Daniel Marcu Susan Dumais and Salim Roukos, editors, HLT-NAACL 2004: Main Proceedings, pages 337–342. Association for Computational Linguistics, Boston, Massachusetts, USA. Naoaki Okazaki. 2007. Crfsuite: a fast implementation of conditional random fields (crfs). Valentin I. Spitkovsky, Daniel Jurafsky, and Hiyan Alshawi. 2010. Profiting from mark-up: Hypertext annotations for guided parsing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1278– 1287. Association for Computational Linguistics, Uppsala, Sweden. Weiwei Sun. 2010. Word-based and characterbased word segmentation models: Comparison and combination. In Coling 2010: Posters, pages 1211–1219. Coling 2010 Organizing Committee, Beijing, China. Weiwei Sun. 2011. A stacked sub-word model for joint Chinese word segmentation and part-ofspeech tagging. In Proceedings of the ACL 2011 Conference. Association for Computational Lin- guistics, Portland, Oregon, United States. Xu Sun, Yaozhong Zhang, Takuya Matsuzaki, Yoshimasa Tsuruoka, and Jun’ichi Tsujii. 2009. A discriminative latent variable Chinese segmenter with hybrid word/character information. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 56–64. Association for Computational Linguistics, Boulder, Colorado. Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. 2005. A conditional random field word segmenter. In In Fourth SIGHAN Workshop on Chinese Language Processing. Joseph Turian, Lev-Arie Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 384–394. Association for Computational Linguistics, Uppsala, Sweden. Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 1017–1024. Coling 2008 Organizing Committee, Manchester, UK. Nianwen Xue. 2003. Chinese word segmentation as character tagging. In International Journal of Computational Linguistics and Chinese Language Processing. 979