emnlp emnlp2011 emnlp2011-99 emnlp2011-99-reference knowledge-graph by maker-knowledge-mining

99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases

Source: pdf

Author: Yugo Murawaki ; Sadao Kurohashi

Abstract: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer.

reference text

Masayuki Asahara and Yuji Matsumoto. 2000. Extended models and tools for high-performance part-of-speech tagger. In Proc. of COLING 2000, pages 21–27. Masayuki Asahara and Yuji Matsumoto. 2004. Japanese unknown word identification by characterbased chunking. In Proc. COLING 2004, pages 459– 465. Phil Blunsom, Trevor Cohn, Sharon Goldwater, and Mark Johnson. 2009. A note on the implementation of hierarchical Dirichlet processes. In Proc. of ACL-IJCNLP 2009: Short Papers, pages 337–340. Michael D. Escobar and Mike West. 1995. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577–588. Thomas S. Ferguson. 1973. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230. Sharon Goldwater, Thomas L. Griffiths, and Mark John- son. 2006. Interpolating between types and tokens by estimating power-law generators. In NIPS 18, pages 459–466. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2009. A Bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112(1):21–54. Mark Johnson and Katherine Demuth. 2010. Unsupervised phonemic Chinese word segmentation using adaptor grammars. In Proc. of COLING 2010, pages 528–536. Mark Johnson. 2008. Using adaptor grammars to identify synergies in the unsupervised acquisition of linguistic structure. In Proc. of ACL 2008, pages 398– 406. Jun’ichi Kazama and Kentaro Torisawa. 2008. Inducing gazetteers for named entity recognition by large-scale clustering of dependency relations. In Proc. of ACL 2008, pages 407–415, June. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In Proc. of EMNLP 2004, pages 230–237. Sadao Kurohashi, Toshihisa Nakamura, Yuji Matsumoto, and Makoto Nagao. 1994. Improvements of Japanese morphological analyzer JUMAN. In Proc. of The International Workshop on Sharable Natural Language Resources, pages 22–38. Yoong Keok Lee, Aria Haghighi, and Regina Barzilay. 2010. Simple type-level unsupervised POS tagging. In Proc. of EMNLP 2010, pages 853–861. Zhongguo Li and Maosong Sun. 2009. Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics, 35(4):505–512. Percy Liang, Michael I. Jordan, and Dan Klein. 2010. Type-based MCMC. In Proc. of NAACL 2010, pages 573–581. 615 Jin Kiat Low, Hwee Tou Ng, and Wenyuan Guo. 2005. A maximum entropy approach to Chinese word segmentation. In Proc. of the 4th SIGHAN Workshop, pages 161–164. Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proc. of ACL-IJCNLP 2009, pages 100–108. Yugo Murawaki and Sadao Kurohashi. 2008. Online acquisition of Japanese unknown morphemes using morphological constraints. In Proc. of EMNLP 2008, pages 429–437. Masaaki Nagata. 1996. Automatic extraction of new words from Japanese texts using generalized forwardbackward search. In Proc. of EMNLP 1996, pages 48– 59. Tetsuji Nakagawa and Yuji Matsumoto. 2006. Guessing parts-of-speech of unknown words using global information. In Proc. of COLING-ACL 2006, pages 705– 712. Graham Neubig and Shinsuke Mori. 2010. Word-based partial annotation for efficient corpus construction. In Proc. of LREC 2010. ThuyLinh Nguyen, Stephan Vogel, and Noah A. Smith. 2010. Nonparametric word segmentation for machine translation. In Proc. of COLING 2010, pages 815– 823. Fuchun Peng, Fangfang Feng, and Andrew McCallum. 2004. Chinese segmentation and new word detection using conditional random fields. In Proc. of COLING ’04, pages 562–568. Yee Whye Teh. 2006. A Bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing, National University of Singapore. Yuta Tsuboi, Hisashi Kashima, Shinsuke Mori, Hiroki Oda, and Yuji Matsumoto. 2008. Training conditional random fields using incomplete annotations. In Proc. of COLING 2008, pages 897–904. Kiyotaka Uchimoto, Satoshi Sekine, and Hitoshi Isahara. 2001. The unknown word problem: a morphological analysis of Japanese using maximum entropy aided by a dictionary. In Proc. of EMNLP 2001, pages 91–99. Jia Xu, Jianfeng Gao, Kristina Toutanova, and Hermann Ney. 2008. Bayesian semi-supervised Chinese word segmentation for statistical machine translation. In Proc. of COLING 2008, pages 1017–1024. Nianwen Xue. 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48. Toshio Yokoi. 1995. The EDR electronic dictionary. Communications of the ACM, 38(1 1):42–44.