acl acl2011 acl2011-140 acl2011-140-reference knowledge-graph by maker-knowledge-mining

140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Source: pdf

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

reference text

Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. 2004. Efficient Unsupervised Recursive Word Segmentation Using Minimum Description Length. In Proceedings of the 20th International Conference on Computational Linguistics, Morristown, NJ, USA. Association for Computational Linguistics. Nan Bernstein Ratner, 1987. The phonology of parentchild speech, pages 159–174. Erlbaum, Hillsdale, NJ. Michael R. Brent. 1999. An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discov- Machine Learning, (34):71–105. Jimming Cheng and Michael Mitzenmacher. 2005. The Markov Expert for Finding Episodes in Time Series. In Proceedings of the Data Compression Conference, pages 454–454. IEEE. Paul Cohen and Niall Adams. 2001 . An algorithm for segmenting categorical time series into meaningful episodes. In Proceedings of the Fourth Symposium on Intelligent Data Analysis. ery. Margaret M. Fleck. 2008. Lexicalized phonotactic word segmentation. In Proceedings of The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 130–138, Columbus, Ohio, USA. Association for Computational Linguistics. John J. Godfrey and Ed Holliman. 1993. Switchboard- 1 Transcripts. Sharon Goldwater, Thomas L Griffiths, and Mark Johnson. 2009. A Bayesian Framework for Word Segmentation: Exploring the Effects of Context. Cognition, 112(1):21–54. Sharon Goldwater. 2007. Nonparametric Bayesian models of lexical acquisition. Ph.D. dissertation, Brown University. Zellig S. Harris. 1955. From Phoneme to Morpheme. Language, 3 1(2): 190–222. Daniel Hewlett and Paul Cohen. 2009. Bootstrap Voting Experts. In Proceedings of the Twenty-first International Joint Conference on Artificial Intelligence. J. E. Hopcroft and J. D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. Chu-Ren Huang. 2007. Tagged Chinese Gigaword (Catalog LDC2007T03). Linguistic Data Consortium, Philadephia. Howard Johnson and Joel Martin. 2003. Unsupervised learning ofmorphology for English and Inuktitut. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (HLTNAACL 2003), pages 43–45. Brian MacWhinney. 2000. The CHILDES project: Tools for analyzing talk. Lawrence Erlbaum Associates, Mahwah, NJ, 3rd editio edition. Matthew Miller and Alexander Stoytchev. 2008. Hierarchical Voting Experts: An Unsupervised Algorithm for Hierarchical Sequence Segmentation. In Proceedings of the 7th IEEE International Conference on Development and Learning, pages 186–191 . Jorma Rissanen. 1983. A Universal Prior for Integers and Estimation by Minimum Description Length. The Annals of Statistics, 11(2):416–43 1. Kumiko Tanaka-Ishii and Zhihui Jin. 2006. From Symposium of Chinese Spoken Language Processing, Beijing, China. Valentin Zhikov, Hiroya Takamura, and Manabu Okumura. 2010. An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 832–842, Cambridge, MA. MIT Press. Phoneme to Morpheme: Another Verification Using a Corpus. In Proceedings of the 21st International Conference on Computer Processing of Oriental Languages, pages 234–244. Anand Venkataraman. 2001. A procedure for unsupervised lexicon learning. In Proceedings of the Eighteenth International Conference on Machine Learning. Hua Yu. 2000. Unsupervised Word Induction using MDL Criterion. In Proceedings of the International 545