acl acl2013 acl2013-369 acl2013-369-reference knowledge-graph by maker-knowledge-mining

369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Source: pdf

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

reference text

Taylor Berg-Kirkpatrick and Dan Klein. 2010. Phylogenetic grammar induction. In Proceedings of the ACL, pages 1288–1297. Association for Computational Linguistics. Alexandre Bouchard-Côté, David Hall, Thomas L Griffiths, and Dan Klein. 2013. Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences, 110(1 1):4224–4229. Christos Christodoulopoulos, Sharon Goldwater, and Mark Steedman. 2011. A Bayesian mixture model for partof-speech induction using multiple features. In Proceedings of EMNLP, pages 638–647. Association for Computational Linguistics. Shay B Cohen and Noah A Smith. 2009. Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of NAACL, pages 74– 82. Association for Computational Linguistics. Shay B Cohen, Dipanjan Das, and Noah A Smith. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of EMNLP, pages 50–61 . Association for Computational Linguistics. Stuart Geman and Donald Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (6):721–741. Young-Bum Kim and Benjamin Snyder. 2012. Universal grapheme-to-phoneme prediction over latin alphabets. In Proceedings of EMNLP, pages 332–343, Jeju Island, South Korea, July. Association for Computational Linguistics. Young-Bum Kim, 201 1. Universal nearest neighbor pages 322–332. tics. João V Graça, and Benjamin Snyder. morphological analysis using structured prediction. In Proceedings of EMNLP, Association for Computational Linguis- Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. 2006. Unsupervised analysis for decipherment problems. In Proceedings of COLING/ACL, pages 499–506. Association for Computational Linguistics. Yoong Keok Lee, Aria Haghighi, and Regina Barzilay. 2010. Simple type-level unsupervised POS tagging. In Proceedings of EMNLP, pages 853–861. Association for Computational Linguistics. Percy Liang, Michael I Jordan, and Dan Klein. 2010. Typebased MCMC. In Proceedings of NAACL, pages 573–581. Association for Computational Linguistics. David JC MacKay. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press. Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. Multilingual part-of-speech tagging: Two unsupervised approaches. Journal of Artificial Intelligence Research, 36(1):341–385. Radford M Neal. 2003. Slice sampling. Annals of statistics, 31:705–741. Benjamin Snyder, Regina Barzilay, and Kevin Knight. 2010. A statistical model for lost language decipherment. In Proceedings of the ACL, pages 1048–1057. Association for Computational Linguistics. 1536