emnlp emnlp2012 emnlp2012-132 emnlp2012-132-reference knowledge-graph by maker-knowledge-mining

132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets

Source: pdf

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.

reference text

Taylor Berg-Kirkpatrick and Dan Klein. 2010. Phylogenetic grammar induction. In Proceedings of the ACL, pages 1288–1297, Uppsala, Sweden, July. Association for Computational Linguistics. P. Blunsom, T. Cohn, C. Dyer, and M. Osborne. 2009. A gibbs sampler for phrasal synchronous grammar induction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 782– 790. Association for Computational Linguistics. David Burkett, Slav Petrov, John Blitzer, and Dan Klein. 2010. Learning better monolingual models with unannotated bilingual text. In Proceedings of CoNLL. G.N. Clements. 2003. Feature economy in sound systems. Phonology, 20(3):287–333. Shay B. Cohen and Noah A. Smith. 2009. Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of the NAACL/HLT. Ido Dagan, Alon Itai, and Ulrike Schwall. 1991. Two languages are more informative than one. In Proceedings of the ACL, pages 130–137. P.T. Daniels and W. Bright. 1996. The world’s writing systems, volume 198. Oxford University Press New York, NY. James Dougherty, Ron Kohavi, and Mehran Sahami. 1995. Supervised and unsupervised discretization of continuous features. In ICML, pages 194–202. K. Dwyer and G. Kondrak. 2009. Reducing the annotation effort for letter-to-phoneme conversion. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 127–135. Associ- ation for Computational Linguistics. Usama M Fayyad and Keki B Irani. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the International Joint Conference on Uncertainty in AI, pages 1022– 1027. M. Haspelmath and H.J. Bibiko. 2005. The world atlas of language structures, volume 1. Oxford University Press, USA. R. Hwa, P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Journal ofNatural Language Engineering, 11(3):3 11–325. S. Jiampojamarn and G. Kondrak. 2010. Letter-phoneme alignment: An exploration. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 780–788. Association for Computational Linguistics. M.J. Kenstowicz and C.W. Kisseberth. 1979. Generative phonology. Academic Press San Diego, CA. Young-Bum Kim, João Graça, and Benjamin Snyder. 2011. Universal morphological analysis using structured nearest neighbor prediction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Edinburgh, 342 Scotland, UK., July. Association for Computational Linguistics. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289. J. Liljencrants and B. Lindblom. 1972. Numerical simulation of vowel quality systems: the role of perceptual contrast. Language, pages 839–862. D.C. Liu and J. Nocedal. 1989. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503–528. K.P. Murphy, Y. Weiss, and M.I. Jordan. 1999. Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 467–475. Morgan Kaufmann Publishers Inc. Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. Multilingual part-of-speech tagging: two unsupervised approaches. Journal of Artificial Intelligence Research, 36(1):341–385. Sebastian Padó and Mirella Lapata. 2006. Optimal constituent alignment with edge covers for semantic projection. In Proceedings of ACL, pages 1161 1168. G. Penn and T. Choma. 2006. Quantitative methods for classifying writing systems. In Proceedings of the Human Language Technology Conference of the NAACL, – Companion Volume: Short Papers, pages 117–120. Association for Computational Linguistics. S. Ravi and K. Knight. 2009. Learning phoneme mappings for transliteration without parallel data. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 37–45. Association for Computational Linguistics. S. Reddy and J. Goldsmith. 2010. An mdl-based approach to extracting subword units for grapheme-tophoneme conversion. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 713–716. Association for Computational Linguistics. Philip Resnik and David Yarowsky. 1997. A perspective on word sense disambiguation methods and their evaluation. In Proceedings of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, pages 79–86. B. Snyder and R. Barzilay. 2008a. Unsupervised multilingual learning for morphological segmentation. Proceedings of ACL-08: HLT, pages 737–745. Benjamin Snyder and Regina Barzilay. 2008b. Crosslingual propagation for morphological analysis. In Proceedings of the AAAI, pages 848–854. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. learning for POS 2008. Unsupervised tagging. In Proceedings multilingual of EMNLP, pages 1041–1050. B. Snyder, T. Naseem, J. Eisenstein, and R. Barzilay. 2009a. Adding more languages improves unsupervised multilingual part-of-speech tagging: A bayesian non-parametric approach. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 83–91 . Association for Computational Linguistics. Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009b. Unsupervised multilingual grammar induction. In Proceedings of the ACL, pages 73–81 . P. Spirtes, C.N. Glymour, and R. Scheines. 2000. Causation, prediction, and search, volume 81. The MIT Press. R. Sproat, T. Tao, and C.X. Zhai. 2006. Named entity transliteration with comparable corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 73– 80. Association for Computational Linguistics. R.W. Sproat. 2000. A computational theory of writing systems. Cambridge Univ Pr. David Yarowsky and Grace Ngai. 2001. Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In Proceedings of the NAACL, pages 1–8. David Yarowsky and Richard Wicentowski. 2000. Minimally supervised morphological analysis by multimodal alignment. In ACL ’00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 207–216, Morristown, NJ, USA. Association for Computational Linguistics. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2000. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of HLT, pages 161–168. 343