acl acl2010 acl2010-116 acl2010-116-reference knowledge-graph by maker-knowledge-mining

116 acl-2010-Finding Cognate Groups Using Phylogenies


Source: pdf

Author: David Hall ; Dan Klein

Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.


reference text

Leonard Bloomfield. York. 1938. Language. Holt, New 9Morphological noise and transcription errors contribute to the absolute error rate for this data set. 1038 Alexandre Bouchard-C oˆt´ e, Percy Liang, Thomas Griffiths, and Dan Klein. 2007. A probabilistic approach to diachronic phonology. In EMNLP. Alexandre Bouchard-C oˆt´ e, Thomas L. Griffiths, and Dan Klein. 2009. Improved reconstruction of protolanguage word forms. In NAACL, pages 65–73. Hal Daum e´ III and Lyle Campbell. 2007. A Bayesian model for discovering typological implications. In Conference of the Association for Computational Linguistics (ACL). Hal Daum e´ III. 2009. Non-parametric Bayesian model areal linguistics. In NAACL. Markus Dreyer and Jason Eisner. 2009. Graphical models over multiple strings. pore, August. In EMNLP, Singa- Jason Eisner. 2001. Expectation semirings: Flexible EM for finite-state transducers. In Gertjan van Noord, editor, FSMNLP. Jason Eisner. 2002. Parameter estimation for probabilistic finite-state transducers. In ACL. Grzegorz Kondrak, Daniel Marcu, and Keven Knight. 2003. Cognates can improve statistical translation models. In NAACL. Grzegorz Kondrak. 2001. Identifying cognates by phonetic and semantic similarity. In NAACL. Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97. Zhifei Li and Jason Eisner. 2009. First- and secondorder expectation semirings with applications to minimum-risk training on translation forests. In EMNLP. John B. Lowe and Martine Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method. Computational Linguis- tics, 20(3):381–417. Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In NAACL, pages 1–8. Association for Computational Linguistics. Thomas P. Minka. 2001. Expectation propagation for approximate bayesian inference. In UAI, pages 362– 369. Mehryar Mohri, Fernando Pereira, and Michael Riley. 1996. Weighted automata in text and speech processing. In ECAI-96 Workshop. John Wiley and Sons. Mehryar Mohri, 2009. Handbook of Weighted Automata, chapter Weighted Automata Algorithms. Springer. Andrea Mulloni. 2007. Automatic prediction of cognate orthography using support vector machines. In ACL, pages 25–30. John Nerbonne. 2010. Measuring the diffusion of linguistic change. Philosophical Transactions of the Royal Society B: Biological Sciences. Michael P. Oakes. 2000. Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. 7(3):233–243. Quantitative Linguistics, OED. 1989. “day, n.”. In The Oxford English Dictionary online. Oxford University Press. John Ohala, 1993. Historical linguistics: Problems and perspectives, chapter The phonetics of sound change, pages 237–238. Longman. Jose Oncina and Marc Sebban. 2006. Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recognition, 39(9). Don Ringe, Tandy Warnow, and Ann Taylor. 2002. Indo-european and computational cladistics. Transactions of the Philological Society, 100(1):59–129. Alan S.C. Ross. 1950. Philological probability problems. Journal of the Royal Statistical Society Series B. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2000. Inducing multilingual text analysis tools via robust projection across aligned corpora. In NAACL. 1039