emnlp emnlp2011 emnlp2011-77 emnlp2011-77-reference knowledge-graph by maker-knowledge-mining

77 emnlp-2011-Large-Scale Cognate Recovery


Source: pdf

Author: David Hall ; Dan Klein

Abstract: We present a system for the large scale induction of cognate groups. Our model explains the evolution of cognates as a sequence of mutations and innovations along a phylogeny. On the task of identifying cognates from over 21,000 words in 218 different languages from the Oceanic language family, our model achieves a cluster purity score over 91%, while maintaining pairwise recall over 62%.


reference text

Shane Bergsma and Greg Kondrak. 2007. Multilingual cognate identification using integer linear programming. In RANLP Workshop on Acquisition and Management of Multilingual Lexicons, Borovets, Bulgaria, September. Leonard Bloomfield. 1938. Language. Holt, New York. R. A. Blust. 2009. The Austronesian languages. Australian National University. Alexandre Bouchard-C oˆt´ e, Percy Liang, Thomas Griffiths, and Dan Klein. 2007. A probabilistic approach to diachronic phonology. In EMNLP. Alexandre Bouchard-C oˆt´ e, Thomas L. Griffiths, and Dan Klein. 2009. Improved reconstruction of protolanguage word forms. In NAACL, pages 65–73. L. L. Cavalli-Sforza and A. W. F. Edwards. 1965. Analy- sis of human evolution. In S. J. Geerts Genetics Today, editor, Proceedings of XIth International Congress of Genetics, 1963, Vol, page 923–933. 3, 3. Hal Daum e´ III and Lyle Campbell. 2007. A Bayesian model for discovering typological implications. In Conference of the Association for Computational Linguistics (ACL). Hal Daum e´ III. 2009. Non-parametric Bayesian areal linguistics. In NAACL. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38. Markus Dreyer and Jason Eisner. 2009. Graphical models over multiple strings. In EMNLP, Singapore, August. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. 2006. Biological sequence analysis. eleventh edition. James S. Farris. 1973. On Comparing the Shapes of Taxonomic Trees. Systematic Zoology, 22(1):50–54, March. J. Felsenstein. 1973. Maximum likelihood and minimum steps methods for estimating evolutionnary trees from data on discrete characters. Systematic Zoology, 23:240–249. W. M. Fitch. 1971 . Toward defining the course of evolution: minimal change for a specific tree topology. Systematic Zoology, 20:406–416. S.J. Greenhill, R. Blust, and R.D. Gray. 2008. The Austronesian basic vocabulary database: from bioinformatics to lexomics. Evolutionary Bioinformatics, 4:271–283. David Hall and Dan Klein. 2010. Finding cognates using phylogenies. In Association for Computational Linguistics (ACL). Robert M. Keesing and Jonathan Fifi’i. 1969. Kwaio word tabooing in its cultural context. Journal of the Polynesian Society, 78(2): 154–177. Grzegorz Kondrak. 2001 . Identifying cognates by phonetic and semantic similarity. In NAACL. Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83–97. John B. Lowe and Martine Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method. Computational Linguistics, 20(3):381–417. Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In NAACL. Mehryar Mohri, Fernando Pereira, and Michael Riley. 1996. Weighted automata in text and speech processing. In ECAI-96 Workshop. John Wiley and Sons. 354 Andrea Mulloni. 2007. Automatic prediction of cognate orthography using support vector machines. In ACL, pages 25–30. Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443 453. John Nerbonne. 2010. Measuring the diffusion of linguistic change. Philosophical Transactions of the Royal Society B: Biological Sciences. J. Nichols. 1992. Linguistic diversity in space and time. University of Chicago Press. Michael P. Oakes. 2000. Computer estimation of vocabulary in a protolanguage from word lists in four daughter languages. Quantitative Linguistics, 7(3):233–243. Don Ringe, Tandy Warnow, and Ann Taylor. 2002. Indoeuropean and computational cladistics. Transactions of the Philological Society, 100(1):59–129. D. Sankoff and R. J. Cedergren, 1983. Simultaneuous comparison of three or more sequences related by a tree, page 253–263. Addison-Wesley, Reading, MA. August Schleicher. 1861. A Compendium of the Comparative Grammar of the Indo-European, Sanskrit, Greek and Latin Languages. –