acl acl2010 acl2010-16 acl2010-16-reference knowledge-graph by maker-knowledge-mining

16 acl-2010-A Statistical Model for Lost Language Decipherment

Source: pdf

Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight

Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.

reference text

C. E. Antoniak. 1974. Mixtures of Dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 2: 1152–1 174, November. Alexandre Bouchard, Percy Liang, Thomas Griffiths, and Dan Klein. 2007. A probabilistic approach to diachronic phonology. In Proceedings of EMNLP, pages 887–896. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 4(1). Jesus-Luis Cunchillos, Juan-Pablo Vita, and JoseA´ngel Zamora. 2002. Ugaritic data bank. CDROM. Gregoria del Olo Lete and Joaqu ´ın Sanmart ı´n. 2004. A Dictionary of the Ugaritic Language in the Alphabetic Tradition. Number 67 in Handbook of Oriental Studies. Section 1The Near and Middle East. Brill. Pascale Fung and Kathleen McKeown. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the Annual Workshop on Very Large Corpora, pages 192–202. S. Geman and D. Geman. 1984. Stochastic relaxation, gibbs distributions and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:609–628. Alan Groves and Kirk Lowery, editors. 2006. The Westminster Hebrew Bible Morphology Database. Westminster Hebrew Institute, Philadelphia, PA, USA. Jacques B. M. Guy. 1994. An algorithm for identifying cognates in bilingual wordlists and its applicability to machine translation. Journal of Quantitative Linguistics, 1(1):35–42. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the ACL/HLT, pages 771–779. Robert Hetzron, editor. 1997. The Semitic Languages. Routledge. H. Ishwaran and J.S. Rao. 2005. Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics, 33(2):730–773. Kevin Knight and Richard Sproat. 2009. Writing systems, transliteration and decipherment. NAACL Tutorial. K. Knight and K. Yamada. 1999. A computational approach to deciphering unknown scripts. In ACL Workshop on Unsupervised Learning in Natural Language Processing. Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. 2006. Unsupervised analysis for decipherment problems. In Proceedings of the COLING/ACL, pages 499–506. Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, pages 9–16. Grzegorz Kondrak. 2001 . Identifying cognates by phonetic and semantic similarity. In Proceeding of NAACL, pages 1–8. Grzegorz Kondrak. 2009. Identification of cognates and recurrent sound correspondences in word lists. Traitement Automatique des Langues, 50(2):201– 235. John B. Lowe and Martine Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method. Computational Linguistics, 20(3):381–417. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the ACL, pages 5 19–526. Andrew Robinson. Enigma of the McGraw-Hill. 2002. World’s Lost Languages: The Undeciphered Scripts. William M. Schniedewind and Joel H. Hunt. 2007. A Primer on Ugaritic: Language, Culture and Litera- ture. Cambridge University Press. Mark S. Smith, editor. 1955. Untold Stories: The Bible and Ugaritic Studies in the Twentieth Century. Hendrickson Publishers. Benjamin Snyder and Regina Barzilay. 2008. Crosslingual propagation for morphological analysis. In Proceedings of the AAAI, pages 848–854. Wilfred Watson and Nicolas Wyatt, editors. Handbook of Ugaritic Studies. Brill. 1999. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2000. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of HLT, pages 161–168. 1057