emnlp emnlp2012 emnlp2012-96 emnlp2012-96-reference knowledge-graph by maker-knowledge-mining

96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation


Source: pdf

Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze

Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.


reference text

Nicholas Andrews and Jason Eisner. 2011. Transformation process priors. In NIPS 2011 Workshop on Bayesian Nonparametrics: Hope or Hype?, Sierra Nevada, Spain, December. Extended abstract (3 pages). A. Bagga and B. Baldwin. 1998. Algorithms for scoring coreference chains. In LREC. Regina Barzilay and Lillian Lee. 2003. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proc. of NAACL-HLT, pages 16–23, Stroudsburg, PA, USA. C. H. Bennett, M. Li, , and B. Ma. 2003. Chain letters and evolutionary histories. Scientific American, 288(3):76– 81, June. More mathematical version available at http : / /www . cs .uwaterl oo . ca / ~ml i chain .html. / Shane Bergsma and Benjamin Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web images. In Proc. of IJCAI, pages 1764–1769, Barcelona, Spain. Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 39–48, New York, NY, USA. ACM. Alexandre Bouchard-Côté, Percy Liang, Thomas Griffiths, and Dan Klein. 2008. A probabilistic approach to language change. In Proc. of NIPS, pages 169–176. S. Cucerzan. 2007. Large-scale named entity disambiguation based on Wikipedia data. In Proc. of EMNLP. Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, and Andrew McCallum. 2007. Canonicalization of database records using adaptive similarity measures. In Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pages 201–209. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38. Z. Dias, A. Rocha, and S. Goldenstein. 2012. Image phylogeny by minimal spanning trees. IEEE Trans. on Information Forensics and Security, 7(2):774–788, April. Markus Dreyer and Jason Eisner. 2011. Discovering morphological paradigms from plain text using a Dirichlet process mixture model. In Proc. of EMNLP, pages 616–627. Supplementary material (9 pages) also available. Markus Dreyer, Jason Smith, and Jason Eisner. 2008. Latentvariable modeling of string transductions with finite-state methods. In Proc. of EMNLP, pages 1080–1089, Honolulu, Hawaii, October. Association for Computational Linguistics. Jacob Eisenstein, Tae Yano, William Cohen, Noah Smith, and Eric Xing. 2011. Structured databases of named entities from Bayesian nonparametrics. In Proc. ofthe Firstworkshop on Unsupervised Learning in NLP, pages 2–12, Edinburgh, Scotland, July. Association for Computational Linguistics. Jason Eisner. 2002. Transformational priors over grammars. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, July. 354 Dan Gusfield. 1997. Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proc. of ACL-08: HLT, pages 771–779. David Hall and Dan Klein. 2010. Finding cognates using phylogenies. In Association for Computational Linguistics (ACL). Rob Hall, Charles Sutton, and Andrew McCallum. 2008. Unsupervised deduplication using cross-field dependencies. In Proc. of the ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, KDD ’08, pages 310–317. Md. Enamul. Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. 2005. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1 2): 13–23. Alexandre Klementiev and Dan Roth. 2006. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In Proc. of COLING-ACL, pages 817– 824. K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24:599–612. Terry Koo, Amir Globerson, Xavier Carreras, and Michael Collins. 2007. Structured prediction models via the matrixtree theorem. In Proc. of EMNLP-CoNLL, pages 141–150. Jose Oncina and Marc Sebban. 2006. Using learned conditional distributions as edit distance. In Proc. of the 2006 Joint IAPR international Conference on Structural, Syntactic, and Statistical Pattern Recognition, SSPR’06/SPR’06, pages 403–41 1. Kristen Parton, Kathleen R. McKeown, James Allan, and Enrique Henestroza. 2008. Simultaneous multilingual search for translingual information retrieval. In Proceeding of the ACM conference on Information and Knowledge Management, CIKM ’08, pages 719–728. Eric Sven Ristad and Peter N. Yianilos. 1996. Learning string edit distance. Technical Report CS-TR-532-96, Princeton University, Department of Computer Science. Eric Sven Ristad and Peter N. Yianilos. 1998. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(5):522–532, May. Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2011. An algorithm for unsupervised transliteration mining with an application to word alignment. In Proc. of ACL, pages 430– 439. Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proc. of CONLL, pages 146–152. Charles Schafer. 2006a. Novel probabilistic finite-state transducers for cognate and transliteration modeling. In 7th Biennial Conference of the Association for Machine Translation in the Americas (AMTA). Charles Schafer. 2006b. Translation Discovery Using Diverse Smilarity Measures. Ph.D. thesis, Johns Hopkins University. David A. Smith and Noah A. Smith. 2007. Probabilistic models of nonprojective dependency trees. In Proc. of EMNLPCoNLL, pages 132–140. R. T. Smythe sive trees. and H. M. Mahmoud. 1995. A survey of recurTheory ofProbability andMathematical Statistics, 51(1–27). Barzilay, and Kevin Knight. 2010. A statistical model for lost language decipherment. In Proc. of Benjamin Snyder, Regina ACL, pages 1048–1057. Koichiro Tamura, Daniel Peterson, Nicholas Peterson, Glen Stecher, Masatoshi Nei, and Sudhir Kumar. 2011. Mega5: Molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular Biology and Evolution, 28(10):273 1– 2739. R E Tarjan. 1977. Finding optimum branchings. Networks, 7(1):25–35. W. Tutte. 1984. Graph Theory. Addison-Wesley. William E. Winkler. 1999. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau. 355