acl acl2013 acl2013-370 acl2013-370-reference knowledge-graph by maker-knowledge-mining

370 acl-2013-Unsupervised Transcription of Historical Documents

Source: pdf

Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein

Abstract: We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.

reference text

Kenning Arlitsch and John Herbert. 2004. Microfilm, paper, and OCR: Issues in newspaper digitization. the Utah digital newspapers program. Microform & Imaging Review. Taylor Berg-Kirkpatrick and Dan Klein. 2011. Simple effective decipherment via combinatorial optimization. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Taylor Berg-Kirkpatrick, Alexandre Bouchard-C oˆt´ e, John DeNero, and Dan Klein. 2010. Painless unsupervised learning with features. In Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies:. Arthur Dempster, Nan Laird, and Donald Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2007. English Gigaword third edition. Linguistic Data Consortium, Catalog Number LDC2007T07. Tin Kam Ho and George Nagy. 2000. OCR with no shape training. In Proceedings of the 15th International Conference on Pattern Recognition. Rose Holley. 2010. Trove: Innovation in access to information in Australia. Ariadne. Gary Huang, Erik G Learned-Miller, and Andrew McCallum. 2006. Cryptogram decoding for optical character recognition. University of MassachusettsAmherst Technical Report. Fred Jelinek. 1998. Statistical methods for speech recognition. MIT press. Andrew Kae and Erik Learned-Miller. 2009. Learning on the fly: font-free approaches to difficult OCR problems. In Proceedings of the 2009 International Conference on Document Analysis and Recognition. Andrew Kae, Gary Huang, Carl Doersch, and Erik Learned-Miller. 2010. Improving state-of-theart OCR through high-precision document-specific modeling. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. Vladimir Kluzner, Asaf Tzadok, Yuval Shimony, Eugene Walach, and Apostolos Antonacopoulos. 2009. Word-based adaptive OCR for historical books. In Proceedings of the 2009 International Conference on on Document Analysis and Recognition. Vladimir Kluzner, Asaf Tzadok, Dan Chevion, and Eugene Walach. 2011. Hybrid approach to adaptive OCR for historical books. In Proceedings of the 2011 International Conference on Document Anal- ysis and Recognition. Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. Philipp Koehn. 2004. Pharaoh: a beam search decoder for phrase-based statistical machine translation models. Machine translation: From real users to research. Okan Kolak, William Byrne, and Philip Resnik. 2003. A generative probabilistic OCR model for NLP applications. In Proceedings ofthe 2003 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Gary Kopec and Mauricio Lomelin. 1996. Documentspecific character template estimation. In Proceedings of the International Society for Optics and Photonics. Gary Kopec, Maya Said, and Kris Popat. 2001 . Ngram language models for document image decoding. In Proceedings of Society of Photographic Instrumentation Engineers. Stephen Levinson. 1986. Continuously variable du- ration hidden Markov models for automatic speech recognition. Computer Speech & Language. Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming. Franz Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics. Slav Petrov, Aria Haghighi, and Dan Klein. 2008. Coarse-to-fine syntactic machine translation using language projections. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Sujith Ravi and Kevin Knight. 2008. Attacking decipherment problems optimally with low-order ngram models. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Sujith Ravi and Kevin Knight. 2011. Bayesian inference for Zodiac and other homophonic ciphers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Robert Shoemaker. 2005. Digital London: Creating a searchable web of interlinked sources on eighteenth century London. Electronic Library and Information Systems. Ray Smith. 2007. An overview of the tesseract ocr engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition. 216 Benjamin Snyder, Regina Barzilay, and Kevin Knight. 2010. A statistical model for lost language decipherment. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Georgios Vamvakas, Basilios Gatos, Nikolaos Stamatopoulos, and Stavros Perantonis. 2008. A complete optical character recognition methodology for historical documents. In The Eighth IAPR International Workshop on Document Analysis Systems. Hao Zhang and Daniel Gildea. 2008. Efficient multipass decoding for synchronous context free grammars. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. 217