acl acl2010 acl2010-68 acl2010-68-reference knowledge-graph by maker-knowledge-mining

68 acl-2010-Conditional Random Fields for Word Hyphenation

Source: pdf

Author: Nikolaos Trogkanis ; Charles Elkan

Abstract: Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. Experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates several times higher for both languages.

reference text

Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. 2008. Automatic syllabification with structured SVMs for letter-to-phoneme conversion. Proceedings of ACL-08: HLT, pages 568–576. Barbara Beeton. 2002. Hyphenation exception log. TUGboat, 23(3). L e´on Bottou. 2008. Stochastic gradient CRF software CRFSGD. Available at http : / / leon .bott ou . org/pro j e ct s / s gd. Gosse Bouma. 2003. Finite state methods for hyphenation. Natural Language Engineering, 9(1):5–20, March. Aron Culotta and Andrew McCallum. 2004. Confidence Estimation for Information Extraction. In Susan Dumais, Daniel Marcu, and Salim Roukos, editors, HLT-NAACL 2004: Short Papers, pages 109– 112, Boston, Massachusetts, USA, May. Association for Computational Linguistics. Fred J. Damerau. 1964. Automatic Hyphenation Scheme. U.S. patent 3537076 filed June 17, 1964, issued October 1970. Gordon D. Friedlander. 1968. Automation comes to the printing and publishing industry. IEEE Spectrum, 5:48–62, April. 373 Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems, 24(2):8–12. Yannis Haralambous. 2006. New hyphenation techniques in Ω2. TUGboat, 27:98–103. Steven L. Huyser. 1976. AUTO-MA-TIC WORD DIVI-SION. SIGDOC Asterisk Journal of Computer Documentation, 3(5):9–10. Timo Jarvi. 2009. Computerized Typesetting and Other New Applications in a Publishing House. In History of Nordic Computing 2, pages 230–237. Springer. Terje Kristensen and Dag Langmyhr. 2001 . Two regimes of computer hyphenation–a comparison. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), volume 2, pages 1532–1535. Taku Kudo, 2007. CRF++: Yet Another CRF Toolkit. Version 0.5 available at http : / / crfpp . s ource forge .net / . John Lafferty, Andrew McCallum, and Fernando Pereira. 2001 . Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML), pages 282–289. Franklin M. Liang and Peter Breitenlohner, 2008. PAT- tern GENeration Program for the TEX82 Hyphenator. Electronic documentation of PATGEN program version 2.3 from web2c distribution on CTAN, retrieved 2008. Franklin M. Liang. 1983. Word Hy-phen-a-tion by Com-put-er. Ph.D. thesis, Stanford University. Jorge Nocedal and Stephen J. Wright. 1999. Limited memory BFGS. In Numerical Optimization, pages 222–247. Springer. Wolfgang A. Ocker. 1971 . A program to hyphenate English words. IEEE Transactions on Engineering, Writing and Speech, 14(2):53–59, June. Martin Porter. 1980. An algorithm for suffix stripping. Program, 14(3): 130–137. Terrence J. Sejnowski and Charles R. Rosenberg, 1988. NETtalk: A parallel network that learns to read aloud, pages 661–672. MIT Press, Cambridge, MA, USA. Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 134– 141. Petr Sojka and Pavel Sevecek. 1995. Hyphenation in TEX–Quo Vadis? TUGboat, 16(3):280–289. Christos Tsalidis, Giorgos Orphanos, Anna Iordanidou, and Aristides Vagelatos. 2004. Proofing Tools Technology at Neurosoft S.A. ArXiv Computer Science e-prints, (cs/0408059), August. P.T.H. Tutelaers, 1999. Afbreken in TEX, hoe werkt dat nou? Available at ftp : / / ftp .tue .nl /pub / tex/ afbreken / . Antal van den Bosch, Ton Weijters, Jaap Van Den Herik, and Walter Daelemans. 1995. The profit of learning exceptions. In Proceedings of the 5th Belgian-Dutch Conference on Machine Learning (BENELEARN), pages 118–126. Jaap C. Woestenburg, 2006. *TALO ’s Language Technology, November. Available at http : / /www .t alo .nl /t alo / download/ document s / Language_Book .pdf. 374