emnlp emnlp2013 emnlp2013-19 emnlp2013-19-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jan A. Botha ; Phil Blunsom
Abstract: This paper contributes an approach for expressing non-concatenative morphological phenomena, such as stem derivation in Semitic languages, in terms of a mildly context-sensitive grammar formalism. This offers a convenient level of modelling abstraction while remaining computationally tractable. The nonparametric Bayesian framework of adaptor grammars is extended to this richer grammar formalism to propose a probabilistic model that can learn word segmentation and morpheme lexicons, including ones with discontiguous strings as elements, from unannotated data. Our experiments on Hebrew and three variants of Arabic data find that the additional expressiveness to capture roots and templates as atomic units improves the quality of concatenative segmentation and stem identification. We obtain 74% accuracy in identifying triliteral Hebrew roots, while performing morphological segmentation with an F1-score of 78. 1.
Aviad Albert, Brian MacWhinney, Bracha Nir, and Shuly Wintner. 2013. The Hebrew CHILDES corpus: transcription and morphological analysis. Language Resources and Evaluation, pages 1–33. Mohamed Altantawy, Nizar Habash, Owen Rambow, and Ibrahim Saleh. 2010. Morphological Analysis and Generation of Arabic Nouns: A Morphemic Functional Approach. In Proceedings of LREC, pages 851– 858. Kenneth R Beesley and Lauri Karttunen. 2003. Finite state morphology, volume 18. CSLI publications Stanford. Abderrahim Boudlal, Rachid Belahbib, Abdelhak Lakhouaja, Azzeddine Mazroui, Abdelouafi Meziane, and Mohamed Bebah. 2009. A Markovian approach for Arabic Root Extraction. The International Arab Journal of Information Technology, 8(1):91–98. Pierre Boullier. 2000. A cubic time extension of contextfree grammars. Grammars, 3(2-3): 111–13 1. Tim Buckwalter. 2002. Arabic Morphological Analyzer. Technical report, Linguistic Data Consortium, Philedelphia. Alexander Clark. 2001. Learning Morphology with Pair Hidden Markov Models. In Proceedings of the ACL Student Workshop, pages 55–60. Yael Cohen-Sygal and Shuly Wintner. 2006. Finitestate registered automata for non-concatenative morphology. Computational Linguistics, 32(1):49–82. Mathias Creutz and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing, 4(1): 1–34. Kareem Darwish. 2002. Building a shallow Arabic morphological analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pages 47–54. Association for Computational Linguistics. Ezra Daya, Dan Roth, and Shuly Wintner. 2008. Identifying Semitic Roots: Machine Learning with Linguistic Constraints. Computational Linguistics, 34(3):429–448. Markus Dreyer and Jason Eisner. 2011. Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model. In Proceedings of EMNLP, pages 616–627, Edinburgh, Scotland. Kais Dukes and Nizar Habash. 2010. Morphological Annotation of Quranic Arabic. In Proceedings of LREC. Greg Durrett and John DeNero. 2013. Supervised Learning of Complete Morphological Paradigms. In Proceedings of NAACL-HLT, pages 1185–1 195, Atlanta, Georgia, June. Association for Computational Linguistics. 355 Khaled Elghamry. 2005. A Constraint-based Algorithm for the Identification of Arabic Roots. In Proceedings of the Midwest Computational Linguistics Colloquium. Indiana University. Bloomington, IN. Raphael Finkel and Gregory Stump. 2002. Generating Hebrew verb morphology by default inheritance hierarchies. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages. Association for Computational Linguistics. Michelle A. Fullwood and Timothy J. O’Donnell. 2013. Learning non-concatenative morphology. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 21–27, Sofia, Bulgaria. Association for Computational Linguistics. Michael Gasser. 2009. Semitic morphological analysis and generation using finite state transducers with feature structures. In Proceedings of EACL, pages 309– 317. Association for Computational Linguistics. Daniel Gildea. 2010. Optimal Parsing Strategies for Linear Context-Free Rewriting Systems. In Proceedings of NAACL, pages 769–776. Association for Computational Linguistics. Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Interpolating Between Types and Tokens by Estimating Power-Law Generators. In Advances in Neural Information Processing Systems, Volume 18. Harald Hammarstr o¨m and Lars Borin. 2011. Unsupervised Learning of Morphology. Computational Linguistics, 37(2):309–350. Yun Huang, Min Zhang, and Chew Lim Tan. 2011. Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars. In Proceedings of ACL (Short papers), pages 534–539. Mark Johnson and Sharon Goldwater. 2009. Improving nonparameteric Bayesian inference: Experiments on unsupervised word segmentation with adaptor grammars. In Proceedings of NAACL-HLT, pages 3 17–325. Association for Computational Linguistics. Mark Johnson, Thomas L. Griffiths, and Sharon Goldwater. 2007. Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models. In Advances in Neural Information Processing Systems, volume 19, page 641. MIT. Mark Johnson. 2008. Unsupervised word segmentation for Sesotho using Adaptor Grammars. In Proceedings of ACL Special Interest Group on Computational Morphology and Phonology (SigMorPhon), pages 20–27. Association for Computational Linguistics. Aravind K. Joshi. 1985. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? In D.R. Dowty, L. Karttunen, and A.M. Zwicky, editors, Natural Language Parsing, chapter 6, pages 206–250. Cambridge Uni- versity Press. Miriam Kaeshammer. 2013. Synchronous Linear Context-Free Rewriting Systems for Machine Translation. In Proceedings of the Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 68–77, Atlanta, Georgia. Association for Computational Linguistics. Yuki Kato, Hiroyuki Seki, and Tadao Kasami. 2006. Stochastic Multiple Context-Free Grammar for RNA Pseudoknot Modeling. In Proceedings of the International Workshop on Tree Adjoining Grammar and Related Formalisms, pages 57–64. George Anton Kiraz. 2000. Multitiered Nonlinear Morphology Using Multitape Finite Automata: A Case Study on Syriac and Arabic. Computational Linguistics, 26(1):77–105, March. Kimmo Koskenniemi. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th international conference on Computational Linguistics, pages 178–181 . Association for Computational Linguistics. Mikko Kurimo, Sami Virpioja, Ville T. Turunen, Graeme W. Blackwood, and William Byrne. 2010. Overview and Results of Morpho Challenge 2009. In Multilingual Information Access Evaluation I. Text Retrieval Experiments, volume 6241 of Lecture Notes in Computer Science, pages 578–597. Springer Berlin / Heidelberg. Yoong Keok Lee, Aria Haghighi, and Regina Barzilay. 2011. Modeling syntactic context improves morphological segmentation. In Proceedings of CoNLL. Wolfgang Maier. 2010. Direct Parsing of Discontinuous Constituents in German. In Proceedings of the NAACL-HLT Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 58–66. Association for Computational Linguistics. Jim Pitman and Marc Yor. 1997. The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator. The Annals of Probability, 25(2):855– 900. Jim Pitman. 1995. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102: 145–158. Jean-Fran ¸cois Prunet. 2006. External Evidence and the Semitic Root. Morphology, 16(1):41–67. Paul Rodrigues and Damir C´avar. 2007. Learning Arabic Morphology Using Statistical Constraint-Satisfaction Models. In Elabbas Benmamoun, editor, Perspectives on Arabic Linguistics: Proceedings of the 19th Arabic Linguistics Symposium, pages 63–75, Urbana, IL, USA. John Benjamins Publishing Company. Nathan Schneider. 2010. Computational Cognitive Morphosemantics: Modeling Morphological Compositionality in Hebrew Verbs with Embodied Construc- 356 tion Grammar. In Proceedings of the Annual Meeting of the Berkeley Linguistics Society, Berkeley, CA. Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, and Tadao Kasami. 1991. On multiple context-free grammars. Theoretical Computer Science, 88(2): 191–229. Kairit Sirts and Sharon Goldwater. 2013. MinimallySupervised Morphological Segmentation using Adaptor Grammars. Transactions of the ACL. K. Vijay-Shanker, David J. Weir, and Aravind K. Joshi. 1987. Characterizing structural descriptions produced by various grammatical formalisms. In Proceedings of ACL, pages 104–1 11.