emnlp emnlp2011 emnlp2011-54 emnlp2011-54-reference knowledge-graph by maker-knowledge-mining

54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

Source: pdf

Author: Sze-Meng Jojo Wong ; Mark Dras

Abstract: Attempts to profile authors according to their characteristics extracted from textual data, including native language, have drawn attention in recent years, via various machine learning approaches utilising mostly lexical features. Drawing on the idea of contrastive analysis, which postulates that syntactic errors in a text are to some extent influenced by the native language of an author, this paper explores the usefulness of syntactic features for native language identification. We take two types of parse substructure as features— horizontal slices of trees, and the more general feature schemas from discriminative parse reranking—and show that using this kind of syntactic feature results in an accuracy score in classification of seven native languages of around 80%, an error reduction of more than 30%.

reference text

Ethem Alpaydin. 2004. Introduction to Machine Learning. MIT Press, Cambridge, MA, USA. Harald Baayen, Hans van Halteren, and Fiona Tweedie. 1996. Outside the Cave of Shadows: Using Syntactic Annotation to Enhance Authorship Attribution. Literary and Linguistic Computing, 11(3): 121–13 1. Stephen Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 173– 180, Ann Arbor, Michigan. Stephen Clark and James R. Curran. 2007. WideCoverage Efficient Statistical Parsing with CCG and Log-Linear Models. Computational Linguistics, 33(4):493–552. Michael Collins. 2000. Discriminative reranking for natural language processing. In Proceedings ofthe Seventeenth International Conference on Machine Learning (ICML’00), Stanford, CA. Stephen P. Corder. 1967. The significance of learners’ errors. International Review of Applied Linguistics in Language Teaching (IRAL), 5(4): 161–170. Dominique Estival, Tanja Gaustad, Son-Bao Pham, Will Radford, and Ben Hutchinson. 2007. Author profiling for English emails. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING), pages 263–272. Ian Fette, Norman Sadeh, and Anthony Tomasic. 2007. Learning to detect phishing emails. In Proceedings of the 16th International World Wide Web Conference. George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3: 1289–1305. 1609 Jennifer Foster, Joachim Wagner, and Josefvan Genabith. 2008. Adapting a WSJ-trained parser to grammatically noisy text. In Proceedings of ACL-08: HLT, Short Papers, pages 221–224, Columbus, Ohio. Julie Franck, Gabriella Vigliocco, and Janet Nicol. 2002. Subject-verb agreement errors in French and English: The role of syntactic hierarchy. Language and Cognitive Processes, 17(4):371–404. Michael Gamon. 2004. Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), pages 611–617. Sylviane Granger and Stephanie Tyson. 1996. Connector usage in the English essay writing of native and non-native EFL speakers of English. World Englishes, 15(1): 17–27. Sylviane Granger, Estelle Dagneaux, Fanny Meunier, and Magali Paquot. 2009. International Corpus of Learner English (Version 2). Presses Universitaires de Louvain, Louvian-la-Neuve. Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. Mark Johnson and Ahmet Engin Ural. 2010. Reranking the Berkeley and Brown Parsers. In Proceedings of Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL-10), pages 665–668, Los Angeles, CA, USA, June. Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler. 1999. Estimators for stochastic unification-based grammars. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), College Park, MD. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 423–430, Sapporo, Japan. Moshe Koppel, Jonathan Schler, and Kfir Zigdon. 2005. Automatically determining an anonymous author’s native language. In Intelligence and Security Informatics, volume 3495 of Lecture Notes in Computer Science, pages 209–217. Springer-Verlag. Robert Lado. 1957. Linguistics Across Cultures: Applied Linguistics for Language Teachers. University of Michigan Press, Ann Arbor, MI, US. Brian MacWhinney and Elizabeth Bates. 1989. The Crosslinguistic Study of Sentence Processing. Cambridge University Press, New York, NY, USA. Andrew Mutton, Mark Dras, Stephen Wan, and Robert Dale. 2007. GLEU: Automatic evaluation of sentence-level fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 344–35 1, Prague, Czech Republic. Steven Myers. 2007. Introduction to phishing. In Markus Jakobsson and Steven Myers, editors, Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft. John Wiley & Sons, Inc., Hoboken, NJ, USA. Dominick Ng, Matthew Honnibal, and James R. Curran. 2010. Reranking a Wide-Coverage CCG Parser. In Proceedings of Australasian Language Technology Association Workshop (ALTA’10), pages 90–98, Melbourne, Australia. Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning Accurate, Compact, and Interpretable Tree Annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL’06), pages 433–440, Sydney, Australia, July. Manfred Pienemann. 1998. Language Processing and Second Language Development: Processability Theory. John Benjamins, Amsterdam, The Netherlands. Payam Refaeilzadeh, Lei Tang, and Huan Liu. 2009. Cross-validation. In Ling Liu and M. Tamer O¨zsu, editors, Encyclopedia of Database Systems, pages 532– 538. Springer, US. Jack C. Richards. 1971 . A non-contrastive approach to error analysis. ELT Journal, 25(3):204–219. Roumyana Slabakova. 2000. L1 transfer revisited: the L2 acquisition of telicity marking in English by Spanish and Bulgarian native speakers. Linguistics, 38(4):739–770. Guihua Sun, Xiaohua Liu, Gao Cong, Ming Zhou, Zhongyang Xiong, John Lee, and Chin-Yew Lin. 2007. Detecting erroneous sentences using automatically mined sequential patterns. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 81–88, Prague, Czech Republic. Michael Swan and Bernard Smith, editors. 2001 . Learner English: A teacher’s guide to interference and other problems. Cambridge University Press, 2nd edition. Joel Tetreault, Jennifer Foster, and Martin Chodorow. 2010. Using parse features for preposition selection and error detection. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’ 10, pages 353– 358. Association for Computational Linguistics. Oren Tsur and Ari Rappoport. 2007. Using classifier features for studying the effect of native language on the choice of written second language words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition, pages 9–16. 1610 Hans van Halteren. 2008. Source language markers in EUROPARL translations. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING), pages 937–944. Irena Vassileva. 1998. Who am I/how are we in academic writing? A contrastive analysis of authorial presence in English, German, French, Russian and Bulgarian. International Journal of Applied Linguistics, 8(2): 163–185. Garbriella Vigliocco, Brian Butterworth, and Merrill F. Garrett. 1996. Subject-verb agreement in Spanish and English: Differences in the role of conceptual constraints. Cognition, 61(3):261–298. Joachim Wagner, Jennifer Foster, and Josefvan Genabith. 2009. Judging grammaticality: Experiments in sentence classification. CALICO Journal, 26(3):474–490. Richard Wardhaugh. 1970. The Contrastive Analysis Hypothesis. TESOL Quarterly, 4(2): 123–130. Sze-Meng Jojo Wong and Mark Dras. 2009. Contrastive analysis and native language identification. In Proceedings of the Australasian Language Technology Association Workshop 2009, pages 53–61, Sydney, Australia, December. Sze-Meng Jojo Wong and Mark Dras. 2010. Parser features for sentence grammaticality classification. In Proceedings of the Australasian Language Technology Association Workshop 2010, pages 67–75, Melbourne, Australia, December. Suying Yang and Yue-Yuan Huang. 2004. The impact of the absence of grammatical tense in L1 on the acquisition of the tense-aspect system in L2. International Review of Applied Linguistics in Language Teaching (IRAL), 42(1):49–70. Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 412– 420. Rong Zheng, Yi Qin, Zan Huang, and Hsinchun Chen. 2003. Authorship analysis in cybercrime investigation. In Intelligence and Security Informatics, volume 2665 of Lecture Notes in Computer Science, pages 59– 73. Springer-Verlag.