emnlp emnlp2010 emnlp2010-62 emnlp2010-62-reference knowledge-graph by maker-knowledge-mining

62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input


Source: pdf

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.


reference text

Y. Benajiba, M. Diab, and P. Rosso. 2009. Arabic named entity recognition: A feature-driven study. In the special issue on Processing Morphologically Rich Languages of the IEEE Transaction on Audio, Speech and Language. D. M. Bikel, S. Miller, R. Schwartz, and R. Weischedel. 1997. Nymble: a high-performance learning namefinder. In Proceedings of ANLP-97, pages 194–201 . A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman. 1998. Exploiting diverse knowledge sources via maximum entropy in named entity recognition. P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, J. C. Lai, and R. L. Mercer. 1992. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1), March. S. Chen and R. Rosenfeld. 2000. A survey of smoothing techniques for ME models. IEEE Transaction on Speech and Audio Processing. R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. 2003. Named entity recognition through classifier combination. In Conference on Computational Natural Language Learning - CoNLL-2003, Edmonton, Canada, May. R. Florian, H. Hassan, A. Ittycheriah, H. Jing, N. Kambhatla, X. Luo, N Nicolov, and S Roukos. 2004. A statistical model for multilingual entity detection and tracking. In Proceedings of HLT-NAACL 2004, pages 1–8. R. Florian, H. Jing, N. Kambhatla, and I. Zitouni. 2006. Factorizing complex models: A case study in mention detection. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 473–480, Sydney, Australia, July. Association for Computational Linguistics. J. Goodman. 2002. Sequential conditional generalized iterative scaling. In Proceedings of ACL’02. D. B. Han. 2010. Klue annotation guidelines - version 2.0. Technical Report RC25042, IBM Research, August. A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. 1997. The DET curve in assessment of detection task performance. In Proceedings of the European Conference on Speech Communication and Technology (Eurospeech), pages 1895–1898. Rhodes, Greece. D. Miller, S. Boisen, R. Schwartz, R. Stone, and R. Weischedel. 2000. Named entity extraction from noisy input: speech and OCR. In Proceedings of the sixth conference on Applied natural languageprocessing, pages 3 16–324, Morristown, NJ, USA. Association for Computational Linguistics. E. Minkov, R. C. Wang, and W. W. Cohen. 2005. Extracting personal names from email: Applying named entity recognition to informal text. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 443–450, Vancouver, British Columbia, Canada, October. Association for Computational Linguistics. E. W. Noreen. 1989. Computer-Intensive Methods for Testing Hypotheses. John Wiley Sons. J. F. Pitrelli, B. L. Lewis, E. A. Epstein, M. Franz, D. Kiecza, J. L. Quinn, G. Ramaswamy, A. Srivastava, and P. Virga. 2008. Aggregating Distributed STT, MT, and Information Extraction Engines: The GALE Interoperability-Demo System. In Interspeech. Brisbane, NSW, Australia. J. M. Prager. 1999. Linguini: Language identification for multilingual documents. In Journal of Management Information Systems, pages 1–1 1. I. Zitouni and R. Florian. 2008. Mention detection crossing the language barrier. In Proceedings of EMNLP’08, Honolulu, Hawaii, October. I. Zitouni and R. Florian. 2009. Cross-language information propagation for Arabic mention detection. ACM Transactions onAsian Language Information Processing (TALIP), 8(4):1–21. L. Ramshaw and M. Marcus. 1999. Text chunking using transformation-based learning. In S. Armstrong, K.W. Church, P. Isabelle, S. Manzi, E. Tzoukermann, and D. Yarowsky, editors, Natural Language Processing Using Very Large Corpora, pages 157–176. Kluwer. E. F. Tjong Kim Sang. 2002. Introduction to the conll2002 shared task: Language-independentnamed entity recognition. In Proceedings of CoNLL-2002, pages 155–158. Taipei, Taiwan. J. Warmer and S. van Egmond. 1989. The implementation of the Amsterdam SGML parser. Electron. Publ. Origin. Dissem. Des., 2(2):65–90. L. Yi, B. Liu, and X. Li. 2003. Eliminating noisy information in web pages for data mining. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296–305, New York, NY, USA. ACM. M. Zimmerman, D. Hakkani-Tur, J. Fung, N. Mirghafori, L. Gottlieb, E. Shriberg, and Y. Liu. 2006. The ICSI+ multilingual sentence segmentation system. In Interspeech, pages 117–120, Pittsburgh, Pennsylvania, September. 345