jmlr jmlr2011 jmlr2011-68 jmlr2011-68-reference knowledge-graph by maker-knowledge-mining

68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Source: pdf

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

Abstract: We propose a uniﬁed neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-speciﬁc engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks

reference text

R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research (JMLR), 6:1817–1953, 2005. R. M. Bell, Y. Koren, and C. Volinsky. The BellKor solution to the Netﬂix Prize. Technical report, AT&T; Labs, 2007. http://www.research.att.com/˜volinsky/netflix. Y. Bengio and R. Ducharme. A neural probabilistic language model. In Advances in Neural Information Processing Systems (NIPS 13), 2001. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems (NIPS 19), 2007. Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In International Conference on Machine Learning (ICML), 2009. L. Bottou. Stochastic gradient learning in neural networks. In Proceedings of Neuro-Nˆmes. EC2, ı 1991. L. Bottou. Online algorithms and stochastic approximations. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. L. Bottou and P. Gallinari. A framework for the cooperation of learning algorithms. In Advances in Neural Information Processing Systems (NIPS 3). 1991. L. Bottou, Y. LeCun, and Yoshua Bengio. Global training of document processing systems using graph transformer networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 489–493, 1997. J. S. Bridle. Probabilistic interpretation of feedforward classiﬁcation network outputs, with relationships to statistical pattern recognition. In F. Fogelman Souli´ and J. H´ rault, editors, Neue e rocomputing: Algorithms, Architectures and Applications, pages 227–236. NATO ASI Series, 1990. P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J C. Lai. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992a. P. F. Brown, V. J. Della Pietra, R. L. Mercer, S. A. Della Pietra, and J. C. Lai. An estimate of an upper bound for the entropy of english. Computational Linguistics, 18(1):31–41, 1992b. C. J. C. Burges, R. Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems (NIPS 19), pages 193–200. 2007. R. Caruana. Multitask Learning. Machine Learning, 28(1):41–75, 1997. O. Chapelle, B. Schlkopf, and A. Zien. Semi-Supervised Learning. Adaptive computation and machine learning. MIT Press, Cambridge, Mass., USA, September 2006. E. Charniak. A maximum-entropy-inspired parser. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), pages 132–139, 2000. 2532 NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH H. L. Chieu. Named entity recognition with a maximum entropy approach. In Conference on Natural Language Learning (CoNLL), pages 160–163, 2003. N. Chomsky. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124, September 1956. S. Cl´ mencon and N. Vayatis. Ranking the best instances. Journal of Machine Learning Research e ¸ (JMLR), 8:2671–2699, 2007. W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. Journal of Artiﬁcial Intelligence Research (JAIR), 10:243–270, 1998. T. Cohn and P. Blunsom. Semantic role labelling with tree conditional random ﬁelds. In Conference on Computational Natural Language (CoNLL), 2005. M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, 1999. R. Collobert. Large Scale Machine Learning. PhD thesis, Universit´ Paris VI, 2004. e R. Collobert. Deep learning for efﬁcient discriminative parsing. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2011. T. Cover and R. King. A convergent gambling estimate of the entropy of english. IEEE Transactions on Information Theory, 24(4):413–421, July 1978. R. Florian, A. Ittycheriah, H. Jing, and T. Zhang. Named entity recognition through classiﬁer combination. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), pages 168–171, 2003. D. Gildea and D. Jurafsky. Automatic labeling of semantic roles. Computational Linguistics, 28(3): 245–288, 2002. D. Gildea and M. Palmer. The necessity of parsing for predicate argument recognition. Meeting of the Association for Computational Linguistics (ACL), pages 239–246, 2002. J. Gim´ nez and L. M` rquez. SVMTool: A general POS tagger generator based on support vector e a machines. In Conference on Language Resources and Evaluation (LREC), 2004. A. Haghighi, K. Toutanova, and C. D. Manning. A joint model for semantic role labeling. In Conference on Computational Natural Language Learning (CoNLL), June 2005. Z. S. Harris. Mathematical Structures of Language. John Wiley & Sons Inc., 1968. D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative ﬁltering, and data visualization. Journal of Machine Learning Research (JMLR), 1:49–75, 2001. G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, July 2006. 2533 C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA K. Hollingshead, S. Fisher, and B. Roark. Comparing and combining ﬁnite-state and context-free parsers. In Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP), pages 787–794, 2005. F. Huang and A. Yates. Distributional representations for handling sparsity in supervised sequencelabeling. In Meeting of the Association for Computational Linguistics (ACL), pages 495–503, 2009. F. Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4): 532–556, 1976. T. Joachims. Transductive inference for text classiﬁcation using support vector machines. In International Conference on Machine learning (ICML), 1999. D. Klein and C. D. Manning. Natural language grammar induction using a constituent-context model. In Advances in Neural Information Processing Systems (NIPS 14), pages 35–42. 2002. T. Koo, X. Carreras, and M. Collins. Simple semi-supervised dependency parsing. In Meeting of the Association for Computational Linguistics (ACL), pages 595–603, 2008. P. Koomen, V. Punyakanok, D. Roth, and W. Yih. Generalized inference with multiple semantic role labeling systems (shared task paper). In Conference on Computational Natural Language Learning (CoNLL), pages 181–184, 2005. T. Kudo and Y. Matsumoto. Chunking with support vector machines. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), pages 1–8, 2001. T. Kudoh and Y. Matsumoto. Use of support vector learning for chunk identiﬁcation. In Conference on Natural Language Learning (CoNLL) and Second Learning Language in Logic Workshop (LLL), pages 142–144, 2000. J. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), 2001. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition. Proceedings of IEEE, 86(11):2278–2324, 1998. Y. LeCun. A learning scheme for asymmetric threshold networks. In Proceedings of Cognitiva, pages 599–604, Paris, France, 1985. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. M¨ ller. Efﬁcient backprop. In G.B. Orr and K.-R. M¨ ller, u u editors, Neural Networks: Tricks of the Trade, pages 9–50. Springer, 1998. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5:361–397, 2004. P. Liang. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology, 2005. 2534 NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH P. Liang, H. Daum´ , III, and D. Klein. Structure compilation: trading structure for features. In e International Conference on Machine learning (ICML), pages 592–599, 2008. D. Lin and X. Wu. Phrase clustering for discriminative learning. In Meeting of the Association for Computational Linguistics (ACL), pages 1030–1038, 2009. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. In Machine Learning, pages 285–318, 1988. A. McCallum and Wei Li. Early results for named entity recognition with conditional random ﬁelds, feature induction and web-enhanced lexicons. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), pages 188–191, 2003. D. McClosky, E. Charniak, and M. Johnson. Effective self-training for parsing. Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), 2006. R. McDonald, K. Crammer, and F. Pereira. Flexible text segmentation with structured multilabel classiﬁcation. In Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP), pages 987–994, 2005. S. Miller, H. Fox, L. Ramshaw, and R. Weischedel. A novel use of statistical parsing to extract information from text. Applied Natural Language Processing Conference (ANLP), 2000. S. Miller, J. Guinness, and A. Zamanian. Name tagging with word clusters and discriminative training. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), pages 337–342, 2004. A Mnih and G. E. Hinton. Three new graphical models for statistical language modelling. In International Conference on Machine Learning (ICML), pages 641–648, 2007. G. Musillo and P. Merlo. Robust Parsing of the Proposition Bank. ROMAND 2006: Robust Methods in Analysis of Natural language Data, 2006. R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in Statistics. Springer-Verlag, New York, 1996. D. Okanohara and J. Tsujii. A discriminative language model with pseudo-negative samples. Meeting of the Association for Computational Linguistics (ACL), pages 73–80, 2007. M. Palmer, D. Gildea, and P. Kingsbury. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106, 2005. J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Mateo, 1988. D. C. Plaut and G. E. Hinton. Learning sets of ﬁlters using back-propagation. Computer Speech and Language, 2:35–61, 1987. M. F. Porter. An algorithm for sufﬁx stripping. Program, 14(3):130–137, 1980. 2535 C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA S. Pradhan, W. Ward, K. Hacioglu, J. Martin, and D. Jurafsky. Shallow semantic parsing using support vector machines. Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), 2004. S. Pradhan, K. Hacioglu, W. Ward, J. H. Martin, and D. Jurafsky. Semantic role chunking combining complementary syntactic views. In Conference on Computational Natural Language Learning (CoNLL), pages 217–220, 2005. V. Punyakanok, D. Roth, and W. Yih. The necessity of syntactic parsing for semantic role labeling. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 1117–1123, 2005. L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In Conference on Computational Natural Language Learning (CoNLL), pages 147–155. Association for Computational Linguistics, 2009. A. Ratnaparkhi. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 133–142, 1996. B. Rosenfeld and R. Feldman. Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web. Meeting of the Association for Computational Linguistics (ACL), pages 600–607, 2007. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by backpropagating errors. In D.E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1, pages 318–362. MIT Press, 1986. H. Sch¨ tze. Distributional part-of-speech tagging. In Meeting of the Association for Computational u Linguistics (ACL), pages 141–148, 1995. H. Schwenk and J. L. Gauvain. Connectionist language modeling for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 765–768, 2002. F. Sha and F. Pereira. Shallow parsing with conditional random ﬁelds. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), pages 134–141, 2003. C. E. Shannon. Prediction and entropy of printed english. Bell Systems Technical Journal, 30: 50–64, 1951. H. Shen and A. Sarkar. Voting between multiple data representations for text chunking. Advances in Artiﬁcial Intelligence, pages 389–400, 2005. L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional sequence classiﬁcation. In Meeting of the Association for Computational Linguistics (ACL), 2007. 2536 NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH N. A. Smith and J. Eisner. Contrastive estimation: Training log-linear models on unlabeled data. In Meeting of the Association for Computational Linguistics (ACL), pages 354–362, 2005. S. C. Suddarth and A. D. C. Holden. Symbolic-neural systems and the use of hints for developing complex systems. International Journal of Man-Machine Studies, 35(3):291–311, 1991. X. Sun, L.-P. Morency, D. Okanohara, and J. Tsujii. Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference. In International Conference on Computational Linguistics (COLING), pages 841–848, 2008. C. Sutton and A. McCallum. Joint parsing and semantic role labeling. In Conference on Computational Natural Language (CoNLL), pages 225–228, 2005a. C. Sutton and A. McCallum. Composition of conditional random ﬁelds for transfer learning. Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT-EMNLP), pages 748–754, 2005b. C. Sutton, A. McCallum, and K. Rohanimanesh. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Journal of Machine Learning Research (JMLR), 8:693–723, 2007. J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), pages 665–673, 2008. W. J. Teahan and J. G. Cleary. The entropy of english using ppm-based models. In Data Compression Conference (DCC), pages 53–62. IEEE Computer Society Press, 1996. K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Conference of the North American Chapter of the Association for Computational Linguistics & Human Language Technologies (NAACL-HLT), 2003. J. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general method for semisupervised learning. In Meeting of the Association for Computational Linguistics (ACL), pages 384–392, 2010. N. Uefﬁng, G. Haffari, and A. Sarkar. Transductive learning for statistical machine translation. In Meeting of the Association for Computational Linguistics (ACL), pages 25–32, 2007. A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang. Phoneme recognition using timedelay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3): 328–339, 1989. J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In International Conference on Machine learning (ICML), pages 1168–1175, 2008. 2537