emnlp emnlp2013 emnlp2013-86 emnlp2013-86-reference knowledge-graph by maker-knowledge-mining

86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction

Source: pdf

Author: Sida Wang ; Mengqiu Wang ; Stefan Wager ; Percy Liang ; Christopher D. Manning

Abstract: NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a > 1% absolute performance gain over use of standard L2 regularization.

reference text

Yaser S. Abu-Mostafa. 1990. Learning from hints in neural networks. Journal of Complexity, 6(2): 192– 198. Chris M. Bishop. 1995. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1): 108–1 16. Robert Bryll, Ricardo Gutierrez-Osuna, and Francis Quek. 2003. Attribute bagging: improving accuracy 1178 of classifier ensembles by using random feature subsets. Pattern recognition, 36(6): 1291–1302. Chris J.C. Burges and Bernhard Sch o¨lkopf. 1997. Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, pages 375–381. Brad Efron and Robert Tibshirani. 1993. An Introduction to the Bootstrap. Chapman & Hall, New York. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd annual meeting of the Association for Computational Linguistics, pages 363–370. Yves Grandvalet and Yoshua Bengio. 2005. Entropy regularization. In Semi-Supervised Learning, United Kingdom. Springer. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Feng Jiao, Shaojun Wang, Chi-Hoon Lee, Russell Greiner, and Dale Schuurmans. 2006. Semisupervised conditional random fields for improved sequence segmentation and labeling. In Proceedings of the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 209–216. Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the International Conference on Machine Learning, pages 200–209. Wei Li and Andrew McCallum. 2005. Semi-supervised sequence modeling with syntactic topic models. In Proceedings of the 20th national conference on Artificial Intelligence - Volume 2, AAAI’05, pages 813– 818. Gideon S. Mann and Andrew McCallum. 2007. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the International Conference on Machine Learning. Kiyotoshi Matsuoka. 1992. Noise injection into inputs in back-propagation learning. Systems, Man and Cybernetics, IEEE Transactions on, 22(3):436–440. Slav Petrov and Ryan McDonald. 2012. Overview of the 2012 shared task on parsing the web. Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL). Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. 2011a. The manifold tangent classifier. Advances in Neural Information Processing Systems, 24:2294–2302. Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. 2011b. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250. Patrice Y. Simard, Yann A. Le Cun, John S. Denker, and Bernard Victorri. 2000. Transformation invariance in pattern recognition: Tangent distance and propagation. International Journal of Imaging Systems and Technology, 11(3): 181–197. Andrew Smith, Trevor Cohn, and Miles Osborne. 2005. Logarithmic opinion pools for conditional random fields. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 18– 25. Association for Computational Linguistics. Charles Sutton, Michael Sindelar, and Andrew McCallum. 2005. Feature bagging: Preventing weight undertraining in structured discriminative learning. Center for Intelligent Information Retrieval, U. of Massachusetts. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: languageindependent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 142–147. Laurens van der Maaten, Minmin Chen, Stephen Tyree, and Kilian Q. Weinberger. 2013. Learning with marginalized corrupted features. In Proceedings of the International Conference on Machine Learning. 1179 2013. arXiv Stefan Wager, Sida Wang, and Percy Liang. Dropout training as adaptive regularization. preprint:1307.1493. Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. 2013. Regularization of neural networks using dropconnect. In Proceedings of the International Conference on Machine learning. Sida Wang and Christopher D. Manning. 2013. Fast dropout training. In Proceedings of the International Conference on Machine Learning.