nips nips2013 nips2013-99 nips2013-99-reference knowledge-graph by maker-knowledge-mining

99 nips-2013-Dropout Training as Adaptive Regularization

Source: pdf

Author: Stefan Wager, Sida Wang, Percy Liang

Abstract: Dropout and other feature noising schemes control overﬁtting by artiﬁcially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is ﬁrst-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and ﬁnd that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classiﬁcation tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1

reference text

[1] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[2] Laurens van der Maaten, Minmin Chen, Stephen Tyree, and Kilian Q Weinberger. Learning with marginalized corrupted features. In Proceedings of the International Conference on Machine Learning, 2013.

[3] Sida I Wang and Christopher D Manning. Fast dropout training. In Proceedings of the International Conference on Machine Learning, 2013.

[4] Yaser S Abu-Mostafa. Learning from hints in neural networks. Journal of Complexity, 6(2):192–198, 1990.

[5] Chris J.C. Burges and Bernhard Schlkopf. Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, pages 375–381, 1997.

[6] Patrice Y Simard, Yann A Le Cun, John S Denker, and Bernard Victorri. Transformation invariance in pattern recognition: Tangent distance and propagation. International Journal of Imaging Systems and Technology, 11(3):181–197, 2000.

[7] Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent classiﬁer. Advances in Neural Information Processing Systems, 24:2294–2302, 2011.

[8] Kiyotoshi Matsuoka. Noise injection into inputs in back-propagation learning. Systems, Man and Cybernetics, IEEE Transactions on, 22(3):436–440, 1992.

[9] Chris M Bishop. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1):108–116, 1995.

[10] Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.

[11] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2010.

[12] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 142–150. Association for Computational Linguistics, 2011.

[13] Sida I Wang, Mengqiu Wang, Stefan Wager, Percy Liang, and Christopher D Manning. Feature noising for log-linear structured prediction. In Empirical Methods in Natural Language Processing, 2013.

[14] Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the International Conference on Machine Learning, 2013.

[15] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment and topic classiﬁcation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 90–94. Association for Computational Linguistics, 2012.

[16] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.

[17] Erich Leo Lehmann and George Casella. Theory of Point Estimation. Springer, 1998.

[18] Koby Crammer, Alex Kulesza, Mark Dredze, et al. Adaptive regularization of weight vectors. Advances in Neural Information Processing Systems, 22:414–422, 2009.

[19] Jiang Su, Jelber Sayyad Shirab, and Stan Matwin. Large scale text classiﬁcation using semi-supervised multinomial naive Bayes. In Proceedings of the International Conference on Machine Learning, 2011.

[20] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classiﬁcation from labeled and unlabeled documents using EM. Machine Learning, 39(2-3):103–134, May 2000.

[21] G. Bouchard and B. Triggs. The trade-off between generative and discriminative classiﬁers. In International Conference on Computational Statistics, pages 721–728, 2004.

[22] R. Raina, Y. Shen, A. Ng, and A. McCallum. Classiﬁcation with hybrid generative/discriminative models. In Advances in Neural Information Processing Systems, Cambridge, MA, 2004. MIT Press.

[23] J. Suzuki, A. Fujino, and H. Isozaki. Semi-supervised structured output learning based on a hybrid generative and discriminative approach. In Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007.

[24] Y. Grandvalet and Y. Bengio. Entropy regularization. In Semi-Supervised Learning, United Kingdom, 2005. Springer.

[25] Thorsten Joachims. Transductive inference for text classiﬁcation using support vector machines. In Proceedings of the International Conference on Machine Learning, pages 200–209, 1999. 9