nips nips2011 nips2011-271 nips2011-271-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
[1] L. Bottou and O. Bousquet. Learning using large datasets. In Mining Massive DataSets for Security, NATO ASI Workshop Series. IOS Press, Amsterdam, 2008.
[2] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Advances in neural information processing systems, 20:161–168, 2008.
[3] B. Yu. Embracing statistical challenges in the information technology age. Technometrics, American Statistical Association and the American Society for Quality, 49:237–248, 2007.
[4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. Information Theory, IEEE Transactions on, 50(9):2050–2057, 2004.
[5] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning, pages 928–935. ACM, 2008.
[6] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. Twentieth International Conference on Machine Learning, 2003.
[7] P.L. Bartlett, E. Hazan, and A. Rakhlin. Adaptive online gradient descent. Advances in Neural Information Processing Systems, 21, 2007.
[8] A. Korattikara, L. Boyles, M. Welling, J. Kim, and H. Park. Statistical optimization of non-negative matrix factorization. AISTATS, 2011.
[9] J. Friedman, T. Hastie, H. H¨ fling, and R. Tibshirani. Pathwise coordinate optimization. Annals of o Applied Statistics, 1(2):302–332, 2007.
[10] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9:1871–1874, 2008.
[11] N. Le Roux, P.A. Manzagol, and Y. Bengio. Topmoumoute online natural gradient algorithm. In Neural Information Processing Systems (NIPS). Citeseer, 2007.
[12] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1, page 886. Citeseer, 2005.
[13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
[14] Yoshimasa Tsuruoka, Jun’ichi Tsujii, and Sophia Ananiadou. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 477–485, Suntec, Singapore, August 2009. Association for Computational Linguistics.
[15] Navneet Dalal. Finding People in Images and Video. PhD thesis, Institut National Polytechnique de Grenoble / INRIA Grenoble, July 2006.
[16] Pascal large scale learning challenge. http://largescale.ml.tu-berlin.de/workshop/, 2008.
[17] A. Bordes, L. Bottou, and P. Gallinari. SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10:1737–1754, 2009. 9