nips nips2012 nips2012-272 nips2012-272-reference knowledge-graph by maker-knowledge-mining

272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms

Source: pdf

Author: Jasper Snoek, Hugo Larochelle, Ryan P. Adams

Abstract: The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a “black art” requiring expert experience, rules of thumb, or sometimes bruteforce search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm’s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expertlevel performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including latent Dirichlet allocation, structured SVMs and convolutional neural networks. 1

reference text

[1] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2:117–129, 1978. 5 Available at: http://code.google.com/p/cuda-convnet/ 8

[2] D.R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345–383, 2001.

[3] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.

[4] Adam D. Bull. Convergence rates of efﬁcient global optimization algorithms. Journal of Machine Learning Research, (3-4):2879–2904, 2011.

[5] James S. Bergstra, R´ mi Bardenet, Yoshua Bengio, and B´ l´ zs K´ gl. Algorithms for hypere aa e parameter optimization. In Advances in Neural Information Processing Systems 25. 2011.

[6] Marc C. Kennedy and Anthony O’Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3), 2001.

[7] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm conﬁguration. In Learning and Intelligent Optimization 5, 2011.

[8] Nimalan Mahendran, Ziyu Wang, Firas Hamze, and Nando de Freitas. Adaptive mcmc with bayesian optimization. In AISTATS, 2012.

[9] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.

[10] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. pre-print, 2010. arXiv:1012.2599.

[11] Carl E. Rasmussen and Christopher Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

[12] H. J. Kushner. A new method for locating the maximum point of an arbitrary multipeak curve in the presence of noise. Journal of Basic Engineering, 86, 1964.

[13] Iain Murray and Ryan P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. In Advances in Neural Information Processing Systems 24, pages 1723–1731. 2010.

[14] Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models. In AISTATS, 2005.

[15] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multi-task Gaussian process prediction. In Advances in Neural Information Processing Systems 22, 2008.

[16] David Ginsbourger and Rodolphe Le Riche. Dealing with asynchronicity in parallel Gaussian process based global optimization. http://hal.archives-ouvertes.fr/ hal-00507632, 2010.

[17] Matthew Hoffman, David M. Blei, and Francis Bach. Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 24, 2010.

[18] Kevin Miller, M. Pawan Kumar, Benjamin Packer, Danny Goodman, and Daphne Koller. Maxmargin min-entropy models. In AISTATS, 2012.

[19] Chun-Nam John Yu and Thorsten Joachims. Learning structural SVMs with latent variables. In Proceedings of the 26th International Conference on Machine Learning, 2009.

[20] M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems 25. 2010.

[21] Andrew Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Ng. On random weights and unsupervised feature learning. In Proceedings of the 28th International Conference on Machine Learning, 2011.

[22] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto, 2009.

[23] Adam Coates and Andrew Y. Ng. Selecting receptive ﬁelds in deep networks. In Advances in Neural Information Processing Systems 25. 2011.

[24] Dan Claudiu Ciresan, Ueli Meier, and J¨ rgen Schmidhuber. Multi-column deep neural netu works for image classiﬁcation. In Computer Vision and Pattern Recognition, 2012. 9