jmlr jmlr2012 jmlr2012-95 jmlr2012-95-reference knowledge-graph by maker-knowledge-mining

95 jmlr-2012-Random Search for Hyper-Parameter Optimization

Source: pdf

Author: James Bergstra, Yoshua Bengio

Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent Ă˘€œHigh ThroughputĂ˘€? methods achieve surprising successĂ˘€”they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword

reference text

I. A. Antonov and V. M. Saleev. An economic method of computing LPÄŽ„ -sequences. USSR Computational Mathematics and Mathematical Physics, 19(1):252Ă˘€“256, 1979. R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press, New Jersey, 1961. Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1): 1Ă˘€“127, 2009. doi: 10.1561/2200000006. Y. Bengio and X. Glorot. Understanding the difÄ?Ĺš culty of training deep feedforward neural networks. In Y. W. Teh and M. Titterington, editors, Proc. of The Thirteenth International Conference on ArtiÄ?Ĺš cial Intelligence and Statistics (AISTATSĂ˘€™10), pages 249Ă˘€“256, 2010. J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, and Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for ScientiÄ?Ĺš c Computing Conference (SciPy), June 2010. Oral. C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, London, UK, 1995. P. Bratley, B. L. Fox, and H. Niederreiter. Implementation and tests of low-discrepancy sequences. Transactions on Modeling and Computer Simulation, (TOMACS), 2(3):195Ă˘€“213, 1992. R. E. CaÄ?Ĺš‚isch, W. Morokoff, and A. Owen. Valuation of mortgage backed securities using brownian bridges to reduce effective dimension, 1997. 303 B ERGSTRA AND B ENGIO C. Chang and C. Lin. LIBSVM: A Library for Support Vector Machines, 2001. I. Czogiel, K. Luebke, and C. Weihs. Response surface methodology for optimizing hyper parameters. Technical report, UniversitĂ‚Â¨ t Dortmund Fachbereich Statistik, September 2005. a S. S. Drew and T. Homem de Mello. Quasi-Monte Carlo strategies for stochastic optimization. In Proc. of the 38th Conference on Winter Simulation, pages 774 Ă˘€“ 782, 2006. D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11:625Ă˘€“660, 2010. J. H. Halton. On the efÄ?Ĺš ciency of certain quasi-random sequences of points in evaluating multidimensional integrals. Numerische Mathematik, 2:84Ă˘€“90, 1960. N. Hansen, S. D. MĂ‚Â¨ ller, and P. Koumoutsakos. Reducing the time complexity of the derandomized u evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11 (1):1Ă˘€“18, 2003. G. E. Hinton. A practical guide to training restricted Boltzmann machines. Technical Report 2010003, University of Toronto, 2010. version 1. G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527Ă˘€“1554, 2006. F. Hutter. Automated ConÄ?Ĺš guration of Algorithms for Solving Hard Computational Problems. PhD thesis, University of British Columbia, 2009. F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm conÄ?Ĺš guration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220 (4598):671Ă˘€“680, 1983. N. L. Kleinman, J. C. Spall, and D. Q. Naiman. Simulation-based optimization with stochastic approximation using common random numbers. Management Science, 45(11):1570Ă˘€“1578, November 1999. doi: doi:10.1287/mnsc.45.11.1570. H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. In Z. Ghahramani, editor, Proceedings of the Twenty-fourth International Conference on Machine Learning (ICMLĂ˘€™07), pages 473Ă˘€“480. ACM, 2007. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278Ă˘€“2324, November 1998a. Y. LeCun, L. Bottou, G. Orr, and K. Muller. EfÄ?Ĺš cient backprop. In G. Orr and K. Muller, editors, Neural Networks: Tricks of the Trade. Springer, 1998b. M. Galassi et al. GNU ScientiÄ?Ĺš c Library Reference Manual, 3rd edition, 2009. 304 R ANDOM S EARCH FOR H YPER -PARAMETER O PTIMIZATION M. D. McKay, R. J. Beckman, and W. J. Conover. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2): 239Ă˘€“245, May 1979. doi: doi:10.2307/1268522. A. Nareyek. Choosing search heuristics by non-stationary reinforcement learning. Applied Optimization, 86:523Ă˘€“544, 2003. R. M. Neal. Assessing relevance determination methods using DELVE. In C. M. Bishop, editor, Neural Networks and Machine Learning, pages 97Ă˘€“129. Springer-Verlag, 1998. J. A. Nelder and R. Mead. A simplex method for function minimization. The Computer Journal, 7: 308Ă˘€“313, 1965. M. J. D. Powell. A direct search optimization method that models the objective and constraint functions by linear interpolation. Advances in Optimization and Numerical Analysis, pages 51Ă˘€“ 67, 1994. C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. Ingo Rechenberg. Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Fommann-Holzboog, Stuttgart, 1973. A. Srinivasan and G. Ramakrishnan. Parameter screening and optimisation for ILP using designed experiments. Journal of Machine Learning Research, 12:627Ă˘€“662, February 2011. P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with denoising autoencoders. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Proceedings of the Twenty-Ä?Ĺš fth International Conference on Machine Learning (ICMLĂ˘€™08), pages 1096Ă˘€“1103. ACM, 2008. T. Weise. Global Optimization Algorithms - Theory and Application. Self-Published, second edition, 2009. Online available at http://www.it-weise.de/. 305