nips nips2012 nips2012-134 nips2012-134-reference knowledge-graph by maker-knowledge-mining

134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Source: pdf

Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi

Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their ﬁnite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1

reference text

[1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010.

[2] A. Agarwal, D. P. Foster, D. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex optimization with bandit feedback. SIAM Journal on Optimization, To appear, 2011. URL http://arxiv.org/abs/1107.1744.

[3] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, May 2012.

[4] K. Ball. An elementary introduction to modern convex geometry. In S. Levy, editor, Flavors of Geometry, pages 1–58. MSRI Publications, 1997.

[5] P. L. Bartlett, V. Dani, T. P. Hayes, S. M. Kakade, A. Rakhlin, and A. Tewari. High-probability regret bounds for bandit online linear optimization. In Proceedings of the Twenty First Annual Conference on Computational Learning Theory, 2008.

[6] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31:167–175, 2003.

[7] A. Ben-Tal, T. Margalit, and A. Nemirovski. The ordered subsets mirror descent optimization method with applications to tomography. SIAM Journal on Optimization, 12:79–108, 2001.

[8] V. Buldygin and Y. Kozachenko. Metric Characterization of Random Variables and Random Processes, volume 188 of Translations of Mathematical Monographs. American Mathematical Society, 2000.

[9] N. Cesa-Bianchi, A. Conconi, and C.Gentile. On the generalization ability of on-line learning algorithms. In Advances in Neural Information Processing Systems 14, pages 359–366, 2002.

[10] A. Conn, K. Scheinberg, and L. Vicente. Introduction to Derivative-Free Optimization, volume 8 of MPS-SIAM Series on Optimization. SIAM, 2009.

[11] T. M. Cover and J. A. Thomas. Elements of Information Theory, Second Edition. Wiley, 2006.

[12] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010.

[13] A. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005.

[14] C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53(3), 2002.

[15] J. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms I & II. Springer, e 1996.

[16] J. Kivinen and M. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–64, Jan. 1997.

[17] H. J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, Second edition, 2003.

[18] A. Nemirovski and D. Yudin. Problem Complexity and Method Efﬁciency in Optimization. Wiley, 1983.

[19] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

[20] Y. Nesterov. Random gradient-free minimization of convex functions. URL http://www.ecore.be/DPs/dp_1297333890.pdf, 2011.

[21] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley, 2003.

[22] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Applications, chapter 5, pages 210–268. Cambridge University Press, 2012.

[23] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.

[24] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010.

[25] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27(5):1564–1599, 1999.

[26] B. Yu. Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pages 423–435. Springer-Verlag, 1997.

[27] M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning, 2003. 9