nips nips2007 nips2007-159 nips2007-159-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jean-yves Audibert
Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
[1] G. Lecu´ . Suboptimality of penalized empirical risk minimization in classification. In Proceedings of the e 20th annual conference on Computational Learning Theory, 2007.
[2] A. Barron. Are bayes rules consistent in information? In T.M. Cover and B. Gopinath, editors, Open Problems in Communication and Computation, pages 85–91. Springer, 1987.
[3] O. Catoni. A mixture approach to universal model selection. preprint LMENS 97-30, Available from http://www.dma.ens.fr/edition/preprints/Index.97.html, 1997.
[4] A. Barron and Y. Yang. Information-theoretic determination of minimax rates of convergence. Ann. Stat., 27(5):1564–1599, 1999.
[5] O. Catoni. Universal aggregation rules with exact bias bound. Preprint n.510, http://www.proba. jussieu.fr/mathdoc/preprints/index.html\#1999, 1999.
[6] G. Blanchard. The progressive mixture estimator for regression trees. Ann. Inst. Henri Poincar´ , Probab. e Stat., 35(6):793–820, 1999.
[7] Y. Yang. Combining different procedures for adaptive regression. Journal of multivariate analysis, 74:135–161, 2000.
[8] F. Bunea and A. Nobel. Sequential procedures for aggregating arbitrary estimators of a conditional mean, 2005. Technical report.
[9] A. Juditsky, P. Rigollet, and A.B. Tsybakov. Learning by mirror averaging. Preprint n.1034, Laboratoire de Probabilit´ s et Mod` les Al´ atoires, Universit´ s Paris 6 and Paris 7, 2005. e e e e
[10] J.-Y. Audibert. A randomized online learning algorithm for better variance control. In Proceedings of the 19th annual conference on Computational Learning Theory, pages 392–407, 2006.
[11] V.G. Vovk. Aggregating strategies. In Proceedings of the 3rd annual workshop on Computational Learning Theory, pages 371–386, 1990.
[12] D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Trans. on Information Theory, 44(5):1906–1925, 1998.
[13] V.G. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, pages 153–173, 1998.
[14] M. Wegkamp. Model selection in nonparametric regression. Ann. Stat., 31(1):252–273, 2003.
[15] T. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the 18th annual conference on Computational Learning Theory, pages 173–187, 2005.
[16] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.
[17] L. Devroye, L. Gy¨ rfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, o 1996. 8