nips nips2007 nips2007-159 nips2007-159-reference knowledge-graph by maker-knowledge-mining

159 nips-2007-Progressive mixture rules are deviation suboptimal

Source: pdf

Author: Jean-yves Audibert

Abstract: We consider the learning task consisting in predicting as well as the best function in a ﬁnite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisﬁed by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisﬁes ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1

reference text

[1] G. Lecu´ . Suboptimality of penalized empirical risk minimization in classiﬁcation. In Proceedings of the e 20th annual conference on Computational Learning Theory, 2007.

[2] A. Barron. Are bayes rules consistent in information? In T.M. Cover and B. Gopinath, editors, Open Problems in Communication and Computation, pages 85–91. Springer, 1987.

[3] O. Catoni. A mixture approach to universal model selection. preprint LMENS 97-30, Available from http://www.dma.ens.fr/edition/preprints/Index.97.html, 1997.

[4] A. Barron and Y. Yang. Information-theoretic determination of minimax rates of convergence. Ann. Stat., 27(5):1564–1599, 1999.

[5] O. Catoni. Universal aggregation rules with exact bias bound. Preprint n.510, http://www.proba. jussieu.fr/mathdoc/preprints/index.html\#1999, 1999.

[6] G. Blanchard. The progressive mixture estimator for regression trees. Ann. Inst. Henri Poincar´ , Probab. e Stat., 35(6):793–820, 1999.

[7] Y. Yang. Combining different procedures for adaptive regression. Journal of multivariate analysis, 74:135–161, 2000.

[8] F. Bunea and A. Nobel. Sequential procedures for aggregating arbitrary estimators of a conditional mean, 2005. Technical report.

[9] A. Juditsky, P. Rigollet, and A.B. Tsybakov. Learning by mirror averaging. Preprint n.1034, Laboratoire de Probabilit´ s et Mod` les Al´ atoires, Universit´ s Paris 6 and Paris 7, 2005. e e e e

[10] J.-Y. Audibert. A randomized online learning algorithm for better variance control. In Proceedings of the 19th annual conference on Computational Learning Theory, pages 392–407, 2006.

[11] V.G. Vovk. Aggregating strategies. In Proceedings of the 3rd annual workshop on Computational Learning Theory, pages 371–386, 1990.

[12] D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Trans. on Information Theory, 44(5):1906–1925, 1998.

[13] V.G. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, pages 153–173, 1998.

[14] M. Wegkamp. Model selection in nonparametric regression. Ann. Stat., 31(1):252–273, 2003.

[15] T. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the 18th annual conference on Computational Learning Theory, pages 173–187, 2005.

[16] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.

[17] L. Devroye, L. Gy¨ rﬁ, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, o 1996. 8