nips nips2012 nips2012-217 nips2012-217-reference knowledge-graph by maker-knowledge-mining

217 nips-2012-Mixability in Statistical Learning


Source: pdf

Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson

Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1


reference text

[1] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In O. Bousquet, U. von Luxburg, and G. R¨ tsch, editors, Advanced Lectures on Machine Learna ing, volume 3176 of Lecture Notes in Computer Science, pages 169–207. Springer Berlin / Heidelberg, 2004.

[2] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

[3] O. Dekel and Y. Singer. Data-driven online to batch conversions. In Y. Weiss, B. Sch¨ lkopf, o and J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS), pages 267–274, Cambridge, MA, 2006. MIT Press.

[4] J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through minimax duality. In Proceedings of the 22nd Conference on Learning Theory (COLT), 2009.

[5] Y. Kalnishkan and M. V. Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74:1228–1244, 2008.

[6] V. Vovk. A game of prediction with expert advice. In Proceedings of the 8th Conference on Learning Theory (COLT), pages 51–60. ACM, 1995.

[7] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.

[8] A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.

[9] J. L. Doob. Application of the theory of martingales. In Le Calcul de Probabilit´ s et ses e Applications. Colloques Internationaux du Centre National de la Recherche Scientifique, pages 23–27, Paris, 1949.

[10] A. Barron and T. Cover. Minimum complexity density estimation. IEEE Transactions on Information Theory, 37(4):1034–1054, 1991.

[11] T. Zhang. From ✏-entropy to KL entropy: analysis of minimum information complexity density estimation. Annals of Statistics, 34(5):2180–2210, 2006.

[12] J. Li. Estimation of Mixture Models. PhD thesis, Yale University, 1999.

[13] B. Kleijn and A. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics, 34(2), 2006.

[14] P. Gr¨ nwald. Safe learning: bridging the gap between Bayes, MDL and statistical learning u theory via empirical convexity. In Proceedings of the 24th Conference on Learning Theory (COLT), 2011.

[15] A. Chernov, Y. Kalnishkan, F. Zhdanov, and V. Vovk. Supermartingales in prediction with expert advice. Theoretical Computer Science, 411:2647–2669, 2010.

[16] J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. Annals of Statistics, 37(4):1591–1646, 2009.

[17] E. Vernet, R. C. Williamson, and M. D. Reid. Composite multiclass losses. In Advances in Neural Information Processing Systems 24 (NIPS), 2011.

[18] P. Gr¨ nwald. The Minimum Description Length Principle. MIT Press, Cambridge, MA, 2007. u

[19] T. van Erven, M. Reid, and R. Williamson. Mixability is Bayes risk curvature relative to log loss. In Proceedings of the 24th Conference on Learning Theory (COLT), 2011.

[20] S. Arlot and P. L. Bartlett. Margin-adaptive model selection in statistical learning. Bernoulli, 17(2):687–713, 2011.

[21] T. Zhang. Information theoretical upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006.

[22] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.

[23] J.-Y. Audibert. PAC-Bayesian statistical learning theory. PhD thesis, Universit´ Paris VI, e 2004.

[24] O. Catoni. PAC-Bayesian Supervised Classification. Lecture Notes-Monograph Series. IMS, 2007.

[25] W. Lee, P. Bartlett, and R. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998. Correction, Volume 54(9), 4395 (2008).

[26] A. N. Shiryaev. Probability. Springer-Verlag, 1996.

[27] J.-Y. Audibert. A better variance control for PAC-Bayesian classification. Preprint 905, Laboratoire de Probabilit´ s et Mod` les Al´ atoires, Universit´ s Paris 6 and Paris 7, 2004. e e e e 9