nips nips2007 nips2007-184 nips2007-184-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.
[1] S. N. Bernstein. Sur l’extension du th´ or` me limite du calcul des probabilit´ s aux sommes de quantit´ s e e e e d´ pendantes. Math. Ann., 97:1–59, 1927. e
[2] O. Bousquet and A. Elisseeff. Algorithmic stability and generalization performance. In NIPS 2000, 2001.
[3] O. Bousquet and A. Elisseeff. Stability and generalization. JMLR, 2:499–526, 2002.
[4] L. Devroye and T. Wagner. Distribution-free performance bounds for potential function rules. In Information Theory, IEEE Transactions on, volume 25, pages 601–604, 1979.
[5] P. Doukhan. Mixing: Properties and Examples. Springer-Verlag, 1994.
[6] M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. In Computational Learing Theory, pages 152–162, 1997.
[7] L. Kontorovich and K. Ramanan. Concentration inequalities for dependent random variables via the martingale method, 2006.
[8] A. Lozano, S. Kulkarni, and R. Schapire. Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In NIPS, 2006.
[9] D. Mattera and S. Haykin. Support vector machines for dynamic reconstruction of a chaotic system. In Advances in kernel methods: support vector learning, pages 211–241. MIT Press, Cambridge, MA, 1999.
[10] R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5–34, 2000.
[11] D. Modha and E. Masry. On the consistency in nonparametric estimation under mixing assumptions. IEEE Transactions of Information Theory, 44:117–133, 1998.
[12] K.-R. M¨ ller, A. Smola, G. R¨ tsch, B. Sch¨ lkopf, J. K., and V. Vapnik. Predicting time series with support u a o vector machines. In Proceedings of ICANN’97, LNCS, pages 999–1004. Springer, 1997.
[13] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning Algorithm in Dual Variables. In Proceedings of the ICML ’98, pages 515–521. Morgan Kaufmann Publishers Inc., 1998.
[14] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press: Cambridge, MA, 2002. o
[15] V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
[16] M. Vidyasagar. Learning and Generalization: With Applications to Neural Networks. Springer, 2003.
[17] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, Jan. 1994. 8