jmlr jmlr2010 jmlr2010-106 jmlr2010-106-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
Sergei Natanovich Bernstein. Sur l’extension du th´ or` me limite du calcul des probabilit´ s aux e e e sommes de quantit´ s d´ pendantes. Mathematische Annalen, 97:1–59, 1927. e e Olivier Bousquet and Andr´ Elisseeff. Algorithmic stability and generalization performance. In e Advances in Neural Information Processing Systems, 2001. Olivier Bousquet and Andr´ Elisseeff. Stability and generalization. Journal of Machine Learning e Research, 2:499–526, 2002. Jean-Ren´ Chazottes, Pierre Collet, Christof K¨ lske, and Frank Redig. Concentration inequalities e u for random fields via coupling. Probability Theory and Related Fields, 137(1):201–225, 2007. Corinna Cortes and Vladimir N. Vapnik. Support-vector networks. Machine Learning, 20(3):273– 297, 1995. Luc Devroye and T. J. Wagner. Distribution-free performance bounds for potential function rules. In Information Theory, IEEE Transactions on, volume 25, pages 601–604, 1979. Paul Doukhan. Mixing: Properties and Examples. Springer-Verlag, 1994. Michael Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. In Computational Learning Theory, pages 152–162, 1997. 812 S TABILITY B OUNDS FOR N ON - I . I . D . P ROCESSES Leo Kontorovich. Measure Concentration of Strongly Mixing Processes with Applications. PhD thesis, Carnegie Mellon University, 2007. Leo Kontorovich and Kavita Ramanan. Concentration inequalities for dependent random variables via the martingale method. Annals of Probability, 36(6):2126–2158, 2008. Aur´ lie Lozano, Sanjeev Kulkarni, and Robert Schapire. Convergence and consistency of regulare ized boosting algorithms with stationary β-mixing observations. In Advances in Neural Information Processing Systems, 2006. Katalin Marton. Measure concentration for a class of random processes. Probability Theory and Related Fields, 110(3):427–439, 1998. Davide Mattera and Simon Haykin. Support vector machines for dynamic reconstruction of a chaotic system. In Advances in Kernel Methods: Support Vector Learning, pages 211–241. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3. Colin McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148–188. Cambridge University Press, 1989. Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39(1):5–34, April 2000. Dharmendra Modha and Elias Masry. On the consistency in nonparametric estimation under mixing assumptions. IEEE Transactions of Information Theory, 44:117–133, 1998. Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for non-iid processes. In Advances in Neural Information Processing Systems, 2007. Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In Advances in Neural Information Processing Systems (NIPS 2008), pages 1097–1104, Vancouver, Canada, 2009. MIT Press. Klaus-Robert M¨ ller, Alex Smola, Gunnar R¨ tsch, Bernhard Sch¨ lkopf, Jens Kohlmorgen, and u a o Vladimir Vapnik. Predicting time series with support vector machines. In Proceedings of the International Conference on Artificial Neural Networks, Lecture Notes in Computer Science, pages 999–1004. Springer, 1997. Paul-Marie Samson. Concentration of measure inequalities for Markov chains and-mixing processes. Annals Probability, 28(1):416–461, 2000. Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 515–521. Morgan Kaufmann Publishers Inc., 1998. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. Mathukumalli Vidyasagar. Learning and Generalization: with Applications to Neural Networks. Springer, 2003. 813 M OHRI AND ROSTAMIZADEH Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, Jan. 1994. Shuheng Zhou, John Lafferty, and Larry Wasserman. Time varying undirected graphs. In Proceedings of the 21st Annual Conference on Learning Theory, 2008. 814