nips nips2013 nips2013-252 nips2013-252-reference knowledge-graph by maker-knowledge-mining

252 nips-2013-Predictive PAC Learning and Process Decompositions


Source: pdf

Author: Cosma Shalizi, Aryeh Kontorovitch

Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1


reference text

[1] Mathukumalli Vidyasagar. A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Springer-Verlag, Berlin, 1997.

[2] Patrizia Berti and Pietro Rigo. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics and Probability Letters, 32:385–391, 1997.

[3] Vladimir Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. In Proceedings of the 2010 IEEE International Conference on Granular Computing (GrC 2010), pages 387–391, Los Alamitos, California, 2010. IEEE Computer Society. URL http://arxiv.org/abs/1006.1129.

[4] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. Annals of Probability, 22:94–116, 1994. URL http://projecteuclid.org/euclid.aop/ 1176988849.

[5] M. Vidyasagar. Learning and Generalization: With Applications to Neural Networks. Springer-Verlag, Berlin, second edition, 2003.

[6] Terrence M. Adams and Andrew B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. Annals of Probability, 38:1345–1367, 2010. URL http: //arxiv.org/abs/1010.3162. 7

[7] Ramon van Handel. The universal Glivenko-Cantelli property. Probability Theory and Related Fields, 155:911–934, 2013. doi: 10.1007/s00440-012-0416-5. URL http://arxiv.org/ abs/1009.4434.

[8] Dharmendra S. Modha and Elias Masry. Memory-universal prediction of stationary random processes. IEEE Transactions on Information Theory, 44:117–133, 1998. doi: 10.1109/18. 650998.

[9] Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39:5–34, 2000. URL http://www.ee.technion.ac.il/˜rmeir/ Publications/MeirTimeSeries00.pdf.

[10] Rajeeva L. Karandikar and Mathukumalli Vidyasagar. Rates of uniform convergence of empirical means with mixing processes. Statistics and Probability Letters, 58:297–307, 2002. doi: 10.1016/S0167-7152(02)00124-4.

[11] David Gamarnik. Extension of the PAC framework to finite and countable Markov chains. IEEE Transactions on Information Theory, 49:338–345, 2003. doi: 10.1145/307400.307478.

[12] Ingo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. observations. In Y. Bengio, D. Schuurmans, John Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22 [NIPS 2009], pages 1768–1776. MIT Press, Cambridge, Massachusetts, 2009. URL http://books.nips.cc/papers/files/ nips22/NIPS2009_1061.pdf.

[13] Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary φ-mixing and βmixing processes. Journal of Machine Learning Research, 11, 2010. URL http://www. jmlr.org/papers/v11/mohri10a.html.

[14] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-I.I.D. processes. In Daphne Koller, D. Schuurmans, Y. Bengio, and L´ on Bottou, editors, Advances e in Neural Information Processing Systems 21 [NIPS 2008], pages 1097–1104, 2009. URL http://books.nips.cc/papers/files/nips21/NIPS2008_0419.pdf.

[15] Pierre Alquier and Olivier Wintenberger. Model selection for weakly dependent time series forecasting. Bernoulli, 18:883–913, 2012. doi: 10.3150/11-BEJ359. URL http://arxiv. org/abs/0902.2924.

[16] Ben London, Bert Huang, and Lise Getoor. Improved generalization bounds for largescale structured prediction. In NIPS Workshop on Algorithmic and Statistical Approaches for Large Social Networks, 2012. URL http://linqs.cs.umd.edu/basilic/web/ Publications/2012/london:nips12asalsn/.

[17] Ben London, Bert Huang, Benjamin Taskar, and Lise Getoor. Collective stability in structured prediction: Generalization from one example. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning [ICML-13], volume 28, pages 828–836, 2013. URL http://jmlr.org/proceedings/papers/ v28/london13.html.

[18] Robert M. Gray. Probability, Random Processes, and Ergodic Properties. Springer-Verlag, New York, second edition, 2009. URL http://ee.stanford.edu/˜gray/arp. html.

[19] E. B. Dynkin. Sufficient statistics and extreme points. Annals of Probability, 6:705–730, 1978. URL http://projecteuclid.org/euclid.aop/1176995424.

[20] Steffen L. Lauritzen. Extreme point models in statistics. Scandinavian Journal of Statistics, 11:65–91, 1984. URL http://www.jstor.org/pss/4615945. With discussion and response.

[21] Olav Kallenberg. Probabilistic Symmetries and Invariance Principles. Springer-Verlag, New York, 2005.

[22] Persi Diaconis and David Freedman. De Finetti’s theorem for Markov chains. Annals of Probability, 8:115–130, 1980. doi: 10.1214/aop/1176994828. URL http://projecteuclid. org/euclid.aop/1176994828. ´

[23] R. M. Dudley. A course on empirical processes. In Ecole d’´ t´ de probabilit´ s de Saint-Flour, ee e XII–1982, volume 1097 of Lecture Notes in Mathematics, pages 1–142. Springer, Berlin, 1984. 8

[24] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36:929–965, 1989. doi: 10.1145/76359.76371. URL http://users.soe.ucsc. edu/˜manfred/pubs/J14.pdf.

[25] St´ phane Boucheron, Olivier Bousquet, and G´ bor Lugosi. Theory of classification: A survey e a of recent advances. ESAIM: Probabability and Statistics, 9:323–375, 2005. URL http: //www.numdam.org/item?id=PS_2005__9__323_0.

[26] Frank B. Knight. A predictive view of continuous time processes. Annals of Probability, 3: 573–596, 1975. URL http://projecteuclid.org/euclid.aop/1176996302.

[27] William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity and learning. Neural Computation, 13:2409–2463, 2001. URL http://arxiv.org/abs/physics/ 0007070.

[28] Andrzej Lasota and Michael C. Mackey. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics. Springer-Verlag, Berlin, 1994. First edition, Probabilistic Properties of Deterministic Systems, Cambridge University Press, 1985.

[29] J´ rˆ me Dedecker, Paul Doukhan, Gabriel Lang, Jos´ Rafael Le´ n R., Sana Louhichi, and eo e o Cl´ mentine Prieur. Weak Dependence: With Examples and Applications. Springer, New York, e 2007.

[30] Norbert Wiener. Nonlinear prediction and dynamics. In Jerzy Neyman, editor, Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, volume 3, pages 247–252, Berkeley, 1956. University of California Press. URL http://projecteuclid. org/euclid.bsmsp/1200502197.

[31] Paul Doukhan. Mixing: Properties and Examples. Springer-Verlag, New York, 1995.

[32] Richard C. Bradley. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2:107–144, 2005. URL http://arxiv.org/abs/math/ 0511078.

[33] Noga Alon, Shai Ben-David, Nicol` Cesa-Bianchi, and David Haussler. Scale-sensitive dio mensions, uniform convergence, and learnability. Journal of the ACM, 44:615–631, 1997. doi: 10.1145/263867.263927. URL http://tau.ac.il/˜nogaa/PDFS/learn3.pdf.

[34] Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. Journal of Computer and Systems Science, 56:174–190, 1998. doi: 10.1006/jcss.1997.1557. URL http://www.phillong.info/publications/ more_theorems.pdf. 9