jmlr jmlr2013 jmlr2013-35 jmlr2013-35-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning. Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence
M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Mach. Learn., 30(1):31–56, 1998. ISSN 0885-6125. Z. Bai and J. W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices. Springer, second edition edition, 2010. M. F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Journal of Computer and System Sciences, 75(1):78–89, 2009. 2147 S ABATO , S REBRO AND T ISHBY P. Bartlett. Lecture notes. http://www.cs.berkeley.edu/˜bartlett/courses/281b-sp06/ lecture25.ps, 2006. unpublished. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005. S. Ben-David, T. Lu, and D. P´ l. Does unlabeled data provably help? In Proceedings of the a Twenty-First Annual Conference on Computational Learning Theory, pages 33–44, 2008. G. M. Benedek and A. Itai. Learnability with respect to fixed distributions. Theoretical Computer Science, 86(2):377–389, Sep 1991. G. Bennett, V. Goodman, and C. M. Newman. Norms of random matrices. Pacific J. Math., 59(2): 359–365, 1975. A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems 23, pages 199–207. 2010. O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. V. V. Buldygin and Y. V. Kozachenko. Metric Characterization of Random Variables and Random Processes. American Mathematical Society, 1998. R. M. Dudley. Central limit theorems for empirical measures. The Annals of Probability, 6(6): 899–929, 1978. A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of examples needed for learning. In Proceedings of the First Anuual Workshop on Computational Learning Theory, pages 139–154, Aug 1988. C. Gentile and D. P. Helmbold. Improved lower bounds for learning from noisy examples: an information-theoretic approach. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (COLT), pages 104–115, 1998. M. J. Kearns and R. E. Schapire. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464–497, 1994. M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991. P. Liang and N. Srebro. On the interaction between norm and dimensionality: Multiple regimes in learning. In Proceedings of the Twenty-Seventh International Conference on Machine learning (ICML), 2010. S. Mendelson. Rademacher averages and phase transitions in glivenko-cantelli classes. IEEE Transactions on Information Theory, 48(1):251–263, 2002. 2148 D ISTRIBUTION -D EPENDENT S AMPLE C OMPLEXITY OF L ARGE M ARGIN L EARNING F. L. Nazarov and A. Podkorytov. Ball, haagerup, and distribution functions. Operator Theory: Advances and Applications, 113 (Complex Analysis, Operators, and Related Topics):247–267, 2000. A. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine learning (ICML), 2004. R. Paley and A. Zygmund. A note on analytic functions in the unit circle. Proceedings of the Cambridge Philosophical Society, 28:266272, 1932. D. Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984. M. Rudelson and R. Vershynin. The Littlewood-Offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008. M. Rudelson and R. Vershynin. The smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62:1707–1739, 2009. S. Sabato, N. Srebro, and N. Tishby. Tight sample complexity of large-margin learning. In Advances in Neural Information Processing Systems 23 (NIPS), pages 2038–2046, 2010. B. Sch¨ lkopf, J. Shawe-Taylor, A. J. Smola, and R. Williamson. Generalization bounds via eigeno values of the gram matrix. Technical Report NC2-TR-1999-035, NeuroCOLT2, 1999. I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Annals of Statistics, 35(2):575–607, 2007. V. N. Sudakov. Gaussian processes and measures of solid angles in hilbert space. Sov. Math.Dokl., 12:412–415, 1971. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its applications, XVI(2):264–280, 1971. V. N. Vapnik and A. Y. Chervonenkis. Theory of pattern recognition. Nauka, Moscow, 1974. (In Russian). N. Vayatis and R. Azencott. Distribution-dependent vapnik-chervonenkis bounds. In EuroCOLT ’99, pages 230–240, London, UK, 1999. Springer-Verlag. ISBN 3-540-65701-0. T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002. 2149