jmlr jmlr2007 jmlr2007-45 jmlr2007-45-reference knowledge-graph by maker-knowledge-mining

45 jmlr-2007-Learnability of Gaussians with Flexible Variances


Source: pdf

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with flexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli (uGC) class. This result confirms a conjecture concerning learnability of Gaussian kernels and verifies the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when flexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with flexible variances we present explicit learning rates for regression with least square loss and classification with hinge loss. Keywords: Gaussian kernel, flexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number


reference text

N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950. N. Alon, S. Ben-David, S. N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence and learnability. Journal of the ACM, 44:615–631, 1997. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101:138–156, 2006. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. N. Cristianini, J. Shawe-Taylor, and C. Campbell. Dynamically adapting kernels in support vector machines. In Advances in Neural Information Processing Systems 11 (M. S. Kearns, S. A. Solla, and D. A. Cohn, eds), MIT Press, 1999. D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou. Support vector machine soft margin classifiers: Error analysis. Journal of Machine Learning Research, 5:1143–1175, 2004. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46:131–159, 2002. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273–297, 1995. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1–49, 2001. F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, to appear, 2007. 274 L EARNABILITY OF G AUSSIANS E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5:59–85, 2006. L. Devroye, L. Gy¨ rfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springero Verlag, New York, 1997. R. M. Dudley. Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics 63, Cambridge University Press, 1999. R. M. Dudley, E. Gin´ , and J. Zinn. Uniform and universal Glivenko-Cantelli Classes. Journal of e Theoretical Probability, 4:485–510, 1991. T. Evgeniou and M. Pontil. Regularized multi-task learning. In 10th SIGKDD Conference on Knowledge Discovery and Data http://www.cs.ucl.ac.uk/staff/M.Pontil/reading/mt-kdd.pdf Proceedings of Mining, 2004. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13:1–50, 2000. V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47:1902–1914, 2001. V. Koltchinskii and V. Panchenko. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30:1–50, 2002. G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, P. L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004. M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer Press, New York, 1991. J. Li, and A. Barron. Mixture density estimation. In Advances in Neural Information Processing ¨ Systems 12 (S. A. Solla, K. L. Todd, K.-R. Muller eds.), MIT Press, 2000. P. Niyogi and F. Girosi. On the relationships between generalization error, hypothesis complexity and sample complexity for radial basis functions. Neural Computation, 8:819–842, 1996. V. S. Pugachev, and I. N. Sinitsyn. Lectures on Functional Analysis and Applications. World Scientific, Singapore, 1999. A. Rakhlin, D. Panchenko, and S. Mukherjee. Risk bounds for mixture density estimation. ESAIM: Probability and Statistics, 9:220–229, 2005. B. Sch¨ lkopf, B. R. Herbrich, and A. J. Smola. A generalized representer theorem. In Proceedings o of the 14th Annual Conference on Computational Learning Theory, Lecture Notes in Artificial Intelligence, 2111: 416–426, 2001. J. Shawe-Taylor, P. L. Bartlett, S. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:1926–1940, 1998. 275 Y ING AND Z HOU S. Smale and D. X. Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1:17–41, 2003. S. Smale and D. X. Zhou. Shannon sampling and function reconstruction from point values. Bulletin of the American Mathematical Society, 41:279–305, 2004. S. Smale and D. X. Zhou. Shannon sampling II. Connections to learning theory. Applied and Computational Harmonic Analysis, 19:285–302, 2005. E. M. Stein. Singular Integrals and Differentiability Properties of Functions. Princeton University Press, Princeton, New Jersey, 1970. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. I. Steinwart and C. Scovel. Fast rates for support vector machines. In Proceedings of 18th Annual Conference on Learning Theory, 2005. A. B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32:135–166, 2004. V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. A. W. Van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. SpringerVerlag, 1996. G. Wahba. Spline Models for Observational Data. SIAM, 1990. Q. Wu, Y. Ying, and D. X. Zhou. Multi-kernel Regularized Classifiers. Journal of Complexity, forthcoming. G. B. Ye and D. X. Zhou. Learning and approximation by Gaussians on Riemannian manifolds. Advances in Computational Mathematics, forthcoming. K. Yosida. Functional Analysis. 6th edition, Springer-Verlag, 1980. T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004. D. X. Zhou. The covering number in learning theory. Journal of Complexity, 18:739–767, 2002. D. X. Zhou. Capacity of reproducing kernel spaces in learning theory. IEEE Transactions on Information Theory, 49:1743–1752, 2003. D. X. Zhou and K. Jetter. Approximation with polynomial kernels and SVM classifiers. Advances in Computational Mathematics, 25:323–344, 2006. 276