jmlr jmlr2005 jmlr2005-48 jmlr2005-48-reference knowledge-graph by maker-knowledge-mining

48 jmlr-2005-Learning the Kernel Function via Regularization


Source: pdf

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.


reference text

A. Argyriou, C. A. Micchelli and M. Pontil. Learning convex combinations of continuously parameterized basic kernels. Proc. 18-th Annual Conference on Learning Theory (COLT’05), Bertinoro, Italy, June, 2005. N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 686: 337–404, 1950. J. P. Aubin. Mathematical methods of game and economic theory. Studies in Mathematics and its applications, Vol. 7, North-Holland, 1982. F. R. Bach, G. R. G. Lanckriet and M. I. Jordan. Multiple kernels learning, conic duality, and the SMO algorithm. Proc. of the Int. Conf. on Machine Learning (ICML’04) 2004. F. R. Bach, R. Thibaux and M. I. Jordan. Computing regularization paths for learning multiple kernels. Advances in Neural Information Processing Systems, 17, 2004. C. Bennett and R. Sharpley. Interpolation of Operators. Vol. 129, Pure and Appl. Math, Academic Press, Boston, 1988. J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization. Theory and Examples CMS (Canadian Mathematical Society) Springer-Verlag, New York, 2000. O. Bousquet and A. Elisseeff. Stability and generalization. J. of Machine Learning Research, 2: 499–526, 2002. O. Bousquet and D. J. L. Herrmann. On the complexity of learning the kernel matrix. Advances in Neural Information Processing Systems, 15, 2003. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1): 131–159, 2002. F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Math. Soc., 39 (1): 1–49, 2002. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, J. Kandola. On kernel-target alignment Advances in Neural Information Processing Systems, 14, T. G. Dietterich, S. Becker, Z. Ghahramani (eds.), 2002. E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, A. Verri. Some properties of regularized kernel methods. J. of Machine Learning Research, 5(Oct):1363–1390, 2004. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13: 1–50, 2000. F. Girosi. An Equivalence Between Sparse Approximation and Support Vector Machines. Neural Computation, 10 (6): 1455–1480, 1998. T. Graepel. Kernel matrix completion by semi-definite programming. Proc. of ICANN, pages 694–699, 2002. 1123 M ICCHELLI AND P ONTIL T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2002. T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for support vector machines, J. of Machine Learning Research, 5, 1391–1415, 2004. M. Herbster. Learning Additive Models Online with Fast Evaluating Kernels. Proc. of the The 14-th Annual Conference on Computational Learning Theory (COLT), pages 444–460, 2001. M. Herbster. Relative Loss Bounds and Polynomial-time Predictions for the K-LMS-NET Algorithm. Proc. of the 15-th Int. Conference on Algorithmic Learning Theory, October 2004. T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. MIT AI-Lab Technical Report, 1999. G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33: 82–95, 1971. G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan. Learning the kernel matrix with semi-definite programming. In C. Sammut and A.Hoffmann (Eds.), Proc. of the 19-th Int. Conf. on Machine Learning, Sydney, Australia, Morgann Kaufmann, 2002. G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, M. I. Jordan. Learning the kernel matrix with semi-definite programming. J. of Machine Learning Research, 5: 27–72, 2004. Y. Lee, Y. Kim, S. Lee and J.-Y. Koo. Structured Multicategory Support Vector Machine with ANOVA decomposition. Technical Report No. 743, Department of Statistics, The Ohio State University, October 2004. Y. Lin and H. H. Zhang. Component Selection and Smoothing in Smoothing Spline Analysis of Variance Models – COSSO. Institute of Statistics Mimeo Series 2556, NCSU, January 2003. O. L. Mangasarian. Nonlinear Programming. Classics in Applied Mathematics, SIAM, 1994. A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications. Academic Press, San Diego, 1979. C. A. Micchelli. Saturation Classes and Iterates of Operators. PhD Thesis, Stanford University, 1969. C. A. Micchelli and M. Pontil. A function representation for learning in Banach spaces. Proc. of the 17–th Annual Conf. on Learning Theory (COLT’04), Banff, Alberta, June 2004. C. A. Micchelli and M. Pontil. On learning vector–valued functions. Neural Computation, 17: 177–204, 2005. C. A. Micchelli, M. Pontil, Q. Wu, and D. X. Zhou. Error bounds for learning the kernel. Research Note 05/09, Dept of Computer Science, University College London, June, 2005. S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for empirical risk minimization. Advances in Computational Mathematics, to appear, 2004. 1124 L EARNING THE K ERNEL F UNCTION VIA R EGULARIZATION C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels. Advances in Neural Information Processing Systems, 15, S. Becker, S. Thrun, K. Obermayer (Eds.), MIT Press, Cambridge, MA, 2003. M. Pontil and A. Verri. Properties of support vector machines. Neural Computation, 10: 955–974, 1998. R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey, 1970. H. L. Royden. Real Analysis. Macmillan Publishing Company, New York, 3rd edition, 1988. I. J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, 39(4): 811–841, 1938. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA, USA, 2002. o C. Scovel and I. Steinwart. Fast rates for support vector machines. Preprint, 2004. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. S. Smale, and D. X. Zhou. Estimating the approximation error in learning theory, Anal. Appl., 1: 1–25, 2003. D. M. J. Tax and R. P. W. Duin. Support vector domain description. Pattern Recognition Letters, 20: 1191–1199, 1999. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. G. Wahba. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia, 1990. C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. Advances in Neural Processing Systems 8: 598–604, D. S. Touretzky, M. C. Mozer, M. E. Hasselmo (eds.), MIT Press, Cambridge, MA, 1996. Q. Wu, Y. Ying and D. X. Zhou. Multi-kernel regularization classifiers. Preprint, City University of Hong Kong, 2004. Y. M. Ying and D. X. Zhou. Learnability of Gaussians with flexible variances. Preprint, City University of Hong Kong, 2004. T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statis., 32: 56–85, 2004. T. Zhang. On the dual formulation of regularized linear systems with convex risks. Machine Learning, 46: 91–129, 2002. Z. Zhang, D.-Y. Yeung and J. T. Kwok. Bayesian inference for transductive learning of kernel matrix using the Tanner-Wong data augmentation algorithm. Proc. 21-st Int. Conf. Machine Learning (ICML-2004), pages 935-942, Banff, Alberta, Canada, July 2004. D. X. Zhou. The covering number in learning theory. J. of Complexity, 18: 739–767, 2002. 1125