jmlr jmlr2010 jmlr2010-65 jmlr2010-65-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi
Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k
S. Amari and S. Wu. Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12(6):783–789, 1999. E.D. Andersen and K.D. Andersen. The MOSEK interior point optimizer for linear programming: an implementation of the homogeneous algorithm. In T. Terlaky H. Frenk, K. Roos and S. Zhang, editors, High Performance Optimization, pages 197–232. Kluwer Academic Publishers, 2000. A. Argyriou, C.A. Micchelli, and M. Pontil. Learning convex combinations of continuously parameterized basic kernels. In Proceedings of the 18th Conference on Learning Theory, volume 18, pages 338–352, 2005. A. Argyriou, R. Hauser, C.A. Micchelli, and M. Pontil. A DC-programming algorithm for kernel selection. In Proceedings of the 23rd International Conference on Machine Learning, volume 23, pages 338–352, Pittsburgh, PA, 2006. F.R. Bach, G.R.G. Lanckriet, and M.I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proceedings of the 21st International Conference on Machine Learning, volume 21, pages 41–48, Banff, Canada, 2004. Omnipress. S. Bochner. Monotone functions, Stieltjessche integrale und harmonische analyse. Mathematische Annalen, 108:378–410, 1933. S. Bochner. Harmonic Analysis and the Theory of Probability. University of California Press, Los Angeles, California, 1955. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1-3):131–159, 2002. 1387 G HIASI -S HIRAZI , S AFABAKHSH AND S HAMSI D.J. Crisp and C.J.C. Burges. A geometric interpretation of ν-svm classifiers. In Advances in Neural Information Processing Systems, volume 12, pages 244–250, Cambridge, MA:, 1999. MIT Press. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. Nello Cristianini, Colin Campbell, and John Shawe-taylor. Dynamically adapting kernels in support vector machines. In Advances in Neural Information Processing Systems 11, pages 204–210. MIT Press, 1998. F. Cucker and D.X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge, UK, 2007. J. M. Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641–664, 1966. P. V. Gehler and S. Nowozin. Infinite kernel learning. In Proceedings of the NIPS 2008 Workshop on ”Kernel Learning: Automatic Selection of Optimal Kernels”, pages 1–4, 2008. M.G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine Learning Research, 2:299–312, 2001. F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and basis functions: from regularization to radial, tensor and additive splines. A.I. Memo 1430, Artificial Intelligence Labratory, Massachusetts Institute of Technology, June 1993. T. Glasmachers and C. Igel. Gradient-based adaptation of general Gaussian kernels. Neural Computation, 17:2099–2105, 2005. R. Hettich and K.O. Kortanek. Semi-infinite programming: theory, methods, and applications. SIAM Review, 35(3):380–429, 1993. T. Joachims. Making large-scale support vector machine learning practical. In B. Sch¨ lkopf, C.J.C. o Burges, and A.J. Smola, editors, Advances in Kernel Methods- Support Vector Learning, pages 169–184. MIT Press, 1999. E. Kreyszig. Introductory Functional Analysis with Applications. Wiley, New York, 1989. G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004. D.C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming B, 45(3):503–528, 1989. M.E. Mavroforakis and S. Theodorodis. A geometric approach to support vector machine (SVM) classification. IEEE Transactions on Neural Networks, 17(3):671–682, 2006. C.A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099–1125, 2005. 1388 L EARNING T RANSLATION I NVARIANT K ERNELS FOR C LASSIFICATION C.A. Micchelli, M. Pontil, Q. Wu, and D.X. Zhou. Error bounds for learning the kernel. Research Note RN/05/09, Department of Computer Science, University College London, June 2005b. J. Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35 (151):773–782, 1980. J. Nocedal and S.J. Wright. Numerical Optimization, second edition. Springer Science+Business Media, LLC, New York, USA, 2006. C.S. Ong, A.J. Smola, and R.C. Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6:1043–1071, 2005. J.C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨ lkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods- Support o Vector Learning, pages 185–208. MIT Press, 1999. A. Rakotomamonjy, F.R. Bach, S. Canu, and V. Grandvalet. SimpleMKL. Journal of Machine Learning Research, 9:2491–2521, 2008. G. R¨ tsch, T. Onoda, and K.R. M¨ ller. Soft margins for AdaBoost. Machine Learning, 42(3): a u 287–320, 2001. R. Reemtsen and S. G¨ rner. Numerical methods for semi-infinite programming: A survey. In o R. Reemtsen and R¨ ckmann, editors, Semi-infinite Programming, pages 195–275. Kluwer Acau demic Publishers, 1998. R.T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, New Jersey, 1970. H.L. Royden. Real Analysis, 3rd edition. Macmillan Publishing Company, New York, 1988. W. Rudin. Functional Analysis, 2nd edition. McGraw-Hill, New York, 1991. W. Rudin. Real & Complex Analysis, 3rd edition. McGraw-Hill, New York, 1987. I.J. Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, 39(4): 811–841, 1938. B. Sch¨ lkopf and A. Smola. Learning with Kernels- Support Vector Machines, Regularization, o Optimization and Beyond. MIT Press, Cambridge, MA, 2002. B. Sch¨ lkopf, P. Knirsch, A. Smola, and C. Burges. Fast approximation of support vector kernel o expansions, and an interpretation of clustering as approximation in feature spaces. In DAGM Symposium Mustererkennung. Springer Lecture Notes in Computer Science, 1998. S. Sonnenburg, G. R¨ tsch, and C. Schafer. A general and efficient multiple kernel learning algoa rithm. In Advances in Neural Information Processing Systems, volume 18, pages 1275–1282, Cambridge MA, 2005. MIT Press. S. Sonnenburg, G. R¨ tsch, C. Schafer, and B. Sch¨ lkopf. Large scale multiple kernel learning. a o Journal of Machine Learning Research, 7:1531–1567, 2006. 1389 G HIASI -S HIRAZI , S AFABAKHSH AND S HAMSI V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. V.N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5):988–999, 1999. H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the kernel in the empirical feature space. IEEE Transactions on Neural Networks, 16(2):460–474, 2005. Z. Xu, R. Jin, I. King, and M.R. Lyu. An extended level method for efficient multiple kernel learning. In Advances in Neural Information Processing Systems, volume 21, pages 1825–1832, British Columbia, Canada, 2008. MIT Press. Y. Yiming and D.X. Zhou. Learnability of Gaussians with flexible variances. Journal of Machine Learning Research, 8:249–276, 2007. 1390