jmlr jmlr2005 jmlr2005-49 jmlr2005-49-reference knowledge-graph by maker-knowledge-mining

49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Source: pdf

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by deﬁning a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semideﬁnite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classiﬁcation, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semideﬁnite programming

reference text

A. Albert. Conditions for positive and nonnegative deﬁniteness in terms of pseudoinverses. SIAM Journal on Applied Mathematics, 17(2):434 – 440, 1969. N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337 – 404, 1950. F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1 – 48, 2002. K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1:23 – 34, 1992. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. O. Bousquet and D. Herrmann. On the complexity of learning the kernel matrix. In Advances in Neural Information Processing Systems 15, pages 399–406, 2002. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121 – 167, 1998. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131 – 159, 2002. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273 – 297, 1995. D. Cox and F. O’Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics, 18:1676 – 1695, 1990. K. Crammer, J. Keshet, and Y. Singer. Kernel design using boosting. In Advances in Neural Information Processing Systems 15, pages 537–544, 2002. N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 367 – 373, Cambridge, MA, 2002. MIT Press. N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment. Technical report, UC Davis Department of Statistics, 2003. K. Duan, S.S. Keerthi, and A.N. Poo. Evaluation of simple performance measures for tuning svm hyperparameters. Neurocomputing, 51:41 – 59, 2003. S. Fine and K. Scheinberg. Efﬁcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243 – 264, Dec 2001. http://www.jmlr.org. 1069 O NG , S MOLA AND W ILLIAMSON Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learing, pages 148 – 146. Morgan Kaufmann Publishers, 1996. D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99 - 10, Computer Science Department, UC Santa Cruz, 1999. R. Herbrich and R.C. Williamson. Algorithmic luckiness. Journal of Machine Learning Research, 3:175 – 212, 2002. G. S. Kimeldorf and G. Wahba. Some results on Tchebychefﬁan spline functions. J. Math. Anal. Applic., 33:82 – 95, 1971. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semideﬁnite programming. In Proceedings of the International Conference on Machine Learning, pages 323–330. Morgan Kaufmann, 2002. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semi-deﬁnite programming. Journal of Machine Learning Research, 5:27 – 72, 2004. S. L. Lauritzen. Graphical Models. Oxford University Press, 1996. J. L¨ fberg. o YALMIP, yet another LMI parser, 2002. hanl/yalmip.html. http://www.control.isy.liu.se/ ˜jo- A. Luntz and V. Brailovsky. On estimation of characters obtained in statistical procedure of recognition (in Russian). Technicheskaya Kibernetica, 3, 1969. D. J. C. MacKay. Bayesian non-linear modelling for the energy prediction competition. ASHRAE Transcations, 4:448 – 472, 1994. O. L. Mangasarian and D. R. Musicant. Lagrangian support vector machines. Journal of Machine Learning Research, 1:161 – 177, 2001. D. Meyer, F. Leisch, and K. Hornik. The support vector machine under test. Neurocomputing, 55 (1–2):169–186, 2003. R. Neal. Bayesian Learning in Neural Networks. Springer, 1996. C. S. Ong and A. J. Smola. Machine learning using hyperkernels. In Proceedings of the International Conference on Machine Learning, pages 568–575, 2003. C. S. Ong, A. J. Smola, and R. C. Williamson. Hyperkernels. In Neural Information Processing Systems, volume 15, pages 495–502. MIT Press, 2002. M. Opper and O. Winther. Gaussian processes and SVM: Mean ﬁeld and leave-one-out. In A. J. Smola, P. L. Bartlett, B. Sch¨ lkopf, and D. Schuurmans, editors, Advances in Large Margin o Classiﬁers, pages 311 – 326, Cambridge, MA, 2000. MIT Press. G. R¨ tsch, T. Onoda, and K. R. M¨ ller. Soft margins for adaboost. Machine Learning, 42(3):287 – a u 320, 2001. 1070 H YPERKERNELS B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. o B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207 – 1245, 2000. B. Sch¨ lkopf, J. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support o of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. M. Seeger. Bayesian methods for support vector machines and Gaussian processes. Master’s thesis, University of Edinburgh, Division of Informatics, 1999. A. J. Smola, B. Sch¨ lkopf, and K.-R. M¨ ller. The connection between regularization operators and o u support vector kernels. Neural Networks, 11(5):637 – 649, 1998. J. F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11/12(1 - 4):625 – 653, 1999. K. Tsuda, S. Akaho, and K. Asai. The EM algorithm for kernel matrix completion with auxiliary data. Journal of Machine Learning Research, 4:67–81, 2003. L. Vandenberghe and S. Boyd. Semideﬁnite programming. SIAM Review., 38(1):49 – 95, 1996. V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990. C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514 – 520, Cambridge, MA, 1996. MIT Press. Christopher K. I. Williams and David Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI, 20(12):1342 – 1351, 1998. T. Zhang. Some sparse approximation bounds for regression problems. In Proc. 18th International Conf. on Machine Learning, pages 624 – 631. Morgan Kaufmann, San Francisco, CA, 2001. 1071