jmlr jmlr2011 jmlr2011-105 jmlr2011-105-reference knowledge-graph by maker-knowledge-mining

105 jmlr-2011-lp-Norm Multiple Kernel Learning

Source: pdf

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efﬁcient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artiﬁcial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

reference text

T. Abeel, Y. Van de Peer, and Y. Saeys. Towards a gold standard for promoter prediction evaluation. Bioinformatics, 2009. J. Aﬂalo, A. Ben-Tal, C. Bhattacharyya, J. Saketha Nath, and S. Raman. Variable sparsity kernel learning—algorithms and applications. Journal of Machine Learning Research, 2011. To appear. 991 K LOFT, B REFELD , S ONNENBURG AND Z IEN A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008. F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 105–112, 2009. F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proc. 21st ICML. ACM, 2004. V. B. Bajic, S. L. Tan, Y. Suzuki, and S. Sugano. Promoter prediction analysis on the whole human genome. Nature Biotechnology, 22(11):1467–1473, 2004. P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, November 2002. D.P. Bertsekas. Nonlinear Programming, Second Edition. Athena Scientiﬁc, Belmont, MA, 1999. K. Bleakley, G. Biau, and J.-P. Vert. Supervised reconstruction of biological networks with local models. Bioinformatics, 23:i57–i65, 2007. O. Bousquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. In Advances in Neural Information Processing Systems, 2002. O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar R¨ tsch, editors, Advanced Lectures on Machine a Learning, volume 3176 of Lecture Notes in Computer Science, pages 169–207. Springer Berlin / Heidelberg, 2004. S. Boyd and L. Vandenberghe. Convex Optimization. Cambrigde University Press, Cambridge, UK, 2004. O. Chapelle. Training a support vector machine in the primal. Neural Computation, 2006. O. Chapelle and A. Rakotomamonjy. Second order optimization of kernel parameters. In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008. O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131–159, 2002. C. Cortes, A. Gretton, G. Lanckriet, M. Mohri, and A. Rostamizadeh. Proceedings of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008. URL http: //www.cs.nyu.edu/learning_kernels. C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Proceedings of the International Conference on Uncertainty in Artiﬁcial Intelligence, 2009a. C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 396–404, 2009b. 992 ℓ p -N ORM M ULTIPLE K ERNEL L EARNING C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In Proceedings, 27th ICML, 2010a. C. Cortes, M. Mohri, and A. Rostamizadeh. Two-stage learning kernel algorithms. In Proceedings of the 27th Conference on Machine Learning (ICML 2010), 2010b. N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel-target alignment. In Advances in Neural Information Processing Systems, 2002. L. Devroye, L. Gy¨ rﬁ, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 o in Applications of Mathematics. Springer, New York, 1996. R. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin. LIBLINEAR: A library for large linear classiﬁcation. Journal of Machine Learning Research, 9:1871–1874, 2008. R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using the second order information for training support vector machines. Journal of Machine Learning Research, 6:1889–1918, 2005. V. Franc and S. Sonnenburg. OCAS optimized cutting plane algorithm for support vector machines. In Proceedings of the 25nd International Machine Learning Conference. ACM Press, 2008. P. V. Gehler and S. Nowozin. Inﬁnite kernel learning. In Proceedings of the NIPS 2008 Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008. A. Gelman. Causality and statistical learning. American Journal of Sociology, 0, 2010. G.H. Golub and C.F. van Loan. Matrix Computations. John Hopkins University Press, Baltimore, London, 3rd edition, 1996. M. G¨ nen and E. Alpaydin. Localized multiple kernel learning. In ICML ’08: Proceedings of the o 25th international conference on Machine learning, pages 352–359, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: http://doi.acm.org/10.1145/1390156.1390201. V.K. Ivanov, V.V. Vasin, and V.P. Tanana. Theory of Linear Ill-Posed Problems and its application. VSP, Zeist, 2002. S. Ji, L. Sun, R. Jin, and J. Ye. Multi-label multiple kernel learning. In Advances in Neural Information Processing Systems, 2009. T. Joachims. Making large–scale SVM learning practical. In B. Sch¨ lkopf, C.J.C. Burges, and o A.J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184, Cambridge, MA, 1999. MIT Press. S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Applications of strong convexity–strong smoothness duality to learning with matrices. CoRR, abs/0910.0610, 2009. M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for deciphering the genome. Nucleic Acids Res, 32:D277–D280, 2004. G. Kimeldorf and G. Wahba. Some results on tchebychefﬁan spline functions. J. Math. Anal. Applic., 33:82–95, 1971. 993 K LOFT, B REFELD , S ONNENBURG AND Z IEN M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse multiple kernel learning. In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Kernels, dec 2008. M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨ ller, and A. Zien. Efﬁcient and accurate u lp-norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 997–1005. MIT Press, 2009a. M. Kloft, S. Nakajima, and U. Brefeld. Feature selection for density level-sets. In W. L. Buntine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor, editors, Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), pages 692–704, 2009b. M. Kloft, U. R¨ ckert, and P. L. Bartlett. A unifying view of multiple kernel learning. In Proceedu ings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), 2010. To appear. ArXiv preprint: http://arxiv.org/abs/1005.0437. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, 30:1–50, 2002. G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel matrix with semi-deﬁnite programming. JMLR, 5:27–72, 2004. D.C. Liu and J. Nocedal. On the limited memory method for large scale optimization. Mathematical Programming B, 45(3):503–528, 1989. P. C. Mahalanobis. On the generalised distance in statistics. In Proceedings National Institute of Science, India, volume 2, no. 1, April 1936. M. Markou and S. Singh. Novelty detection: a review – part 1: statistical approaches. Signal Processing, 83:2481–2497, 2003a. M. Markou and S. Singh. Novelty detection: a review – part 2: neural network based approaches. Signal Processing, 83:2499–2521, 2003b. C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099–1125, 2005. K.-R. M¨ ller, S. Mika, G. R¨ tsch, K. Tsuda, and B. Sch¨ lkopf. An introduction to kernel-based u a o learning algorithms. IEEE Neural Networks, 12(2):181–201, May 2001. S. Nash and A. Sofer. Linear and Nonlinear Programming. McGraw-Hill, New York, NY, 1996. J. S. Nath, G. Dinesh, S. Ramanand, C. Bhattacharyya, A. Ben-Tal, and K. R. Ramakrishnan. On the algorithmics and applications of a mixed-norm based kernel learning formulation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 844–852, 2009. A. Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15:229–251, 2004. 994 ℓ p -N ORM M ULTIPLE K ERNEL L EARNING C. S. Ong and A. Zien. An Automated Combination of Kernels for Predicting Protein Subcellular Localization. In Proc. of the 8th Workshop on Algorithms in Bioinformatics, 2008. C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6:1043–1071, 2005. ¨ o u S. Oz¨ g¨ r-Aky¨ z and G.W. Weber. Learning with inﬁnitely many kernels via semi-inﬁnite prou gramming. In Proceedings of Euro Mini Conference on Continuous Optimization and Knowledge Based Technologies, 2008. J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨ lkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods — Support o Vector Learning, pages 185–208, Cambridge, MA, 1999. MIT Press. A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. More efﬁciency in multiple kernel learning. In ICML, pages 775–782, 2007. A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine Learning Research, 9:2491–2521, 2008. R. M. Rifkin and R. A. Lippert. Value regularization and Fenchel duality. J. Mach. Learn. Res., 8: 441–479, 2007. V. Roth and B. Fischer. Improved functional prediction of proteins by learning kernel combinations in multilabel settings. BMC Bioinformatics, 8(Suppl 2):S12, 2007. ISSN 1471-2105. V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of solutions and efﬁcient algorithms. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), volume 307, pages 848–855. ACM, 2008. E. Rubinstein. Support vector machines via advanced optimization techniques. Master’s thesis, Faculty of Electrical Engineering, Technion, 2005, Nov 2005. W. Rudin. Functional Analysis. McGraw-Hill, 1991. B. Sch¨ lkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o B. Sch¨ lkopf, A.J. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998. B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K.-R. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5): 1000–1017, September 1999. B. Sch¨ lkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C. Williamson. Estimating the support o of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004. 995 K LOFT, B REFELD , S ONNENBURG AND Z IEN S. Sonnenburg, G. R¨ tsch, and C. Sch¨ fer. Learning interpretable SVMs for biological sequence a a classiﬁcation. In RECOMB 2005, LNBI 3500, pages 389–407. Springer-Verlag Berlin Heidelberg, 2005. S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf. Large scale multiple kernel learning. a a o Journal of Machine Learning Research, 7:1531–1565, July 2006a. S. Sonnenburg, A. Zien, and G. R¨ tsch. Arts: Accurate recognition of transcription starts in human. a Bioinformatics, 22(14):e472–e480, 2006b. S. Sonnenburg, G. R¨ tsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, a and V. Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine Learning Research, 2010. J. M. Steele. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge University Press, New York, NY, USA, 2004. ISBN 052154677X. M. Stone. Cross-validatory choice and assessment of statistical predictors (with discussion). Journal of the Royal Statistical Society, B36:111–147, 1974. Y. Suzuki, R. Yamashita, K. Nakai, and S. Sugano. dbTSS: Database of human transcriptional start sites and full-length cDNAs. Nucleic Acids Research, 30(1):328–331, 2002. M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. In Proceedings of the International Conference on Machine Learning, 2008. M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. Mach. Learn., 79(1-2):73–103, 2010. ISSN 0885-6125. doi: http://dx.doi.org/10.1007/s10994-009-5150-6. D.M.J. Tax and R.P.W. Duin. Support vector domain description. Pattern Recognition Letters, 20 (11–13):1191–1199, 1999. A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed problems. W. H. Winston, Washington, 1977. M. Varma and B. R. Babu. More generality in efﬁcient multiple kernel learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 1065–1072, New York, NY, USA, 2009. ACM. M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In IEEE 11th International Conference on Computer Vision (ICCV), pages 1–8, 2007. Z. Xu, R. Jin, I. King, and M. Lyu. An extended level method for efﬁcient multiple kernel learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1825–1832, 2009. Z. Xu, R. Jin, H. Yang, I. King, and M. Lyu. Simple and efﬁcient multiple kernel learning by group lasso. In Proceedings of the 27th Conference on Machine Learning (ICML 2010), 2010. Y. Yamanishi, , J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics, 21:i468–i477, 2005. 996 ℓ p -N ORM M ULTIPLE K ERNEL L EARNING Y. Ying, C. Campbell, T. Damoulas, and M. Girolami. Class prediction from disparate biological data sources using an iterative multi-kernel algorithm. In Visakan Kadirkamanathan, Guido Sanguinetti, Mark Girolami, Mahesan Niranjan, and Josselin Noirel, editors, Pattern Recognition in Bioinformatics, volume 5780 of Lecture Notes in Computer Science, pages 427–438. Springer Berlin / Heidelberg, 2009. S. Yu, T. Falck, A. Daemen, L.-C. Tranchevent, J. Suykens, B. De Moor, and Y. Moreau. L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioinformatics, 11 (1):309, 2010. ISSN 1471-2105. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006. A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th international conference on Machine learning (ICML), pages 1191–1198. ACM, 2007. 997