jmlr jmlr2012 jmlr2012-81 jmlr2012-81-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marius Kloft, Gilles Blanchard
Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity
J. Aflalo, A. Ben-Tal, C. Bhattacharyya, J. S. Nath, and S. Raman. Variable sparsity kernel learning—algorithms and applications. Journal of Machine Learning Research, 12:565–592, Feb 2011. F. R. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res., 9: 1179–1225, 2008. F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In Proc. 21st ICML. ACM, 2004. P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, Nov. 2002. P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):1497–1537, 2005. P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines. Annals of Statistics, 36(2):489–531, 2008. R. R. Bouckaert, E. Frank, M. A. Hall, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. WEKA—experiences with a java open-source project. Journal of Machine Learning Research, 11:2533–2541, 2010. O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, March 2002. ISSN 1532-4435. O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In O. Bousquet, U. von Luxburg, and G. R¨ tsch, editors, Advanced Lectures on Machine Learning, volume a 3176 of Lecture Notes in Computer Science, pages 169–207. Springer Berlin / Heidelberg, 2004. C. Cortes. Invited talk: Can learning kernels help performance? In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1:1–1:1, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. C. Cortes, A. Gretton, G. Lanckriet, M. Mohri, and A. Rostamizadeh. Proceedings of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008. URL http: //www.cs.nyu.edu/learning_kernels. C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence, 2009. C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In Proceedings, 27th ICML, 2010. 2499 K LOFT AND B LANCHARD P. V. Gehler and S. Nowozin. Let the kernel figure it out: Principled learning of pre-processing for kernel classifiers. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 06 2009. A. Gelman. Causality and statistical learning. American Journal of Sociology, 0, 2010. R. Ibragimov and S. Sharakhmetov. The best constant in the rosenthal inequality for nonnegative random variables. Statistics & Probability Letters, 55(4):367 – 376, 2001. ISSN 0167-7152. J.-P. Kahane. Some Random Series of Functions. Cambridge University Press, 2nd edition, 1985. M. Kloft. ℓ p -Norm Multiple Kernel Learning. PhD thesis, Berlin Institute of Technology, Oct 2011. URL http://opus.kobv.de/tuberlin/volltexte/2011/3239/. M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse multiple kernel learning. In Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Kernels, dec 2008. M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨ ller, and A. Zien. Efficient and accurate u lp-norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 997–1005. MIT Press, 2009. M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. ℓ p -norm multiple kernel learning. Journal of Machine Learning Research, 12:953–997, Mar 2011. V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001. V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34(6):2593–2656, 2006. V. Koltchinskii. Sparsity in penalized empirical risk minimization. Annales de l’Institut Henri Poincar´ , Probabilit´ s et Statistiques, 45(1):7–57, 2009. e e V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning. Annals of Statistics, 38(6): 3660–3695, 2010. S. Kwapi´ n and W. A. Woyczy´ ski. Random Series and Stochastic Integrals: Single and Multiple. e n Birkh¨ user, Basel and Boston, M.A., 1992. a G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel matrix with semi-definite programming. JMLR, 5:27–72, 2004. H. Leeb and B. M. P¨ tscher. Sparse estimators and the oracle property, or the return of Hodges’ o estimator. Journal of Econometrics, 142:201–211, 2008. C. McDiarmid. On the method of bounded differences. In Surveys in combinatorics, 1989 (Norwich, 1989), volume 141 of London Math. Soc. Lecture Note Ser., pages 148–188. Cambridge Univ. Press, Cambridge, 1989. 2500 O N THE C ONVERGENCE R ATE OF ℓ p -N ORM MKL S. Mendelson. On the performance of kernel classes. J. Mach. Learn. Res., 4:759–771, December 2003. C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099–1125, 2005. K.-R. M¨ ller, S. Mika, G. R¨ tsch, K. Tsuda, and B. Sch¨ lkopf. An introduction to kernel-based u a o learning algorithms. IEEE Neural Networks, 12(2):181–201, May 2001. G. Raskutti, M. J. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. CoRR, abs/1008.3654, 2010. H. Rosenthal. On the subspaces of L p (p > 2) spanned by sequences of independent random variables. Israel J. Math., 8:273–303, 1970. B. Sch¨ lkopf, A. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998. J. R. Searle. Minds, brains, and programs. Behavioral and Brain Sciences, 3(03):417– 424, 1980. doi: 10.1017/S0140525X00005756. URL http://dx.doi.org/10.1017/ S0140525X00005756. S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf. Large scale multiple kernel learning. a a o Journal of Machine Learning Research, 7:1531–1565, July 2006. N. Srebro and S. Ben-David. Learning bounds for support vector machines with learned kernels. In G. Lugosi and H.-U. Simon, editors, COLT, volume 4005 of Lecture Notes in Computer Science, pages 169–183. Springer, 2006. J. M. Steele. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Cambridge University Press, New York, NY, USA, 2004. ISBN 052154677X. I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008. M. Stone. Cross-validatory choice and assessment of statistical predictors (with discussion). Journal of the Royal Statistical Society, B36:111–147, 1974. T. Suzuki. Unifying framework for fast learning rate of non-sparse multiple kernel learning. In J. Shawe-Taylor, R. Zemel, P. L. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems 24. 2011. To appear. M. Talagrand. Concentration of measure and isoperimetric inequalities in product spaces. Publications Mathmatiques de L’IHS, 81:73–205, 1995. A. Tsybakov. Optimal rates of aggregation. In B. Sch¨ lkopf and M. Warmuth, editors, Compuo tational Learning Theory and Kernel Machines (COLT-2003), volume 2777 of Lecture Notes in Artificial Intelligence, pages 303–313. Springer, 2003. S. van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000. 2501 K LOFT AND B LANCHARD R. C. Williamson, A. J. Smola, and B. Sch¨ lkopf. Generalization performance of regularization o networks and support vector machines via entropy numbers of compact operators. IEEE Transactions on Information Theory, 47(6):2516–2532, 2001. H. Xu, S. Mannor, and C. Caramanis. Sparse algorithms are not stable: A no-free-lunch theorem. In Proceedings of the 46th Annual Allerton Conference on Communication, Control, and Computing, pages 1299 –1303, 2008. Y. Ying and C. Campbell. Generalization bounds for learning the kernel problem. In COLT, 2009. 2502