nips nips2013 nips2013-156 nips2013-156-reference knowledge-graph by maker-knowledge-mining

156 nips-2013-Learning Kernels Using Local Rademacher Complexity


Source: pdf

Author: Corinna Cortes, Marius Kloft, Mehryar Mohri

Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby benefit from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efficient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classification tasks. 1


reference text

[1] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. 21st ICML, ACM, 2004.

[2] C. Cortes, M. Mohri, and A. Rostamizadeh, “Generalization bounds for learning kernels,” in Proceedings, 27th ICML, pp. 247–254, 2010.

[3] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “`p -norm multiple kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 953–997, Mar 2011.

[4] G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan, “Learning the kernel matrix with semi-definite programming,” JMLR, vol. 5, pp. 27–72, 2004.

[5] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” J. Mach. Learn. Res., vol. 9, pp. 2491–2521, 2008.

[6] S. Sonnenburg, G. R¨ tsch, C. Sch¨ fer, and B. Sch¨ lkopf, “Large scale multiple kernel learning,” Journal a a o of Machine Learning Research, vol. 7, pp. 1531–1565, July 2006.

[7] P. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, pp. 463–482, Nov. 2002.

[8] V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” Annals of Statistics, vol. 30, pp. 1–50, 2002.

[9] N. Srebro and S. Ben-David, “Learning bounds for support vector machines with learned kernels,” in Proc. 19th COLT, pp. 169–183, 2006.

[10] Y. Ying and C. Campbell, “Generalization bounds for learning the kernel problem,” in COLT, 2009.

[11] P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local Rademacher complexities,” Ann. Stat., vol. 33, no. 4, pp. 1497–1537, 2005.

[12] V. Koltchinskii, “Local Rademacher complexities and oracle inequalities in risk minimization,” Annals of Statistics, vol. 34, no. 6, pp. 2593–2656, 2006.

[13] S. Mendelson, “On the performance of kernel classes,” J. Mach. Learn. Res., vol. 4, pp. 759–771, December 2003.

[14] B. Sch¨ lkopf and A. Smola, Learning with Kernels. Cambridge, MA: MIT Press, 2002. o

[15] P. D. Tao and L. T. H. An, “A DC optimization algorithm for solving the trust-region subproblem,” SIAM Journal on Optimization, vol. 8, no. 2, pp. 476–505, 1998.

[16] A. L. Yuille and A. Rangarajan, “The concave-convex procedure,” Neural Computation, vol. 15, pp. 915– 936, Apr. 2003.

[17] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995.

[18] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in Proc. 5th Annual ACM Workshop on Computational Learning Theory (D. Haussler, ed.), pp. 144–152, 1992.

[19] M. Kloft and G. Blanchard, “On the convergence rate of `p -norm multiple kernel learning,” Journal of Machine Learning Research, vol. 13, pp. 2465–2502, Aug 2012.

[20] V. Koltchinskii and M. Yuan, “Sparsity in multiple kernel learning,” Ann. Stat., vol. 38, no. 6, pp. 3660– 3695, 2010.

[21] T. Suzuki, “Unifying framework for fast learning rate of non-sparse multiple kernel learning,” in Advances in Neural Information Processing Systems 24, pp. 1575–1583, 2011.

[22] P. Gehler and S. Nowozin, “On feature combination for multiclass object classification,” in International Conference on Computer Vision, pp. 221–228, 2009.

[23] S. Sonnenburg, A. Zien, and G. R¨ tsch, “Arts: Accurate recognition of transcription starts in human,” a Bioinformatics, vol. 22, no. 14, pp. e472–e480, 2006.

[24] T. Abeel, Y. V. de Peer, and Y. Saeys, “Towards a gold standard for promoter prediction evaluation,” Bioinformatics, 2009.

[25] S. Sonnenburg, G. R¨ tsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, and a V. Franc, “The SHOGUN Machine Learning Toolbox,” J. Mach. Learn. Res., 2010.

[26] F. Orabona and L. Jie, “Ultra-fast optimization algorithm for sparse multi kernel learning,” in Proceedings of the 28th International Conference on Machine Learning, 2011.

[27] A. Zien and C. S. Ong, “Multiclass multiple kernel learning,” in ICML 24, pp. 1191–1198, ACM, 2007.

[28] T. Damoulas and M. A. Girolami, “Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection,” Bioinformatics, vol. 24, no. 10, pp. 1264–1270, 2008.

[29] P. Bartlett and S. Mendelson, “Empirical minimization,” Probab. Theory Related Fields, vol. 135(3), pp. 311–334, 2006.

[30] A. B. Tsybakov, “Optimal aggregation of classifiers in statistical learning,” Ann. Stat., vol. 32, pp. 135– 166, 2004. 9