jmlr jmlr2010 jmlr2010-74 jmlr2010-74-reference knowledge-graph by maker-knowledge-mining

74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Source: pdf

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

reference text

A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002. M. Belkin, P. Niyogi, and V. Sindhwani. On manifold regularization. In Artiﬁcial Intelligence and Statistics, 2005. Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19, pages 153–160. MIT Press, Cambridge, MA, 2007. O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In Lecture Notes in Artiﬁcial Intelligence 3176, pages 169–207, Heidelberg, Germany, 2004. Springer. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2003. G. C. Cawley and N. L. C Talbot. Efﬁcient leave-one-out cross-validation of kernel ﬁsher discriminant classiﬁers. Pattern Recognition, 36:2585–2592, 2003. N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order perceptron algorithm. SIAM J. Comput., 34(3):640–668, 2005. M. Collins and B. Roark. Incremental parsing with the perceptron algorithm. In Annual Meeting of the Association for Computational Linguistics, pages 111–118, 2004. K. Crammer, M. Dredze, and F. Pereira. Exact convex conﬁdence-weighted learning. In Advances in Neural Information Processing Systems 21, Cambridge, MA, 2009a. MIT Press. K. Crammer, M. Mohri, and F. Pereira. Gaussian margin machines. In Proceedings of the Artiﬁcial Intelligence and Statistics, 2009b. D. Decoste and B. Sch¨ lkopf. Training invariant support vector machines. Machine Learning, pages o 161–190, 2002. M. Dredze, K. Crammer, and F. Pereira. Conﬁdence-weighted linear classiﬁcation. In Internationcal Conference on Machine Leaarning, 2008. 786 M AXIMUM R ELATIVE M ARGIN AND DATA -D EPENDENT R EGULARIZATION R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁcation. Wiley-Interscience Publication, 2000. P. Haffner. Escaping the convex hull with extrapolated vector machines. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 753–760. MIT Press, Cambridge, MA, 2001. R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of Machine Learning Research, 1:245–279, 2001. T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In Advances in Neural Information Processing Systems 11, 1999. T. Joachims. Transductive inference for text classiﬁcation using support vector machines. In Proceedings of the 16th International Conference on Machine Learning, 1999. T. Joachims. Training linear SVMs in linear time. In ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (KDD), pages 217–226, 2006. T. Joachims. Making large-scale support vector machine learning practical. In A. Smola B. Sch¨ lkopf, C. Burges, editor, Advances in Kernel Methods: Support Vector Machines. MIT o Press, Cambridge, MA, 1998. S.S. Keerthi. Efﬁcient tuning of svm hyperparameters using radius/margin bound and iterative algorithms. IEEE Transactions on Neural Networks, 13:1225–1229, 2002. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, 30, 2002. Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. S. Mika, G. Raetsch, J. Weston, B. Scholkopf, and K.-R. Muller. Fisher discriminant analysis with kernels. In in Neural Networks for Signal Processing IX, pages 41–48, 1999. G. Raetsch, T. Onoda, and K.-R. Muller. Soft margins for adaboost. Machine Learning, 43:287– 320, 2001. R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:322–330, 1998. B. Sch¨ lkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, USA, 2002. o S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In International Conference on Machine Learning, 2007. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. 787 S HIVASWAMY AND J EBARA P. K. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In Proceedings of the Artiﬁcial Intelligence and Statistics, 2007. P. K. Shivaswamy and T. Jebara. Relative margin machines. In Advances in Neural Information Processing Systems 21, Cambridge, MA, 2009a. MIT Press. P. K. Shivaswamy and T. Jebara. Structured prediction with relative margin. In International Conference on Machine Learning and Applications, 2009b. P. K. Shivaswamy, C. Bhattacharyya, and A. J. Smola. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7:1283–1314, 2006. F. Sinz, O. Chapelle, A. Agarwal, and B. Sch¨ lkopf. An analysis of inference with the univero sum. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1369–1376. MIT Press, Cambridge, MA, 2008. N. Srebro, J. D. M. Rennie, and T. S. Jaakola. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems 17, pages 1329–1336. MIT Press, 2005. B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning. Max-margin parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004. I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, September 2005. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 668–674. MIT Press, Cambridge, MA, 2000. J. Weston, R. Collobert, F. H. Sinz, L. Bottou, and V. Vapnik. Inference with the universum. In Proceedings of the International Conference on Machine Learning, pages 1009–1016, 2006. B. Zhang, X. Chen, S. Shan, and W. Gao. Nonlinear face recognition based on maximum average margin criterion. In Computer Vision and Pattern Recognition, pages 554–559, 2005. 788