nips nips2008 nips2008-196 nips2008-196-reference knowledge-graph by maker-knowledge-mining

196 nips-2008-Relative Margin Machines


Source: pdf

Author: Tony Jebara, Pannagadatta K. Shivaswamy

Abstract: In classification problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be efficiently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1


reference text

[1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.

[2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.

[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press, Cambridge, MA, 2007. 7

[4] D. Decoste and B. Sch¨lkopf. Training invariant support vector machines. Machine Learning, o pages 161–190, 2002.

[5] T. Joachims. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.

[6] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L. Jackel. Back-propagation applied to handwritten zip code recognition. Neural Computation, 1:541– 551, 1989.

[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[8] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

[9] P. K. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In Proceedings of the Artificial Intelligence and Statistics, 2007.

[10] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995.

[11] J. Weston, R. Collobert, F. H. Sinz, L. Bottou, and V. Vapnik. Inference with the universum. In Proceedings of the International Conference on Machine Learning, pages 1009–1016, 2006.

[12] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs. In Neural Information Processing Systems, pages 668–674, 2000. A Generalization Bound In this section, we give the empirical Rademacher complexity [2, 8] for function classes used by the SVM, and modified versions of RMM and Σ-SVM which can be plugged into a generalization bound. Maximizing the margin can be seen as choosing a function f (x) = w⊤ x from a bounded class 1 of functions FE := {x → w⊤ x| 2 w 2 ≤ E}. For a technical reason, instead of bounding the projection on the training examples as in (5), we consider bounding the projections on an independent set of examples drawn from Pr(x), that is, a set U = {u1 , u2 , . . . unu }. Note that if we have an iid training set, it can be split into two parts and one part can be used exclusively to bound the projections and the other part can be used exclusively for classification constraints. Since the labels of the examples used to bound the projections do not matter, we denote this set by U and the other part of the set by (xi , yi )n We i=1 now consider the following function class which is closely related to RMM: HE,D := {x → w⊤ x| 1 w⊤ w + D (w⊤ ui )2 ≤ E ∀1 ≤ i ≤ nu } where D > 0 trades off between large margin 2 2 and small bound on the projections. Similarly, consider: GE,D := {x → w⊤ x| 1 w⊤ w + 2 nu D ⊤ 2 i=1 (w ui ) ≤ E}, which is closely related to the class of functions considered by 2nu Σ-SVM. The empirical Rademacher complexities of the three classes of functions are as below: √ √ n n 2 2E 2 2E ˆ ˆ R(FE ) ≤ UFE := x⊤ xi , R(GE,D ) ≤ UGE,D := x⊤ Σ−1 xi , i i D n n i=1 i=1 1 ˆ R(HE,D ) ≤ UHE,D := min λ≥0 n n x⊤ Σ−1 xi + i λ,D i=1 n n 2 E n nu λi , i=1 n u u u D where ΣD = I + nu i=1 ui u⊤ and Σλ,D = i=1 λi I + D i=1 λi ui u⊤ . Note that the last i i upper bound is not a closed form expression, but a semi-definite optimization. Now, the upper bounds UFE , UGE,D and UHE,D can be plugged in the following theorem in place of ˆ R(F ) to obtain Rademacher type generalization bounds. Theorem 1 Fix γ > 0, let F be the class of functions from Rm × {±1} → R given by f (x, y) = −yg(x). Let {(x1 , y1 ), . . . , (xn , yn )} be drawn iid from a probability distribution D. Then, with probability at least 1 − δ over the samples of size n, the following bound holds: ˆ PrD [y = sign(g(x))] ≤ ξ ⊤ 1/n + 2R(F )/γ + 3 (ln(2/δ))/2n, where ξi = max(0, 1 − yi g(xi )) are the so-called slack variables. 8