nips nips2001 nips2001-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Patrick Haffner
Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. [sent-4, score-0.72]
2 We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. [sent-5, score-0.131]
3 They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. [sent-7, score-0.272]
4 1 Introduction Both intuition and theory [9] seem to support that the best linear separation between two classes is the one that maximizes the margin. [sent-8, score-0.146]
5 (l), the maximum margin hyperplane is Wo; however , most observers would say that the separating hyperplane WI has better chances to generalize, as it takes into account the expected location of additional training sam- • • • • • • • • • • • f\J:- • • • • • • • • • • •• . [sent-11, score-0.272]
6 (} Figure 1: Example of separation where the large margin is undesirable. [sent-61, score-0.312]
7 The convex hull and the separation that corresponds to the standard SVM use plain lines while the extrapolated convex hulls and XVMs use dotted lines. [sent-62, score-0.868]
8 In this paper, we just use a very elementary form of extrapolation ("the poor man variance") and show that it can be implemented into a new extension to SVMs that we call Extrapolated Vector Machines (XVMs). [sent-65, score-0.18]
9 2 Adding Extrapolation to Maximum Margin Constraints This section states extrapolation as a constrained optimization problem and computes a simpler dual form. [sent-66, score-0.281]
10 The N training samples {(Xi, Yi); 1 ::::; i ::::; N} are separated with a margin p if there exists a set of weights W such that Ilwll = 1 and Vk E {+, -}, Vi E Ck, Yk(w,xi+b) 2: p (1) SVMs offer techniques to find the weights W which maximize the margin p. [sent-68, score-0.569]
11 Now, instead of imposing the margin constraint on each training point, suppose that for two points in the same class Ck, we require any possible extrapolation within a range factor 17k 2: 0 to be larger than the margin: Vi,j E Ck , V)" E [-17k, l+17k], Yk (W. [sent-69, score-0.651]
12 ()"Xi + (l-)")Xj) + b) 2: P (2) It is sufficient to enforce the constraints at the end of the extrapolation segments, and (3) Keeping the constraint over each pair of points would result in N 2 Lagrange multipliers. [sent-70, score-0.335]
13 But we can reduce it to a double constraint applied to each single point. [sent-71, score-0.119]
14 (4) and (5), the margin becomes 2p = L ((17k+ 1)vk - 17kJ. [sent-80, score-0.272]
15 Lk - Vk)) (6) k Our problem is to maximize the margin under the double constraint: Vi E Ck , Vk ::::; Yk(W. [sent-82, score-0.319]
16 Lk In other words, the extrapolated margin maximization is equivalent to squeezing the points belonging to a given class between two hyperplanes. [sent-84, score-0.69]
17 Lk , Vk and following dual problem: W with Lagrangian techniques gives us the (7) lIn this paper, it is necessary to index the outputs y with the class k rather than the more traditional sample index i, as extrapolation constraints require two examples to belong to the same class. [sent-89, score-0.409]
18 Compared to the standard SVM formulation, we have two sets of support vectors. [sent-91, score-0.075]
19 Moreover, the Lagrange multipliers that we chose are normalized differently from the traditional SVM multipliers (note that this is one possible choice of notation, see Section. [sent-92, score-0.186]
20 They sum to 1 and allow and interesting geometric interpretation developed in the next section. [sent-94, score-0.096]
21 3 Geometric Interpretation and Iterative Algorithm For each class k, we define the nearest point to the other class convex hull along the direction of w: Nk = I:iECk f3iXi. [sent-95, score-0.516]
22 Nk is a combination of the internal support vectors that belong to class k with f3i > O. [sent-96, score-0.202]
23 At the minimum of (7), because they correspond to non zero Lagrange multipliers, they fallon the internal margin Yk(W,Xi) = Vk; therefore, we obtain Vk = Ykw. [sent-97, score-0.297]
24 Nk· Similarly, we define the furthest point Fk = I: i ECk ~i Xi' Fk is a combination of the external support vectors, and we have flk = Ykw. [sent-98, score-0.223]
25 The dual problem is equivalent to the distance minimization problem IILYk ((1Jk+I)Nk _1Jk F k)11 k min Nk ,Fk EHk where 1{k 2 is the convex hull containing the examples of class k. [sent-100, score-0.482]
26 An interesting geometric interpretation is also offered in [3]. [sent-103, score-0.071]
27 All the aforementioned algorithms search for the points in the convex hulls of each class that are the nearest to each other (Nt and No on Fig. [sent-104, score-0.37]
28 I) , the maximal margin weight vector w = Nt - N o -' XCHDM look for nearest points in the extrapolated convex hulls (X+ I and X-I on Fig. [sent-105, score-0.958]
29 The extrapolated nearest points are X k = 1JkNk - 1JkFk' Note that they can be outside the convex hull because we allow negative contribution from external support vectors. [sent-107, score-0.847]
30 With the double set of Lagrange multipliers, the description of the XCHDM algorithm is beyond the scope of this paper. [sent-110, score-0.101]
31 Define the extrapolated primal margin 1'; = 2p = L ((1Jk+ I )vk - 1Jkflk) k and the dual margin 1'; = IIX+ - X-II Convergence consists in reducing the duality gap 1'~ -1'; down to zero. [sent-113, score-1.14]
32 In the rest of the paper, we will measure convergence with the duality ratio r = 1'~ . [sent-114, score-0.166]
33 1'2 To determine the threshold to compute the classifier output class sign(w. [sent-115, score-0.16]
34 We can require the separation to happen at the center of the primal margin, with the primal threshold (subtract Eq. [sent-117, score-0.228]
35 lk) k or at the center of the dual margin, with the dual threshold b2 = - ~w. [sent-120, score-0.138]
36 Our standard stopping heuristic is numerical: stop when the duality ratio gets over a fixed value (typically between 0. [sent-123, score-0.359]
37 The only other stopping heuristic we have tried so far is based on the following idea. [sent-126, score-0.092]
38 Define the set of extrapolated pairs as {(T}k+1)Xi -T}kXj; 1 :S i,j :S N}. [sent-127, score-0.329]
39 Convergence means that we find extrapolated support pairs that contain every extrapolated pair on the correct side of the margin. [sent-128, score-0.76]
40 We can relax this constraint and stop when the extrapolated support pairs contain every vector. [sent-129, score-0.577]
41 This means that 12 must be lower than the primal true margin along w (measured on the non-extrapolated data) 11 = y+ + Y - . [sent-130, score-0.347]
42 This causes the XCHDM algorithm to stop long before 12 reaches Ii and is called the hybrid stopping heuristic. [sent-131, score-0.258]
43 Kernel Machines consist of any classifier of the type f(x) = L:i Yi(Xi K(x, Xi). [sent-133, score-0.096]
44 SVMs offer one solution among many others, with the constraint (Xi > O. [sent-134, score-0.097]
45 While the algorithm described in Section 2 converges toward a solution where vectors act as support of margins (internal and external), experiments show that the performance of XVMs can be significantly improved if we stopped before full convergence. [sent-136, score-0.166]
46 In this case, the vectors with (Xi =/: 0 do not line up onto any type of margin, and should not be called support vectors. [sent-137, score-0.113]
47 The extrapolated margin contains terms which are caused by the extrapolation and are proportional to the width of each class along the direction of w. [sent-138, score-0.891]
48 We would observe the same phenomenon if we had trained the classifier using Maximum Likelihood Estimation (MLE) (replace class width with variance). [sent-139, score-0.206]
49 Note also that like the Fisher discriminant , XVMs look for the projection that maximizes the inter-class variance while minimizing the intra-class variances. [sent-142, score-0.059]
50 5 Experiments on MNIST The MNIST OCR database contains 60,000 handwritten digits for training and 10,000 for testing (the testing data can be extended to 60,000 but we prefer to keep unseen test data for final testing and comparisons). [sent-143, score-0.081]
51 Before computing the kernels , the input vectors are normalized to 1: x = II~II' Good polynomial kernels are easy to define as Kp(x, y) = (x. [sent-147, score-0.283]
52 We found these normalized kernels to outperform the unnormalized kernels Kp(x, y) = (a(x. [sent-149, score-0.205]
53 We also define normalized Gaussian kernels: Kp(x, y) = exp ( - ~ Ilx - y112) = [exp (x. [sent-154, score-0.079]
54 (8) shows how they relate to normalized polynomial kernels: when x. [sent-157, score-0.067]
55 We observed that on MNIST, the performance with Kp is very similar to what is obtained with unnormalized Gaussian kernels Ku(x , y) = exp _(X~Y)2. [sent-159, score-0.106]
56 MNIST contains 1 class per digit, so the total number of classes is M=10. [sent-161, score-0.095]
57 To combine binary classifiers to perform multiclass classifications, the two most common approaches were considered . [sent-162, score-0.148]
58 • In the one-vs-others case (lvsR) , we have one classifier per class c, with the positive examples taken from class c and negative examples form the other classes. [sent-163, score-0.224]
59 Class c is recognized when the corresponding classifier yields the largest output . [sent-164, score-0.096]
60 • In the one-vs-one case (lvs1), each classifier only discriminates one class from another: we need a total of (MU:;-l) = 45 classifiers. [sent-165, score-0.16]
61 Despite the effort we spent on optimizing the recombination of the classifiers [8] 1vsR SVMs (Table 1) perform significantly better than 1vs1 SVMs (Table 2). [sent-166, score-0.123]
62 4 3, For each trial, the number of errors over the 10,000 test samples (#err) and the total number of support vectors( #SV) are reported. [sent-167, score-0.075]
63 For instance, 12,000 support vectors mean that 20% of the 60,000 vectors are used as support. [sent-169, score-0.151]
64 Preliminary experiments to choose the value of rJk with the hybrid criterion show that the results for rJk = 1 are better than rJk = 1. [sent-170, score-0.065]
65 This is is surprising in the light of results published on other tasks [8] , and would require further investigations beyond the scope of this paper. [sent-178, score-0.092]
66 In a transductive inference framework , treat the test example as a training example: for each of the M possible labels, retrain the M among (M(":-l) classifiers that use examples with such label. [sent-180, score-0.098]
67 The best label will be the one that causes the smallest increase in the multiclass margin p such that it combines the classifier margins pc in the following manner ~= ,,~ 2 ~ 2 P c~M Pc The fact that this margin predicts generalization is "justified" by Theorem 1 in [8]. [sent-181, score-0.75]
68 40 #err #SV 136 8367 127 8331 125 8834 136 13002 147 9014 125 8668 125 8944 Duality Ratio stop 0. [sent-183, score-0.101]
69 75 #err #SV 136 11132 117 11807 119 12786 137 18784 128 11663 12222 119 125 12852 # err 132 119 119 141 131 117 125 0. [sent-184, score-0.2]
70 99 #SV 13762 15746 17868 25953 13918 16604 18085 Table 1: SVMs on MNIST with 10 1vsR classifiers Kernel K3 K4 K5 SVM/ratio at 0. [sent-185, score-0.098]
71 99 # err #SV 138 11952 135 13526 191 13526 XVM/Hybrid # err #SV 117 17020 110 16066 114 15775 Table 2: SVMjXVM on MNIST with 45 1vs1 classifiers The 103 errors obtained with K4 and r = 0. [sent-186, score-0.498]
72 5 in Table 3 represent only about 1% error: t his is t he lowest error ever reported for any learning technique without a priori knowledge about the fact that t he input data corresponds to a pixel map (the lowest reproducible error previously reported was 1. [sent-187, score-0.18]
73 The downside is that XVMs require 4 times as many support vectors as standards SVMs. [sent-190, score-0.151]
74 Table 3 compares stopping according to t he duality ratio and t he hybrid criterion. [sent-191, score-0.323]
75 With t he duality ratio, the best performance is most often reached with r = 0. [sent-192, score-0.123]
76 50 (if t his happens to be consistent ly true, validation data to decide when to stop could be spared). [sent-193, score-0.101]
77 The hybrid criterion does not require validation data and yields errors that, while higher than the best XVM, are lower than SVMs and only require a few more support vectors. [sent-194, score-0.216]
78 One way to interpret this hybrid stopping criterion is that we stop when interpolation in some (but not all) directions account for all non-interpolated vectors. [sent-196, score-0.293]
79 This suggests that extrapolating on a convex hull that contains several different classes (in the 1vsR case) may be undesirable. [sent-199, score-0.311]
80 40 # err #SV 118 46662 112 40274 109 36912 128 35809 114 43909 108 36980 Duality Ratio stop 0. [sent-201, score-0.301]
81 50 # err #SV 111 43819 43132 103 44226 106 126 39462 114 46905 111 40329 0. [sent-202, score-0.2]
82 75 # err #SV 116 50216 110 52861 110 49383 131 50233 114 53676 114 51088 Hybrid. [sent-203, score-0.2]
83 # err #SV 125 20604 18002 107 17322 107 125 19218 119 20152 16895 108 Table 3: XVMs on MNIST wit h 10 1vsR classifiers 6 The Soft Margin Case MNIST is characterized by the quasi-absence of outliers, so to assume that the data is fully separable does not impair performance at all. [sent-205, score-0.324]
84 To extend XVMs to non-separable data, we first considered the traditional approaches of adding slack variables to allow margin constraints to be violated. [sent-206, score-0.419]
85 The most commonly used approach with SVMs adds linear slack variables to the unitary margin. [sent-207, score-0.099]
86 Its application to the XVM requires to give up the weight normalization constraint, so that the usual unitary margin can be used in the constraints [9] . [sent-208, score-0.338]
87 Compared to standard SVMs, a new issue to tackle is the fact that each constraint corresponds to a pair of vectors: ideally, we should handle N 2 slack variables ~ij. [sent-209, score-0.163]
88 (3), the constraint on the extrapolation from any pair of points is Vi,j E Ck, Yk (w. [sent-212, score-0.304]
89 ECk JECk + ~i)' (9) we ob- tain the simpler double constraint Vi E Ck , Vk -~i ~ Yk(W,Xi+b) ~ J. [sent-215, score-0.119]
90 tk (and thus Vk) instead of treating it as an optimization variable, it would amount to a standard SVM regression problem with {-I, + I} outputs, the width of the asymmetric f-insensitive tube being J. [sent-220, score-0.111]
91 3% (in both cases, we use Gaussian kernels with parameter (J = 1650). [sent-225, score-0.066]
92 Experiments with SVMtorch on a variety of tasks where non-zero slacks are required to achieve optimal performance (Reuters, VCI/Forest, VCI/Breast cancer) have not shown significant improvement using the regression mode while we vary the width of the f-tube. [sent-228, score-0.079]
93 Many experiments on SVMs have reported that removing the outliers often gives efficient and sparse solutions. [sent-229, score-0.093]
94 The early stopping heuristics that we have presented for XVMs suggest strategies to avoid learning (or to unlearn) the outliers, and this is the approach we are currently exploring. [sent-230, score-0.092]
95 7 Concluding Remarks This paper shows that large margin classification on extrapolated data is equivalent to the addition of the minimization of a second external margin to the standard SVM approach. [sent-231, score-0.967]
96 The associated optimization problem is solved efficiently with convex hull distance minimization algorithms. [sent-232, score-0.355]
97 A 1 % error rate is obtained on the MNIST dataset: it is the lowest ever obtained without a-priori knowledge about the data. [sent-233, score-0.069]
98 We are currently trying to identify what other types of dataset show similar gains over SVMs, to determine how dependent XVM performance is on the facts that the data is separable or has invariance properties. [sent-234, score-0.062]
99 Learning Theory bounds that would be a function of both the margin and some form of variance of the data would be necessary to predict XVM generalization and allow us to also consider the extrapolation factor 'TJ as an optimization variable. [sent-237, score-0.509]
100 A fast iterative nearest point algorithm for support vector machine classifier design. [sent-276, score-0.233]
wordName wordTfidf (topN-words)
[('extrapolated', 0.329), ('xvms', 0.328), ('vk', 0.299), ('margin', 0.272), ('err', 0.2), ('mnist', 0.191), ('extrapolation', 0.18), ('xchdm', 0.177), ('xvm', 0.177), ('svms', 0.171), ('hull', 0.149), ('yk', 0.138), ('sv', 0.134), ('kp', 0.132), ('convex', 0.131), ('duality', 0.123), ('nk', 0.11), ('stop', 0.101), ('rjk', 0.101), ('ck', 0.1), ('classifiers', 0.098), ('classifier', 0.096), ('stopping', 0.092), ('hulls', 0.088), ('svm', 0.08), ('support', 0.075), ('primal', 0.075), ('constraint', 0.072), ('dual', 0.069), ('kernels', 0.066), ('yi', 0.066), ('xi', 0.065), ('hybrid', 0.065), ('slack', 0.064), ('class', 0.064), ('multipliers', 0.063), ('nearest', 0.062), ('lagrange', 0.059), ('outliers', 0.057), ('table', 0.054), ('svmtorch', 0.053), ('external', 0.051), ('furthest', 0.051), ('haffner', 0.051), ('kxj', 0.051), ('multiclass', 0.05), ('double', 0.047), ('width', 0.046), ('define', 0.046), ('ocr', 0.044), ('ratio', 0.043), ('minimization', 0.043), ('separation', 0.04), ('unnormalized', 0.04), ('kg', 0.04), ('mle', 0.04), ('machines', 0.04), ('lowest', 0.039), ('geometric', 0.039), ('vectors', 0.038), ('require', 0.038), ('eds', 0.037), ('iix', 0.037), ('mller', 0.037), ('invariance', 0.036), ('reported', 0.036), ('discriminant', 0.035), ('unitary', 0.035), ('interpolation', 0.035), ('polynomial', 0.034), ('fk', 0.033), ('nt', 0.033), ('regression', 0.033), ('normalized', 0.033), ('interpretation', 0.032), ('pc', 0.032), ('cortes', 0.032), ('optimization', 0.032), ('constraints', 0.031), ('classes', 0.031), ('rj', 0.031), ('ever', 0.03), ('margins', 0.028), ('traditionally', 0.028), ('vi', 0.028), ('scope', 0.028), ('traditional', 0.027), ('pair', 0.027), ('testing', 0.027), ('maximal', 0.027), ('min', 0.026), ('beyond', 0.026), ('separable', 0.026), ('maximized', 0.026), ('allow', 0.025), ('offer', 0.025), ('significantly', 0.025), ('internal', 0.025), ('points', 0.025), ('look', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
Author: Patrick Haffner
Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1
2 0.21391572 62 nips-2001-Duality, Geometry, and Support Vector Regression
Author: J. Bi, Kristin P. Bennett
Abstract: We develop an intuitive geometric framework for support vector regression (SVR). By examining when -tubes exist, we show that SVR can be regarded as a classification problem in the dual space. Hard and soft -tubes are constructed by separating the convex or reduced convex hulls respectively of the training data with the response variable shifted up and down by . A novel SVR model is proposed based on choosing the max-margin plane between the two shifted datasets. Maximizing the margin corresponds to shrinking the effective -tube. In the proposed approach the effects of the choices of all parameters become clear geometrically. 1
3 0.14372608 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
4 0.14309511 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
Author: Manfred Opper, Robert Urbanczik
Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1
5 0.1351053 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
Author: Ronan Collobert, Samy Bengio, Yoshua Bengio
Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1
6 0.12856568 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
7 0.12204999 167 nips-2001-Semi-supervised MarginBoost
8 0.11705813 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
9 0.11514585 8 nips-2001-A General Greedy Approximation Algorithm with Applications
10 0.11243951 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
11 0.10731739 105 nips-2001-Kernel Machines and Boolean Functions
12 0.10325201 172 nips-2001-Speech Recognition using SVMs
13 0.10052975 139 nips-2001-Online Learning with Kernels
14 0.091925681 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
15 0.088640086 31 nips-2001-Algorithmic Luckiness
16 0.085535251 144 nips-2001-Partially labeled classification with Markov random walks
17 0.081669211 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
18 0.074899204 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
19 0.071639158 134 nips-2001-On Kernel-Target Alignment
20 0.071368903 60 nips-2001-Discriminative Direction for Kernel Classifiers
topicId topicWeight
[(0, -0.225), (1, 0.149), (2, -0.026), (3, 0.088), (4, 0.093), (5, 0.056), (6, 0.04), (7, -0.054), (8, -0.039), (9, 0.023), (10, 0.079), (11, 0.042), (12, 0.067), (13, 0.037), (14, 0.393), (15, 0.029), (16, -0.05), (17, -0.005), (18, 0.041), (19, 0.091), (20, -0.08), (21, -0.052), (22, -0.101), (23, 0.036), (24, 0.038), (25, -0.04), (26, 0.053), (27, 0.037), (28, -0.027), (29, -0.044), (30, 0.056), (31, -0.088), (32, 0.016), (33, 0.037), (34, 0.025), (35, -0.036), (36, 0.035), (37, 0.041), (38, 0.16), (39, 0.083), (40, -0.022), (41, -0.025), (42, 0.018), (43, -0.013), (44, 0.011), (45, 0.003), (46, -0.051), (47, 0.084), (48, -0.068), (49, -0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.95972973 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
Author: Patrick Haffner
Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1
2 0.76425958 62 nips-2001-Duality, Geometry, and Support Vector Regression
Author: J. Bi, Kristin P. Bennett
Abstract: We develop an intuitive geometric framework for support vector regression (SVR). By examining when -tubes exist, we show that SVR can be regarded as a classification problem in the dual space. Hard and soft -tubes are constructed by separating the convex or reduced convex hulls respectively of the training data with the response variable shifted up and down by . A novel SVR model is proposed based on choosing the max-margin plane between the two shifted datasets. Maximizing the margin corresponds to shrinking the effective -tube. In the proposed approach the effects of the choices of all parameters become clear geometrically. 1
3 0.64033133 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
4 0.63155484 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
Author: Olivier Chapelle, Bernhard Schćžšlkopf
Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1
5 0.54993755 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
6 0.5324834 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
7 0.51949191 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
8 0.51338851 120 nips-2001-Minimax Probability Machine
9 0.50444335 167 nips-2001-Semi-supervised MarginBoost
10 0.48735732 25 nips-2001-Active Learning in the Drug Discovery Process
11 0.48366734 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
12 0.41495085 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
13 0.38045484 180 nips-2001-The Concave-Convex Procedure (CCCP)
14 0.38034272 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
15 0.37308258 31 nips-2001-Algorithmic Luckiness
16 0.37214357 8 nips-2001-A General Greedy Approximation Algorithm with Applications
17 0.35593462 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation
18 0.35041702 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
19 0.34374174 144 nips-2001-Partially labeled classification with Markov random walks
20 0.32667649 105 nips-2001-Kernel Machines and Boolean Functions
topicId topicWeight
[(14, 0.063), (17, 0.041), (19, 0.03), (20, 0.021), (27, 0.13), (30, 0.052), (36, 0.015), (38, 0.015), (44, 0.011), (59, 0.033), (63, 0.033), (72, 0.115), (75, 0.215), (79, 0.029), (83, 0.027), (91, 0.101)]
simIndex simValue paperId paperTitle
same-paper 1 0.85864437 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
Author: Patrick Haffner
Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1
2 0.69891459 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
3 0.69729698 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
4 0.69333446 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
5 0.69144291 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
6 0.69043243 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
7 0.6877557 62 nips-2001-Duality, Geometry, and Support Vector Regression
9 0.68696934 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
10 0.68558496 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
11 0.68414891 1 nips-2001-(Not) Bounding the True Error
12 0.6837219 155 nips-2001-Quantizing Density Estimators
13 0.68307352 40 nips-2001-Batch Value Function Approximation via Support Vectors
14 0.6823495 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
15 0.68233871 58 nips-2001-Covariance Kernels from Bayesian Generative Models
16 0.68058383 134 nips-2001-On Kernel-Target Alignment
17 0.67867929 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
18 0.67786598 114 nips-2001-Learning from Infinite Data in Finite Time
19 0.67670107 143 nips-2001-PAC Generalization Bounds for Co-training
20 0.67506605 95 nips-2001-Infinite Mixtures of Gaussian Process Experts