nips nips2001 nips2001-101 knowledge-graph by maker-knowledge-mining

101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms


Source: pdf

Author: Pascal Vincent, Yoshua Bengio

Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). ‚ wx u  r i q ©

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca/ vincentp   ¡ ¢ Abstract Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. [sent-9, score-0.794]

2 We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. [sent-10, score-0.05]

3 The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. [sent-11, score-0.214]

4 Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. [sent-12, score-0.064]

5 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. [sent-13, score-0.158]

6 The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. [sent-14, score-0.708]

7 The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. [sent-15, score-0.77]

8 While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. [sent-16, score-0.74]

9 This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. [sent-18, score-0.44]

10 In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. [sent-20, score-0.664]

11 It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). [sent-21, score-0.1]

12 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. [sent-22, score-0.071]

13 A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). [sent-23, score-0.645]

14 We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. [sent-24, score-0.388]

15 Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. [sent-26, score-1.17]

16 Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? [sent-27, score-0.644]

17 Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! [sent-28, score-0.071]

18 One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. [sent-30, score-0.084]

19 But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well. [sent-31, score-0.023]

20 This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. [sent-34, score-0.099]

21 1 Setting and definitions The setting is that of a classical classification problem in (the input space). [sent-36, score-0.056]

22 „( '% W 8 ƒ# ¦ € y wx u s r i q f d 6pvt9ph gec d ec V E W ( ba'% H F Y `¢ ¡   V X UW 4 P Q¤ ¤ ¤ ¤ I" # B ¦ I¦ T # ¦% US( 1R! [sent-38, score-0.029]

23  §©§  ¦  ¦¨ ¦ ¢ £¡     We are given a training set of points and their corresponding class label where is the number of different classes. [sent-45, score-0.191]

24 Barring duplicate inputs, the class labels associated to each define a partition of : let . [sent-47, score-0.084]

25 B 3A ¡   ¡   ¡   ¡ The problem is to find a decision function that will generalize well on new points drawn from . [sent-48, score-0.202]

26 denotes the indicator function, whose value is H F ( a1% and In the previous and following discussion, we often refer to the concept of decision surface, also known as decision boundary. [sent-52, score-0.268]

27 The function corresponding to a given algorithm defines for any class two regions of the input space: the region and its complement . [sent-53, score-0.163]

28 The decision surface for class is the “boundary” between those two regions, i. [sent-54, score-0.443]

29 the contour of , and can be seen as a dimensional manifold (a “surface” in ) possibly made of several disconnected components. [sent-56, score-0.192]

30 For simplicity, when we mention the decision surface in our discussion we consider only the case of two class discrimination, in which there is a single decision surface. [sent-57, score-0.624]

31 ‡" # B ¦ V   P ¢ ¡  8 T† W B ‰  B † 4 5 T ‰ † &¢ ¡   ¡ V T # ¦ ˆS( '% W ¢ ¡  I¦ When we mention a test point, we mean a point that does not belong to the training set and for which the algorithm is to decide on a class . [sent-58, score-0.291]

32 ” @’ ‰ )” ‘ ¢ ¡  The distance between a single point point of the set: . [sent-62, score-0.168]

33 ¦ • By K-Nearest Neighbor algorithm (KNN) we mean the following algorithm: the class of a test point is decided to be the same as the class appearing most frequently among the K-neighborhood of . [sent-65, score-0.32]

34 Figure 1 illustrates a possible intuition about why SVMs outperforms 1NNs when we have a finite number of samples. [sent-68, score-0.072]

35 For classification tasks where the classes are considered to be mostly separable,2 we often like to think of each class as residing close to a lowerdimensional manifold (in the high dimensional input space) which can reasonably be considered locally linear3 . [sent-69, score-0.379]

36 In the case of a finite number of samples, “missing” samples would appear as “holes” introducing artifacts in the decision surface produced by classical Nearest Neighbor algorithms. [sent-70, score-0.432]

37 Thus the decision surface, while having the largest possible local margin with regard to the training points, is likely to have a poor small local margin with respect to yet unseen samples falling close to the locally linear manifold, and will thus result in poor generalization performance. [sent-71, score-0.62]

38 It is interesting to notice, if the assumption of locally linear class manifolds holds, how the 1NN solution approaches the SVM solution in the limit as we increase the number of samples. [sent-73, score-0.235]

39 To fix this problem, the idea is to somehow fantasize the missing points, based on a local linear approximation of the manifold of each class. [sent-74, score-0.23]

40 This leads to modified Nearest Neighbor algorithms described in the next sections. [sent-75, score-0.027]

41 4 2 By “mostly separable” we mean that the Bayes error is almost zero, and the optimal decision surface has not too many disconnected components. [sent-76, score-0.39]

42 each class has a probability density with a “support” that is a lower-dimensional manifold, and with the probability quickly fading, away from this support. [sent-79, score-0.109]

43 4 Note that although we never generate the “fantasy” points explicitly, the proposed algorithms are really equivalent to classical 1NN with fantasized points. [sent-80, score-0.186]

44 We shall thus consider, for each class , the local affine subspace that passes through the points of the K-c neighborhood of . [sent-83, score-0.295]

45 This affine subspace is typically dimensional or less, and we will somewhat abusively call it the “local hyperplane”. [sent-84, score-0.094]

46 5 ¦ ¤ T ¦ l 8 ‰ l Formally, the local hyperplane can be defined as   #  ¨ x ¡     ¦  ¤ ¨ § § ©¨ x § ¨  A ¤¤ k ¦ # i ¥i . [sent-85, score-0.254]

47 k ¡ # ¦ S( '% B ¢  k j ¦ A ( '% B # § b§ ¨ A   ¡ ¨  # § 8 £ where (1) , is to Another way to define this hyperplane, that gets rid of the constraint take a reference point within the hyperplane as an origin, for instance the centroid 6 . [sent-86, score-0.285]

48 This same hyperplane can then be expressed as # A  £  ¡   ¤ ¨ § § ©¨ x § ¨  $ ‰Y " ¤¤ k ¦ #A # i ¥i k ¡ # ¦ S( '% B ! [sent-87, score-0.18]

49 C'% 1g B d `6)'e( '% W ¦ k § A ¨ x §k  ¨ A ‰ § A # § $ ‰Y where (2) Our modified nearest neighbor algorithm then associates a test point to the class whose is closest to . [sent-90, score-0.729]

50 Formally , where hyperplane is logically called K-local Hyperplane Distance, hence the name K-local Hyperplane Distance Nearest Neighbor algorithm (HKNN in short). [sent-91, score-0.228]

51 k “ ¡  ¦ ( ¦ k D( 1% B ¢  p'% ¡ ¦ ( '% B ¢  ¦ T Computing, for each class § $ ‰Y k ¨ ¨ ¨  E ( §§ ¨ % #  ¨ ‰ $G E ¦ G E ( A '% ©I$ # G ( H©F$ % ¨ C C CC CC x§ CC § $ ‰Y § ¨©¨ ‰ A ‰ ¦ DBtd@ 9g — 7 # CC A 8 ` ™ k¦ u 46 3 2 g ‰ ” § r 5©'e “ k ¡ ¦  ¦ d 3— ™ # D( '% B ! [sent-92, score-0.084]

52 p1% ( ¦ ” (3) i amounts to solving a linear system in , that can be easily expressed in matrix form as: $ A ¦ 5 , and is a l ¥ P where and are dimensional column vectors, matrix whose columns are the vectors defined earlier. [sent-93, score-0.129]

53 7 (4) Strictly speaking a hyperplane in an dimensional input space is an affine subspace, while our “local hyperplanes” can have fewer dimensions. [sent-94, score-0.278]

54 6 We could be using one of the neighbors as the reference point, but this formulation with the centroid will prove useful later. [sent-95, score-0.129]

55 But we are interested in not in so any solution will do. [sent-97, score-0.026]

56 4 Links with other paradigms The proposed HKNN algorithm is very similar in spirit to the Tangent Distance Algorithm [13]. [sent-100, score-0.073]

57 can be seen as a tangent hyperplane representing a set of local directions of transformation (any linear combination of the vectors) that do not affect the class identity. [sent-101, score-0.508]

58 The main difference is that in HKNN these invariances are inferred directly from the local neighborhood in the training set, whereas in Tangent Distance, they are based on prior knowledge. [sent-103, score-0.228]

59 We should also mention close similarities between our approach and the recently proposed Local Linear Embedding [11] method for dimensionality reduction. [sent-106, score-0.047]

60 The idea of fantasizing points around the training points in order to define the decision surface is also very close to methods based on estimating the class-conditional input density [14, 4]. [sent-107, score-0.61]

61 From this point of view, the algorithm is very similar to the Nearest Feature Line (NFL) [8] method. [sent-109, score-0.076]

62 They differ in the fact that NFL considers all pairs of points for its search rather than the local neighbors, thus looking at many ( ) lines (i. [sent-110, score-0.142]

63 2 dimensional affine subspaces), rather than at a single dimensional one. [sent-112, score-0.134]

64 l ¥   ¡ 8 l l ‰ l 3 Fixing the basic HKNN algorithm 3. [sent-113, score-0.079]

65 1 Problem arising for large K One problem with the basic HKNN algorithm, as previously described, arises as we increase the value of , i. [sent-114, score-0.056]

66 the number of points considered in the neighborhood of the test point. [sent-116, score-0.155]

67 In a typical high dimensional setting, exact colinearities between input patterns are , any pattern of (including nonsensical ones) rare, which means that as soon as can be produced by a linear combination of the neighbors. [sent-117, score-0.179]

68 The “actual” dimensionality of the manifold may be much less than . [sent-118, score-0.094]

69 This is due to “near-colinearities” producing directions associated to small eigenvalues of the covariance matrix that are but noise, that can lead the algorithm to mistake those noise directions for “invariances”, and may hurt its performance even for smaller values of . [sent-119, score-0.106]

70 Another related issue is that the linear approximation of the class manifold by a hyperplane is valid only locally, so we might want to restrict the “fantasizing” of class members to a smaller region of the hyperplane. [sent-120, score-0.505]

71 2 The convex hull solution One way to avoid the above mentioned problems is to restrict ourselves to considering the convex hull of the neighbors, rather than the whole hyperplane they support (of which the convex hull is a subset). [sent-123, score-0.634]

72 Unlike the problem of computing the distance to the hyperplane, the distance to the convex hull cannot be found by solving a simple linear system, but typically requires solving a quadratic programming problem (very similar to the one of SVMs). [sent-125, score-0.432]

73 X is more complex to implement, it should be mentioned that the problems to be solved are of a relatively small dimension of order , and that the time of the whole algorithm will very likely still be dominated by the search of the nearest neighbors within each class. [sent-127, score-0.319]

74 This algorithm will be referred to as K-local Convex Distance Nearest Neighbor Algorithm (CKNN in short). [sent-128, score-0.048]

75 3 The “weight decay” penalty solution This consists in incorporating a penalty term to equation (3) to penalize large values of (i. [sent-130, score-0.1]

76 The resulting algorithm is a generalization of HKNN (basic HKNN corresponds to ). [sent-134, score-0.048]

77 #   £h 4 Experimental results We performed a number of experiments, to highlight different properties of the algorithms: A first 2D toy example (see Figure 2) graphically illustrates the qualitative differences in the decision surfaces produced by KNN, linear SVM and CKNN. [sent-135, score-0.27]

78 Table 1 gives quantitative results on two real-world digit OCR tasks, allowing to compare the performance of the different old and new algorithms. [sent-136, score-0.046]

79 Figure 3 illustrates the problem arising with large , mentioned in Section 3, and shows that the two proposed solutions: CKNN and HKNN with an added weight decay , allow to overcome it. [sent-137, score-0.131]

80 In our final experiment, we wanted to see if the good performance of the new algorithms absolutely depended on having all the training points at hand, as this has a direct impact on speed. [sent-138, score-0.134]

81 So we checked what performance we could get out of HKNN and CKNN when using only a small but representative subset of the training points, namely the set of support vectors found by a Gaussian Kernel SVM. [sent-139, score-0.06]

82 The results obtained for MNIST are given in Table 2, and look very encouraging. [sent-140, score-0.022]

83 HKNN appears to be able to perform as well or better than SVMs without requiring more data points than SVMs. [sent-141, score-0.068]

84 ¥ ¥   ¥ l ¥ Figure 2: 2D illustration of the decision surfaces produced by KNN (left, K=1), linear SVM (middle), and CKNN (right, K=2). [sent-142, score-0.244]

85 CKNN doesn’t suffer from this, but keeps the objective of maximizing the margin locally. [sent-144, score-0.1]

86 5 Conclusion From a few geometrical intuitions, we have derived two modified versions of the KNN algorithm that look very promising. [sent-145, score-0.121]

87 HKNN is especially attractive: it is very simple to implement on top of a KNN system, as it only requires the additional step of solving a small and simple linear system, and appears to greatly boost the performance of standard KNN even above the level of SVMs. [sent-146, score-0.062]

88 The proposed algorithms share the advantages of KNN (no training required, ideal for fast adaptation, natural handling of the multi-class case) and its drawbacks (requires large memory, slow testing). [sent-147, score-0.066]

89 However our latest result also indicate the possibility of substantially reducing the reference set in memory without loosing on accuracy. [sent-148, score-0.03]

90 This suggests that the algorithm indeed captures essential information in the data, and that our initial intuition on the nature of the flaw of KNN may well be at least partially correct. [sent-149, score-0.094]

91 The multi-class measure problem in nearest neighbour discrimination rules. [sent-205, score-0.193]

92 Transformation invariance in pattern recognition — tangent distance and tangent propagation. [sent-236, score-0.276]

93 Is the maximal margin hyperplane special in a feature space? [sent-249, score-0.32]

94 Table 1: Test-error obtained on the USPS and MNIST digit classification tasks by KNN, SVM (using a Gaussian Kernel), HKNN and CKNN. [sent-251, score-0.06]

95 032 CKNN basic HKNN HKNN, lambda=1 HKNN, lambda=10 0. [sent-265, score-0.031]

96 As can be seen the basic HKNN algorithm performs poorly for large values of . [sent-276, score-0.109]

97 As expected, CKNN is relatively unaffected by this problem, and HKNN can be made robust through the added “weight decay” penalty controlled by . [sent-277, score-0.037]

98 l     l Table 2: Test-error obtained on MNIST with HKNN and CKNN when using a reduced training set made of the 16712 support vectors retained by the best Gaussian Kernel SVM. [sent-278, score-0.06]

99 This corresponds to 28% of the initial 60000 training patterns. [sent-279, score-0.039]

100 But here, hyper parameters and were chosen with the test set, as we did not have a separate validation set in this setting. [sent-281, score-0.078]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hknn', 0.543), ('knn', 0.362), ('cknn', 0.294), ('neighbor', 0.256), ('surface', 0.225), ('nearest', 0.193), ('hyperplane', 0.18), ('decision', 0.134), ('cc', 0.126), ('distance', 0.112), ('margin', 0.1), ('svm', 0.095), ('manifold', 0.094), ('svms', 0.086), ('mnist', 0.086), ('class', 0.084), ('tangent', 0.082), ('closest', 0.075), ('local', 0.074), ('classi', 0.074), ('points', 0.068), ('hull', 0.067), ('dimensional', 0.067), ('locally', 0.066), ('af', 0.06), ('neighbors', 0.052), ('geometrical', 0.051), ('convex', 0.05), ('invariances', 0.05), ('produced', 0.048), ('algorithm', 0.048), ('mention', 0.047), ('centroid', 0.047), ('intuition', 0.046), ('test', 0.045), ('fantasized', 0.045), ('fantasizing', 0.045), ('nfl', 0.045), ('ne', 0.043), ('neighborhood', 0.042), ('maximal', 0.04), ('holes', 0.039), ('training', 0.039), ('cation', 0.037), ('penalty', 0.037), ('tasks', 0.037), ('building', 0.036), ('euclidean', 0.036), ('ux', 0.036), ('bottou', 0.036), ('lambda', 0.036), ('modi', 0.036), ('hyper', 0.033), ('fixing', 0.033), ('usps', 0.033), ('linear', 0.033), ('nite', 0.032), ('rw', 0.031), ('disconnected', 0.031), ('decided', 0.031), ('decay', 0.031), ('input', 0.031), ('basic', 0.031), ('reference', 0.03), ('restrict', 0.03), ('poorly', 0.03), ('directions', 0.029), ('solving', 0.029), ('missing', 0.029), ('wx', 0.029), ('surfaces', 0.029), ('de', 0.028), ('point', 0.028), ('algorithms', 0.027), ('subspace', 0.027), ('solution', 0.026), ('illustrates', 0.026), ('mentioned', 0.026), ('transformation', 0.026), ('away', 0.025), ('kernel', 0.025), ('arising', 0.025), ('spirit', 0.025), ('classical', 0.025), ('smallest', 0.023), ('trick', 0.023), ('old', 0.023), ('separable', 0.023), ('inferred', 0.023), ('digit', 0.023), ('correctly', 0.023), ('capacity', 0.023), ('overcome', 0.023), ('non', 0.023), ('extending', 0.022), ('alternatively', 0.022), ('look', 0.022), ('really', 0.021), ('notion', 0.021), ('support', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

Author: Pascal Vincent, Yoshua Bengio

Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). ‚ wx u  r i q ©

2 0.22531559 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

Author: Carlotta Domeniconi, Dimitrios Gunopulos

Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1

3 0.14372608 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1

4 0.11339685 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

Author: Olivier Chapelle, Bernhard Schćžšlkopf

Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1

5 0.10839516 46 nips-2001-Categorization by Learning and Combining Object Parts

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.

6 0.10611688 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

7 0.10111759 144 nips-2001-Partially labeled classification with Markov random walks

8 0.10078795 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

9 0.099083081 105 nips-2001-Kernel Machines and Boolean Functions

10 0.095021755 62 nips-2001-Duality, Geometry, and Support Vector Regression

11 0.09193904 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

12 0.091785006 8 nips-2001-A General Greedy Approximation Algorithm with Applications

13 0.091621414 25 nips-2001-Active Learning in the Drug Discovery Process

14 0.083119579 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

15 0.082113177 60 nips-2001-Discriminative Direction for Kernel Classifiers

16 0.081354715 84 nips-2001-Global Coordination of Local Linear Models

17 0.076499715 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

18 0.075886518 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

19 0.075655766 139 nips-2001-Online Learning with Kernels

20 0.074431241 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.221), (1, 0.122), (2, -0.038), (3, 0.082), (4, 0.026), (5, 0.077), (6, -0.038), (7, -0.052), (8, 0.01), (9, 0.04), (10, 0.088), (11, -0.035), (12, 0.034), (13, -0.007), (14, 0.281), (15, 0.015), (16, 0.0), (17, -0.017), (18, 0.022), (19, 0.048), (20, -0.106), (21, -0.015), (22, 0.137), (23, -0.041), (24, -0.036), (25, 0.064), (26, 0.054), (27, 0.086), (28, -0.032), (29, 0.015), (30, 0.017), (31, 0.085), (32, 0.051), (33, 0.088), (34, -0.016), (35, -0.184), (36, 0.037), (37, 0.045), (38, -0.014), (39, 0.09), (40, 0.013), (41, -0.116), (42, 0.046), (43, 0.07), (44, -0.027), (45, 0.04), (46, 0.07), (47, 0.046), (48, -0.05), (49, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94867194 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

Author: Pascal Vincent, Yoshua Bengio

Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). ‚ wx u  r i q ©

2 0.80290645 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

Author: Carlotta Domeniconi, Dimitrios Gunopulos

Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1

3 0.68763024 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1

4 0.56368977 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

Author: Olivier Chapelle, Bernhard Schćžšlkopf

Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1

5 0.55340952 25 nips-2001-Active Learning in the Drug Discovery Process

Author: Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen

Abstract: We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector votes with its prediction and we pick the unlabeled examples for which the prediction is most evenly split between and . For a third selection strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select unlabeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceuticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. § © ¨

6 0.52099293 144 nips-2001-Partially labeled classification with Markov random walks

7 0.50907522 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

8 0.48702231 62 nips-2001-Duality, Geometry, and Support Vector Regression

9 0.47390214 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

10 0.46981326 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

11 0.45654359 60 nips-2001-Discriminative Direction for Kernel Classifiers

12 0.42428511 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

13 0.40773737 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

14 0.40577409 84 nips-2001-Global Coordination of Local Linear Models

15 0.40167123 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

16 0.36694527 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation

17 0.36545846 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

18 0.36158857 22 nips-2001-A kernel method for multi-labelled classification

19 0.35238105 185 nips-2001-The Method of Quantum Clustering

20 0.35142508 120 nips-2001-Minimax Probability Machine


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.04), (17, 0.016), (19, 0.033), (27, 0.101), (30, 0.061), (36, 0.012), (38, 0.019), (59, 0.029), (63, 0.011), (72, 0.453), (79, 0.024), (83, 0.021), (91, 0.098)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99160081 21 nips-2001-A Variational Approach to Learning Curves

Author: Dörthe Malzahn, Manfred Opper

Abstract: We combine the replica approach from statistical physics with a variational approach to analyze learning curves analytically. We apply the method to Gaussian process regression. As a main result we derive approximative relations between empirical error measures, the generalization error and the posterior variance.

2 0.96997434 53 nips-2001-Constructing Distributed Representations Using Additive Clustering

Author: Wheeler Ruml

Abstract: If the promise of computational modeling is to be fully realized in higherlevel cognitive domains such as language processing, principled methods must be developed to construct the semantic representations used in such models. In this paper, we propose the use of an established formalism from mathematical psychology, additive clustering, as a means of automatically constructing binary representations for objects using only pairwise similarity data. However, existing methods for the unsupervised learning of additive clustering models do not scale well to large problems. We present a new algorithm for additive clustering, based on a novel heuristic technique for combinatorial optimization. The algorithm is simpler than previous formulations and makes fewer independence assumptions. Extensive empirical tests on both human and synthetic data suggest that it is more effective than previous methods and that it also scales better to larger problems. By making additive clustering practical, we take a significant step toward scaling connectionist models beyond hand-coded examples.

3 0.9603188 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

Author: Jeff Bilmes, Gang Ji, Marina Meila

Abstract: In this work, we introduce an information-theoretic based correction term to the likelihood ratio classification method for multiple classes. Under certain conditions, the term is sufficient for optimally correcting the difference between the true and estimated likelihood ratio, and we analyze this in the Gaussian case. We find that the new correction term significantly improves the classification results when tested on medium vocabulary speech recognition tasks. Moreover, the addition of this term makes the class comparisons analogous to an intransitive game and we therefore use several tournament-like strategies to deal with this issue. We find that further small improvements are obtained by using an appropriate tournament. Lastly, we find that intransitivity appears to be a good measure of classification confidence.

same-paper 4 0.95170742 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

Author: Pascal Vincent, Yoshua Bengio

Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). ‚ wx u  r i q ©

5 0.85742867 40 nips-2001-Batch Value Function Approximation via Support Vectors

Author: Thomas G. Dietterich, Xin Wang

Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1

6 0.80714887 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

7 0.76536119 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

8 0.74782813 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

9 0.73680484 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes

10 0.73011494 1 nips-2001-(Not) Bounding the True Error

11 0.7237587 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

12 0.7069695 79 nips-2001-Gaussian Process Regression with Mismatched Models

13 0.7033788 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

14 0.69479555 155 nips-2001-Quantizing Density Estimators

15 0.6917057 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

16 0.68778336 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

17 0.68704575 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

18 0.68653882 61 nips-2001-Distribution of Mutual Information

19 0.67955351 25 nips-2001-Active Learning in the Drug Discovery Process

20 0.67793584 128 nips-2001-Multiagent Planning with Factored MDPs