nips nips2001 nips2001-62 knowledge-graph by maker-knowledge-mining

62 nips-2001-Duality, Geometry, and Support Vector Regression

Source: pdf

Author: J. Bi, Kristin P. Bennett

Abstract: We develop an intuitive geometric framework for support vector regression (SVR). By examining when -tubes exist, we show that SVR can be regarded as a classiﬁcation problem in the dual space. Hard and soft -tubes are constructed by separating the convex or reduced convex hulls respectively of the training data with the response variable shifted up and down by . A novel SVR model is proposed based on choosing the max-margin plane between the two shifted datasets. Maximizing the margin corresponds to shrinking the eﬀective -tube. In the proposed approach the eﬀects of the choices of all parameters become clear geometrically. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We develop an intuitive geometric framework for support vector regression (SVR). [sent-4, score-0.257]

2 By examining when -tubes exist, we show that SVR can be regarded as a classiﬁcation problem in the dual space. [sent-5, score-0.3]

3 Hard and soft -tubes are constructed by separating the convex or reduced convex hulls respectively of the training data with the response variable shifted up and down by . [sent-6, score-1.529]

4 A novel SVR model is proposed based on choosing the max-margin plane between the two shifted datasets. [sent-7, score-0.469]

5 Maximizing the margin corresponds to shrinking the eﬀective -tube. [sent-8, score-0.116]

6 Intuitive geometric formulations exist for the classiﬁcation case addressing both the error metric and capacity control [1, 2]. [sent-11, score-0.182]

7 For linearly separable classiﬁcation, the primal SVM ﬁnds the separating plane with maximum hard margin between two sets. [sent-12, score-0.94]

8 The equivalent dual SVM computes the closest points in the convex hulls of the data from each class. [sent-13, score-1.111]

9 For the inseparable case, the primal SVM optimizes the soft margin of separation between the two classes. [sent-14, score-0.448]

10 The corresponding dual SVM ﬁnds the closest points in the reduced convex hulls. [sent-15, score-0.819]

11 From the primal perspective, a linear function with no residuals greater than corresponds to an -tube constructed about the data in the space of the data attributes and the response variable [6] (see e. [sent-18, score-0.279]

12 The primary contribution of this work is a novel geometric interpretation of SVR from the dual perspective along with a mathematically rigorous derivation of the geometric concepts. [sent-21, score-0.453]

13 With duality analysis, the existence of a hard -tube depends on the separability of two sets. [sent-24, score-0.346]

14 The two sets consist of the training data augmented with the response variable shifted up and down by . [sent-25, score-0.204]

15 In the dual space, regression becomes the classiﬁcation problem of distinguishing between these two sets. [sent-26, score-0.339]

16 The geometric formulations developed for the classiﬁcation case [1] become applicable to the regression case. [sent-27, score-0.209]

17 We call the resulting formulation convex SVR (C-SVR) since it is based on convex hulls of the augmented training data. [sent-28, score-1.052]

18 Much like in SVM classiﬁcation, to compute a hard -tube, C-SVR computes the nearest points in the convex hulls of the augmented classes. [sent-29, score-1.049]

19 The corresponding maximum margin (max-margin) planes deﬁne the eﬀective -tube. [sent-30, score-0.168]

20 Similarly, to compute a soft -tube, reduced-convex SVR (RC-SVR) ﬁnds the closest points in the reduced convex hulls of the two augmented sets. [sent-32, score-1.153]

21 y y y D ε y + D + D ε D - x D- D- (b) (a) + (d) (c) x x x Figure 1: The (a) primal hard 0-tube, and dual cases: (b) dual strictly separable (c) dual separable = 0 , and (d) dual inseparable < 0. [sent-42, score-1.617]

22 > 0, SVR constructs a regression model that minimizes some empirical risk measure regularized to control capacity. [sent-43, score-0.197]

23 Plotting the points in (x, y) space as in Figure 1(a), we see that for a “perfect” regression model the data fall in a hard -tube about the regression line. [sent-46, score-0.453]

24 Let (Xi , yi) be an example where i = 1, 2, · · · , m, Xi is the ith predictor vector, and yi is its response. [sent-47, score-0.15]

25 A hard -tube for a ﬁxed > 0 is deﬁned as a plane y = w x + b satisfying − e ≤ y − Xw − be ≤ e where e is an m-dimensional vector of ones. [sent-49, score-0.532]

26 Clearly, for large enough such a tube always exists for ﬁnite data. [sent-51, score-0.328]

27 − e ≤ y − Xw − be ≤ e (1) Note that the smallest tube is typically not the -SVR solution. [sent-54, score-0.298]

28 D + = {(Xi , yi + ), i = 1, · · · , m} and D − = {(Xi , yi − ), i = 1, · · · , m}. [sent-57, score-0.248]

29 For any ﬁxed > 0, there are three possible cases: > 0 in which strict hard -tubes exist, = 0 in which only 0-tubes exist, and < 0 in which no hard -tubes exist. [sent-59, score-0.423]

30 A strict hard -tube with no points on the edges of the tube only exists for > 0 . [sent-60, score-0.655]

31 Figure 1(b-d) illustrates what happens in the dual space for each case. [sent-61, score-0.238]

32 The convex hulls of D+ and D− are drawn along with the max-margin plane in (b) and the supporting plane in (c) for separating the convex hulls. [sent-62, score-1.834]

33 Clearly, the existence of the tube is directly related to the separability of D + and D− . [sent-63, score-0.356]

34 If > 0 then a strict tube exists and the convex hulls of D + and D− are strictly separable1 . [sent-64, score-1.104]

35 One can see that the max-margin plane separating D + and D− corresponds to one such . [sent-66, score-0.478]

36 In fact this plane forms an ˆ tube where > ˆ ≥ 0 . [sent-67, score-0.609]

37 If = 0, then the convex hulls of D+ and D− are separable but not strictly separable. [sent-68, score-0.815]

38 The plane that separates the two convex hulls forms the 0 tube. [sent-69, score-1.04]

39 It is easy to show by construction that if a hard -tube exists for a given > 0 then the convex hulls of D + and D− will be separable. [sent-72, score-0.922]

40 If a hard -tube exists, then there exists (w, b) such that (y + e) − Xw − be ≥ 0, (y − e) − Xw − be ≤ 0. [sent-73, score-0.251]

41 (2) X For any convex combination of D + , (y+ e) u where e u = 1, u ≥ 0 of points (Xi , yi + ), i = 1, 2, · · · , m, we have (y + e) u − w (X u) − b ≥ 0. [sent-74, score-0.505]

42 Similarly for X D− , (y− e) v where e v = 1, v ≥ 0 of points (Xi , yi − ), i = 1, 2, · · · , m, we have (y − e) v − w (X v) − b ≤ 0. [sent-75, score-0.187]

43 Then the plane y = w x + b in the -tube separates the two convex hulls. [sent-76, score-0.687]

44 Note the separating plane and the -tube plane are the same. [sent-77, score-0.789]

45 If no separating plane exists, then there is no tube. [sent-78, score-0.445]

46 1 (Conditions for existence of hard -tube) A hard -tube exists for a given > 0 if and only if the following system in (u, v) has no solution: X u = X v, e u = e v = 1, (y + e) u − (y − e) v < 0, u ≥ 0, v ≥ 0. [sent-81, score-0.48]

47 (3) Proof A hard -tube exists if and only if System (2) has a solution. [sent-82, score-0.251]

48 1 We use the following deﬁnitions of separation of convex sets. [sent-85, score-0.318]

49 A plane H = {x : w x = α} is said to separate D + and D− if w x ≥ α, ∀x ∈ D + and w x ≤ α, ∀x ∈ D − . [sent-87, score-0.372]

50 So as a consequence of this theorem, if D+ and D− are separable, then a hard -tube exists. [sent-92, score-0.188]

51 In hard-margin SVM classiﬁcation, the dual SVM formulation constructs the max-margin plane by ﬁnding the two nearest points in the convex hulls of the two classes. [sent-99, score-1.422]

52 The max-margin plane is the plane bisecting these two points. [sent-100, score-0.688]

53 We know that the existence of the tube is linked to the separability of the shifted sets, D + and D− . [sent-101, score-0.457]

54 The key insight is that the regression problem can be regarded as a classiﬁcation problem between D + and D− . [sent-102, score-0.138]

55 The only signiﬁcant diﬀerence occurs along the y dimension as the response variable y is shifted up by in D + and down by in D− . [sent-104, score-0.167]

56 For > 0, the max-margin separating plane corresponds to a hard ˆ-tube where > ˆ ≥ 0. [sent-105, score-0.666]

57 The resulting tube is smaller than but not necessarily the smallest tube. [sent-106, score-0.326]

58 Figure 1(b) shows the max-margin plane found for > 0 . [sent-107, score-0.344]

59 As in classiﬁcation, we will have a hard and soft -tube case. [sent-109, score-0.344]

60 The soft -tube with ≤ 0 is used to obtain good generalization when there are outliers. [sent-110, score-0.156]

61 1 The hard -tube case We now apply the dual convex hull method to constructing the max-margin plane for our augmented sets D + and D− assuming they are strictly separable, i. [sent-112, score-1.332]

62 The closest points of D + and D− can be found by solving the following dual C-SVR quadratic program: min u,v s. [sent-116, score-0.456]

63 1 2 X (y+ e) u− X (y− e) v 2 Let the closest points in the convex hulls of D + and D− be c = d = (4) e u = 1, e v = 1, u ≥ 0, v ≥ 0. [sent-118, score-0.848]

64 The max-margin separating plane bisects these two ˆ points. [sent-120, score-0.487]

65 The normal (w, δ) of the plane is the diﬀerence between them, i. [sent-121, score-0.344]

66 The threshold, ˆ is the distance from the X u − X v, δ ˆ ˆ ˆ ˆ b, origin to the point halfway between the two closest points along the normal: ˆ = b ˆ ˆ X u +X v ˆ ˆ ˆ y u+y v . [sent-124, score-0.203]

67 The separating plane has the equation w x+ δy−ˆ = 0. [sent-125, score-0.445]

68 ˆ b w ˆ +δ ˆ 2 2 Rescaling this plane yields the regression function. [sent-126, score-0.445]

69 Dual C-SVR (4) can be derived by taking the Wolfe or Lagrangian dual [4] of primal C-SVR (5) and simplifying. [sent-137, score-0.414]

70 We prove that the optimal plane from C-SVR bisects the ˆ tube. [sent-138, score-0.386]

71 The supporting planes for class D + and class D− determines the lower and upper edges of the ˆ-tube respectively. [sent-139, score-0.213]

72 The support vectors from D + and D− correspond to the points along the lower and upper edges of the ˆ-tube. [sent-140, score-0.152]

73 1 (C-SVR constructs ˆ-tube) Let the max-margin plane obtained ˆ b by C-SVR (4) be w x+δy−ˆ = 0 where w = X u−X v, δ = (y+ e) u−(y− e) v, and ˆ ˆ ˆ ˆ ˆ ˆ ˆ y u+y v ˆ ˆ ˆ ˆ ˆ = w X u+X v + δ ˆ b ˆ . [sent-143, score-0.411]

74 If > 0, then the plane y = w x + b corresponds 2 2 ˆ to an ˆ-tube of training data (Xi , yi), i = 1, 2, · · · , m where w = − w , b = ˆ δ α−β ˆ ˆ ˆ 2δ ˆ= − ˆ b ˆ δ and < . [sent-144, score-0.377]

75 By the Wolfe duality theorem [4], α − β > 0, ˆ since the objective values of (5) and the negative objective value of (4) are equal at optimality. [sent-146, score-0.162]

76 By complementarity, the closest points are right on the margin planes ˆ ˆ ˆ w x + δy − α = 0 and w x + δy − β = 0 respectively, so α = w X u + δ(y + e) u and ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ α+β ˆ ˆ ˆ = w X v + δ(y− e) v. [sent-147, score-0.345]

77 Hence the plane y = w x + b is in the middle of the ˆ < tube. [sent-158, score-0.344]

78 2 The soft -tube case For < 0 , a hard -tube does not exist. [sent-160, score-0.344]

79 In soft-margin classiﬁcation, outliers were handled in the y ¨¦¤ £ © § ¥ ^ 2ε ¡ ¢ x Figure 3: Soft ˆ-tube found by RC-SVR: left: dual, right: primal space. [sent-162, score-0.203]

80 The same strategy works for soft -tubes, see Figure 3. [sent-164, score-0.156]

81 Instead of taking the full convex hulls of D + and D− , we reduce the convex hulls away from the diﬃcult boundary cases. [sent-165, score-1.342]

82 RC-SVR computes the closest points in the reduced convex hulls u,v s. [sent-166, score-0.959]

83 Parameter D determines the robustness of the solution by reducing the convex hull. [sent-169, score-0.349]

84 In this example, every point in the m reduced convex hull must depend on at least two data points since i=1 ui = 1 and 0 ≤ ui ≤ 1/2. [sent-174, score-0.806]

85 In general, every point in the reduced convex hull can be written as the convex combination of at least 1/D = ν ∗ m . [sent-175, score-0.845]

86 Since these points are exactly the support vectors and there are two reduced convex hulls, 2 ∗ νm is a lower bound on the number of support vectors in RC-SVR. [sent-176, score-0.535]

87 By choosing ν suﬃciently large, the inseparable case with ≤ 0 is transformed into a separable case where once again our nearest-points-in-the-convex-hull-problem is well deﬁned. [sent-177, score-0.181]

88 As in classiﬁcation, the dual reduced convex hull problem corresponds to computing a soft -tube in the primal space. [sent-178, score-1.13]

89 Consider the following soft tube version of the primal C-SVR (7) which has its Wolfe Dual RC-SVR (6): 2 min 1 2 s. [sent-179, score-0.638]

90 2 (RC-SVR constructs soft ˆ-tube) Let the soft max-margin ˆ plane obtained by RC-SVR (6) be w x + δy − ˆ = 0 where w = X u − X v, ˆ b ˆ ˆ ˆ y u+y v ˆ ˆ ˆ X u+X v ˆ ˆ ˆ δ = (y + e) u − (y − e) v, and ˆ = ˆ ˆ b w+ ˆ δ. [sent-184, score-0.723]

91 If 0 < ≤ 0, then 2 2 ˜ ˜ the plane y = w x + b corresponds to a soft ˆ = − α−β < -tube of training data ˆ 2δ (Xi , yi ), i = 1, 2, · · · , m, i. [sent-185, score-0.657]

92 , a ˆ-tube of reduced convex hull of training data where ˆ ˆ b w = − w , b = δ and α = w X u + δ(y + e) u, β = w X v + δ(y − e) v. [sent-187, score-0.527]

93 ˜ ˆ ˆ ˆ ˆ ˜ ˆ ˆ ˆ ˆ ˆ ˆ δ ˜ Notice that the α and β determine the planes parallel to the regression plane and ˜ through the closest points in each reduced convex hull of shifted data. [sent-188, score-1.373]

94 In the inseparable case, these planes are parallel but not necessarily identical to the planes obtained by the primal RC-SVR (7). [sent-189, score-0.521]

95 Both RC-SVR and ν-SVR can shrink or grow the tube according to desired robustness. [sent-222, score-0.265]

96 5 Conclusion and Discussion By examining when -tubes exist, we showed that in the dual space SVR can be regarded as a classiﬁcation problem. [sent-224, score-0.3]

97 Hard and soft -tubes are constructed by separating the convex or reduced convex hulls respectively of the training data with the response variable shifted up and down by . [sent-225, score-1.529]

98 We proposed RC-SVR based on choosing the soft max-margin plane between the two shifted datasets. [sent-226, score-0.625]

99 The max-margin determines how much the tube can shrink. [sent-228, score-0.296]

100 We can show -SVR is equivalent to ﬁnding closest points in a reduced convex hull problem for certain C, but the equivalent problem utilizes a diﬀerent metric in the objective function than RC-SVR. [sent-307, score-0.732]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hulls', 0.353), ('plane', 0.344), ('convex', 0.318), ('tube', 0.265), ('svr', 0.263), ('dual', 0.238), ('xw', 0.209), ('hard', 0.188), ('primal', 0.176), ('soft', 0.156), ('yi', 0.124), ('hull', 0.123), ('planes', 0.123), ('classi', 0.115), ('closest', 0.114), ('ui', 0.108), ('xi', 0.103), ('separating', 0.101), ('shifted', 0.101), ('regression', 0.101), ('svm', 0.1), ('separable', 0.086), ('reduced', 0.086), ('geometric', 0.083), ('inseparable', 0.071), ('cation', 0.069), ('constructs', 0.067), ('mse', 0.067), ('duality', 0.067), ('augmented', 0.063), ('std', 0.063), ('exists', 0.063), ('points', 0.063), ('wolfe', 0.062), ('vi', 0.062), ('strictly', 0.058), ('di', 0.055), ('exist', 0.05), ('separability', 0.05), ('allowable', 0.048), ('gale', 0.048), ('erence', 0.047), ('strict', 0.047), ('xj', 0.047), ('margin', 0.045), ('ective', 0.044), ('bisects', 0.042), ('existence', 0.041), ('min', 0.041), ('response', 0.04), ('theorem', 0.039), ('nearest', 0.039), ('intuitive', 0.039), ('shrinking', 0.038), ('nds', 0.037), ('regarded', 0.037), ('inequalities', 0.036), ('rescaling', 0.035), ('housing', 0.035), ('shrinks', 0.035), ('support', 0.034), ('yj', 0.033), ('geometrically', 0.033), ('corresponds', 0.033), ('smallest', 0.033), ('determines', 0.031), ('constructed', 0.03), ('vol', 0.03), ('bennett', 0.03), ('supporting', 0.03), ('nitely', 0.03), ('risk', 0.029), ('vj', 0.029), ('edges', 0.029), ('objective', 0.028), ('necessarily', 0.028), ('rm', 0.028), ('said', 0.028), ('uj', 0.028), ('outliers', 0.027), ('predictor', 0.026), ('su', 0.026), ('poor', 0.026), ('along', 0.026), ('respectively', 0.026), ('formulations', 0.025), ('examining', 0.025), ('computes', 0.025), ('separates', 0.025), ('erent', 0.025), ('parameter', 0.025), ('choosing', 0.024), ('capacity', 0.024), ('boston', 0.024), ('lkopf', 0.023), ('interpretation', 0.023), ('perfect', 0.023), ('solved', 0.023), ('let', 0.022), ('sch', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 62 nips-2001-Duality, Geometry, and Support Vector Regression

Author: J. Bi, Kristin P. Bennett

2 0.21391572 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1

3 0.17969246 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.

4 0.12065418 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.

5 0.11992924 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama

Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classiﬁcation algorithms can be employed without further modiﬁcations. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1

6 0.10517617 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

7 0.1000285 137 nips-2001-On the Convergence of Leveraging

8 0.095021755 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

9 0.094307035 46 nips-2001-Categorization by Learning and Combining Object Parts

10 0.090862848 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

11 0.088039808 139 nips-2001-Online Learning with Kernels

12 0.087852143 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

13 0.086437188 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

14 0.084758066 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

15 0.084455311 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

16 0.082197458 60 nips-2001-Discriminative Direction for Kernel Classifiers

17 0.080995321 25 nips-2001-Active Learning in the Drug Discovery Process

18 0.080065541 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

19 0.077166751 105 nips-2001-Kernel Machines and Boolean Functions

20 0.07658112 180 nips-2001-The Concave-Convex Procedure (CCCP)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.202), (1, 0.147), (2, -0.026), (3, 0.178), (4, 0.099), (5, 0.013), (6, 0.006), (7, -0.052), (8, 0.046), (9, -0.02), (10, 0.084), (11, 0.045), (12, 0.013), (13, 0.08), (14, 0.215), (15, -0.017), (16, -0.065), (17, -0.093), (18, 0.001), (19, 0.086), (20, -0.032), (21, -0.081), (22, -0.024), (23, 0.064), (24, -0.057), (25, -0.036), (26, -0.05), (27, 0.027), (28, 0.051), (29, -0.083), (30, 0.038), (31, -0.102), (32, 0.0), (33, 0.095), (34, 0.025), (35, -0.044), (36, 0.037), (37, -0.012), (38, 0.204), (39, 0.039), (40, -0.145), (41, -0.102), (42, -0.044), (43, -0.116), (44, -0.001), (45, -0.136), (46, -0.161), (47, 0.107), (48, -0.179), (49, -0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97065985 62 nips-2001-Duality, Geometry, and Support Vector Regression

Author: J. Bi, Kristin P. Bennett

2 0.71500731 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

3 0.55414903 8 nips-2001-A General Greedy Approximation Algorithm with Applications

Author: T. Zhang

4 0.46209577 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

Author: T. Zhang

5 0.45565248 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

Author: Pascal Vincent, Yoshua Bengio

Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classiﬁcation tasks. We then propose modiﬁed K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classiﬁcation tasks suggest that the modiﬁed KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classiﬁcation tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be deﬁned informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to ﬁrst mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes inﬁnite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to ﬁnd the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we deﬁne the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to ﬁnd an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has inﬁnite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and deﬁnitions The setting is that of a classical classiﬁcation problem in (the input space). wx u r i q ©

6 0.44516593 180 nips-2001-The Concave-Convex Procedure (CCCP)

7 0.42630783 120 nips-2001-Minimax Probability Machine

8 0.42201841 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

9 0.40667447 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models

10 0.37456325 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

11 0.36602315 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

12 0.3542206 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation

13 0.34544826 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

14 0.3353039 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

15 0.31850958 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation

16 0.31823769 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

17 0.31660375 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

18 0.31087229 143 nips-2001-PAC Generalization Bounds for Co-training

19 0.30958229 167 nips-2001-Semi-supervised MarginBoost

20 0.30704451 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.062), (17, 0.03), (19, 0.037), (20, 0.011), (27, 0.111), (30, 0.063), (38, 0.01), (59, 0.038), (63, 0.307), (72, 0.086), (79, 0.016), (83, 0.024), (91, 0.12)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8503629 126 nips-2001-Motivated Reinforcement Learning

Author: Peter Dayan

Abstract: The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine, 12,13 into the neurobiology and psychology of motivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.

same-paper 2 0.81858009 62 nips-2001-Duality, Geometry, and Support Vector Regression

Author: J. Bi, Kristin P. Bennett

3 0.62747163 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

4 0.5841459 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

5 0.55645466 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

6 0.551449 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

7 0.54992193 8 nips-2001-A General Greedy Approximation Algorithm with Applications

8 0.548998 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

9 0.54816979 1 nips-2001-(Not) Bounding the True Error

10 0.54794908 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

11 0.54767263 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

12 0.54712319 58 nips-2001-Covariance Kernels from Bayesian Generative Models

13 0.5459215 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

14 0.54582036 185 nips-2001-The Method of Quantum Clustering

15 0.54551011 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

16 0.54515648 22 nips-2001-A kernel method for multi-labelled classification

17 0.54348171 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

18 0.54343826 56 nips-2001-Convolution Kernels for Natural Language

19 0.54322821 60 nips-2001-Discriminative Direction for Kernel Classifiers

20 0.54303855 121 nips-2001-Model-Free Least-Squares Policy Iteration