nips nips2000 nips2000-12 knowledge-graph by maker-knowledge-mining

12 nips-2000-A Support Vector Method for Clustering


Source: pdf

Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik

Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , Red Bank, NJ 07701, USA Abstract We present a novel method for clustering using the support vector machine approach. [sent-4, score-0.46]

2 Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. [sent-5, score-0.96]

3 The boundary of the sphere forms in data space a set of closed contours containing the data. [sent-6, score-0.794]

4 Data points enclosed by each contour are defined as a cluster. [sent-7, score-0.357]

5 As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. [sent-8, score-1.168]

6 The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. [sent-9, score-0.675]

7 As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. [sent-10, score-0.667]

8 The structure of the data is explored by varying the two parameters. [sent-11, score-0.085]

9 We investigate the dependence of our method on these parameters and apply it to several data sets. [sent-12, score-0.082]

10 1 Introduction Clustering is an ill-defined problem for which there exist numerous methods [1, 2]. [sent-13, score-0.05]

11 These can be based on parametric models or can be non-parametric. [sent-14, score-0.079]

12 Parametric algorithms are usually limited in their expressive power, i. [sent-15, score-0.107]

13 In this paper we propose a non-parametric clustering algorithm based on the support vector approach [3], which is usually employed for supervised learning. [sent-18, score-0.599]

14 In the papers [4, 5] an SV algorithm for characterizing the support of a high dimensional distribution was proposed. [sent-19, score-0.469]

15 As a by-product of the algorithm one can compute a set of contours which enclose the data points. [sent-20, score-0.659]

16 These contours were interpreted by us as cluster boundaries [6]. [sent-21, score-0.778]

17 In [6] the number of clusters was predefined, and the value of the kernel parameter was not determined as part of the algorithm. [sent-22, score-0.334]

18 The first stage of our Support Vector Clustering (SVC) algorithm consists of computing the sphere with minimal radius which encloses the data points when mapped to a high dimensional feature space. [sent-24, score-0.856]

19 This sphere corresponds to a set of contours which enclose the points in input space. [sent-25, score-0.897]

20 As the width parameter of the Gaussian kernel function that represents the map to feature space is decreased, this contour breaks into an increasing number of disconnected pieces. [sent-26, score-0.43]

21 The points enclosed by each separate piece are interpreted as belonging to the same cluster. [sent-27, score-0.454]

22 Since the contours characterize the support of the data, our algorithm identifies valleys in its probability distribution. [sent-28, score-0.891]

23 When we deal with overlapping clusters we have to employ a soft margin constant, allowing for "outliers". [sent-29, score-0.428]

24 In this parameter range our algorithm is similar to the space clustering method [7]. [sent-30, score-0.402]

25 The latter is based on a Parzen window estimate of the probability density, using a Gaussian kernel and identifying cluster centers with peaks of the estimator. [sent-31, score-0.487]

26 2 Describing Cluster Boundaries with Support Vectors In this section we describe an algorithm for representing the support of a probability distribution by a finite data set using the formalism of support vectors [5, 4]. [sent-32, score-0.548]

27 Using a nonlinear transformation from X to some high dimensional featurespace, we look for the smallest enclosing sphere of radius R, described by the constraints: 11 (xi) - aW ~ R2 'Vi , where II . [sent-35, score-0.663]

28 II is the Euclidean norm and a is the center of the sphere. [sent-36, score-0.042]

29 Soft constraints are incorporated by adding slack variables ~j: (1) with ~j ~ O. [sent-37, score-0.146]

30 To solve this problem we introduce the Lagrangian L = R2 - 2:(R 2 + ~j -11 (Xj) - aI1 2 ),Bj - 2:~j{tj + C2:~j , (2) j where,Bj ~ 0 and {tj ~ 0 are Lagrange multipliers, C is a constant, and C L: ~j is a penalty term. [sent-38, score-0.053]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('contours', 0.388), ('clustering', 0.286), ('sphere', 0.242), ('cluster', 0.225), ('support', 0.174), ('clusters', 0.174), ('aviv', 0.173), ('enclose', 0.173), ('enclosed', 0.173), ('tel', 0.173), ('valleys', 0.173), ('tj', 0.163), ('enclosing', 0.149), ('decreased', 0.117), ('soft', 0.115), ('outliers', 0.106), ('kernel', 0.1), ('boundaries', 0.097), ('points', 0.094), ('sv', 0.093), ('contour', 0.09), ('radius', 0.09), ('israel', 0.09), ('dimensional', 0.089), ('parametric', 0.079), ('tightly', 0.074), ('breaks', 0.074), ('complementarity', 0.074), ('dealt', 0.074), ('encloses', 0.074), ('faculty', 0.074), ('piece', 0.074), ('svc', 0.074), ('mapped', 0.071), ('forms', 0.069), ('interpreted', 0.068), ('horn', 0.067), ('labs', 0.067), ('management', 0.067), ('parzen', 0.067), ('predefined', 0.067), ('usa', 0.066), ('width', 0.063), ('expressive', 0.062), ('haifa', 0.062), ('schultz', 0.062), ('technion', 0.062), ('parameter', 0.06), ('margin', 0.06), ('formalism', 0.059), ('peaks', 0.059), ('red', 0.059), ('vladimir', 0.059), ('algorithm', 0.056), ('kkt', 0.055), ('geometrical', 0.055), ('centers', 0.055), ('identifies', 0.055), ('lagrangian', 0.055), ('high', 0.055), ('xj', 0.053), ('splitting', 0.053), ('bank', 0.053), ('closed', 0.053), ('aw', 0.053), ('multipliers', 0.053), ('penalty', 0.053), ('incorporated', 0.05), ('numerous', 0.05), ('smoother', 0.05), ('xd', 0.05), ('constraints', 0.049), ('identifying', 0.048), ('lagrange', 0.048), ('papers', 0.048), ('slack', 0.047), ('characterizing', 0.047), ('vapnik', 0.047), ('rd', 0.047), ('usually', 0.045), ('characterize', 0.045), ('belonging', 0.045), ('describing', 0.045), ('explored', 0.043), ('separating', 0.043), ('lab', 0.043), ('vectors', 0.043), ('feature', 0.043), ('data', 0.042), ('norm', 0.042), ('ie', 0.042), ('derivative', 0.041), ('employ', 0.041), ('school', 0.04), ('investigate', 0.04), ('smallest', 0.038), ('overlapping', 0.038), ('supervised', 0.038), ('nj', 0.037), ('constant', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 12 nips-2000-A Support Vector Method for Clustering

Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik

Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.

2 0.15108427 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

Author: Naftali Tishby, Noam Slonim

Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1

3 0.1244299 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates

Author: Trausti T. Kristjansson, Brendan J. Frey

Abstract: Condensation, a form of likelihood-weighted particle filtering, has been successfully used to infer the shapes of highly constrained

4 0.11579547 4 nips-2000-A Linear Programming Approach to Novelty Detection

Author: Colin Campbell, Kristin P. Bennett

Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J

5 0.10571405 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

6 0.10411517 54 nips-2000-Feature Selection for SVMs

7 0.084040061 52 nips-2000-Fast Training of Support Vector Classifiers

8 0.0827007 121 nips-2000-Sparse Kernel Principal Component Analysis

9 0.081364773 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

10 0.079990007 134 nips-2000-The Kernel Trick for Distances

11 0.073991358 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

12 0.067778796 44 nips-2000-Efficient Learning of Linear Perceptrons

13 0.067544855 74 nips-2000-Kernel Expansions with Unlabeled Examples

14 0.066026144 18 nips-2000-Active Support Vector Machine Classification

15 0.065865695 133 nips-2000-The Kernel Gibbs Sampler

16 0.065587983 58 nips-2000-From Margin to Sparsity

17 0.063999429 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice

18 0.063885957 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

19 0.063313723 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

20 0.063267648 62 nips-2000-Generalized Belief Propagation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, 0.126), (2, -0.022), (3, 0.056), (4, -0.066), (5, 0.145), (6, 0.016), (7, 0.031), (8, -0.066), (9, 0.193), (10, -0.053), (11, 0.11), (12, 0.059), (13, -0.044), (14, 0.036), (15, -0.155), (16, -0.114), (17, -0.113), (18, 0.014), (19, -0.095), (20, -0.075), (21, 0.092), (22, -0.126), (23, 0.107), (24, 0.038), (25, -0.093), (26, -0.079), (27, -0.031), (28, 0.072), (29, 0.025), (30, -0.007), (31, 0.136), (32, 0.155), (33, -0.129), (34, 0.265), (35, 0.165), (36, 0.057), (37, -0.133), (38, -0.044), (39, -0.062), (40, -0.105), (41, 0.073), (42, 0.043), (43, -0.101), (44, 0.003), (45, -0.161), (46, 0.002), (47, -0.055), (48, 0.004), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.963817 12 nips-2000-A Support Vector Method for Clustering

Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik

Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.

2 0.60146767 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

Author: Naftali Tishby, Noam Slonim

Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1

3 0.44177252 54 nips-2000-Feature Selection for SVMs

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.

4 0.41098148 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates

Author: Trausti T. Kristjansson, Brendan J. Frey

Abstract: Condensation, a form of likelihood-weighted particle filtering, has been successfully used to infer the shapes of highly constrained

5 0.39802217 79 nips-2000-Learning Segmentation by Random Walks

Author: Marina Meila, Jianbo Shi

Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1

6 0.36845988 4 nips-2000-A Linear Programming Approach to Novelty Detection

7 0.36049443 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

8 0.33508724 52 nips-2000-Fast Training of Support Vector Classifiers

9 0.329153 130 nips-2000-Text Classification using String Kernels

10 0.32690465 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

11 0.32482699 62 nips-2000-Generalized Belief Propagation

12 0.31732431 44 nips-2000-Efficient Learning of Linear Perceptrons

13 0.30183563 116 nips-2000-Sex with Support Vector Machines

14 0.30173877 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

15 0.29706442 18 nips-2000-Active Support Vector Machine Classification

16 0.2959865 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

17 0.26203308 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

18 0.24760386 74 nips-2000-Kernel Expansions with Unlabeled Examples

19 0.23501103 134 nips-2000-The Kernel Trick for Distances

20 0.23252118 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.063), (17, 0.114), (32, 0.018), (33, 0.062), (35, 0.022), (41, 0.393), (54, 0.016), (55, 0.014), (62, 0.018), (67, 0.099), (76, 0.037), (90, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75822401 12 nips-2000-A Support Vector Method for Clustering

Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik

Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.

2 0.40398026 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

3 0.39569286 21 nips-2000-Algorithmic Stability and Generalization Performance

Author: Olivier Bousquet, Andr茅 Elisseeff

Abstract: We present a novel way of obtaining PAC-style bounds on the generalization error of learning algorithms, explicitly using their stability properties. A stable learner is one for which the learned solution does not change much with small changes in the training set. The bounds we obtain do not depend on any measure of the complexity of the hypothesis space (e.g. VC dimension) but rather depend on how the learning algorithm searches this space, and can thus be applied even when the VC dimension is infinite. We demonstrate that regularization networks possess the required stability property and apply our method to obtain new bounds on their generalization performance. 1

4 0.39554515 134 nips-2000-The Kernel Trick for Distances

Author: Bernhard Schölkopf

Abstract: A method is described which, like the kernel trick in support vector machines (SVMs), lets us generalize distance-based algorithms to operate in feature spaces, usually nonlinearly related to the input space. This is done by identifying a class of kernels which can be represented as norm-based distances in Hilbert spaces. It turns out that common kernel algorithms, such as SVMs and kernel PCA, are actually really distance based algorithms and can be run with that class of kernels, too. As well as providing a useful new insight into how these algorithms work, the present work can form the basis for conceiving new algorithms.

5 0.39416316 111 nips-2000-Regularized Winnow Methods

Author: Tong Zhang

Abstract: In theory, the Winnow multiplicative update has certain advantages over the Perceptron additive update when there are many irrelevant attributes. Recently, there has been much effort on enhancing the Perceptron algorithm by using regularization, leading to a class of linear classification methods called support vector machines. Similarly, it is also possible to apply the regularization idea to the Winnow algorithm, which gives methods we call regularized Winnows. We show that the resulting methods compare with the basic Winnows in a similar way that a support vector machine compares with the Perceptron. We investigate algorithmic issues and learning properties of the derived methods. Some experimental results will also be provided to illustrate different methods.

6 0.39165771 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

7 0.38852021 74 nips-2000-Kernel Expansions with Unlabeled Examples

8 0.38476652 37 nips-2000-Convergence of Large Margin Separable Linear Classification

9 0.3811782 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

10 0.3794069 36 nips-2000-Constrained Independent Component Analysis

11 0.37754795 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

12 0.37726742 79 nips-2000-Learning Segmentation by Random Walks

13 0.37562716 4 nips-2000-A Linear Programming Approach to Novelty Detection

14 0.37299538 133 nips-2000-The Kernel Gibbs Sampler

15 0.3709718 130 nips-2000-Text Classification using String Kernels

16 0.36952218 75 nips-2000-Large Scale Bayes Point Machines

17 0.36858663 122 nips-2000-Sparse Representation for Gaussian Process Models

18 0.3685604 44 nips-2000-Efficient Learning of Linear Perceptrons

19 0.36836368 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

20 0.36810011 22 nips-2000-Algorithms for Non-negative Matrix Factorization