nips nips2000 nips2000-54 knowledge-graph by maker-knowledge-mining

54 nips-2000-Feature Selection for SVMs

Source: pdf

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We introduce a method of feature selection for Support Vector Machines. [sent-12, score-0.655]

2 The method is based upon finding those features which minimize bounds on the leave-one-out error. [sent-13, score-0.292]

3 This search can be efficiently performed via gradient descent. [sent-14, score-0.071]

4 The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data. [sent-15, score-0.833]

5 1 Introduction In many supervised learning problems feature selection is important for a variety of reasons: generalization performance, running time requirements, and constraints and interpretational issues imposed by the problem itself. [sent-16, score-0.856]

6 In classification problems we are given f data points Xi E ~n labeled Y E ±1 drawn i. [sent-17, score-0.047]

7 We would like to select a subset of features while preserving or improving the discriminative ability of a classifier. [sent-20, score-0.167]

8 As a brute force search of all possible features is a combinatorial problem one needs to take into account both the quality of solution and the computational expense of any given algorithm. [sent-21, score-0.319]

9 Support vector machines (SVMs) have been extensively used as a classification tool with a great deal of success from object recognition [5, 11] to classification of cancer morphologies [10] and a variety of other areas, see e. [sent-22, score-0.362]

10 In this article we introduce feature selection algorithms for SVMs. [sent-24, score-0.797]

11 The methods are based on minimizing generalization bounds via gradient descent and are feasible to compute. [sent-25, score-0.376]

12 g object recognition) and one can perform feature discovery (e. [sent-27, score-0.271]

13 We also show how SVMs can perform badly in the situation of many irrelevant features, a problem which is remedied by using our feature selection approach. [sent-29, score-0.73]

14 In section 2 we describe the feature selection problem, in section 3 we review SVMs and some of their generalization bounds and in section 4 we introduce the new SVM feature selection method. [sent-31, score-1.43]

15 Section 5 then describes results on toy and real life data indicating the usefulness of our approach. [sent-32, score-0.135]

16 2 The Feature Selection problem The feature selection problem can be addressed in the following two ways: (1) given a fixed m « n , find the m features that give the smallest expected generalization error; or (2) given a maximum allowable generalization error "(, find the smallest m. [sent-33, score-1.214]

17 In both of these problems the expected generalization error is of course unknown, and thus must be estimated. [sent-34, score-0.216]

18 Note that choices of m in problem (1) can usually can be reparameterized as choices of"( in problem (2). [sent-36, score-0.268]

19 In the literature one distinguishes between two types of method to solve this problem: the so-called filter and wrapper methods [2]. [sent-44, score-0.515]

20 Filter methods are defined as a preprocessing step to induction that can remove irrelevant attributes before induction occurs, and thus wish to be valid for any set of functions f(x, a). [sent-45, score-0.714]

21 For example one popular filter method is to use Pearson correlation coefficients. [sent-46, score-0.17]

22 The wrapper method, on the other hand, is defined as a search through the space of feature subsets using the estimated accuracy from an induction algorithm as a measure of goodness of a particular feature subset. [sent-47, score-1.093]

23 Thus, one approximates T(O', a) by minimizing Twrap(O', a) = min Talg(O') (2) IT subject to 0' E {a, l}n where Talg is a learning algorithm trained on data preprocessed with fixed 0'. [sent-48, score-0.192]

24 In this article we introduce a feature selection algorithm for SVMs that takes advantage of the performance increase of wrapper methods whilst avoiding their computational complexity. [sent-50, score-1.192]

25 Note, some previous work on feature selection for SVMs does exist, however results have been limited to linear kernels [3, 7] or linear probabilistic models [8]. [sent-51, score-0.532]

26 In order to describe this algorithm, we first review the SVM method and some of its properties. [sent-53, score-0.099]

27 3 Support Vector Learning Support Vector Machines [13] realize the following idea: they map x E IRn into a high (possibly infinite) dimensional space and construct an optimal hyperplane in this space. [sent-54, score-0.13]

28 (3) The optimal hyperplane is the one with the maximal distance (in 1l space) to the closest image ~(Xi) from the training data (called the maximal margin). [sent-62, score-0.35]

29 For the non-separable case one can quadratically penalize errors with the modified kernel K +- K + I where I is the identity matrix and A a constant penalizing the training errors (see [4] for reasons for this choice). [sent-67, score-0.467]

30 t Suppose that the size of the maximal margin is M and the images (Xl), . [sent-68, score-0.257]

31 , (Xl) of the training vectors are within a sphere of radius R. [sent-71, score-0.043]

32 Theorem 1 lfimages of training data of size £ belonging to a . [sent-73, score-0.106]

33 Iphere of size R are separable with the corresponding margin M, then the expectation of the error probability has the bound 1 1 O} EPerr ~ £E {R2} = £E { R 2 W 2 (0:) , M2 (5) where expectation is taken over sets of training data of size £. [sent-74, score-0.459]

34 This theorem justifies the idea that the performance depends on the ratio E{ R2 / M2} and not simply on the large margin M, where R is controlled by the mapping function <1>(. [sent-75, score-0.216]

35 Other bounds also exist, in particular Vapnik and Chapelle [4] derived an estimate using the concept of the span of support vectors. [sent-77, score-0.281]

36 Theorem 2 Under the assumption that the set of support vectors does not change when removing the example p Ep l - err 1 < ! [sent-78, score-0.184]

37 E ~ \II ( (K- 1 ) o:~ - £ ~ p=1 sv pp -1) (6) where \II is the step function , Ksv is the matrix of dot products between support vectors, p~;:-; is the probability of test error for the machine trained on a sample of size £ - 1 and the expectations are taken over the random choice of the sample. [sent-79, score-0.316]

38 4 Feature Selection for SVMs In the problem of feature selection we wish to minimize equation (1) over u and 0:. [sent-80, score-0.75]

39 The support vector method attempts to find the function from the set f(x, w, b) = w . [sent-81, score-0.213]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('selection', 0.312), ('wrapper', 0.297), ('svms', 0.245), ('talg', 0.223), ('feature', 0.22), ('induction', 0.19), ('article', 0.19), ('ku', 0.148), ('tt', 0.13), ('generalization', 0.124), ('filter', 0.122), ('support', 0.12), ('bounds', 0.116), ('chapelle', 0.116), ('cancer', 0.116), ('wish', 0.112), ('margin', 0.103), ('goodness', 0.095), ('maximal', 0.091), ('preprocessing', 0.091), ('toy', 0.087), ('svm', 0.085), ('irrelevant', 0.083), ('hyperplane', 0.08), ('vapnik', 0.08), ('features', 0.079), ('choices', 0.077), ('introduce', 0.075), ('reasons', 0.072), ('search', 0.071), ('smallest', 0.066), ('kernel', 0.064), ('allowable', 0.064), ('egham', 0.064), ('elementwise', 0.064), ('err', 0.064), ('georgia', 0.064), ('pedestrian', 0.064), ('penalizing', 0.064), ('quadratically', 0.064), ('surrey', 0.064), ('weston', 0.064), ('yik', 0.064), ('size', 0.063), ('theorem', 0.059), ('badly', 0.058), ('holloway', 0.058), ('barnhill', 0.058), ('savannah', 0.058), ('pearson', 0.058), ('brute', 0.058), ('dna', 0.058), ('iyi', 0.058), ('micro', 0.058), ('penalize', 0.058), ('preprocessed', 0.058), ('problem', 0.057), ('justifies', 0.054), ('diagnosis', 0.054), ('possibilities', 0.054), ('laboratories', 0.054), ('cbcl', 0.054), ('expense', 0.054), ('exist', 0.053), ('recognition', 0.052), ('errors', 0.051), ('review', 0.051), ('object', 0.051), ('avoiding', 0.05), ('red', 0.05), ('extensively', 0.05), ('realize', 0.05), ('minimize', 0.049), ('variety', 0.048), ('unknown', 0.048), ('methods', 0.048), ('xl', 0.048), ('method', 0.048), ('life', 0.048), ('irn', 0.048), ('separable', 0.048), ('imposed', 0.048), ('ep', 0.048), ('expectation', 0.047), ('problems', 0.047), ('subject', 0.046), ('minimizing', 0.045), ('vector', 0.045), ('span', 0.045), ('bank', 0.045), ('closest', 0.045), ('preserving', 0.045), ('array', 0.045), ('expectations', 0.045), ('error', 0.045), ('feasible', 0.043), ('io', 0.043), ('discriminative', 0.043), ('training', 0.043), ('trained', 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 54 nips-2000-Feature Selection for SVMs

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

2 0.15230072 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

3 0.13952592 58 nips-2000-From Margin to Sparsity

Author: Thore Graepel, Ralf Herbrich, Robert C. Williamson

Abstract: We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation error bound for the classifier learned by the perceptron learning algorithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are currently available for the support vector solution itself. 1

4 0.13733146 130 nips-2000-Text Classification using String Kernels

Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins

Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1

5 0.13709989 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

6 0.13123235 37 nips-2000-Convergence of Large Margin Separable Linear Classification

7 0.12256552 4 nips-2000-A Linear Programming Approach to Novelty Detection

8 0.11875787 134 nips-2000-The Kernel Trick for Distances

9 0.11669762 21 nips-2000-Algorithmic Stability and Generalization Performance

10 0.10887054 75 nips-2000-Large Scale Bayes Point Machines

11 0.10668965 133 nips-2000-The Kernel Gibbs Sampler

12 0.10411517 12 nips-2000-A Support Vector Method for Clustering

13 0.098927163 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

14 0.098909549 74 nips-2000-Kernel Expansions with Unlabeled Examples

15 0.09790545 144 nips-2000-Vicinal Risk Minimization

16 0.097685724 121 nips-2000-Sparse Kernel Principal Component Analysis

17 0.093029842 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

18 0.092301212 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

19 0.09111543 120 nips-2000-Sparse Greedy Gaussian Process Regression

20 0.090229347 18 nips-2000-Active Support Vector Machine Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.288), (1, 0.216), (2, -0.076), (3, 0.053), (4, -0.072), (5, 0.024), (6, 0.026), (7, 0.004), (8, 0.011), (9, 0.029), (10, -0.036), (11, -0.006), (12, 0.061), (13, -0.033), (14, 0.129), (15, -0.041), (16, 0.04), (17, 0.016), (18, -0.184), (19, -0.008), (20, 0.022), (21, -0.081), (22, 0.001), (23, 0.069), (24, -0.011), (25, -0.137), (26, 0.161), (27, -0.083), (28, -0.096), (29, 0.013), (30, -0.044), (31, 0.136), (32, -0.042), (33, -0.118), (34, 0.097), (35, 0.02), (36, 0.139), (37, -0.106), (38, 0.02), (39, 0.066), (40, -0.003), (41, 0.007), (42, -0.024), (43, -0.022), (44, 0.005), (45, 0.018), (46, -0.087), (47, -0.064), (48, 0.014), (49, -0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96617424 54 nips-2000-Feature Selection for SVMs

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

2 0.64392537 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

Author: Sebastian Mika, Gunnar R채tsch, Klaus-Robert M체ller

Abstract: We investigate a new kernel-based classifier: the Kernel Fisher Discriminant (KFD). A mathematical programming formulation based on the observation that KFD maximizes the average margin permits an interesting modification of the original KFD algorithm yielding the sparse KFD. We find that both, KFD and the proposed sparse KFD, can be understood in an unifying probabilistic context. Furthermore, we show connections to Support Vector Machines and Relevance Vector Machines. From this understanding, we are able to outline an interesting kernel-regression technique based upon the KFD algorithm. Simulations support the usefulness of our approach.

3 0.58801246 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

Author: Gert Cauwenberghs, Tomaso Poggio

Abstract: An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. The incremental procedure is reversible, and decremental

4 0.53622037 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

5 0.53161591 12 nips-2000-A Support Vector Method for Clustering

Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik

Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.

6 0.53114063 37 nips-2000-Convergence of Large Margin Separable Linear Classification

7 0.53014523 130 nips-2000-Text Classification using String Kernels

8 0.52673042 21 nips-2000-Algorithmic Stability and Generalization Performance

9 0.51783955 52 nips-2000-Fast Training of Support Vector Classifiers

10 0.51461577 18 nips-2000-Active Support Vector Machine Classification

11 0.4883796 74 nips-2000-Kernel Expansions with Unlabeled Examples

12 0.47596449 58 nips-2000-From Margin to Sparsity

13 0.4529312 4 nips-2000-A Linear Programming Approach to Novelty Detection

14 0.4478831 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

15 0.42872584 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

16 0.42650339 144 nips-2000-Vicinal Risk Minimization

17 0.4254739 44 nips-2000-Efficient Learning of Linear Perceptrons

18 0.41738319 120 nips-2000-Sparse Greedy Gaussian Process Regression

19 0.40523085 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers

20 0.40309918 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.061), (17, 0.697), (33, 0.032), (48, 0.01), (62, 0.023), (67, 0.02), (76, 0.035), (90, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9972837 121 nips-2000-Sparse Kernel Principal Component Analysis

Author: Michael E. Tipping

Abstract: 'Kernel' principal component analysis (PCA) is an elegant nonlinear generalisation of the popular linear data analysis method, where a kernel function implicitly defines a nonlinear transformation into a feature space wherein standard PCA is performed. Unfortunately, the technique is not 'sparse', since the components thus obtained are expressed in terms of kernels associated with every training vector. This paper shows that by approximating the covariance matrix in feature space by a reduced number of example vectors, using a maximum-likelihood approach, we may obtain a highly sparse form of kernel PCA without loss of effectiveness. 1

same-paper 2 0.99708021 54 nips-2000-Feature Selection for SVMs

Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik

3 0.99267536 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference

Author: James M. Coughlan, Alan L. Yuille

Abstract: Preliminary work by the authors made use of the so-called

4 0.98857999 56 nips-2000-Foundations for a Circuit Complexity Theory of Sensory Processing

Author: Robert A. Legenstein, Wolfgang Maass

Abstract: We introduce total wire length as salient complexity measure for an analysis of the circuit complexity of sensory processing in biological neural systems and neuromorphic engineering. This new complexity measure is applied to a set of basic computational problems that apparently need to be solved by circuits for translation- and scale-invariant sensory processing. We exhibit new circuit design strategies for these new benchmark functions that can be implemented within realistic complexity bounds, in particular with linear or almost linear total wire length.

5 0.98696172 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes

Author: Te-Won Lee, Thomas Wachtler, Terrence J. Sejnowski

Abstract: The human visual system encodes the chromatic signals conveyed by the three types of retinal cone photoreceptors in an opponent fashion. This color opponency has been shown to constitute an efficient encoding by spectral decorrelation of the receptor signals. We analyze the spatial and chromatic structure of natural scenes by decomposing the spectral images into a set of linear basis functions such that they constitute a representation with minimal redundancy. Independent component analysis finds the basis functions that transforms the spatiochromatic data such that the outputs (activations) are statistically as independent as possible, i.e. least redundant. The resulting basis functions show strong opponency along an achromatic direction (luminance edges), along a blueyellow direction, and along a red-blue direction. Furthermore, the resulting activations have very sparse distributions, suggesting that the use of color opponency in the human visual system achieves a highly efficient representation of colors. Our findings suggest that color opponency is a result of the properties of natural spectra and not solely a consequence of the overlapping cone spectral sensitivities. 1 Statistical structure of natural scenes Efficient encoding of visual sensory information is an important task for information processing systems and its study may provide insights into coding principles of biological visual systems. An important goal of sensory information processing Electronic version available at www. cnl. salk . edu/

6 0.91784286 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications

7 0.85096401 130 nips-2000-Text Classification using String Kernels

8 0.84179562 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

9 0.8325386 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm

10 0.81976789 4 nips-2000-A Linear Programming Approach to Novelty Detection

11 0.81525427 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling

12 0.81122351 133 nips-2000-The Kernel Gibbs Sampler

13 0.80954301 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

14 0.80923867 51 nips-2000-Factored Semi-Tied Covariance Matrices

15 0.80118614 82 nips-2000-Learning and Tracking Cyclic Human Motion

16 0.79547226 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

17 0.7897523 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

18 0.78945935 36 nips-2000-Constrained Independent Component Analysis

19 0.78754985 118 nips-2000-Smart Vision Chip Fabricated Using Three Dimensional Integration Technology

20 0.78327191 74 nips-2000-Kernel Expansions with Unlabeled Examples