nips nips2000 nips2000-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gert Cauwenberghs, Tomaso Poggio
Abstract: An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. The incremental procedure is reversible, and decremental
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. [sent-6, score-0.466]
2 Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. [sent-7, score-0.649]
3 The incremental procedure is reversible, and decremental "unlearning" offers an efficient method to exactly evaluate leave-one-out generalization performance. [sent-8, score-1.051]
4 Interpretation of decremental unlearning in feature space sheds light on the relationship between generalization and geometry of the data. [sent-9, score-1.131]
5 1 Introduction Training a support vector machine (SVM) requires solving a quadratic programming (QP) problem in a number of coefficients equal to the number of training examples. [sent-10, score-0.624]
6 For very large datasets, standard numeric techniques for QP become infeasible. [sent-11, score-0.117]
7 Practical techniques decompose the problem into manageable subproblems over part of the data [7, 5] or, in the limit, perform iterative pairwise [8] or component-wise [3] optimization. [sent-12, score-0.445]
8 A disadvantage of these techniques is that they may give an approximate solution, and may require many passes through the dataset to reach a reasonable level of convergence. [sent-13, score-0.472]
9 An on-line alternative, that formulates the (exact) solution for £ + 1 training data in terms of that for £ data and one new data point, is presented here. [sent-14, score-0.464]
10 The incremental procedure is reversible, and decremental "unlearning" of each training sample produces an exact leave-one-out estimate of generalization performance on the training set. [sent-15, score-1.376]
11 2 Incremental SVM Learning Training an SVM "incrementally" on new data by discarding all previous data except their support vectors, gives only approximate results [II). [sent-16, score-0.402]
12 In what follows we consider incrementallearning as an exact on-line method to construct the solution recursively, one point at a time. [sent-17, score-0.21]
13 The key is to retain the Kuhn-Tucker (KT) conditions on all previously seen data, while "adiabatically" adding a new data point to the solution. [sent-18, score-0.541]
14 1 Kuhn-Thcker conditions In SVM classification, the optimal separating function reduces to a linear combination of kernels on the training data, f(x) = E j QjYjK(xj, x) + b, with training vectors Xi and corresponding labels Yi = ±l. [sent-20, score-0.632]
15 In the dual formulation of the training problem, the ·On sabbatical leave at CBCL in MIT while this work was performed. [sent-21, score-0.302]
16 " ,, a, , 0 " ,, , support vector , error vector Figure 1: Soft-margin classification SVM training. [sent-52, score-0.302]
17 coefficients ai are obtained by minimizing a convex quadratic objective function under constraints [12] min o<"'. [sent-53, score-0.426]
wordName wordTfidf (topN-words)
[('decremental', 0.467), ('unlearning', 0.35), ('incremental', 0.317), ('svm', 0.223), ('gert', 0.201), ('qp', 0.182), ('reversible', 0.182), ('retain', 0.169), ('cbcl', 0.169), ('training', 0.149), ('support', 0.118), ('hopkins', 0.101), ('increments', 0.101), ('manageable', 0.101), ('md', 0.101), ('sheds', 0.101), ('incrementally', 0.091), ('cauwenberghs', 0.091), ('discarding', 0.091), ('formulates', 0.091), ('poggio', 0.091), ('exact', 0.09), ('tp', 0.084), ('generalization', 0.083), ('previously', 0.08), ('decompose', 0.079), ('techniques', 0.077), ('coefficients', 0.075), ('quadratic', 0.073), ('vector', 0.071), ('aw', 0.071), ('conditions', 0.069), ('kt', 0.068), ('passes', 0.068), ('recursively', 0.068), ('po', 0.068), ('ai', 0.065), ('ox', 0.063), ('leave', 0.063), ('pairwise', 0.063), ('offers', 0.061), ('datasets', 0.059), ('disadvantage', 0.059), ('separating', 0.059), ('labels', 0.059), ('data', 0.057), ('reach', 0.057), ('recursive', 0.057), ('geometry', 0.055), ('convex', 0.055), ('solution', 0.053), ('dual', 0.053), ('massachusetts', 0.052), ('procedure', 0.052), ('seen', 0.049), ('min', 0.047), ('dataset', 0.046), ('interpretation', 0.046), ('approximate', 0.045), ('produces', 0.045), ('kernels', 0.044), ('programming', 0.043), ('adding', 0.043), ('iterative', 0.043), ('classification', 0.042), ('objective', 0.042), ('reasonable', 0.042), ('light', 0.042), ('become', 0.04), ('evaluate', 0.04), ('key', 0.04), ('machine', 0.039), ('vectors', 0.038), ('yi', 0.038), ('formulation', 0.037), ('practical', 0.037), ('xj', 0.036), ('minimizing', 0.036), ('reduces', 0.035), ('technology', 0.035), ('limit', 0.034), ('point', 0.034), ('except', 0.034), ('alternative', 0.033), ('relationship', 0.033), ('constraints', 0.033), ('construct', 0.033), ('steps', 0.032), ('solving', 0.031), ('exactly', 0.031), ('combination', 0.03), ('machines', 0.029), ('level', 0.027), ('institute', 0.027), ('ma', 0.026), ('require', 0.026), ('perform', 0.025), ('give', 0.025), ('equal', 0.025), ('sample', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
Author: Gert Cauwenberghs, Tomaso Poggio
Abstract: An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. The incremental procedure is reversible, and decremental
2 0.11255595 18 nips-2000-Active Support Vector Machine Classification
Author: Olvi L. Mangasarian, David R. Musicant
Abstract: An active set strategy is applied to the dual of a simple reformulation of the standard quadratic program of a linear support vector machine. This application generates a fast new dual algorithm that consists of solving a finite number of linear equations, with a typically large dimensionality equal to the number of points to be classified. However, by making novel use of the Sherman-MorrisonWoodbury formula , a much smaller matrix of the order of the original input space is inverted at each step. Thus, a problem with a 32-dimensional input space and 7 million points required inverting positive definite symmetric matrices of size 33 x 33 with a total running time of 96 minutes on a 400 MHz Pentium II. The algorithm requires no specialized quadratic or linear programming code, but merely a linear equation solver which is publicly available. 1
3 0.092301212 54 nips-2000-Feature Selection for SVMs
Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik
Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.
4 0.080643103 52 nips-2000-Fast Training of Support Vector Classifiers
Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez
Abstract: In this communication we present a new algorithm for solving Support Vector Classifiers (SVC) with large training data sets. The new algorithm is based on an Iterative Re-Weighted Least Squares procedure which is used to optimize the SVc. Moreover, a novel sample selection strategy for the working set is presented, which randomly chooses the working set among the training samples that do not fulfill the stopping criteria. The validity of both proposals, the optimization procedure and sample selection strategy, is shown by means of computer experiments using well-known data sets. 1 INTRODUCTION The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition problems [13, 14] in such a way that the solution is completely described as a linear combination of several training samples, named the Support Vectors. The training procedure for solving the SVC is usually based on Quadratic Programming (QP) which presents some inherent limitations, mainly the computational complexity and memory requirements for large training data sets. This problem is typically avoided by dividing the QP problem into sets of smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for the whole set of training samples. These schemes rely on an optimizing engine, QP, and in the sample selection strategy for each sub-problem, in order to obtain a fast solution for the SVC. An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as an alternative solver for the SVC [10] and the Support Vector Regressor [9], being computationally efficient in absolute terms. In this communication, we will show that the IRWLS algorithm can replace the QP one in any chunking scheme in order to find the SVC solution for large training data sets. Moreover, we consider that the strategy to decide which training samples must j oin the working set is critical to reduce the total number of iterations needed to attain the SVC solution, and the runtime complexity as a consequence. To aim for this issue, the computer program SV cradit have been developed so as to solve the SVC for large training data sets using IRWLS procedure and fixed-size working sets. The paper is organized as follows. In Section 2, we start by giving a summary of the IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme to obtain an overall implementation which efficiently deals with large training data sets. We present in Section 3 a novel strategy to make up the working set. Section 4 shows the capabilities of the new implementation and they are compared with the fastest available SVC implementation, SV Mlight [6]. We end with some concluding remarks. 2 IRWLS-SVC In order to solve classification problems, the SVC has to minimize Lp = ~llwI12+CLei- LJliei- LQi(Yi(¢(xifw+b)-l+ei) (1) i i i with respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, where ¢(.) is a nonlinear transformation (usually unknown) to a higher dimensional space and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker (KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey by Burges [2] and to the work ofVapnik [13, 14]. In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that the terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi (one of the KKT conditions [2]) must hold. Lp = 1 Qi(l- Yi(¢T(Xi)W + b)) 211wl12 + L i = (2) where The weighted least square nature of (2) can be understood if ei is defined as the error on each sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The minimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we need to apply an IRWLS procedure [4], summarized below in tree steps: 1. Considering the ai fixed, minimize (2). 2. Recalculate ai from the solution on step 1. 3. Repeat until convergence. In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP procedure does, we require that w = Ei (JiYi¢(Xi) and in order to obtain a non-zero b, that Ei {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed set of ai is found by solving the following linear equation system l (3) IThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. where y = [Yl, Y2, ... Yn]T (4) 'r/i,j = 1, ... ,n 'r/i,j = 1, ... ,n (H)ij = YiYj¢T(Xi)¢(Xj) = YiyjK(Xi,Xj) (Da)ij = aio[i - j] 13 = [,81, ,82, ... (5) (6) (7) , ,8n]T and 0[·] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange multipliers is eliminated using the KKT conditions, obtaining a, ai 2.1 ={~ ei Yi' eiYi < Yt.et. > - ° ° (8) IRWLS ALGORITHMIC IMPLEMENTATION The SVC solution with the IRWLS procedure can be simplified by dividing the training samples into three sets. The first set, SI, contains the training samples verifying < ,8i < C, which have to be determined by solving (3). The second one, S2, includes every training sample whose,8i = 0. And the last one, S3, is made up of the training samples whose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm is shown in Table 1. ° 0. Initialization: SI will contain every training sample, S2 = 0 and S3 = 0. Compute H. e_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . 1 Solve [ (H)Sb S1 + D(al S1 . =° = e-lt a, 3. ai = { ~ (13) S2 2. e ° 1[ (Y)Sl (f3)Sl ] (y ) ~1 b and (13) Ss = C DyH(f3 - f3_a) - (b - b_a)1 =[1- G 13 ] G b3 ' °. eiYi < e- _ > O'r/Z E SI U S2 U S3 tYt 4. Sets reordering: a. Move every sample in S3 with eiYi < to S2. b. Move every sample in SI with ,8i = C to S3. c. Move every sample in SI with ai = to S2 . d. Move every sample in S2 with ai :I to SI. 5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (G in )Sl' b-lt = band Gb3 = -y~s (f3)ss + Gbin · 6. Go to step 1 and repeat until convergence. ei Yi ' ° ° ° Table 1: IRWLS-SVC algorithm. The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large to fit into memory. In those cases, several SVC with a reduced set of training samples are iteratively solved until the solution for the whole set is found. The samples are divide into a working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLSSVC procedure, adding a contribution to the independent term in the linear equation system (3) . Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G bin ) will be calculated as G 13 and G b3 are (Table 1, 5th step), before calling the IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider Gin and G bin , which must be set to zero if the Hessian matrix, H, fits into memory for the whole set of training samples. The resolution of the SVC for large training data sets, employing as minimization engine the IRWLS procedure, is summarized in the following steps: 1. Select the samples that will form the working set. 2. Construct Gin = (H)Sw,Sin (f3)s.n and G bin = -yIin (f3)Sin 3. Solve the IRWLS-SVC procedure, following the steps in Table 1. 4. Compute the error of every training sample. 5. If the stopping conditions Yiei < C eiYi> -c leiYil < C 'Vii 'Vii 'Vii (Ji = 0 (Ji = C 0 < (Ji < C (9) (10) (11) are fulfilled, the SVC solution has been reached. The stopping conditions are the ones proposed in [6] and C must be a small value around 10 - 3 , a full discussion concerning this topic can be found in [6]. 3 SAMPLE SELECTION STRATEGY The selection of the training samples that will constitute the working set in each iteration is the most critical decision in any chunking scheme, because such decision is directly involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the number of reproducing kernel evaluations to be made, which are, by far, the two most time consuming operations in any chunking schemes. In order to solve the SVC efficiently, we first need to define a candidate set of training samples to form the working set in each iteration. The candidate set will be made up, as it could not be otherwise, with all the training samples that violate the stopping conditions (9)-(11); and we will also add all those training samples that satisfy condition (11) but a small variation on their error will make them violate such condition. The strategies to select the working set are as numerous as the number of problems to be solved, but one can think three different simple strategies: • Select those samples which do not fulfill the stopping criteria and present the largest Iei I values. • Select those samples which do not fulfill the stopping criteria and present the smallest Iei I values. • Select them randomly from the ones that do not fulfill the stopping conditions. The first strategy seems the more natural one and it was proposed in [6]. If the largest leil samples are selected we guanrantee that attained solution gives the greatest step towards the solution of (1). But if the step is too large, which usually happens, it will cause the solution in each iteration and the (Ji values to oscillate around its optimal value. The magnitude of this effect is directly proportional to the value of C and q (size of the working set), so in the case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. The second one is the most conservative strategy because we will be moving towards the solution of (1) with small steps. Its drawback is readily discerned if the starting point is inappropriate, needing too many iterations to reach the SVC solution. The last strategy, which has been implemented together with the IRWLS-SVC procedure, is a mid-point between the other two, but if the number of samples whose 0 < (3i < C increases above q there might be some iterations where we will make no progress (working set is only made up of the training samples that fulfill the stopping condition in (11)). This situation is easily avoided by introducing one sample that violates each one of the stopping conditions per class. Finally, if the cardinality of the candidate set is less than q the working set is completed with those samples that fulfil the stopping criteria conditions and present the least leil. In summary, the sample selection strategy proposed is 2 : 1. Construct the candidate set, Se with those samples that do not fulfill stopping conditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. 2. IfISel < ngot05. 3. Choose a sample per class that violates each one of the stopping conditions and move them from Se to the working set, SW. 4. Choose randomly n - ISw I samples from Se and move then to SW. Go to Step 6. 5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping conditions (9) and (10) and present the lowest leil values are used to complete SW . 6. Go on, obtaining Gin and Gbin. 4 BENCHMARK FOR THE IRWLS-SVC We have prepared two different experiments to test both the IRWLS and the sample selection strategy for solving the SVc. The first one compares the IRWLS against QP and the second one compares the samples selection strategy, together with the IRWLS, against a complete solving procedure for SVC, the SV Mlight. In the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing engines with equal samples selection strategy. The comparison has been made over a Pentium ill-450MHz with 128Mb running on Window98 and the programs have been compiled using Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first q 20 40 70 Adult44781 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 21.25 20.70 0.61 0.39 20.60 19.22 1.01 0.17 21.15 18.72 2.30 0.46 Splice 2175 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 46.19 30.76 21.94 4.77 71.34 24.93 46.26 8.07 53.77 20.32 34.24 7.72 Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. one, containing 4781 training samples, needs most CPU resources to compute the RKHS and the second one, containing 2175 training samples, uses most CPU resources to solve the SVC for each Sw, where q indicates the size of the working set. The value of C has 2In what follows, I . I represents absolute value for numbers and cardinality for sets been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has been employed, where its parameter a has been set, respectively, to 10 and 70. As it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO procedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC solver. For the second trial, we have compiled a computer program that uses the IRWLS-SVC procedure and the working set selection in Section 3, we will refer to it as svcradit from now on. We have borrowed the chunking and shrinking ideas from the SV Mlight [6] for our computer program. To test these two programs several data sets have been used. The Adult and Web data sets have been obtained from 1. Platt's web page http://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimensional classification problem proposed in [3] to test neural networks, which comprises a gaussian random variable for each class, which highly overlap. The Banana, Diabetes and Splice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der raetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult and Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice data sets. In Table 3, we show the runtime complexity for each data set, where the value of q has been elected as the one that reduces the runtime complexity. Database Dim Adult6 Adult9 Adult! Web 1 Web7 Gauss-M Gauss-M Banana Banana Diabetes Splice 123 123 123 300 300 2 2 2 2 8 69 N Sampl. 11221 32562 1605 2477 24693 4000 4000 400 4900 768 2175 C a SV 1 1 1000 5 5 1 100 316.2 316.2 10 1000 10 10 10 10 10 1 1 1 1 2 70 4477 12181 630 224 1444 1736 1516 80 1084 409 525 q CPU time radit light radit light 150 130 100 100 150 70 100 40 70 40 150 40 70 10 10 10 10 10 70 40 10 20 118.2 1093.29 25.98 2.42 158.13 12.69 61.68 0.33 22.46 2.41 14.06 124.46 1097.09 113.54 2.36 124.57 48.28 3053.20 0.77 1786.56 6.04 49.19 Table 3: Several data sets runtime complexity, when solved with the short, and SV Mlight, light for short. s v c radit , radit for One can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For the Web data set, which is the only data set the SV Mlight is sligthly faster, the value of C is low and most training samples end up as support vector with (3i < C. In such cases the best strategy is to take the largest step towards the solution in every iteration, as the SV Mlig ht does [6], because most training samples (3i will not be affected by the others training samples (3j value. But in those case the value of C increases the SV c radit samples selection strategy is a much more appropriate strategy than the one used in SV Mlight. 5 CONCLUSIONS In this communication a new algorithm for solving the SVC for large training data sets has been presented. Its two major contributions deal with the optimizing engine and the sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, which is much faster that the usual QP procedure, and simpler to implement, because the most difficult step is the linear equation system solution that can be easily obtained by LU decomposition means [12]. The random working set selection from the samples not fulfilling the KKT conditions is the best option if the working is be large, because it reduces the number of chunks to be solved. This strategy benefits from the IRWLS procedure, which allows to work with large training data set. All these modifications have been concreted in the svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. 6 ACKNOWLEDGEMENTS We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to use his SV Mlight to test our IRWLS procedure, comparisons which could not have been properly done otherwise. References [1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, U.S.A., 1992. [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. [4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. [5] T. Joachims. http://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren Isvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Informatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multivariate Data', 1998. [6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Kernel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and Smola, A. 1., pages 169-184. M.I.T. Press, 1999. [7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal Processing, pages 276-285, Amelia Island, U.S.A, 1997. [8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In ICPR'98, Brisbane, Australia, August 1998. [9] F. Perez-Cruz, A. Navia-Vazquez
5 0.075879902 75 nips-2000-Large Scale Bayes Point Machines
Author: Ralf Herbrich, Thore Graepel
Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1
6 0.064915285 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra
7 0.064399198 120 nips-2000-Sparse Greedy Gaussian Process Regression
8 0.063517943 58 nips-2000-From Margin to Sparsity
9 0.059646834 4 nips-2000-A Linear Programming Approach to Novelty Detection
10 0.059225451 111 nips-2000-Regularized Winnow Methods
11 0.05659461 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
12 0.056328956 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting
13 0.056111552 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
14 0.055597726 37 nips-2000-Convergence of Large Margin Separable Linear Classification
15 0.054560419 133 nips-2000-The Kernel Gibbs Sampler
16 0.054105546 74 nips-2000-Kernel Expansions with Unlabeled Examples
17 0.050351966 130 nips-2000-Text Classification using String Kernels
18 0.048973534 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
19 0.046321295 122 nips-2000-Sparse Representation for Gaussian Process Models
20 0.046122883 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
topicId topicWeight
[(0, 0.148), (1, 0.122), (2, -0.033), (3, 0.007), (4, -0.04), (5, 0.017), (6, 0.007), (7, 0.0), (8, -0.008), (9, 0.085), (10, -0.058), (11, 0.021), (12, 0.021), (13, -0.018), (14, 0.102), (15, -0.043), (16, -0.094), (17, -0.024), (18, 0.002), (19, -0.081), (20, 0.011), (21, -0.05), (22, 0.061), (23, 0.084), (24, -0.173), (25, -0.107), (26, -0.031), (27, 0.14), (28, -0.211), (29, -0.053), (30, 0.146), (31, 0.187), (32, -0.045), (33, -0.075), (34, -0.011), (35, -0.022), (36, 0.053), (37, -0.156), (38, 0.062), (39, 0.173), (40, 0.066), (41, 0.146), (42, 0.019), (43, 0.046), (44, 0.021), (45, 0.046), (46, 0.065), (47, 0.157), (48, 0.041), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.95462269 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
Author: Gert Cauwenberghs, Tomaso Poggio
Abstract: An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. The incremental procedure is reversible, and decremental
2 0.72794425 52 nips-2000-Fast Training of Support Vector Classifiers
Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez
Abstract: In this communication we present a new algorithm for solving Support Vector Classifiers (SVC) with large training data sets. The new algorithm is based on an Iterative Re-Weighted Least Squares procedure which is used to optimize the SVc. Moreover, a novel sample selection strategy for the working set is presented, which randomly chooses the working set among the training samples that do not fulfill the stopping criteria. The validity of both proposals, the optimization procedure and sample selection strategy, is shown by means of computer experiments using well-known data sets. 1 INTRODUCTION The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition problems [13, 14] in such a way that the solution is completely described as a linear combination of several training samples, named the Support Vectors. The training procedure for solving the SVC is usually based on Quadratic Programming (QP) which presents some inherent limitations, mainly the computational complexity and memory requirements for large training data sets. This problem is typically avoided by dividing the QP problem into sets of smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for the whole set of training samples. These schemes rely on an optimizing engine, QP, and in the sample selection strategy for each sub-problem, in order to obtain a fast solution for the SVC. An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as an alternative solver for the SVC [10] and the Support Vector Regressor [9], being computationally efficient in absolute terms. In this communication, we will show that the IRWLS algorithm can replace the QP one in any chunking scheme in order to find the SVC solution for large training data sets. Moreover, we consider that the strategy to decide which training samples must j oin the working set is critical to reduce the total number of iterations needed to attain the SVC solution, and the runtime complexity as a consequence. To aim for this issue, the computer program SV cradit have been developed so as to solve the SVC for large training data sets using IRWLS procedure and fixed-size working sets. The paper is organized as follows. In Section 2, we start by giving a summary of the IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme to obtain an overall implementation which efficiently deals with large training data sets. We present in Section 3 a novel strategy to make up the working set. Section 4 shows the capabilities of the new implementation and they are compared with the fastest available SVC implementation, SV Mlight [6]. We end with some concluding remarks. 2 IRWLS-SVC In order to solve classification problems, the SVC has to minimize Lp = ~llwI12+CLei- LJliei- LQi(Yi(¢(xifw+b)-l+ei) (1) i i i with respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, where ¢(.) is a nonlinear transformation (usually unknown) to a higher dimensional space and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker (KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey by Burges [2] and to the work ofVapnik [13, 14]. In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that the terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi (one of the KKT conditions [2]) must hold. Lp = 1 Qi(l- Yi(¢T(Xi)W + b)) 211wl12 + L i = (2) where The weighted least square nature of (2) can be understood if ei is defined as the error on each sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The minimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we need to apply an IRWLS procedure [4], summarized below in tree steps: 1. Considering the ai fixed, minimize (2). 2. Recalculate ai from the solution on step 1. 3. Repeat until convergence. In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP procedure does, we require that w = Ei (JiYi¢(Xi) and in order to obtain a non-zero b, that Ei {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed set of ai is found by solving the following linear equation system l (3) IThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. where y = [Yl, Y2, ... Yn]T (4) 'r/i,j = 1, ... ,n 'r/i,j = 1, ... ,n (H)ij = YiYj¢T(Xi)¢(Xj) = YiyjK(Xi,Xj) (Da)ij = aio[i - j] 13 = [,81, ,82, ... (5) (6) (7) , ,8n]T and 0[·] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange multipliers is eliminated using the KKT conditions, obtaining a, ai 2.1 ={~ ei Yi' eiYi < Yt.et. > - ° ° (8) IRWLS ALGORITHMIC IMPLEMENTATION The SVC solution with the IRWLS procedure can be simplified by dividing the training samples into three sets. The first set, SI, contains the training samples verifying < ,8i < C, which have to be determined by solving (3). The second one, S2, includes every training sample whose,8i = 0. And the last one, S3, is made up of the training samples whose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm is shown in Table 1. ° 0. Initialization: SI will contain every training sample, S2 = 0 and S3 = 0. Compute H. e_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . 1 Solve [ (H)Sb S1 + D(al S1 . =° = e-lt a, 3. ai = { ~ (13) S2 2. e ° 1[ (Y)Sl (f3)Sl ] (y ) ~1 b and (13) Ss = C DyH(f3 - f3_a) - (b - b_a)1 =[1- G 13 ] G b3 ' °. eiYi < e- _ > O'r/Z E SI U S2 U S3 tYt 4. Sets reordering: a. Move every sample in S3 with eiYi < to S2. b. Move every sample in SI with ,8i = C to S3. c. Move every sample in SI with ai = to S2 . d. Move every sample in S2 with ai :I to SI. 5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (G in )Sl' b-lt = band Gb3 = -y~s (f3)ss + Gbin · 6. Go to step 1 and repeat until convergence. ei Yi ' ° ° ° Table 1: IRWLS-SVC algorithm. The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large to fit into memory. In those cases, several SVC with a reduced set of training samples are iteratively solved until the solution for the whole set is found. The samples are divide into a working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLSSVC procedure, adding a contribution to the independent term in the linear equation system (3) . Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G bin ) will be calculated as G 13 and G b3 are (Table 1, 5th step), before calling the IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider Gin and G bin , which must be set to zero if the Hessian matrix, H, fits into memory for the whole set of training samples. The resolution of the SVC for large training data sets, employing as minimization engine the IRWLS procedure, is summarized in the following steps: 1. Select the samples that will form the working set. 2. Construct Gin = (H)Sw,Sin (f3)s.n and G bin = -yIin (f3)Sin 3. Solve the IRWLS-SVC procedure, following the steps in Table 1. 4. Compute the error of every training sample. 5. If the stopping conditions Yiei < C eiYi> -c leiYil < C 'Vii 'Vii 'Vii (Ji = 0 (Ji = C 0 < (Ji < C (9) (10) (11) are fulfilled, the SVC solution has been reached. The stopping conditions are the ones proposed in [6] and C must be a small value around 10 - 3 , a full discussion concerning this topic can be found in [6]. 3 SAMPLE SELECTION STRATEGY The selection of the training samples that will constitute the working set in each iteration is the most critical decision in any chunking scheme, because such decision is directly involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the number of reproducing kernel evaluations to be made, which are, by far, the two most time consuming operations in any chunking schemes. In order to solve the SVC efficiently, we first need to define a candidate set of training samples to form the working set in each iteration. The candidate set will be made up, as it could not be otherwise, with all the training samples that violate the stopping conditions (9)-(11); and we will also add all those training samples that satisfy condition (11) but a small variation on their error will make them violate such condition. The strategies to select the working set are as numerous as the number of problems to be solved, but one can think three different simple strategies: • Select those samples which do not fulfill the stopping criteria and present the largest Iei I values. • Select those samples which do not fulfill the stopping criteria and present the smallest Iei I values. • Select them randomly from the ones that do not fulfill the stopping conditions. The first strategy seems the more natural one and it was proposed in [6]. If the largest leil samples are selected we guanrantee that attained solution gives the greatest step towards the solution of (1). But if the step is too large, which usually happens, it will cause the solution in each iteration and the (Ji values to oscillate around its optimal value. The magnitude of this effect is directly proportional to the value of C and q (size of the working set), so in the case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. The second one is the most conservative strategy because we will be moving towards the solution of (1) with small steps. Its drawback is readily discerned if the starting point is inappropriate, needing too many iterations to reach the SVC solution. The last strategy, which has been implemented together with the IRWLS-SVC procedure, is a mid-point between the other two, but if the number of samples whose 0 < (3i < C increases above q there might be some iterations where we will make no progress (working set is only made up of the training samples that fulfill the stopping condition in (11)). This situation is easily avoided by introducing one sample that violates each one of the stopping conditions per class. Finally, if the cardinality of the candidate set is less than q the working set is completed with those samples that fulfil the stopping criteria conditions and present the least leil. In summary, the sample selection strategy proposed is 2 : 1. Construct the candidate set, Se with those samples that do not fulfill stopping conditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. 2. IfISel < ngot05. 3. Choose a sample per class that violates each one of the stopping conditions and move them from Se to the working set, SW. 4. Choose randomly n - ISw I samples from Se and move then to SW. Go to Step 6. 5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping conditions (9) and (10) and present the lowest leil values are used to complete SW . 6. Go on, obtaining Gin and Gbin. 4 BENCHMARK FOR THE IRWLS-SVC We have prepared two different experiments to test both the IRWLS and the sample selection strategy for solving the SVc. The first one compares the IRWLS against QP and the second one compares the samples selection strategy, together with the IRWLS, against a complete solving procedure for SVC, the SV Mlight. In the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing engines with equal samples selection strategy. The comparison has been made over a Pentium ill-450MHz with 128Mb running on Window98 and the programs have been compiled using Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first q 20 40 70 Adult44781 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 21.25 20.70 0.61 0.39 20.60 19.22 1.01 0.17 21.15 18.72 2.30 0.46 Splice 2175 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 46.19 30.76 21.94 4.77 71.34 24.93 46.26 8.07 53.77 20.32 34.24 7.72 Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. one, containing 4781 training samples, needs most CPU resources to compute the RKHS and the second one, containing 2175 training samples, uses most CPU resources to solve the SVC for each Sw, where q indicates the size of the working set. The value of C has 2In what follows, I . I represents absolute value for numbers and cardinality for sets been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has been employed, where its parameter a has been set, respectively, to 10 and 70. As it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO procedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC solver. For the second trial, we have compiled a computer program that uses the IRWLS-SVC procedure and the working set selection in Section 3, we will refer to it as svcradit from now on. We have borrowed the chunking and shrinking ideas from the SV Mlight [6] for our computer program. To test these two programs several data sets have been used. The Adult and Web data sets have been obtained from 1. Platt's web page http://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimensional classification problem proposed in [3] to test neural networks, which comprises a gaussian random variable for each class, which highly overlap. The Banana, Diabetes and Splice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der raetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult and Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice data sets. In Table 3, we show the runtime complexity for each data set, where the value of q has been elected as the one that reduces the runtime complexity. Database Dim Adult6 Adult9 Adult! Web 1 Web7 Gauss-M Gauss-M Banana Banana Diabetes Splice 123 123 123 300 300 2 2 2 2 8 69 N Sampl. 11221 32562 1605 2477 24693 4000 4000 400 4900 768 2175 C a SV 1 1 1000 5 5 1 100 316.2 316.2 10 1000 10 10 10 10 10 1 1 1 1 2 70 4477 12181 630 224 1444 1736 1516 80 1084 409 525 q CPU time radit light radit light 150 130 100 100 150 70 100 40 70 40 150 40 70 10 10 10 10 10 70 40 10 20 118.2 1093.29 25.98 2.42 158.13 12.69 61.68 0.33 22.46 2.41 14.06 124.46 1097.09 113.54 2.36 124.57 48.28 3053.20 0.77 1786.56 6.04 49.19 Table 3: Several data sets runtime complexity, when solved with the short, and SV Mlight, light for short. s v c radit , radit for One can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For the Web data set, which is the only data set the SV Mlight is sligthly faster, the value of C is low and most training samples end up as support vector with (3i < C. In such cases the best strategy is to take the largest step towards the solution in every iteration, as the SV Mlig ht does [6], because most training samples (3i will not be affected by the others training samples (3j value. But in those case the value of C increases the SV c radit samples selection strategy is a much more appropriate strategy than the one used in SV Mlight. 5 CONCLUSIONS In this communication a new algorithm for solving the SVC for large training data sets has been presented. Its two major contributions deal with the optimizing engine and the sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, which is much faster that the usual QP procedure, and simpler to implement, because the most difficult step is the linear equation system solution that can be easily obtained by LU decomposition means [12]. The random working set selection from the samples not fulfilling the KKT conditions is the best option if the working is be large, because it reduces the number of chunks to be solved. This strategy benefits from the IRWLS procedure, which allows to work with large training data set. All these modifications have been concreted in the svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. 6 ACKNOWLEDGEMENTS We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to use his SV Mlight to test our IRWLS procedure, comparisons which could not have been properly done otherwise. References [1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, U.S.A., 1992. [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. [4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. [5] T. Joachims. http://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren Isvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Informatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multivariate Data', 1998. [6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Kernel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and Smola, A. 1., pages 169-184. M.I.T. Press, 1999. [7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal Processing, pages 276-285, Amelia Island, U.S.A, 1997. [8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In ICPR'98, Brisbane, Australia, August 1998. [9] F. Perez-Cruz, A. Navia-Vazquez
3 0.68984395 18 nips-2000-Active Support Vector Machine Classification
Author: Olvi L. Mangasarian, David R. Musicant
Abstract: An active set strategy is applied to the dual of a simple reformulation of the standard quadratic program of a linear support vector machine. This application generates a fast new dual algorithm that consists of solving a finite number of linear equations, with a typically large dimensionality equal to the number of points to be classified. However, by making novel use of the Sherman-MorrisonWoodbury formula , a much smaller matrix of the order of the original input space is inverted at each step. Thus, a problem with a 32-dimensional input space and 7 million points required inverting positive definite symmetric matrices of size 33 x 33 with a total running time of 96 minutes on a 400 MHz Pentium II. The algorithm requires no specialized quadratic or linear programming code, but merely a linear equation solver which is publicly available. 1
4 0.4964993 54 nips-2000-Feature Selection for SVMs
Author: Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, Vladimir Vapnik
Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA micro array data.
5 0.43476704 116 nips-2000-Sex with Support Vector Machines
Author: Baback Moghaddam, Ming-Hsuan Yang
Abstract: unkown-abstract
7 0.3941519 5 nips-2000-A Mathematical Programming Approach to the Kernel Fisher Algorithm
8 0.35654813 111 nips-2000-Regularized Winnow Methods
9 0.3545033 120 nips-2000-Sparse Greedy Gaussian Process Regression
10 0.32894626 12 nips-2000-A Support Vector Method for Clustering
11 0.3241393 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra
12 0.30805352 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems
13 0.30419555 44 nips-2000-Efficient Learning of Linear Perceptrons
14 0.29129732 133 nips-2000-The Kernel Gibbs Sampler
15 0.2828126 22 nips-2000-Algorithms for Non-negative Matrix Factorization
16 0.27875501 144 nips-2000-Vicinal Risk Minimization
17 0.26249132 75 nips-2000-Large Scale Bayes Point Machines
18 0.25593355 74 nips-2000-Kernel Expansions with Unlabeled Examples
19 0.24928656 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations
20 0.24763785 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice
topicId topicWeight
[(10, 0.033), (17, 0.117), (55, 0.011), (62, 0.016), (67, 0.014), (75, 0.523), (76, 0.037), (90, 0.115), (97, 0.02)]
simIndex simValue paperId paperTitle
1 0.82244337 30 nips-2000-Bayesian Video Shot Segmentation
Author: Nuno Vasconcelos, Andrew Lippman
Abstract: Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.
same-paper 2 0.81658554 70 nips-2000-Incremental and Decremental Support Vector Machine Learning
Author: Gert Cauwenberghs, Tomaso Poggio
Abstract: An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. The incremental procedure is reversible, and decremental
3 0.75564986 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
Author: Naftali Tishby, Noam Slonim
Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1
4 0.44058707 79 nips-2000-Learning Segmentation by Random Walks
Author: Marina Meila, Jianbo Shi
Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1
5 0.35240221 12 nips-2000-A Support Vector Method for Clustering
Author: Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik
Abstract: We present a novel method for clustering using the support vector machine approach. Data points are mapped to a high dimensional feature space, where support vectors are used to define a sphere enclosing them. The boundary of the sphere forms in data space a set of closed contours containing the data. Data points enclosed by each contour are defined as a cluster. As the width parameter of the Gaussian kernel is decreased, these contours fit the data more tightly and splitting of contours occurs. The algorithm works by separating clusters according to valleys in the underlying probability distribution, and thus clusters can take on arbitrary geometrical shapes. As in other SV algorithms, outliers can be dealt with by introducing a soft margin constant leading to smoother cluster boundaries. The structure of the data is explored by varying the two parameters. We investigate the dependence of our method on these parameters and apply it to several data sets.
6 0.34147307 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script
7 0.33112711 52 nips-2000-Fast Training of Support Vector Classifiers
8 0.32628751 128 nips-2000-Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra
9 0.32485482 74 nips-2000-Kernel Expansions with Unlabeled Examples
10 0.31918341 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition
11 0.31612906 81 nips-2000-Learning Winner-take-all Competition Between Groups of Neurons in Lateral Inhibitory Networks
12 0.31486678 4 nips-2000-A Linear Programming Approach to Novelty Detection
13 0.31129211 3 nips-2000-A Gradient-Based Boosting Algorithm for Regression Problems
14 0.31051102 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
15 0.30174559 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
16 0.30016977 122 nips-2000-Sparse Representation for Gaussian Process Models
17 0.29215312 18 nips-2000-Active Support Vector Machine Classification
18 0.29205716 117 nips-2000-Shape Context: A New Descriptor for Shape Matching and Object Recognition
19 0.29194191 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
20 0.29045308 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling