nips nips2004 nips2004-34 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Léon Bottou, Jason Weston, Gökhan H. Bakir
Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). [sent-6, score-0.616]
2 This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. [sent-7, score-0.594]
3 It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. [sent-8, score-0.534]
4 Recent results (Steinwart, 2004) indicate that the number k of SVs increases linearly with the number n of training examples. [sent-10, score-0.249]
5 More specifically, k/n −→ 2BK (1) where n is the number of training examples and BK is the smallest classification error achievable with the SVM kernel K. [sent-11, score-0.471]
6 When using a universal kernel such as the Radial Basis Function kernel, BK is the Bayes risk B, i. [sent-12, score-0.176]
7 the smallest classification error achievable with any decision function. [sent-14, score-0.151]
8 The computational requirements of modern SVM training algorithms (Joachims, 1999; Chang and Lin, 2001) are very largely determined by the amount of memory required to store the active segment of the kernel matrix. [sent-15, score-0.35]
9 When this amount exceeds the available memory, the training time increases quickly because some kernel matrix coefficients must be recomputed multiple times. [sent-16, score-0.355]
10 During the final phase of the training process, the active segment always contains all the k(k + 1)/2 dot products between SVs. [sent-17, score-0.21]
11 Steinwart’s result (1) then suggests that the critical amount of memory scales at least like B 2 n2 . [sent-18, score-0.071]
12 This can be practically prohibitive for problems with either big training sets or large Bayes risk (noisy problems). [sent-19, score-0.317]
13 Large numbers of SVs also penalize SVMs during the prediction stage as the computation of the decision function requires a time proportional to the number of SVs. [sent-20, score-0.195]
14 8) In this paper, we propose to selectively remove examples from the training set using probabilistic estimates inspired by training set editing algorithms (Devijver and Kittler, 1982). [sent-26, score-0.826]
15 The removal procedure aims at creating a separable set of training examples without modifying the location of the decision boundary. [sent-27, score-0.594]
16 Making the problem separable breaks the linear dependency between the number of SVs and the number of training examples. [sent-28, score-0.342]
17 1 Related work Salient facts about SVMs We focus now on the C-SVM applied to the two-class pattern recognition problem. [sent-30, score-0.05]
18 Within each category, the possible values of the margins yi f (xi ) are prescribed by the KarushKuhn-Tucker optimality conditions. [sent-33, score-0.091]
19 ∗ - Examples such that αi = C are called bouncing SVs or margin errors and satisfy yi f (xi ) < 1. [sent-34, score-0.223]
20 The set of bouncing SVs includes all training examples misclassified by the SVM, i. [sent-35, score-0.467]
21 those which have a negative margin yi f (xi ) < 0. [sent-37, score-0.122]
22 ∗ - Examples such that 0 < αi < C are called ordinary SVs and satisfy yi f (xi ) = 1. [sent-38, score-0.091]
23 ∗ - Examples such that αi = 0 satisfy relation yi f (xi ) > 1. [sent-39, score-0.091]
24 These examples play no role in the SVM decision function (2). [sent-40, score-0.271]
25 Retraining after discarding these examples would still yield the same SVM decision function (2). [sent-41, score-0.35]
26 These facts provide some insight into Steinwart’s result (1). [sent-42, score-0.05]
27 The SVM decision function, like any other decision rule, must asymptotically misclassify at least Bn examples, where B is the Bayes risk. [sent-43, score-0.346]
28 To illustrate dependence on the Bayes risk, we perform a linear classification task in two dimensions under varying amount of class overlap. [sent-45, score-0.071]
29 Several techniques aim to reduce the prediction complexity of SVMs by expressing the SVM solution (2) with a smaller kernel expansion. [sent-51, score-0.069]
30 Since one must compute the SVM solution before applying these post-processing techniques, they are not suitable for reducing the complexity of the training stage. [sent-52, score-0.21]
31 Burges (Burges, 1996) proposes to construct new patterns zj in order to define a compact approximation of the decision function (2). [sent-54, score-0.115]
32 A small number of training examples of class y = −1 can still appear in such a region. [sent-57, score-0.4]
33 We say that they are located on the wrong side of the Bayes decision boundary. [sent-58, score-0.262]
34 Asymptotically, all such training examples belong to the condensed training set in order to ensure that they are properly recognized as members of class y = −1. [sent-59, score-0.739]
35 The Edited Nearest Neighbor rule (Wilson, 1972) suggests to first discard all training examples that are misclassified when applying the 1-NN rule using all n − 1 remaining examples as the training set. [sent-61, score-0.967]
36 It was shown that removing these examples improves the asymptotic performance of the nearest neighbor rule. [sent-62, score-0.34]
37 Whereas the 1-NN risk is asymptotically bounded by 2B, the Edited 1-NN risk is asymptotically bounded by 1. [sent-63, score-0.446]
38 11) asymptotically discards all the training examples located on the wrong side of the Bayes decision boundary. [sent-66, score-0.797]
39 The asymptotic risk of the multi-edited nearest neighbor rule is the Bayes risk B. [sent-67, score-0.433]
40 1 Divide randomly the training data into s splits S1 , . [sent-69, score-0.21]
41 Let us call fi the 1-NN classifier that uses Si as the training set. [sent-73, score-0.21]
42 2 Classify all examples in Si using the classifier f(i+1) mod s and discard all misclassified examples. [sent-74, score-0.28]
43 3 Gather all the remaining examples and return to step 1 if any example has been discarded during the last T iterations. [sent-75, score-0.244]
44 4 The remaining examples constitute the multiedited training set. [sent-76, score-0.407]
45 By discarding examples located on the wrong side of the Bayes decision boundary, algorithm MULTIEDIT constructs a new training set whose apparent distribution has the same Bayes decision boundary as the original problem, but with Bayes risk equal to 0. [sent-77, score-0.984]
46 Devijver and Kittler claim that MULTIEDIT produces an ideal training set for CONDENSE. [sent-78, score-0.21]
47 Algorithm MULTIEDIT also discards some proportion of training examples located on the correct side of Bayes decision boundary. [sent-79, score-0.631]
48 4 Editing algorithms and SVMs Training examples recognized with high confidence usually do not appear in the SVM solution (2) because they do not become support vectors. [sent-85, score-0.278]
49 On the other hand, outliers always become support vectors. [sent-86, score-0.06]
50 The mathematical proofs for the asymptotic properties of MULTIEDIT depend on the specific nature of the 1-NN classification rule. [sent-88, score-0.06]
51 Editing SVM training sets implicitly modifies the SVM loss function in a way that relates to robust statistics. [sent-92, score-0.268]
52 Editing alters the apparent distribution of training examples such that the class distributions P (x | y = 1) and P (x | y = −1) no longer overlap. [sent-93, score-0.455]
53 If the class distributions were known, this could be done by trimming the tails of the class distributions. [sent-94, score-0.068]
54 A similar effect could be obtained by altering the SVM loss function (the hinge loss) into a non convex loss function that gives less weight to outliers. [sent-95, score-0.148]
55 3 Cross-Training Cross-Training is a representative algorithm of such combinations of SVMs and editing algorithms. [sent-96, score-0.236]
56 It begins with creating s subsets of the training set with r examples each. [sent-97, score-0.464]
57 The decision functions of these SVMs are then used to discard two types of training examples: those which are confidently recognized, as in CONDENSE, and those which are misclassified, as in MULTIEDIT. [sent-99, score-0.449]
58 1 Create s subsets of size r by randomly drawing r/2 examples of each class. [sent-102, score-0.239]
59 , fs using each of the subsets as the training set. [sent-106, score-0.26]
60 3 For each training example (xi , yi ) estimate the margin average mi and variance vi : mi = 1 s s r=1 yi fr (xi ) vi = 1 s s r=1 (mi − yi fr (xi ))2 4 Discard all training examples for which mi + vi < 0. [sent-107, score-1.309]
61 5 Discard all training examples for which mi − vi > 1. [sent-108, score-0.479]
62 6 Train a final SVM on the remaining training examples. [sent-109, score-0.251]
63 The apparent simplicity of this algorithm hides a lot of hyperparameters. [sent-110, score-0.055]
64 For the first stage SVMs, we choose the C parameter which yields the best performance on training sets of size r. [sent-112, score-0.289]
65 For the second stage SVMs, we choose the C parameter which yields the best overall performance measured on a separate validation set. [sent-113, score-0.082]
66 Furthermore, we discovered that the discarding steps tend to produce a final set of training examples with very different numbers of examples for each class. [sent-114, score-0.635]
67 Specific measures to alleviate this problem are discussed in section 4. [sent-115, score-0.045]
68 3 Further comfort comes from the knowledge that a SVM with the RBF kernel and without threshold term b implements the 1-NN rule when the RBF radius tends to zero. [sent-117, score-0.104]
69 13 0 50 1000 2000 3000 4000 5000 Training set size 6000 7000 8000 0 0 1000 2000 3000 4000 5000 Training set size 6000 7000 8000 Figure 3: Comparing LIBSVM and Cross-Training on a toy problem of two Gaussian clouds for increasing number of training points. [sent-126, score-0.334]
70 Cross-Training gives an almost constant number of support vectors (left figure) for increasing training set size, whereas in LIBSVM the number of support vectors increases linearly. [sent-127, score-0.369]
71 The error rates behave similarly (middle figure), and Cross-Training gives an improved training time (right figure). [sent-128, score-0.21]
72 1 Artificial Data We first constructed artificial data, by generating two classes from two Gaussian clouds in 10 dimensions with means (1, 1, 1, 1, 1, 0, 0, 0, 0, 0) and (−1, −1, −1, −1, −1, 0, 0, 0, 0, 0) and standard deviation 4. [sent-132, score-0.058]
73 We trained a linear SVM for differing amounts of training points, selecting C via cross validation. [sent-133, score-0.25]
74 The results given in figure 3 show a reduction in SVs and computation time using Cross-Training, with no loss in accuracy. [sent-135, score-0.058]
75 There are 11982 training examples and 1984 testing examples. [sent-139, score-0.366]
76 All experiments were carried out using LIBSVM’s ν-SVM (Chang and Lin, 2001) with the RBF kernel (γ = 0. [sent-140, score-0.106]
77 Cross-Training was carried out by splitting the 11982 training examples into 5 subsets. [sent-142, score-0.403]
78 Figure 4 reports our results for various amounts of label noise. [sent-143, score-0.082]
79 Since our label noise is artificial, we can also measure the misclassification rate on the unmodified testing set (right figure). [sent-146, score-0.077]
80 This measurement shows a slight loss of accuracy without statistical significance. [sent-147, score-0.09]
81 Experimental results were quite disappointing until we realized that the discarding steps tends to produce training sets with very different numbers of examples for each class. [sent-150, score-0.479]
82 To alleviate this problem, after training each SVM, we choose the value of the threshold b ∗ 4 5 6 7 http://www. [sent-151, score-0.255]
83 0% 0% 5% 10% 15% 0% 5% 10% 15% Figure 4: Number of SVs (left figure) and test error (middle figure) for varying amounts of label noise on the MNIST 3-8 discrimination task. [sent-174, score-0.117]
84 The x-axis in all graphs shows the amount of label noise; white squares correspond to LIBSVM; black circles to Cross-Training; dashed lines to bagging the first stage Cross-Training SVMs. [sent-175, score-0.125]
85 We also attempt to balance the final training set by re-inserting examples discarded during step [5] of the cross-training algorithm. [sent-179, score-0.413]
86 Experiments were carried out using RBF kernels with the kernel width reported in the literature. [sent-180, score-0.106]
87 In the SVM experiments, the value of parameter C was determined by crossvalidation and then used for training a SVM on the full dataset. [sent-181, score-0.21]
88 In the cross-training experiments, we make a validation set by taking r/3 examples from the training set. [sent-182, score-0.402]
89 These examples are only used for choosing the values of C and for adjusting the SVM thresholds. [sent-183, score-0.156]
90 The columns in table 1 contain the dataset name, the size of the training set used for the experiment, the size of the test set, the SVM accuracy and number of SVs, the CrossTraining subset configuration, accuracy, and final number of SVs. [sent-203, score-0.308]
91 These numbers should be considered carefully because they are impacted by the discrete nature of the grid search for parameter C. [sent-205, score-0.075]
92 The general trend still indicates that Cross-Training causes a slight loss of accuracy but requires much less SVs. [sent-206, score-0.09]
93 Cross-Training with hyperparameter grid searches runs in two days. [sent-211, score-0.082]
94 Timing results would then depend on loosely controlled details of the hyperparameter grid search algorithms. [sent-217, score-0.082]
95 5 Discussion We have suggested to combine SVMs and training set editing techniques to break the linear relationship between number of support vectors and number of examples. [sent-218, score-0.475]
96 Such combinations raise interesting theoretical questions regarding the relative value of each of the training examples. [sent-219, score-0.21]
97 Experiments with a representative algorithm, namely Cross-Training, confirm that both the training and the recognition time are sharply reduced. [sent-220, score-0.288]
98 On the other hand, Cross-Training causes a minor loss of accuracy, comparable to that of reduced set methods (Burges, 1996), and seems to be more sensitive than SVMs in terms of parameter tuning. [sent-221, score-0.058]
99 Despite these drawbacks, Cross-Training provides a practical means to construct kernel classifiers with significantly larger training sets. [sent-222, score-0.279]
100 Asymptotic properties of the nearest neighbor rules using edited data. [sent-295, score-0.212]
wordName wordTfidf (topN-words)
[('svs', 0.348), ('svm', 0.295), ('multiedit', 0.269), ('libsvm', 0.226), ('training', 0.21), ('svms', 0.208), ('editing', 0.205), ('devijver', 0.168), ('kittler', 0.168), ('examples', 0.156), ('burges', 0.156), ('discard', 0.124), ('steinwart', 0.124), ('bayes', 0.118), ('asymptotically', 0.116), ('decision', 0.115), ('risk', 0.107), ('bouncing', 0.101), ('xtrain', 0.101), ('gure', 0.094), ('yi', 0.091), ('edited', 0.088), ('misclassi', 0.085), ('discarding', 0.079), ('kernel', 0.069), ('condense', 0.067), ('condensed', 0.067), ('crosstraining', 0.067), ('rbf', 0.066), ('neighbor', 0.065), ('separable', 0.065), ('located', 0.065), ('forest', 0.064), ('mi', 0.063), ('recognized', 0.062), ('support', 0.06), ('asymptotic', 0.06), ('vapnik', 0.059), ('nearest', 0.059), ('clouds', 0.058), ('loss', 0.058), ('train', 0.057), ('apparent', 0.055), ('chang', 0.054), ('adult', 0.053), ('discards', 0.053), ('xi', 0.05), ('facts', 0.05), ('nec', 0.05), ('wilson', 0.05), ('wrong', 0.05), ('subsets', 0.05), ('vi', 0.05), ('creating', 0.048), ('discarded', 0.047), ('joachims', 0.047), ('sharply', 0.047), ('lin', 0.046), ('stage', 0.046), ('nal', 0.045), ('alleviate', 0.045), ('fr', 0.045), ('labs', 0.045), ('selectively', 0.045), ('label', 0.042), ('hyperparameter', 0.041), ('america', 0.041), ('bk', 0.041), ('princeton', 0.041), ('grid', 0.041), ('remaining', 0.041), ('amounts', 0.04), ('classi', 0.039), ('increases', 0.039), ('sv', 0.038), ('timing', 0.038), ('amount', 0.037), ('machines', 0.037), ('breaks', 0.037), ('carried', 0.037), ('validation', 0.036), ('middle', 0.036), ('cybernetics', 0.036), ('achievable', 0.036), ('rule', 0.035), ('noise', 0.035), ('numbers', 0.034), ('memory', 0.034), ('class', 0.034), ('arti', 0.033), ('size', 0.033), ('nj', 0.032), ('side', 0.032), ('accuracy', 0.032), ('benchmark', 0.032), ('non', 0.032), ('cial', 0.031), ('representative', 0.031), ('margin', 0.031), ('dependency', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 34 nips-2004-Breaking SVM Complexity with Cross-Training
Author: Léon Bottou, Jason Weston, Gökhan H. Bakir
Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1
2 0.20121174 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
Author: Hans P. Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, Vladimir Vapnik
Abstract: We describe an algorithm for support vector machines (SVM) that can be parallelized efficiently and scales to very large problems with hundreds of thousands of training vectors. Instead of analyzing the whole training set in one optimization step, the data are split into subsets and optimized separately with multiple SVMs. The partial results are combined and filtered again in a ‘Cascade’ of SVMs, until the global optimum is reached. The Cascade SVM can be spread over multiple processors with minimal communication overhead and requires far less memory, since the kernel matrices are much smaller than for a regular SVM. Convergence to the global optimum is guaranteed with multiple passes through the Cascade, but already a single pass provides good generalization. A single pass is 5x – 10x faster than a regular SVM for problems of 100,000 vectors when implemented on a single processor. Parallel implementations on a cluster of 16 processors were tested with over 1 million vectors (2-class problems), converging in a day or two, while a regular SVM never converged in over a week. 1
3 0.18484154 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
4 0.18409534 92 nips-2004-Kernel Methods for Implicit Surface Modeling
Author: Joachim Giesen, Simon Spalinger, Bernhard Schölkopf
Abstract: We describe methods for computing an implicit model of a hypersurface that is given only by a finite sampling. The methods work by mapping the sample points into a reproducing kernel Hilbert space and then determining regions in terms of hyperplanes. 1
5 0.15749723 68 nips-2004-Face Detection --- Efficient and Rank Deficient
Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scanning large images, this decreases the computational complexity by a significant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1
6 0.14291835 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.12382298 178 nips-2004-Support Vector Classification with Input Data Uncertainty
8 0.12355797 49 nips-2004-Density Level Detection is Classification
9 0.12302044 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
10 0.115391 19 nips-2004-An Application of Boosting to Graph Classification
11 0.11046686 115 nips-2004-Maximum Margin Clustering
12 0.10464341 82 nips-2004-Incremental Algorithms for Hierarchical Classification
13 0.10322272 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
14 0.10307293 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
15 0.090278283 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
16 0.083342731 168 nips-2004-Semigroup Kernels on Finite Sets
17 0.082741678 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
18 0.082593665 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification
19 0.078799866 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
20 0.078045994 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
topicId topicWeight
[(0, -0.253), (1, 0.123), (2, -0.057), (3, 0.185), (4, -0.008), (5, 0.102), (6, 0.152), (7, -0.161), (8, 0.022), (9, -0.051), (10, -0.009), (11, 0.065), (12, -0.112), (13, -0.012), (14, -0.082), (15, -0.147), (16, -0.09), (17, 0.114), (18, -0.036), (19, 0.119), (20, -0.104), (21, 0.014), (22, -0.036), (23, -0.021), (24, -0.001), (25, -0.089), (26, -0.052), (27, 0.06), (28, 0.091), (29, -0.111), (30, 0.117), (31, 0.032), (32, 0.058), (33, -0.107), (34, -0.08), (35, -0.003), (36, -0.001), (37, 0.087), (38, -0.094), (39, -0.031), (40, -0.059), (41, -0.078), (42, -0.094), (43, -0.049), (44, -0.051), (45, -0.014), (46, 0.01), (47, -0.051), (48, -0.085), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.97027421 34 nips-2004-Breaking SVM Complexity with Cross-Training
Author: Léon Bottou, Jason Weston, Gökhan H. Bakir
Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1
2 0.83426654 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
Author: Hans P. Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, Vladimir Vapnik
Abstract: We describe an algorithm for support vector machines (SVM) that can be parallelized efficiently and scales to very large problems with hundreds of thousands of training vectors. Instead of analyzing the whole training set in one optimization step, the data are split into subsets and optimized separately with multiple SVMs. The partial results are combined and filtered again in a ‘Cascade’ of SVMs, until the global optimum is reached. The Cascade SVM can be spread over multiple processors with minimal communication overhead and requires far less memory, since the kernel matrices are much smaller than for a regular SVM. Convergence to the global optimum is guaranteed with multiple passes through the Cascade, but already a single pass provides good generalization. A single pass is 5x – 10x faster than a regular SVM for problems of 100,000 vectors when implemented on a single processor. Parallel implementations on a cluster of 16 processors were tested with over 1 million vectors (2-class problems), converging in a day or two, while a regular SVM never converged in over a week. 1
3 0.62837392 49 nips-2004-Density Level Detection is Classification
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1
4 0.60821044 68 nips-2004-Face Detection --- Efficient and Rank Deficient
Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the field of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that finds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable filters. For applications that require scanning large images, this decreases the computational complexity by a significant amount. Experimental results show that in face detection, rank deficient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1
5 0.5995062 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert
Abstract: This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classification framework, essentially the regularization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a regularization point of view and we propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM. 1
6 0.59738344 92 nips-2004-Kernel Methods for Implicit Surface Modeling
7 0.58547938 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
8 0.55310947 82 nips-2004-Incremental Algorithms for Hierarchical Classification
9 0.54566014 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
10 0.54363906 69 nips-2004-Fast Rates to Bayes for Kernel Machines
11 0.52402657 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification
12 0.50518948 19 nips-2004-An Application of Boosting to Graph Classification
13 0.49826542 115 nips-2004-Maximum Margin Clustering
14 0.49395362 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
15 0.4830018 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound
16 0.46441036 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization
17 0.45582658 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data
18 0.41674635 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
19 0.41341469 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
20 0.4132871 178 nips-2004-Support Vector Classification with Input Data Uncertainty
topicId topicWeight
[(13, 0.061), (15, 0.104), (26, 0.539), (31, 0.016), (33, 0.127), (35, 0.015), (39, 0.017), (50, 0.025), (51, 0.011)]
simIndex simValue paperId paperTitle
1 0.95816576 185 nips-2004-The Convergence of Contrastive Divergences
Author: Alan L. Yuille
Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufficient conditions for the solution to be unbiased.
same-paper 2 0.9411602 34 nips-2004-Breaking SVM Complexity with Cross-Training
Author: Léon Bottou, Jason Weston, Gökhan H. Bakir
Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1
3 0.93108243 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
Author: Evan C. Smith, Michael S. Lewicki
Abstract: The representation of acoustic signals at the cochlear nerve must serve a wide range of auditory tasks that require exquisite sensitivity in both time and frequency. Lewicki (2002) demonstrated that many of the filtering properties of the cochlea could be explained in terms of efficient coding of natural sounds. This model, however, did not account for properties such as phase-locking or how sound could be encoded in terms of action potentials. Here, we extend this theoretical approach with algorithm for learning efficient auditory codes using a spiking population code. Here, we propose an algorithm for learning efficient auditory codes using a theoretical model for coding sound in terms of spikes. In this model, each spike encodes the precise time position and magnitude of a localized, time varying kernel function. By adapting the kernel functions to the statistics natural sounds, we show that, compared to conventional signal representations, the spike code achieves far greater coding efficiency. Furthermore, the inferred kernels show both striking similarities to measured cochlear filters and a similar bandwidth versus frequency dependence. 1
4 0.90454334 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice
Author: Maria-florina Balcan, Avrim Blum, Ke Yang
Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1
5 0.72140956 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters
Author: Alan L. Yuille
Abstract: This paper analyzes generalization of the classic Rescorla-Wagner (RW) learning algorithm and studies their relationship to Maximum Likelihood estimation of causal parameters. We prove that the parameters of two popular causal models, ∆P and P C, can be learnt by the same generalized linear Rescorla-Wagner (GLRW) algorithm provided genericity conditions apply. We characterize the fixed points of these GLRW algorithms and calculate the fluctuations about them, assuming that the input is a set of i.i.d. samples from a fixed (unknown) distribution. We describe how to determine convergence conditions and calculate convergence rates for the GLRW algorithms under these conditions.
6 0.5980221 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.59734249 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units
8 0.59267288 48 nips-2004-Convergence and No-Regret in Multiagent Learning
9 0.58537263 68 nips-2004-Face Detection --- Efficient and Rank Deficient
10 0.58383501 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM
11 0.58079022 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
12 0.56868756 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
13 0.55730617 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
14 0.55659294 103 nips-2004-Limits of Spectral Clustering
15 0.55623484 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern
16 0.55078524 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting
17 0.54404825 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve
18 0.53667307 28 nips-2004-Bayesian inference in spiking neurons
19 0.53614426 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity
20 0.53249341 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill