nips nips2004 nips2004-34 knowledge-graph by maker-knowledge-mining

34 nips-2004-Breaking SVM Complexity with Cross-Training

Source: pdf

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

Abstract: We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We propose to selectively remove examples from the training set using probabilistic estimates related to editing algorithms (Devijver and Kittler, 1982). [sent-6, score-0.616]

2 This heuristic procedure aims at creating a separable distribution of training examples with minimal impact on the position of the decision boundary. [sent-7, score-0.594]

3 It breaks the linear dependency between the number of SVs and the number of training examples, and sharply reduces the complexity of SVMs during both the training and prediction stages. [sent-8, score-0.534]

4 Recent results (Steinwart, 2004) indicate that the number k of SVs increases linearly with the number n of training examples. [sent-10, score-0.249]

5 More speciﬁcally, k/n −→ 2BK (1) where n is the number of training examples and BK is the smallest classiﬁcation error achievable with the SVM kernel K. [sent-11, score-0.471]

6 When using a universal kernel such as the Radial Basis Function kernel, BK is the Bayes risk B, i. [sent-12, score-0.176]

7 the smallest classiﬁcation error achievable with any decision function. [sent-14, score-0.151]

8 The computational requirements of modern SVM training algorithms (Joachims, 1999; Chang and Lin, 2001) are very largely determined by the amount of memory required to store the active segment of the kernel matrix. [sent-15, score-0.35]

9 When this amount exceeds the available memory, the training time increases quickly because some kernel matrix coefﬁcients must be recomputed multiple times. [sent-16, score-0.355]

10 During the ﬁnal phase of the training process, the active segment always contains all the k(k + 1)/2 dot products between SVs. [sent-17, score-0.21]

11 Steinwart’s result (1) then suggests that the critical amount of memory scales at least like B 2 n2 . [sent-18, score-0.071]

12 This can be practically prohibitive for problems with either big training sets or large Bayes risk (noisy problems). [sent-19, score-0.317]

13 Large numbers of SVs also penalize SVMs during the prediction stage as the computation of the decision function requires a time proportional to the number of SVs. [sent-20, score-0.195]

14 8) In this paper, we propose to selectively remove examples from the training set using probabilistic estimates inspired by training set editing algorithms (Devijver and Kittler, 1982). [sent-26, score-0.826]

15 The removal procedure aims at creating a separable set of training examples without modifying the location of the decision boundary. [sent-27, score-0.594]

16 Making the problem separable breaks the linear dependency between the number of SVs and the number of training examples. [sent-28, score-0.342]

17 1 Related work Salient facts about SVMs We focus now on the C-SVM applied to the two-class pattern recognition problem. [sent-30, score-0.05]

18 Within each category, the possible values of the margins yi f (xi ) are prescribed by the KarushKuhn-Tucker optimality conditions. [sent-33, score-0.091]

19 ∗ - Examples such that αi = C are called bouncing SVs or margin errors and satisfy yi f (xi ) < 1. [sent-34, score-0.223]

20 The set of bouncing SVs includes all training examples misclassiﬁed by the SVM, i. [sent-35, score-0.467]

21 those which have a negative margin yi f (xi ) < 0. [sent-37, score-0.122]

22 ∗ - Examples such that 0 < αi < C are called ordinary SVs and satisfy yi f (xi ) = 1. [sent-38, score-0.091]

23 ∗ - Examples such that αi = 0 satisfy relation yi f (xi ) > 1. [sent-39, score-0.091]

24 These examples play no role in the SVM decision function (2). [sent-40, score-0.271]

25 Retraining after discarding these examples would still yield the same SVM decision function (2). [sent-41, score-0.35]

26 These facts provide some insight into Steinwart’s result (1). [sent-42, score-0.05]

27 The SVM decision function, like any other decision rule, must asymptotically misclassify at least Bn examples, where B is the Bayes risk. [sent-43, score-0.346]

28 To illustrate dependence on the Bayes risk, we perform a linear classiﬁcation task in two dimensions under varying amount of class overlap. [sent-45, score-0.071]

29 Several techniques aim to reduce the prediction complexity of SVMs by expressing the SVM solution (2) with a smaller kernel expansion. [sent-51, score-0.069]

30 Since one must compute the SVM solution before applying these post-processing techniques, they are not suitable for reducing the complexity of the training stage. [sent-52, score-0.21]

31 Burges (Burges, 1996) proposes to construct new patterns zj in order to deﬁne a compact approximation of the decision function (2). [sent-54, score-0.115]

32 A small number of training examples of class y = −1 can still appear in such a region. [sent-57, score-0.4]

33 We say that they are located on the wrong side of the Bayes decision boundary. [sent-58, score-0.262]

34 Asymptotically, all such training examples belong to the condensed training set in order to ensure that they are properly recognized as members of class y = −1. [sent-59, score-0.739]

35 The Edited Nearest Neighbor rule (Wilson, 1972) suggests to ﬁrst discard all training examples that are misclassiﬁed when applying the 1-NN rule using all n − 1 remaining examples as the training set. [sent-61, score-0.967]

36 It was shown that removing these examples improves the asymptotic performance of the nearest neighbor rule. [sent-62, score-0.34]

37 Whereas the 1-NN risk is asymptotically bounded by 2B, the Edited 1-NN risk is asymptotically bounded by 1. [sent-63, score-0.446]

38 11) asymptotically discards all the training examples located on the wrong side of the Bayes decision boundary. [sent-66, score-0.797]

39 The asymptotic risk of the multi-edited nearest neighbor rule is the Bayes risk B. [sent-67, score-0.433]

40 1 Divide randomly the training data into s splits S1 , . [sent-69, score-0.21]

41 Let us call fi the 1-NN classiﬁer that uses Si as the training set. [sent-73, score-0.21]

42 2 Classify all examples in Si using the classiﬁer f(i+1) mod s and discard all misclassiﬁed examples. [sent-74, score-0.28]

43 3 Gather all the remaining examples and return to step 1 if any example has been discarded during the last T iterations. [sent-75, score-0.244]

44 4 The remaining examples constitute the multiedited training set. [sent-76, score-0.407]

45 By discarding examples located on the wrong side of the Bayes decision boundary, algorithm MULTIEDIT constructs a new training set whose apparent distribution has the same Bayes decision boundary as the original problem, but with Bayes risk equal to 0. [sent-77, score-0.984]

46 Devijver and Kittler claim that MULTIEDIT produces an ideal training set for CONDENSE. [sent-78, score-0.21]

47 Algorithm MULTIEDIT also discards some proportion of training examples located on the correct side of Bayes decision boundary. [sent-79, score-0.631]

48 4 Editing algorithms and SVMs Training examples recognized with high conﬁdence usually do not appear in the SVM solution (2) because they do not become support vectors. [sent-85, score-0.278]

49 On the other hand, outliers always become support vectors. [sent-86, score-0.06]

50 The mathematical proofs for the asymptotic properties of MULTIEDIT depend on the speciﬁc nature of the 1-NN classiﬁcation rule. [sent-88, score-0.06]

51 Editing SVM training sets implicitly modiﬁes the SVM loss function in a way that relates to robust statistics. [sent-92, score-0.268]

52 Editing alters the apparent distribution of training examples such that the class distributions P (x | y = 1) and P (x | y = −1) no longer overlap. [sent-93, score-0.455]

53 If the class distributions were known, this could be done by trimming the tails of the class distributions. [sent-94, score-0.068]

54 A similar effect could be obtained by altering the SVM loss function (the hinge loss) into a non convex loss function that gives less weight to outliers. [sent-95, score-0.148]

55 3 Cross-Training Cross-Training is a representative algorithm of such combinations of SVMs and editing algorithms. [sent-96, score-0.236]

56 It begins with creating s subsets of the training set with r examples each. [sent-97, score-0.464]

57 The decision functions of these SVMs are then used to discard two types of training examples: those which are conﬁdently recognized, as in CONDENSE, and those which are misclassiﬁed, as in MULTIEDIT. [sent-99, score-0.449]

58 1 Create s subsets of size r by randomly drawing r/2 examples of each class. [sent-102, score-0.239]

59 , fs using each of the subsets as the training set. [sent-106, score-0.26]

60 3 For each training example (xi , yi ) estimate the margin average mi and variance vi : mi = 1 s s r=1 yi fr (xi ) vi = 1 s s r=1 (mi − yi fr (xi ))2 4 Discard all training examples for which mi + vi < 0. [sent-107, score-1.309]

61 5 Discard all training examples for which mi − vi > 1. [sent-108, score-0.479]

62 6 Train a ﬁnal SVM on the remaining training examples. [sent-109, score-0.251]

63 The apparent simplicity of this algorithm hides a lot of hyperparameters. [sent-110, score-0.055]

64 For the ﬁrst stage SVMs, we choose the C parameter which yields the best performance on training sets of size r. [sent-112, score-0.289]

65 For the second stage SVMs, we choose the C parameter which yields the best overall performance measured on a separate validation set. [sent-113, score-0.082]

66 Furthermore, we discovered that the discarding steps tend to produce a ﬁnal set of training examples with very different numbers of examples for each class. [sent-114, score-0.635]

67 Speciﬁc measures to alleviate this problem are discussed in section 4. [sent-115, score-0.045]

68 3 Further comfort comes from the knowledge that a SVM with the RBF kernel and without threshold term b implements the 1-NN rule when the RBF radius tends to zero. [sent-117, score-0.104]

69 13 0 50 1000 2000 3000 4000 5000 Training set size 6000 7000 8000 0 0 1000 2000 3000 4000 5000 Training set size 6000 7000 8000 Figure 3: Comparing LIBSVM and Cross-Training on a toy problem of two Gaussian clouds for increasing number of training points. [sent-126, score-0.334]

70 Cross-Training gives an almost constant number of support vectors (left ﬁgure) for increasing training set size, whereas in LIBSVM the number of support vectors increases linearly. [sent-127, score-0.369]

71 The error rates behave similarly (middle ﬁgure), and Cross-Training gives an improved training time (right ﬁgure). [sent-128, score-0.21]

72 1 Artiﬁcial Data We ﬁrst constructed artiﬁcial data, by generating two classes from two Gaussian clouds in 10 dimensions with means (1, 1, 1, 1, 1, 0, 0, 0, 0, 0) and (−1, −1, −1, −1, −1, 0, 0, 0, 0, 0) and standard deviation 4. [sent-132, score-0.058]

73 We trained a linear SVM for differing amounts of training points, selecting C via cross validation. [sent-133, score-0.25]

74 The results given in ﬁgure 3 show a reduction in SVs and computation time using Cross-Training, with no loss in accuracy. [sent-135, score-0.058]

75 There are 11982 training examples and 1984 testing examples. [sent-139, score-0.366]

76 All experiments were carried out using LIBSVM’s ν-SVM (Chang and Lin, 2001) with the RBF kernel (γ = 0. [sent-140, score-0.106]

77 Cross-Training was carried out by splitting the 11982 training examples into 5 subsets. [sent-142, score-0.403]

78 Figure 4 reports our results for various amounts of label noise. [sent-143, score-0.082]

79 Since our label noise is artiﬁcial, we can also measure the misclassiﬁcation rate on the unmodiﬁed testing set (right ﬁgure). [sent-146, score-0.077]

80 This measurement shows a slight loss of accuracy without statistical signiﬁcance. [sent-147, score-0.09]

81 Experimental results were quite disappointing until we realized that the discarding steps tends to produce training sets with very different numbers of examples for each class. [sent-150, score-0.479]

82 To alleviate this problem, after training each SVM, we choose the value of the threshold b ∗ 4 5 6 7 http://www. [sent-151, score-0.255]

83 0% 0% 5% 10% 15% 0% 5% 10% 15% Figure 4: Number of SVs (left ﬁgure) and test error (middle ﬁgure) for varying amounts of label noise on the MNIST 3-8 discrimination task. [sent-174, score-0.117]

84 The x-axis in all graphs shows the amount of label noise; white squares correspond to LIBSVM; black circles to Cross-Training; dashed lines to bagging the ﬁrst stage Cross-Training SVMs. [sent-175, score-0.125]

85 We also attempt to balance the ﬁnal training set by re-inserting examples discarded during step [5] of the cross-training algorithm. [sent-179, score-0.413]

86 Experiments were carried out using RBF kernels with the kernel width reported in the literature. [sent-180, score-0.106]

87 In the SVM experiments, the value of parameter C was determined by crossvalidation and then used for training a SVM on the full dataset. [sent-181, score-0.21]

88 In the cross-training experiments, we make a validation set by taking r/3 examples from the training set. [sent-182, score-0.402]

89 These examples are only used for choosing the values of C and for adjusting the SVM thresholds. [sent-183, score-0.156]

90 The columns in table 1 contain the dataset name, the size of the training set used for the experiment, the size of the test set, the SVM accuracy and number of SVs, the CrossTraining subset conﬁguration, accuracy, and ﬁnal number of SVs. [sent-203, score-0.308]

91 These numbers should be considered carefully because they are impacted by the discrete nature of the grid search for parameter C. [sent-205, score-0.075]

92 The general trend still indicates that Cross-Training causes a slight loss of accuracy but requires much less SVs. [sent-206, score-0.09]

93 Cross-Training with hyperparameter grid searches runs in two days. [sent-211, score-0.082]

94 Timing results would then depend on loosely controlled details of the hyperparameter grid search algorithms. [sent-217, score-0.082]

95 5 Discussion We have suggested to combine SVMs and training set editing techniques to break the linear relationship between number of support vectors and number of examples. [sent-218, score-0.475]

96 Such combinations raise interesting theoretical questions regarding the relative value of each of the training examples. [sent-219, score-0.21]

97 Experiments with a representative algorithm, namely Cross-Training, conﬁrm that both the training and the recognition time are sharply reduced. [sent-220, score-0.288]

98 On the other hand, Cross-Training causes a minor loss of accuracy, comparable to that of reduced set methods (Burges, 1996), and seems to be more sensitive than SVMs in terms of parameter tuning. [sent-221, score-0.058]

99 Despite these drawbacks, Cross-Training provides a practical means to construct kernel classiﬁers with signiﬁcantly larger training sets. [sent-222, score-0.279]

100 Asymptotic properties of the nearest neighbor rules using edited data. [sent-295, score-0.212]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('svs', 0.348), ('svm', 0.295), ('multiedit', 0.269), ('libsvm', 0.226), ('training', 0.21), ('svms', 0.208), ('editing', 0.205), ('devijver', 0.168), ('kittler', 0.168), ('examples', 0.156), ('burges', 0.156), ('discard', 0.124), ('steinwart', 0.124), ('bayes', 0.118), ('asymptotically', 0.116), ('decision', 0.115), ('risk', 0.107), ('bouncing', 0.101), ('xtrain', 0.101), ('gure', 0.094), ('yi', 0.091), ('edited', 0.088), ('misclassi', 0.085), ('discarding', 0.079), ('kernel', 0.069), ('condense', 0.067), ('condensed', 0.067), ('crosstraining', 0.067), ('rbf', 0.066), ('neighbor', 0.065), ('separable', 0.065), ('located', 0.065), ('forest', 0.064), ('mi', 0.063), ('recognized', 0.062), ('support', 0.06), ('asymptotic', 0.06), ('vapnik', 0.059), ('nearest', 0.059), ('clouds', 0.058), ('loss', 0.058), ('train', 0.057), ('apparent', 0.055), ('chang', 0.054), ('adult', 0.053), ('discards', 0.053), ('xi', 0.05), ('facts', 0.05), ('nec', 0.05), ('wilson', 0.05), ('wrong', 0.05), ('subsets', 0.05), ('vi', 0.05), ('creating', 0.048), ('discarded', 0.047), ('joachims', 0.047), ('sharply', 0.047), ('lin', 0.046), ('stage', 0.046), ('nal', 0.045), ('alleviate', 0.045), ('fr', 0.045), ('labs', 0.045), ('selectively', 0.045), ('label', 0.042), ('hyperparameter', 0.041), ('america', 0.041), ('bk', 0.041), ('princeton', 0.041), ('grid', 0.041), ('remaining', 0.041), ('amounts', 0.04), ('classi', 0.039), ('increases', 0.039), ('sv', 0.038), ('timing', 0.038), ('amount', 0.037), ('machines', 0.037), ('breaks', 0.037), ('carried', 0.037), ('validation', 0.036), ('middle', 0.036), ('cybernetics', 0.036), ('achievable', 0.036), ('rule', 0.035), ('noise', 0.035), ('numbers', 0.034), ('memory', 0.034), ('class', 0.034), ('arti', 0.033), ('size', 0.033), ('nj', 0.032), ('side', 0.032), ('accuracy', 0.032), ('benchmark', 0.032), ('non', 0.032), ('cial', 0.031), ('representative', 0.031), ('margin', 0.031), ('dependency', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 34 nips-2004-Breaking SVM Complexity with Cross-Training

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

2 0.20121174 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

Author: Hans P. Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, Vladimir Vapnik

Abstract: We describe an algorithm for support vector machines (SVM) that can be parallelized efficiently and scales to very large problems with hundreds of thousands of training vectors. Instead of analyzing the whole training set in one optimization step, the data are split into subsets and optimized separately with multiple SVMs. The partial results are combined and filtered again in a ‘Cascade’ of SVMs, until the global optimum is reached. The Cascade SVM can be spread over multiple processors with minimal communication overhead and requires far less memory, since the kernel matrices are much smaller than for a regular SVM. Convergence to the global optimum is guaranteed with multiple passes through the Cascade, but already a single pass provides good generalization. A single pass is 5x – 10x faster than a regular SVM for problems of 100,000 vectors when implemented on a single processor. Parallel implementations on a cluster of 16 processors were tested with over 1 million vectors (2-class problems), converging in a day or two, while a regular SVM never converged in over a week. 1

3 0.18484154 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie

Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can ﬁt the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as ﬁtting one SVM model. 1

4 0.18409534 92 nips-2004-Kernel Methods for Implicit Surface Modeling

Author: Joachim Giesen, Simon Spalinger, Bernhard Schölkopf

Abstract: We describe methods for computing an implicit model of a hypersurface that is given only by a ﬁnite sampling. The methods work by mapping the sample points into a reproducing kernel Hilbert space and then determining regions in terms of hyperplanes. 1

5 0.15749723 68 nips-2004-Face Detection --- Efficient and Rank Deficient

Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir

Abstract: This paper proposes a method for computing fast approximations to support vector decision functions in the ﬁeld of object detection. In the present approach we are building on an existing algorithm where the set of support vectors is replaced by a smaller, so-called reduced set of synthesized input space points. In contrast to the existing method that ﬁnds the reduced set via unconstrained optimization, we impose a structural constraint on the synthetic points such that the resulting approximations can be evaluated via separable ﬁlters. For applications that require scanning large images, this decreases the computational complexity by a signiﬁcant amount. Experimental results show that in face detection, rank deﬁcient approximations are 4 to 6 times faster than unconstrained reduced set systems. 1

6 0.14291835 69 nips-2004-Fast Rates to Bayes for Kernel Machines

7 0.12382298 178 nips-2004-Support Vector Classification with Input Data Uncertainty

8 0.12355797 49 nips-2004-Density Level Detection is Classification

9 0.12302044 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

10 0.115391 19 nips-2004-An Application of Boosting to Graph Classification

11 0.11046686 115 nips-2004-Maximum Margin Clustering

12 0.10464341 82 nips-2004-Incremental Algorithms for Hierarchical Classification

13 0.10322272 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

14 0.10307293 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

15 0.090278283 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

16 0.083342731 168 nips-2004-Semigroup Kernels on Finite Sets

17 0.082741678 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

18 0.082593665 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

19 0.078799866 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

20 0.078045994 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.253), (1, 0.123), (2, -0.057), (3, 0.185), (4, -0.008), (5, 0.102), (6, 0.152), (7, -0.161), (8, 0.022), (9, -0.051), (10, -0.009), (11, 0.065), (12, -0.112), (13, -0.012), (14, -0.082), (15, -0.147), (16, -0.09), (17, 0.114), (18, -0.036), (19, 0.119), (20, -0.104), (21, 0.014), (22, -0.036), (23, -0.021), (24, -0.001), (25, -0.089), (26, -0.052), (27, 0.06), (28, 0.091), (29, -0.111), (30, 0.117), (31, 0.032), (32, 0.058), (33, -0.107), (34, -0.08), (35, -0.003), (36, -0.001), (37, 0.087), (38, -0.094), (39, -0.031), (40, -0.059), (41, -0.078), (42, -0.094), (43, -0.049), (44, -0.051), (45, -0.014), (46, 0.01), (47, -0.051), (48, -0.085), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97027421 34 nips-2004-Breaking SVM Complexity with Cross-Training

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

2 0.83426654 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

Author: Hans P. Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, Vladimir Vapnik

3 0.62837392 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classiﬁcation problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

4 0.60821044 68 nips-2004-Face Detection --- Efficient and Rank Deficient

Author: Wolf Kienzle, Matthias O. Franz, Bernhard Schölkopf, Gökhan H. Bakir

5 0.5995062 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert

Abstract: This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classiﬁcation framework, essentially the regularization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a regularization point of view and we propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM. 1

6 0.59738344 92 nips-2004-Kernel Methods for Implicit Surface Modeling

7 0.58547938 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

8 0.55310947 82 nips-2004-Incremental Algorithms for Hierarchical Classification

9 0.54566014 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

10 0.54363906 69 nips-2004-Fast Rates to Bayes for Kernel Machines

11 0.52402657 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification

12 0.50518948 19 nips-2004-An Application of Boosting to Graph Classification

13 0.49826542 115 nips-2004-Maximum Margin Clustering

14 0.49395362 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

15 0.4830018 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

16 0.46441036 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

17 0.45582658 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

18 0.41674635 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

19 0.41341469 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

20 0.4132871 178 nips-2004-Support Vector Classification with Input Data Uncertainty

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.061), (15, 0.104), (26, 0.539), (31, 0.016), (33, 0.127), (35, 0.015), (39, 0.017), (50, 0.025), (51, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95816576 185 nips-2004-The Convergence of Contrastive Divergences

Author: Alan L. Yuille

Abstract: This paper analyses the Contrastive Divergence algorithm for learning statistical parameters. We relate the algorithm to the stochastic approximation literature. This enables us to specify conditions under which the algorithm is guaranteed to converge to the optimal solution (with probability 1). This includes necessary and sufﬁcient conditions for the solution to be unbiased.

same-paper 2 0.9411602 34 nips-2004-Breaking SVM Complexity with Cross-Training

Author: Léon Bottou, Jason Weston, Gökhan H. Bakir

3 0.93108243 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters

Author: Evan C. Smith, Michael S. Lewicki

Abstract: The representation of acoustic signals at the cochlear nerve must serve a wide range of auditory tasks that require exquisite sensitivity in both time and frequency. Lewicki (2002) demonstrated that many of the ﬁltering properties of the cochlea could be explained in terms of efﬁcient coding of natural sounds. This model, however, did not account for properties such as phase-locking or how sound could be encoded in terms of action potentials. Here, we extend this theoretical approach with algorithm for learning efﬁcient auditory codes using a spiking population code. Here, we propose an algorithm for learning efﬁcient auditory codes using a theoretical model for coding sound in terms of spikes. In this model, each spike encodes the precise time position and magnitude of a localized, time varying kernel function. By adapting the kernel functions to the statistics natural sounds, we show that, compared to conventional signal representations, the spike code achieves far greater coding efﬁciency. Furthermore, the inferred kernels show both striking similarities to measured cochlear ﬁlters and a similar bandwidth versus frequency dependence. 1

4 0.90454334 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisﬁed in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufﬁcient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

5 0.72140956 190 nips-2004-The Rescorla-Wagner Algorithm and Maximum Likelihood Estimation of Causal Parameters

Author: Alan L. Yuille

Abstract: This paper analyzes generalization of the classic Rescorla-Wagner (RW) learning algorithm and studies their relationship to Maximum Likelihood estimation of causal parameters. We prove that the parameters of two popular causal models, ∆P and P C, can be learnt by the same generalized linear Rescorla-Wagner (GLRW) algorithm provided genericity conditions apply. We characterize the ﬁxed points of these GLRW algorithms and calculate the ﬂuctuations about them, assuming that the input is a set of i.i.d. samples from a ﬁxed (unknown) distribution. We describe how to determine convergence conditions and calculate convergence rates for the GLRW algorithms under these conditions.

6 0.5980221 69 nips-2004-Fast Rates to Bayes for Kernel Machines

7 0.59734249 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

8 0.59267288 48 nips-2004-Convergence and No-Regret in Multiagent Learning

9 0.58537263 68 nips-2004-Face Detection --- Efficient and Rank Deficient

10 0.58383501 144 nips-2004-Parallel Support Vector Machines: The Cascade SVM

11 0.58079022 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

12 0.56868756 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

13 0.55730617 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

14 0.55659294 103 nips-2004-Limits of Spectral Clustering

15 0.55623484 194 nips-2004-Theory of localized synfire chain: characteristic propagation speed of stable spike pattern

16 0.55078524 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

17 0.54404825 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

18 0.53667307 28 nips-2004-Bayesian inference in spiking neurons

19 0.53614426 153 nips-2004-Reducing Spike Train Variability: A Computational Theory Of Spike-Timing Dependent Plasticity

20 0.53249341 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill