nips nips2001 nips2001-16 knowledge-graph by maker-knowledge-mining

16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems


Source: pdf

Author: Ronan Collobert, Samy Bengio, Yoshua Bengio

Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Samy Bengio IDIAP CP 592, rue du Simp Ion 4 1920 Martigny, Switzerland bengio©idiap. [sent-4, score-0.068]

2 ca Abstract Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. [sent-8, score-0.339]

3 Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. [sent-9, score-0.175]

4 The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. [sent-10, score-0.403]

5 Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . [sent-11, score-0.376]

6 In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. [sent-12, score-0.026]

7 1 Introduction Recently a lot of work has been done around Support Vector Machines [9], mainly due to their impressive generalization performances on classification problems when compared to other algorithms such as artificial neural networks [3, 6]. [sent-13, score-0.119]

8 However, SVMs require to solve a quadratic optimization problem which needs resources that are at least quadratic in the number of training examples, and it is thus hopeless to try solving problems having millions of examples using classical SVMs. [sent-14, score-0.481]

9 In order to overcome this drawback, we propose in this paper to use a mixture of several SVMs, each of them trained only on a part of the dataset. [sent-15, score-0.36]

10 vve vruvuoe Here a l:i't'fltpte 'fIte~ltuu LU LlalH oUCH a mixture, and we will show that in practice this method is much faster than training only one SVM, and leads to results that are at least as good as one SVM. [sent-20, score-0.185]

11 We conjecture that the training time complexity of the proposed approach with respect to the number of examples is sub-quadratic for large data sets. [sent-21, score-0.325]

12 Moreover this mixture can be easily parallelized, which could improve again significantly the training time. [sent-22, score-0.412]

13 In section 3 we present our mixture of SVMs, followed in section 4 by some comparisons to related models. [sent-24, score-0.275]

14 In section 5 we show some experimental results, first on a toy dataset, then on two large real-life datasets. [sent-25, score-0.048]

15 2 Introduction to Support Vector Machines Support Vector Machines (SVMs) [9] have been applied to many classification problems, generally yielding good performance compared to other algorithms. [sent-27, score-0.103]

16 Therefore, to train an SVM, we need to solve a quadratic optimization problem, where the number of parameters is N. [sent-42, score-0.185]

17 This makes the use of SVMs for large datasets difficult: computing K(Xi' Xj) for every training pair would require O(N2) computation, and solving may take up to O(N3). [sent-43, score-0.161]

18 Note however that current state-of-the-art algorithms appear to have training time complexity scaling much closer to O(N 2 ) than O(N3) [2]. [sent-44, score-0.194]

19 3 A New Conditional Mixture of SVMs In this section we introduce a new type of mixture of SVMs. [sent-45, score-0.275]

20 Here each expert is an SVM, and we took a neural network for the gater in our experiments. [sent-47, score-0.66]

21 In the proposed model, the gater is trained to minimize the cost function N C= L [f(xi) - Yi]2 . [sent-48, score-0.546]

22 (7) i=l To train this model, we propose a very simple algorithm: 1. [sent-49, score-0.091]

23 Divide the training set into M random subsets of size near N j M. [sent-50, score-0.178]

24 Train each expert separately over one of these subsets. [sent-52, score-0.223]

25 Keeping the experts fixed, train the gater to minimize (7) on the whole training set. [sent-54, score-1.073]

26 Reconstruct M subsets: for each example (Xi,Yi), • sort the experts in descending order according to the values Wm(Xi), • assign the example to the first expert in the list which has less than (NjM examples*, in order to ensure a balance between the experts. [sent-56, score-0.496]

27 If a termination criterion is not fulfilled (such as a given number of iterations or a validation error going up), goto step 2. [sent-58, score-0.224]

28 Note that step 2 of this algorithm can be easily implemented in parallel as each expert can be trained separately on a different computer. [sent-59, score-0.31]

29 Note also that step 3 can be an approximate minimization (as usually done when training neural networks). [sent-60, score-0.137]

30 Hence such an algorithm is quite demanding in terms of resources when the dataset is large, if training time scales like O(NP) with p > 1. [sent-62, score-0.335]

31 In the more recent Support Vector Mixture model [5], the author shows how to replace the experts (typically neural networks) by SVMs and gives a learning algorithm for this model. [sent-63, score-0.32]

32 Once again the resulting mixture is trained jointly on the whole dataset , and hence does not solve the quadratic barrier when the dataset is large. [sent-64, score-0.658]

33 Here, the algorithm does indeed train separately the experts on small datasets, like the present algorithm, but there is no notion of a loop reassigning the examples to experts according to the prediction made by the gater of how well each expert performs on each example. [sent-66, score-1.575]

34 Our experiments suggest that this element is essential to the success of the algorithm. [sent-67, score-0.03]

35 Finally, the Bayesian Committee Machine [8] is a technique to partition the data into several subsets, train SVMs on the individual subsets and then use a specific combination scheme based on the covariance of the test data to combine the predictions. [sent-68, score-0.166]

36 This method scales linearly in the 'where c is a small positive constant. [sent-69, score-0.08]

37 Like in the previous case, this algorithm assigns the examples randomly to the experts (however the Bayesian framework would in principle allow to find better assignments). [sent-76, score-0.4]

38 Regarding our proposed mixture of SVMs, if the number of experts grows with the number of examples, and the number of outer loop iterations is a constant, then the total training time of the experts scales linearly with the number of examples. [sent-77, score-1.445]

39 The actual total training time should however also include k times the training time of the gater, which may potentially grow more rapidly than O(N). [sent-80, score-0.413]

40 However, it did not appear to be the case in our experiments, thus yielding apparent linear training time. [sent-81, score-0.168]

41 Future work will focus on methods to reduce the gater training time and guarantee linear training time per outer loop iteration. [sent-82, score-0.941]

42 5 Experiments In this section, we present three sets of experiments comparing the new mixture of SVMs to other machine learning algorithms. [sent-83, score-0.305]

43 Note that all the SVMs in these experiments have been trained using SVMTorch [2] . [sent-84, score-0.092]

44 1 A Toy Problem In the first series of experiments, we first tested the mixture on an artificial toy problem for which we generated 10,000 training examples and 10,000 test examples. [sent-86, score-0.628]

45 On Figure 1 we show the decision surfaces obtained first by a linear SVM, then by a Gaussian SVM, and finally by the proposed mixture of SVMs. [sent-88, score-0.304]

46 Moreover, in the latter, the gater was a simple linear function and there were two linear SVMs in the mixture t . [sent-89, score-0.759]

47 This artificial problem thus shows clearly that the algorithm seems to work, and is able to combine, even linearly, very simple models in order to produce a non-linear decision surface. [sent-90, score-0.048]

48 2 A Large-Scale Realistic Problem: Forest For a more realistic problem, we did a series of experiments on part of the UCI Forest dataset+. [sent-92, score-0.06]

49 We modified the 7-class classification problem into a binary classification problem where the goal was to separate class 2 from the other 6 classes. [sent-93, score-0.144]

50 Each example was described by 54 input features, each normalized by dividing by the maximum found on the training set. [sent-94, score-0.163]

51 The dataset had more than 500,000 examples and this allowed us to prepare a series of experiments as follows : • We kept a separate test set of 50,000 examples to compare the best mixture of SVMs to other learning algorithms. [sent-95, score-0.635]

52 • We used a validation set of 10,000 examples to select the best mixture of SVMs , varying the number of experts and the number of hidden units in the gater. [sent-96, score-1.094]

53 • We trained our models on different training sets, using from 100,000 to 400,000 examples. [sent-97, score-0.199]

54 • The mixtures had from 10 to 50 expert SVMs with Gaussian kernel and the gater was an MLP with between 25 and 500 hidden units. [sent-98, score-0.843]

55 tThe Forest dataset is available on the VCI website at the following ftp://ftp. [sent-100, score-0.106]

56 address: (a) Linear SVM (b) Gaussian SVM (c) Mixture of two linear SVMs Figure 1: Comparison of the decision surfaces obtained by (a) a linear SVM, (b) a Gaussian SVM, and (c) a linear mixture of two linear SVMs, on a two-dimensional classification toy problem. [sent-105, score-0.424]

57 Note that since the number of examples was quite large, we selected the internal training parameters such as the (J of the Gaussian kernel of the SVMs or the learning rate of the gater using a held-out portion of the training set. [sent-106, score-0.914]

58 Table 1 gives the results of a first series of experiments with a fixed training set of 100,000 examples. [sent-108, score-0.197]

59 To select among the variants of the gated SVM mixture we considered performance over the validation set as well as training time. [sent-109, score-0.661]

60 The selected model had 50 experts and a gater with 150 hidden units. [sent-112, score-0.939]

61 A model with 500 hidden units would have given a performance of 8. [sent-113, score-0.239]

62 1 % over the test set but would have taken 621 minutes on one machine (and 388 minutes on 50 machines). [sent-114, score-0.138]

63 one MLP one SVM uniform SVM mixture gated SVM mixture Train Error 17. [sent-115, score-0.671]

64 28 Time (minutes) (1 cpu) (50 cpu) 12 3231 2 85 237 73 Table 1: Comparison of performance between an MLP (100 hidden units), a single SVM, a uniform SVM mixture where the gater always output the same value for each expert, and finally a mixture of SVMs as proposed in this paper. [sent-123, score-1.17]

65 As it can be seen, the gated SVM outperformed all models in terms of training and test error. [sent-124, score-0.292]

66 Note that the training error of the single SVM is high because its hyper-parameters were selected to minimize error on the validation set (other values could yield to much lower training error but larger test error). [sent-125, score-0.583]

67 It was also much faster, even on one machine, than the SVM and since the mixture could easily be parallelized (each expert can be trained separately) , we also reported Lue LIUIe IL LUUK LU LldUI UU ClV UldCUIUei:>. [sent-126, score-0.559]

68 We also did a series of experiments in order to see the influence of the number of hidden units of the gater as well as the number of experts in the mixture. [sent-130, score-1.155]

69 Figure 2 shows the validation error of different mixtures of SVMs, where the number of hidden units varied from 25 to 500 and the number of experts varied from 10 to 50. [sent-131, score-0.863]

70 There is a clear performance improvement when the number of hidden units is increased, while the improvement with additional experts exists but is not as strong. [sent-132, score-0.637]

71 Note however that the training time increases also rapidly with the number of hidden units while it slightly decreases with the number of experts if one uses one computer per expert. [sent-133, score-0.806]

72 Validation error as a function of the number of hidden units of the gater and the number of experts 2! [sent-134, score-1.127]

73 '50 100 50 150 200 250 Number of hidden units of the gater 500 10 Figure 2: Comparison of the validation error of different mixtures of SVMs with various number of hidden units and experts. [sent-135, score-1.196]

74 In order to find how the algorithm scaled with respect to the number of examples, we then compared the same mixture of experts (50 experts, 150 hidden units in the gater) on different training set sizes. [sent-136, score-0.997]

75 Table 3 shows the validation error of the mixture of SVMs trained on training sets of sizes from 100,000 to 400,000. [sent-137, score-0.634]

76 It seems that, at least in this range and for this particular dataset, the mixture of SVMs scales linearly with respect to the number of examples, and not quadratically as a classical SVM. [sent-138, score-0.487]

77 It is interesting to see for instance that the mixture of SVMs was able to solve a problem of 400,000 examples in less than 7 hours (on 50 computers) while it would have taken more than one month to solve the same problem with a single SVM. [sent-139, score-0.445]

78 Finally, figure 4 shows the evolution of the training and validation errors of a mixture of 50 SVMs gated by an MLP with 150 hidden units, during 5 iterations of the algorithm. [sent-140, score-0.809]

79 This should convince that the loop of the algorithm is essential in order to obtain good performance. [sent-141, score-0.057]

80 It is also clear that the empirical convergence of the outer loop is extremely rapid. [sent-142, score-0.113]

81 3 Verification on Another Large-Scale Problem In order to verify that the results obtained on Forest were replicable on other large-scale problems, we tested the SVM mixture on a speech task. [sent-144, score-0.321]

82 Figure 4: Comparison of the training and validation errors of the mixture of SVMs as a function of the number of training iterations. [sent-146, score-0.703]

83 turned it into a binary classification problem where the task was to separate silence frames from non-silence frames . [sent-147, score-0.268]

84 The total number of frames was around 540,000 frames. [sent-148, score-0.147]

85 The training set contained 100,000 randomly chosen frames out of the first 400,000 frames. [sent-149, score-0.259]

86 The disjoint validation set contained 10,000 randomly chosen frames out of the first 400,000 frames also. [sent-150, score-0.348]

87 Finally, the test set contained 50,000 randomly chosen frames out of the last 140,000 frames. [sent-151, score-0.156]

88 Note that the validation set was used here to select the number of experts in the mixture, the number of hidden units in the gater, and a. [sent-152, score-0.739]

89 Each frame was parameterized using standard methods used in speech recognition (j-rasta coefficients, with first and second temporal derivatives) and was thus described by 45 coefficients, but we used in fact an input window of three frames, yielding 135 input features per examples. [sent-153, score-0.129]

90 Table 2 shows a comparison between a single SVM and a mixture of SVMs on this dataset. [sent-154, score-0.333]

91 The number of experts in the mixture was set to 50, the number of hidden units of the gater was set to 50, and the a of the SVMs was set to 3. [sent-155, score-1.37]

92 As it can be seen, the mixture of SVMs was again many times faster than the single SVM (even on 1 cpu only) but yielded similar generalization performance. [sent-157, score-0.425]

93 32 Time (minutes) (1 cpu) (50 cpu) 6787 851 65 Table 2: Comparison of performance between a single SVM and a mixture of SVMs on the speech dataset. [sent-162, score-0.347]

94 6 Conclusion In this paper we have presented a new algorithm to train a mixture of SVMs that gave very good results compared to classical SVMs either in terms of training time or generalization performance on two large scale difficult databases. [sent-163, score-0.587]

95 Moreover, the algorithm appears to scale linearly with the number of examples, at least between 100,000 and 400,000 examples. [sent-164, score-0.095]

96 ,ebL LUi:tL Lue plupUbeu lueLuuu CUUIU dllUW training SVM-like models for very large multi-million data sets in a reasonable time. [sent-174, score-0.137]

97 Future work will address several questions: how to guarantee linear training time for the gater as well as for the experts? [sent-176, score-0.656]

98 can better results be obtained by tuning the hyper-parameters of each expert separately? [sent-177, score-0.176]

99 Acknowledgments RC would like to thank the Swiss NSF for financial support (project FN2100-061234. [sent-179, score-0.04]

100 Training support vector machines: an application to face detection. [sent-213, score-0.066]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gater', 0.484), ('svms', 0.403), ('experts', 0.32), ('mixture', 0.275), ('svm', 0.26), ('expert', 0.176), ('training', 0.137), ('units', 0.129), ('validation', 0.128), ('gated', 0.121), ('lue', 0.11), ('hidden', 0.11), ('mlp', 0.107), ('dataset', 0.106), ('forest', 0.101), ('frames', 0.098), ('train', 0.091), ('examples', 0.08), ('cpu', 0.077), ('lu', 0.074), ('classification', 0.072), ('montreal', 0.068), ('trained', 0.062), ('loop', 0.057), ('outer', 0.056), ('wm', 0.055), ('cp', 0.054), ('minutes', 0.052), ('collobert', 0.048), ('mixtures', 0.048), ('toy', 0.048), ('separately', 0.047), ('xi', 0.047), ('dirg', 0.046), ('exvell', 0.046), ('kr', 0.046), ('martigny', 0.046), ('parallelized', 0.046), ('quebec', 0.046), ('speech', 0.046), ('linearly', 0.046), ('machines', 0.043), ('bengio', 0.042), ('subsets', 0.041), ('whole', 0.041), ('hopeless', 0.04), ('ronan', 0.04), ('rue', 0.04), ('universite', 0.04), ('support', 0.04), ('iterations', 0.038), ('committee', 0.037), ('idiap', 0.037), ('lui', 0.037), ('quadratic', 0.036), ('time', 0.035), ('quadratically', 0.034), ('test', 0.034), ('scales', 0.034), ('svmtorch', 0.032), ('solve', 0.032), ('comparison', 0.032), ('error', 0.032), ('yielding', 0.031), ('experiments', 0.03), ('series', 0.03), ('ul', 0.029), ('surfaces', 0.029), ('canada', 0.028), ('divide', 0.028), ('du', 0.028), ('xj', 0.027), ('improvement', 0.026), ('number', 0.026), ('vector', 0.026), ('single', 0.026), ('input', 0.026), ('kernel', 0.025), ('conjecture', 0.025), ('classical', 0.025), ('parallel', 0.025), ('selected', 0.025), ('faster', 0.025), ('table', 0.025), ('difficult', 0.024), ('artificial', 0.024), ('contained', 0.024), ('datasets', 0.024), ('seems', 0.024), ('least', 0.023), ('grow', 0.023), ('overcome', 0.023), ('total', 0.023), ('problems', 0.023), ('rapidly', 0.023), ('transfer', 0.023), ('resources', 0.023), ('yielded', 0.022), ('varied', 0.022), ('complexity', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

Author: Ronan Collobert, Samy Bengio, Yoshua Bengio

Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1

2 0.20862679 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

Author: Carl E. Rasmussen, Zoubin Ghahramani

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

3 0.16529503 46 nips-2001-Categorization by Learning and Combining Object Parts

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.

4 0.15766864 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama

Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1

5 0.14930472 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

Author: Manfred Opper, Robert Urbanczik

Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1

6 0.14153001 172 nips-2001-Speech Recognition using SVMs

7 0.13879672 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

8 0.13572589 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

9 0.1351053 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

10 0.13062367 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

11 0.11155788 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

12 0.10078795 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

13 0.091746211 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

14 0.083876006 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

15 0.083080932 58 nips-2001-Covariance Kernels from Bayesian Generative Models

16 0.080411702 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

17 0.077887334 154 nips-2001-Products of Gaussians

18 0.076440215 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

19 0.075068742 25 nips-2001-Active Learning in the Drug Discovery Process

20 0.074709855 15 nips-2001-A New Discriminative Kernel From Probabilistic Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.234), (1, 0.132), (2, -0.059), (3, 0.008), (4, -0.125), (5, 0.132), (6, 0.084), (7, -0.118), (8, -0.078), (9, 0.014), (10, 0.117), (11, 0.061), (12, 0.067), (13, -0.133), (14, 0.252), (15, 0.076), (16, 0.042), (17, -0.024), (18, 0.096), (19, 0.011), (20, -0.047), (21, -0.061), (22, 0.023), (23, 0.011), (24, -0.011), (25, -0.067), (26, -0.02), (27, -0.037), (28, 0.237), (29, -0.132), (30, 0.087), (31, -0.096), (32, 0.151), (33, -0.119), (34, -0.072), (35, -0.012), (36, 0.007), (37, -0.009), (38, -0.08), (39, 0.014), (40, 0.141), (41, -0.039), (42, 0.116), (43, -0.057), (44, 0.047), (45, 0.032), (46, 0.103), (47, 0.106), (48, -0.035), (49, 0.114)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96272528 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

Author: Ronan Collobert, Samy Bengio, Yoshua Bengio

Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1

2 0.68806601 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference

Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng

Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1

3 0.59815639 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

Author: Carl E. Rasmussen, Zoubin Ghahramani

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

4 0.59721661 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

Author: Olivier Chapelle, Bernhard Schćžšlkopf

Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1

5 0.53108698 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1

6 0.52230644 154 nips-2001-Products of Gaussians

7 0.52083176 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

8 0.48868701 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

9 0.44679809 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

10 0.4186078 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

11 0.41493514 172 nips-2001-Speech Recognition using SVMs

12 0.39497438 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

13 0.38203922 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data

14 0.37501618 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

15 0.36376312 25 nips-2001-Active Learning in the Drug Discovery Process

16 0.35629836 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

17 0.34347087 46 nips-2001-Categorization by Learning and Combining Object Parts

18 0.34193784 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

19 0.34147602 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

20 0.33569434 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.029), (17, 0.032), (19, 0.038), (27, 0.139), (30, 0.077), (38, 0.028), (44, 0.261), (59, 0.02), (72, 0.141), (79, 0.033), (83, 0.013), (91, 0.107)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86140448 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller

Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called

same-paper 2 0.81136417 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

Author: Ronan Collobert, Samy Bengio, Yoshua Bengio

Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1

3 0.68334323 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

Author: Mário Figueiredo

Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.

4 0.67814064 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

Author: Carlotta Domeniconi, Dimitrios Gunopulos

Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1

5 0.66410977 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

6 0.66059625 40 nips-2001-Batch Value Function Approximation via Support Vectors

7 0.65941006 1 nips-2001-(Not) Bounding the True Error

8 0.65768248 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

9 0.65242684 13 nips-2001-A Natural Policy Gradient

10 0.65218985 185 nips-2001-The Method of Quantum Clustering

11 0.65207219 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

12 0.64986998 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

13 0.64941514 155 nips-2001-Quantizing Density Estimators

14 0.64785737 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

15 0.64752179 60 nips-2001-Discriminative Direction for Kernel Classifiers

16 0.64742267 79 nips-2001-Gaussian Process Regression with Mismatched Models

17 0.64729214 8 nips-2001-A General Greedy Approximation Algorithm with Applications

18 0.64643526 128 nips-2001-Multiagent Planning with Factored MDPs

19 0.64599931 114 nips-2001-Learning from Infinite Data in Finite Time

20 0.64386111 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes