nips nips2000 nips2000-75 nips2000-75-reference knowledge-graph by maker-knowledge-mining

75 nips-2000-Large Scale Bayes Point Machines

Source: pdf

Author: Ralf Herbrich, Thore Graepel

Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1

reference text

[1) C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20:273-297, 1995. [2) T. Graepel and R. Herbrich. The kernel Gibbs sampler. Information System Processing 13, 200l. In Advances in Neural [3) R. Herbrich and T . Graepel. A PAC-Bayesian margin bound for linear classifiers: Why SVMs work. In Advances in Neural Information System Processing 13, 200l. [4) R. Herbrich, T . Graepel, and C. Campbell. Robust Bayes Point Machines. In Proceedings of ESANN 2000, pages 49- 54, 2000. [5) D. A. McAliester. Some PAC Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pages 230- 234 , Madison, Wisconsin, 1998. [6) R. M. Neal. Markov chain monte carlo method based on 'slicing' the density function . Technical report, Department of Statistics, University of Toronto, 1997. TR- 9722. [7) A. Novikoff. On convergence proofs for perceptrons. In Report at the Symposium on Mathematical Theory of Automata, pages 24- 26, Politechnical Institute Brooklyn, 1962. [8) M. Opper and O. Winther. Gaussian processes for classification: Mean field algorithms. Neural Computation, 12(11) , 2000. [9) J . Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61- 74. MIT Press, 2000. [10) P. Rujan and M. Marchand. Computing the bayes kernel classifier. In Advances in Large Margin Classifiers, pages 329- 348. MIT Press, 2000. [11) J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data- dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926- 1940, 1998. [12) A. J. Smola. Learning with Kernels. PhD thesis, Technische Universitat Berlin, 1998. [13) V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. [14) T. Watkin. Optimal learning with a neural network. Europhysics Letters, 21:871- 877, 1993. [15) C. Williams. Prediction with Gaussian Processes: From linear regression to linear prediction and beyond. Technical report , Neural Computing Research Group , Aston University, 1997. NCRG/ 97/ 012.