nips nips2007 nips2007-160 nips2007-160-reference knowledge-graph by maker-knowledge-mining

160 nips-2007-Random Features for Large-Scale Kernel Machines

Source: pdf

Author: Ali Rahimi, Benjamin Recht

Abstract: To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user speciﬁed shiftinvariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classiﬁcation and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines. 1

reference text

[1] T. Joachims. Training linear SVMs in linear time. In ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006.

[2] M. C. Ferris and T. S. Munson. Interior-point methods for massive Support Vector Machines. SIAM Journal of Optimization, 13(3):783–804, 2003.

[3] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In IEEE International Conference on Machine Learning (ICML), 2007.

[4] D. DeCoste and D. Mazzoni. Fast query-optimized kernel machine classiﬁcation via incremental approximate nearest support vectors. In IEEE International Conference on Machine Learning (ICML), 2003.

[5] J. Platt. Using sparseness and analytic QP to speed training of Support Vector Machines. In Advances in Neural Information Processing Systems (NIPS), 1999.

[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

[7] D. Achlioptas, F. McSherry, and B. Sch¨ lkopf. Sampling techniques for kernel methods. In Advances in o Neural Information Processing Systems (NIPS), 2001.

[8] A. Blum. Random projection, margins, kernels, and feature-selection. LNCS, 3940:52–68, 2006.

[9] A. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for ﬁnding low-rank approximations. In Foundations of Computer Science (FOCS), pages 378–390, 1998.

[10] P. Drineas and M. W. Mahoney. On the nystrom method for approximating a Gram matrix for improved kernel-based learning. In COLT, pages 323–337, 2005.

[11] C. Yang, R. Duraiswami, and L. Davis. Efﬁcient kernel machines using the improved fast gauss transform. In Advances in Neural Information Processing Systems (NIPS), 2004.

[12] Y. Shen, A. Y. Ng, and M. Seeger. Fast gaussian process regression using KD-Trees. In Advances in Neural Information Processing Systems (NIPS), 2005.

[13] P. Indyk and N. Thaper. Fast image retrieval via embeddings. In International Workshop on Statistical and Computational Theories of Vision, 2003.

[14] I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core Vector Machines: Fast SVM training on very large data sets. Journal of Machine Learning Research (JMLR), 6:363–392, 2005.

[15] W. Rudin. Fourier Analysis on Groups. Wiley Classics Library. Wiley-Interscience, New York, reprint edition edition, 1994.

[16] G. Loosli and S. Canu. Comments on the ‘Core Vector Machines: Fast SVM training on very large data sets’. Journal of Machine Learning Research (JMLR), 8:291–301, February 2007.

[17] F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Soc., 39:1–49, 2001. A Proofs Lemma 1. Suppose a function k(∆) : R → R is twice differentiable and has the form ∞ ¨ p(δ) max(0, 1 − ∆ ) dδ. Then p(δ) = δ k(δ). δ 0 Proof. We want p so that ∞ p(δ) max(0, 1 − ∆/δ) dδ k(∆) = (3) 0 ∞ ∆ p(δ) · 0 dδ + = 0 ∞ p(δ)(1 − ∆/δ) dδ = ∆ ∞ p(δ) dδ − ∆ ∆ ˙ To solve for p, differentiate twice w.r.t. to ∆ to ﬁnd that k(∆) = − p(∆)/∆. 7 p(δ)/δ dδ. (4) ∆ ∞ ∆ ¨ p(δ)/δ dδ and k(∆) = Proof of Claim 1. Deﬁne s(x, y) ≡ z(x) z(y), and f (x, y) ≡ s(x, y) − k(y, x). Since f , and s are shift invariant, as their arguments we use ∆ ≡ x − y ∈ M∆ for notational simplicity. M∆ is compact and has diameter at most twice diam(M), so we can ﬁnd an -net that covers M∆ using at most T = (4 diam M/r)d balls of radius r [17]. Let {∆i }T denote the centers of these i=1 balls, and let Lf denote the Lipschitz constant of f . We have |f (∆)| < for all ∆ ∈ M∆ if |f (∆i )| < /2 and Lf < 2r for all i. We bound the probability of these two events. f (∆) . We have Since f is differentiable, Lf = f (∆∗ ) , where ∆∗ = arg max∆∈M∆ 2 ∗ 2 2 E[Lf ] = E f (∆ ) = E s(∆∗ ) 2 − E k(∆∗ ) 2 ≤ E s(∆∗ ) 2 ≤ Ep ω 2 = σp , so 2 2 by Markov’s inequality, Pr[Lf ≥ t] ≤ E[Lf ]/t, or Pr Lf ≥ ≤ 2r 2 2rσp . (5) The union bound followed by Hoeffding’s inequality applied to the anchors in the -net gives Pr ∪T |f (∆i )| ≥ /8 ≤ 2T exp −D 2 /2 . i=1 (6) Combining (5) and (6) gives a bound in terms of the free variable r: sup |f (∆)| ≤ Pr 4 diam(M) r ≥1−2 ∆∈M∆ This has the form 1 − κ1 r−d − k2 r2 . Setting r = κ1 κ2 d 1 d+2 2 2rσp exp −D 2 /8 − . (7) 2 d d+2 d+2 turns this to 1 − 2κ2 κ1 , and σ diam(M) ≥ 1 and diam(M) ≥ 1, proves the ﬁrst part of the claim. To prove the assuming that p second part of the claim, pick any probability for the RHS and solve for D. Proof of Claim 2. M can be covered by rectangles over each of which z is constant. Let δpm be the pitch of the pth grid along the mth dimension. Each grid has at most diam(M)/δpm bins, and P P P 1 overlapping grids produce at most Nm = g=1 diam(M)/δgm ≤ P + diam(M) p=1 δpm partitions along the mth dimension. The expected value of the right hand side is P + P diam(M)α. By Markov’s inequality and the union bound, Pr ∀d Nm ≤ t(P + P diam(M)α) ≥ 1 − d/t. m=1 That is, with probability 1 − d/t, along every dimension, we have at most t(P + P diam(M)α) one-dimensional cells. Denote by dmi the width of the ith cell along the mth dimension and observe Nm that i=1 dmi ≤ diam(M). We further subdivide these cells into smaller rectangles of some small width r to ensure that the kernel k varies very little over each of these cells. This results in at most Nm dmi ≤ Nm +diam(M) small one-dimensional cells over each dimension. Plugging in the i=1 r r 1 upper bound for Nm , setting t ≥ αP and assuming α diam(M) ≥ 1, with probability 1 − d/t, M can be covered with T ≤ 3tP α diam(M) r d rectangles of side r centered at {xi }T . i=1 The condition |z(x, y) − k(x, y)| ≤ on M × M holds if |z(xi , yi ) − k(xi , yi )| ≤ − Lk rd and z(x) is constant throughout each rectangle. With rd = 2Lk , the union bound followed by Hoeffding’s inequality gives Pr [∪ij |z(xi , yj ) − k(xi , yj )| ≥ /2] ≤ 2T 2 exp −P 2 /8 (8) Combining this with the probability that z(x) is constant in each cell gives a bound in terms of t: Pr sup |z(x, y) − k(x, y)| ≤ ≥1 − x,y∈M×M d P 2 2Lk − 2(3tP α diam(M))d exp − t 8 This has the form 1 − κ1 t−1 − κ2 td . To prove the claim, set t = upper bound of 1 − 1 d+1 3κ1 κ2 . 8 κ1 2κ2 1 d+1 . , which results in an