nips nips2005 nips2005-31 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We consider regularized least-squares (RLS) with a Gaussian kernel. [sent-10, score-0.068]
2 K min i=1 Here, H is a Reproducing Kernel Hilbert Space (RKHS) [1] with associated kernel function K, ||f ||2 is the squared norm in the RKHS, and λ is a regularization constant controlling K the tradeoff between fitting the training set accurately and forcing smoothness of f . [sent-18, score-0.16]
3 The Representer Theorem [7] proves that the RLS solution will have the form f (x) = n i=1 ci K(xi , x), and it is easy to show [5] that we can find the coefficients c by solving the linear system (K + λnI)c = y, (1) where K is the n by n matrix satisfying Kij = K(xi , xj ). [sent-19, score-0.202]
4 We focus on the Gaussian kernel K(xi , xj ) = exp(−||xi − xj ||2 /2σ 2 ). [sent-20, score-0.144]
5 Our work was originally motivated by the empirical observation that on a range of benchmark classification tasks, we achieved surprisingly accurate classification using a Gaussian kernel with a very large σ and a very small λ (Figure 1; additional examples in [6]). [sent-21, score-0.066]
6 This prompted us to study the large-σ asymptotics of RLS. [sent-22, score-0.077]
7 As σ → ∞, K(xi , xj ) → 1 for arbitrary xi and xj . [sent-23, score-0.131]
8 RLS classification accuracy results for the UCI Galaxy dataset over a range of σ (along the x-axis) and λ (different lines) values. [sent-36, score-0.028]
9 The vertical labelled lines show m, the smallest entry in the kernel matrix for a given σ. [sent-37, score-0.167]
10 We see that when λ = 1e − 11, we can classify quite accurately when the smallest entry of the kernel matrix is . [sent-38, score-0.196]
11 then compute f (x0 ) = ct k where k is the kernel vector, ki = K(xi , x0 ). [sent-40, score-0.099]
12 Combining the training and testing steps, we see that f (x0 ) = y t (K + λnI)−1 k. [sent-41, score-0.03]
13 If we directly compute c = (K + λnI)−1 y, we will tend to wash out the effects of the ij term as σ becomes large. [sent-45, score-0.044]
14 If, instead, we compute f (x0 ) by associating to the right, first computing point affinities (K + λnI)−1 k, then the ij and j interact meaningfully; this interaction is crucial to our analysis. [sent-46, score-0.044]
15 Our approach is to Taylor expand the kernel elements (and thus K and k) in 1/σ, noting that as σ → ∞, consecutive terms in the expansion differ enormously. [sent-47, score-0.094]
16 The asymptotic affinity formula can then be “transposed” to create an alternate expression for f (x0 ). [sent-49, score-0.035]
17 Our main result is that if we set σ 2 = s2 and λ = s−(2k+1) , then, as s → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. [sent-50, score-0.434]
18 Due to space restrictions, the proofs of supporting lemmas and corollaries are omitted; an expanded version containing all proofs is available [4]. [sent-52, score-0.035]
19 Let xi be a set of n + 1 points (0 ≤ i ≤ n) in a d dimensional space. [sent-54, score-0.078]
20 The scalar xia denotes the value of the ath vector component of the ith point. [sent-55, score-0.173]
21 We think of X as the matrix of training data x1 , . [sent-57, score-0.075]
22 , xn and x0 as an 1×d matrix consisting of the test point. [sent-60, score-0.045]
23 Let 1m , 1lm denote the m dimensional vector and l × m matrix with components all 1, similarly for 0m , 0lm . [sent-61, score-0.097]
24 For two l × m matrices, N, M , N M denotes the l × m matrix given by (N M )ij = Nij Mij . [sent-64, score-0.045]
25 Y I is the k dimensional vector given by Y I i = a=1 Yiaa . [sent-68, score-0.052]
26 If h : Rd → R then h(Y ) is the k dimensional vector given by (h(Y ))i = h(Yi1 , . [sent-69, score-0.052]
27 The d canonical vectors, ea ∈ Zd , are given by (ea )b = δab . [sent-73, score-0.114]
28 ≥0 Any scalar function, f : R → R, applied to any matrix or vector, A, will be assumed to denote the elementwise application of f . [sent-74, score-0.08]
29 We will treat y → ey as a scalar function (we have no need of matrix exponentials in this work, so the notation is unambiguous). [sent-75, score-0.08]
30 (5) Orthogonal polynomial bases c Let Vc = span{X I : |I| = c} and V≤c = a=0 Vc which can be thought of as the set of all d variable polynomials of degree c, evaluated on the training data. [sent-77, score-0.361]
31 Generically, b is the smallest c such that c+d ≥ n. [sent-79, score-0.03]
32 d Let Q be an orthonormal matrix in Rn×n whose columns progressively span the V≤c spaces, i. [sent-80, score-0.098]
33 Taking the σ → ∞ limit We will begin with a few simple lemmas about the limiting solutions of linear systems. [sent-99, score-0.093]
34 At the end of this section we will arrive at the limiting form of suitably modified RLSC equations. [sent-100, score-0.058]
35 −1 A00 (0) 0 ··· 0 b0 (0) A (0) A11 (0) · · · 0 b1 (0) lim A−1 (s)y(s) = 10 ··· ··· ··· ··· ··· s→0 Aq0 (0) Aq1 (0) · · · Aqq (0) bq (0) We are now ready to state and prove the main result of this section, characterizing the limiting large-σ solution of Gaussian RLS. [sent-104, score-0.455]
36 Let q be an integer satisfying q < b, and let p = 2q + 1. [sent-106, score-0.048]
37 Changing bases with Q, 1 t 1 Qt e σ2 XX Q + nCσ −p Qt diag e σ2 ||X|| Qt v = lim σ→∞ 2 −1 Q 1 t Qt e σ2 Xx0 . [sent-113, score-0.284]
38 Expanding via Taylor series and writing in block form (in the b × b block structure of Q), t 1 1 1 Qt (XX t ) 1 Q + Qt (XX t ) 2 Q + · · · Qt e σ2 XX Q = Qt (XX t ) 0 Q + 1! [sent-114, score-0.084]
39 5 The classification function When performing RLS, the actual prediction of the limiting classifier is given via f∞ (x0 ) ≡ lim y t (K + nCσ −p I)−1 k. [sent-118, score-0.188]
40 σ→∞ Theorem 1 determines v = limσ→∞ (K + nCσ −p I)−1 k,showing that f∞ (x0 ) is a polynomial in the training data X. [sent-119, score-0.264]
41 In this section, we show that f∞ (x0 ) is, in fact, a polynomial in the test point x0 . [sent-120, score-0.234]
42 We continue to work with the orthonormal vectors Bi as well as the (c) (c) auxilliary quantities Aij and bi from Theorem 1. [sent-121, score-0.176]
43 Theorem 1 shows that v ∈ V≤q : the point affinity function is a polynomial of degree q in the training data, determined by (7). [sent-122, score-0.29]
44 Bc bi t = Bc Bc (Xxt ) 0 c c we can restate Equation 7 in an equivalent form: (0) (0) t t 0! [sent-127, score-0.147]
45 bq t ··· 0 B0 ··· 0 · · · v = 0 t ··· ··· Bq (q) · · · q! [sent-135, score-0.201]
46 (10) c≤q Up to this point, our results hold for arbitrary training data X. [sent-139, score-0.03]
47 To proceed, we require a mild condition on our training set. [sent-140, score-0.03]
48 For generic X, the solution to Equation 7 (or equivalently, Equation 10) is determined by the conditions ∀I : |I| ≤ q, (X I )t v = xI , where v ∈ V≤q . [sent-147, score-0.097]
49 For generic data, let v be the solution to Equation 10. [sent-149, score-0.122]
50 For any y ∈ Rn , f (x0 ) = y t v = h(x0 ), where h(x) = |I|≤q aI xI is a multivariate polynomial of degree q minimizing ||y − h(X)||. [sent-150, score-0.26]
51 We see that as σ → ∞, the RLS solution tends to the minimum empirical error kth order polynomial. [sent-151, score-0.2]
52 Figure 2 plots f , along with a 150 point dataset drawn by choosing xi uniformly in [0, 1], and choosing y = f (x) + i , where i is a Gaussian random variable with mean 0 and standard deviation . [sent-154, score-0.081]
53 Figure 2 also shows (in red) the best polynomial approximations to the data (not to the ideal f ) of various orders. [sent-156, score-0.278]
54 (We omit third order because it is nearly indistinguishable from second order. [sent-157, score-0.032]
55 ) According to Theorem 1, if we parametrize our system by a variable s, and solve a Gaussian regularized least-squares problem with σ 2 = s2 and λ = Cs−(2k+1) for some integer k, then, as s → ∞, we expect the solution to the system to tend to the kth-order databased polynomial approximation to f . [sent-158, score-0.391]
56 7 Discussion Our result provides insight into the asymptotic behavior of RLS, and (partially) explains Figure 1: in conjunction with additional experiments not reported here, we believe that 0. [sent-165, score-0.035]
57 8 f(x), Random Sample of f(x), and Polynomial Approximations q q q q q q q q q q qq q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q qq 0. [sent-167, score-1.492]
58 4 q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q qq q q q q q q q q q q q q q q q q qq q qq qq q q q qq q q q q q q q y q q 0. [sent-168, score-2.611]
59 2 q q q f 0th order 1st order 2nd order 4th order 5th order q q q q q q −0. [sent-171, score-0.16]
60 95), a random dataset drawn from f (x) with added Gaussian noise, and data-based polynomial approximations to f . [sent-185, score-0.306]
61 we are recovering second-order polynomial behavior, with the drop-off in performance at various λ’s occurring at the transition to third-order behavior, which cannot be accurately recovered in IEEE double-precision floating-point. [sent-186, score-0.298]
62 Although we used the specific details of RLS in deriving our solution, we expect that in practice, a similar result would hold for Support Vector Machines, and perhaps for Tikhonov regularization with convex loss more generally. [sent-187, score-0.035]
63 An interesting implication of our theorem is that for very large σ, we can obtain various order polynomial classifications by sweeping λ. [sent-188, score-0.341]
64 Our work also has implications for approximations to the Gaussian kernel. [sent-191, score-0.044]
65 We showed empirically that the FGT becomes much more accurate at larger values of σ; however, at large-σ, it seems likely we are merely recovering low-order polynomial behavior. [sent-196, score-0.269]
66 We suggest that approximations to the Gaussian kernel must be checked carefully, to show that they produce sufficiently good results are moderate values of σ; this is a topic for future work. [sent-197, score-0.11]
67 As s → ∞, σ 2 = s2 and λ = s−(2k+1) , the solution to Gaussian RLS approaches the kth order polynomial solution. [sent-281, score-0.395]
68 Asymptotic behaviors of support vector machines with gaussian kernel. [sent-284, score-0.105]
69 Efficient kernel machines using the improved fast Gauss transform. [sent-304, score-0.096]
wordName wordTfidf (topN-words)
[('rls', 0.402), ('qq', 0.373), ('xxt', 0.316), ('xx', 0.238), ('bc', 0.237), ('polynomial', 0.234), ('qt', 0.212), ('bq', 0.201), ('nc', 0.152), ('iq', 0.15), ('bi', 0.147), ('lim', 0.13), ('diag', 0.123), ('aqq', 0.115), ('ea', 0.114), ('lippert', 0.086), ('xia', 0.086), ('ni', 0.079), ('bj', 0.074), ('fgt', 0.068), ('regularized', 0.068), ('vc', 0.067), ('kernel', 0.066), ('solution', 0.066), ('kth', 0.063), ('aij', 0.063), ('limiting', 0.058), ('acj', 0.057), ('galaxy', 0.057), ('nij', 0.057), ('rlsc', 0.057), ('ross', 0.057), ('ryan', 0.057), ('zd', 0.057), ('successive', 0.056), ('xi', 0.053), ('yang', 0.052), ('asymptotics', 0.052), ('ci', 0.052), ('theorem', 0.052), ('oating', 0.05), ('colspan', 0.05), ('rifkin', 0.05), ('gaussian', 0.048), ('af', 0.046), ('nities', 0.046), ('matrix', 0.045), ('approximations', 0.044), ('ij', 0.044), ('block', 0.042), ('polynomials', 0.04), ('tends', 0.039), ('xj', 0.039), ('ba', 0.036), ('gauss', 0.036), ('rkhs', 0.036), ('lemmas', 0.035), ('recovering', 0.035), ('kij', 0.035), ('regularization', 0.035), ('scalar', 0.035), ('asymptotic', 0.035), ('mathematics', 0.033), ('ki', 0.033), ('order', 0.032), ('taylor', 0.032), ('generic', 0.031), ('pd', 0.031), ('reproducing', 0.031), ('bases', 0.031), ('lemma', 0.03), ('training', 0.03), ('smallest', 0.03), ('machines', 0.03), ('orthonormal', 0.029), ('massachusetts', 0.029), ('classi', 0.029), ('accurately', 0.029), ('equation', 0.028), ('dataset', 0.028), ('nity', 0.028), ('noting', 0.028), ('vector', 0.027), ('letting', 0.027), ('entry', 0.026), ('degree', 0.026), ('dimensional', 0.025), ('fresh', 0.025), ('ced', 0.025), ('acc', 0.025), ('ath', 0.025), ('factorizations', 0.025), ('prompted', 0.025), ('let', 0.025), ('span', 0.024), ('integer', 0.023), ('unambiguous', 0.023), ('sweeping', 0.023), ('bb', 0.023), ('evgeniou', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
2 0.16022241 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
Author: Suvrit Sra, Inderjit S. Dhillon
Abstract: Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively deficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of “link” functions for modeling nonlinear relationships are also discussed. 1
3 0.11110757 182 nips-2005-Statistical Convergence of Kernel CCA
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
4 0.094132878 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
5 0.088496946 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
6 0.077278405 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
7 0.076312415 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
8 0.067069888 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
9 0.066617072 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
10 0.06288071 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
11 0.060489055 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
12 0.056817178 127 nips-2005-Mixture Modeling by Affinity Propagation
13 0.052297045 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
14 0.048478633 114 nips-2005-Learning Rankings via Convex Hull Separation
15 0.047998428 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
16 0.046217218 105 nips-2005-Large-Scale Multiclass Transduction
17 0.045985833 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
18 0.044523899 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
19 0.043451853 117 nips-2005-Learning from Data of Variable Quality
20 0.04191836 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
topicId topicWeight
[(0, 0.15), (1, 0.077), (2, -0.032), (3, -0.086), (4, 0.02), (5, 0.051), (6, -0.004), (7, -0.022), (8, 0.015), (9, -0.003), (10, -0.024), (11, 0.011), (12, -0.03), (13, 0.023), (14, -0.04), (15, 0.135), (16, -0.034), (17, 0.065), (18, 0.075), (19, 0.003), (20, -0.022), (21, 0.002), (22, 0.042), (23, -0.031), (24, 0.004), (25, -0.031), (26, 0.056), (27, 0.068), (28, -0.123), (29, -0.037), (30, -0.014), (31, 0.174), (32, 0.036), (33, -0.057), (34, -0.054), (35, -0.0), (36, -0.2), (37, -0.071), (38, -0.411), (39, 0.029), (40, -0.294), (41, 0.073), (42, -0.103), (43, -0.076), (44, 0.048), (45, -0.002), (46, -0.032), (47, -0.056), (48, -0.028), (49, 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.93741268 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
2 0.83799738 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
Author: Suvrit Sra, Inderjit S. Dhillon
Abstract: Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively deficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of “link” functions for modeling nonlinear relationships are also discussed. 1
3 0.60249203 182 nips-2005-Statistical Convergence of Kernel CCA
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
4 0.41010031 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
5 0.40764505 59 nips-2005-Dual-Tree Fast Gauss Transforms
Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray
Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.
6 0.35025316 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
7 0.34051001 71 nips-2005-Fast Krylov Methods for N-Body Learning
8 0.31955141 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
9 0.30547529 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
10 0.27093446 114 nips-2005-Learning Rankings via Convex Hull Separation
11 0.25916141 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
12 0.25837472 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
13 0.25736374 146 nips-2005-On the Accuracy of Bounded Rationality: How Far from Optimal Is Fast and Frugal?
14 0.25596386 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
15 0.24918987 2 nips-2005-A Bayes Rule for Density Matrices
16 0.24838132 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
17 0.24834453 47 nips-2005-Consistency of one-class SVM and related algorithms
18 0.24637721 103 nips-2005-Kernels for gene regulatory regions
19 0.24347751 117 nips-2005-Learning from Data of Variable Quality
20 0.24235092 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
topicId topicWeight
[(3, 0.052), (10, 0.043), (11, 0.015), (27, 0.03), (31, 0.022), (34, 0.135), (41, 0.016), (50, 0.016), (55, 0.023), (65, 0.016), (69, 0.033), (72, 0.368), (73, 0.028), (77, 0.018), (88, 0.068), (91, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.77018797 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
2 0.61486405 58 nips-2005-Divergences, surrogate loss functions and experimental design
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1
3 0.55943877 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
4 0.42408198 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1
5 0.42290565 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
6 0.42156148 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
7 0.42005718 184 nips-2005-Structured Prediction via the Extragradient Method
8 0.41818887 105 nips-2005-Large-Scale Multiclass Transduction
9 0.41762158 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
10 0.41709518 50 nips-2005-Convex Neural Networks
11 0.41262817 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
12 0.41103297 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
13 0.4107793 177 nips-2005-Size Regularized Cut for Data Clustering
14 0.41076487 77 nips-2005-From Lasso regression to Feature vector machine
15 0.4105778 47 nips-2005-Consistency of one-class SVM and related algorithms
16 0.4094815 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
17 0.40880588 160 nips-2005-Query by Committee Made Real
18 0.40622675 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
19 0.40486649 195 nips-2005-Transfer learning for text classification
20 0.40477294 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm