nips nips2005 nips2005-71 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang
Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. [sent-9, score-0.256]
2 It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. [sent-10, score-0.316]
3 The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. [sent-11, score-0.148]
4 The paper also presents theoretical bounds on the stability of these methods. [sent-12, score-0.118]
5 1 Introduction Machine learning techniques based on similarity metrics have gained wide acceptance over the last few years. [sent-13, score-0.123]
6 Here one forms a Laplacian matrix L = D−1/2 WD−1/2 , where the entries of W measure the similarity between data points xi ∈ X , i = 1, . [sent-15, score-0.189]
7 For example, a popular choice is to set the entries of W to 1 wij = e− σ xi −xj 2 where σ is a user-specified parameter. [sent-19, score-0.076]
8 D is a normalizing diagonal matrix with entries di = j wij . [sent-20, score-0.081]
9 K-means generates better clusters on this nonlinear embedding of the data provided one adopts a suitable similarity metric. [sent-22, score-0.154]
10 In these settings, one is interested in either inverting the similarity matrix or finding some of its eigenvectors. [sent-24, score-0.14]
11 The computational cost of both of these operations is O(N 3 ) while the storage requirement is O(N 2 ). [sent-25, score-0.111]
12 These costs are prohibitively large in applications where one encounters massive quantities of data points or where one is interested in real-time solutions such as spectral image segmentation for mobile robots [8]. [sent-26, score-0.167]
13 In this paper, we present general numerical techniques for reducing the computational cost to O(N log N ), or even O(N ) in specific cases, and the storage cost to O(N ). [sent-27, score-0.194]
14 These reductions are achieved by combining Krylov subspace iterative solvers (such as Arnoldi, Lanczos, GMRES and conjugate gradients) with fast kernel density estimation (KDE) techniques (such as fast multipole expansions, the fast Gauss transform and dual tree recursions [9, 10, 11]). [sent-28, score-0.984]
15 Specific Krylov methods have been applied to kernel problems. [sent-29, score-0.07]
16 For example, [12] uses Lanczos for spectral clustering and [4] uses conjugate gradients for Gaussian processes. [sent-30, score-0.229]
17 However, the use of fast KDE methods, in particular fast multipole methods, to further accelerate these techniques has only appeared in the context of interpolation [13] and our paper on semi-supervised learning [8]. [sent-31, score-0.464]
18 Here, we go for a more general exposition and present several new examples, such as fast nonlinear embeddings and fast Gaussian processes. [sent-32, score-0.33]
19 More importantly, we attack the issue of stability of these methods. [sent-33, score-0.121]
20 Before doing so, we begin with a very brief review of Krylov solvers and fast KDE methods. [sent-38, score-0.208]
21 2 Krylov subspace iteration This section is a compressed overview of Krylov subspace iteration. [sent-39, score-0.243]
22 The main message is that Krylov methods are very efficient algorithms for solving linear systems and eigenvalue problems, but they require a matrix vector multiplication at each iteration. [sent-40, score-0.187]
23 In the next section, we replace this expensive matrix-vector multiplication with a call to fast KDE routines. [sent-41, score-0.26]
24 Readers happy with this message and familiar with Krylov methods, such as conjugate gradients and Lanczos, can skip the rest of this section. [sent-42, score-0.104]
25 For ease of presentation, let the similarity matrix be simply A = W ∈ RN ×N , with entries aij = a(xi , xj ). [sent-43, score-0.226]
26 ) Typical measures of similarity include polynomial a(xi , xj ) = (xi xT +b)p , j 1 T Gaussian a(xi , xj ) = e− σ (xi −xj )(xi −xj ) and sigmoid a(xi , xj ) = tanh(αxi xT − β) j kernels, where xi xT denotes a scalar inner product. [sent-45, score-0.258]
27 Our goal is to solve linear systems j Ax = b and (possibly generalized) eigenvalue problems Ax = λx. [sent-46, score-0.042]
28 The former arise, for example, in semi-supervised learning and Gaussian processes, while the latter arise in spectral clustering and dimensionality reduction. [sent-47, score-0.153]
29 One could attack these problems with naive iterative methods such as the power method, Jacobi and Gauss-Seidel [14]. [sent-48, score-0.119]
30 The problem with these strategies is that the estimate x(t) , at iteration t, only depends on the previous estimate x(t−1) . [sent-49, score-0.059]
31 Hence, these methods do typically take too many iterations to converge. [sent-50, score-0.04]
32 It is well accepted in the numerical computation field that Krylov methods [14, 15], which make use of the entire history of solutions {x(1) , . [sent-51, score-0.066]
33 The intuition behind Krylov subspace methods is to use the history of the solutions we have already computed. [sent-55, score-0.132]
34 Given a matrix A and a vector b, the associated Krylov matrix is: K = [b Ab A2 b . [sent-57, score-0.088]
35 (As in the power method, At b is converging to the eigenvector corresponding to the largest eigenvalue of A. [sent-64, score-0.042]
36 ) We therefore need to construct a well-conditioned orthogonal matrix Q(t) = [q(1) · · · q(t) ], with q(i) ∈ RN , that spans the Krylov space. [sent-65, score-0.044]
37 That is, the leading t columns of K and Q span the same space. [sent-66, score-0.042]
38 This is easily done using the QR-decomposition of K [14], yielding the following Arnoldi relation (augmented Schuur factorization): AQ(t) = Q(t+1) H(t) , where H(t) is the augmented Hessenberg matrix: h1,1 h1,2 h1,3 ··· ··· h2,1 h2,2 h2,3 . [sent-67, score-0.065]
39 The eigenvalues of the smaller (t + 1) × t Hessenberg matrix approximate the eigenvalues of A as t increases. [sent-83, score-0.108]
40 These eigenvalues can be computed efficiently by applying the Arnoldi relation recursively as shown in Figure 1. [sent-84, score-0.072]
41 ) Notice that the matrix vector multiplication v = Aq is the expensive step in the Arnoldi algorithm. [sent-86, score-0.139]
42 • Perform step t of the Arnoldi algorithm ‚ ‚ ‚ ‚e • miny ‚H(t) y − b i‚ • Set x(t) = Q(t) y(t) Figure 1: The Arnoldi (left) and GMRES (right) algorithms. [sent-97, score-0.068]
43 r(t) b − Ax(t) , leading to the GMRES and MINRES algorithms, or the A-norm, leading to conjugate gradients (CG) [14]. [sent-98, score-0.104]
44 At step t of GMRES, we approximate the solution by the vector in the Krylov subspace x(t) ∈ K(t) that minimizes the norm of the residual. [sent-101, score-0.092]
45 Since x(t) is in the Krylov subspace, it can be written as a linear combination of the columns of the Krylov matrix K (t) . [sent-102, score-0.086]
46 As before, stability considerations force us to use the QR decomposition of K (t) . [sent-104, score-0.071]
47 That is, instead of using a linear combination of the columns of K(t) , we use a linear combination of the columns of Q(t) . [sent-105, score-0.084]
48 So our least squares problem becomes y (t) = miny AQ(t) y−b . [sent-106, score-0.102]
49 Since AQ(t) = Q(t+1) H(t) , we only need to solve a problem of dimension (t + 1) × t: y(t) = miny Q(t+1) H(t) y − b . [sent-107, score-0.068]
50 Keeping in mind that the columns of the projection matrix Q are orthonormal, we can rewrite this least squares problem as min y H(t) y − Q(t+1)T b . [sent-108, score-0.12]
51 The final form of our least squares problem at iteration t is: y(t) = min H(t) y − b i , y (t) (t) (t) with solution x = Q y . [sent-110, score-0.093]
52 Notice again that the expensive step in each iteration is the matrix-vector product v = Aq. [sent-113, score-0.093]
53 One important property of the Arnoldi relation is that the residuals are orthogonal to the space spanned by the columns of V = Q(t+1) H(t) . [sent-115, score-0.298]
54 That is, VT r(t) = H(t)T Q(t+1)T (b − Q(t+1) H(t) y(t) ) = H(t)T b i − H(t)T H(t) y(t) = 0 In the following section, we introduce methods to speed up the matrix-vector product v = Aq. [sent-116, score-0.04]
55 These methods will incur, at most, a pre-specified (tolerance) error e (t) at iteration t. [sent-117, score-0.099]
56 Later, we present theoretical bounds on how these errors affect the residuals and the orthogonality of the Krylov subspace. [sent-118, score-0.342]
57 3 Fast KDE The expensive step in Krylov methods is the operation v = Aq(t) . [sent-119, score-0.074]
58 This step requires that we solve two O(N 2 ) kernel estimates: N (t) vi = qj a(xi , xj ) i = 1, 2, . [sent-120, score-0.08]
59 j=1 It is possible to reduce the storage and computational cost to O(N ) at the expense of a small specified error tolerance , say 10−6 , using the fast Gauss transform (FGT) algorithm [16, 17]. [sent-124, score-0.372]
60 This algorithm is an instance of more general fast multipole methods for solving N -body interactions [9]. [sent-125, score-0.277]
61 However, to attack larger dimensions one can adopt clustering-based partitions as in the improved fast Gauss transform (IFGT) [10]. [sent-127, score-0.249]
62 Fast multipole methods tend to work only in low dimensions and are specific to the choice of similarity metric. [sent-128, score-0.181]
63 Dual tree recursions based on KD-trees and ball trees [11, 18] overcome these difficulties, but on average cost O(N log N ). [sent-129, score-0.067]
64 4 Stability results The problem with replacing the matrix-vector multiplication at each iteration of the Krylov methods is that we do not know how the errors accumulate over successive iterations. [sent-131, score-0.16]
65 In this section, we will derive bounds that describe what factors influence these errors. [sent-132, score-0.047]
66 In particular, the bounds will state what properties of the similarity metric and measurable quantities affect the residuals and the orthogonality of the Krylov subspaces. [sent-133, score-0.411]
67 Several papers have addressed the issue of Krylov subspace stability [19, 20, 21]. [sent-134, score-0.163]
68 Let e(t) denote the errors introduced in the approximate matrix-vector multiplication at each iteration of Arnoldi. [sent-137, score-0.12]
69 For the purposes of upper-bounding, this is the tolerance of the fast KDE methods. [sent-138, score-0.258]
70 Then, the fast KDE methods change the Arnoldi relation to: AQ(t) + E(t) = Aq(1) + e(1) , . [sent-139, score-0.245]
71 The new true residuals are therefore: r(t) = b − Ax(t) = b − AQ(t) y(t) = b − Q(t+1) H(t) y(t) + E(t) y(t) and r(t) = b − Q(t+1) H(t) y(t) are the measured residuals. [sent-146, score-0.216]
72 We need to ensure two bounds when using fast KDE methods in Krylov iterations. [sent-147, score-0.252]
73 First, the measured residuals r(t) should not deviate too far from the true residuals r(t) . [sent-148, score-0.432]
74 The deviation in residuals is given by r(t) − r(t) = E(t) y(t) . [sent-151, score-0.24]
75 yk e(k) ≤ (1) k=1 k=1 The deviation from orthogonality can be upper-bounded in a similar fashion: t VT r(t) = H(t)T Q(t+1)T (r(t) + E(t) y(t) ) = H(t)T E(t) y(t) ≤ H(t) |yk | e(k) k=1 (2) The following lemma provides a relation between the yk and the measured residuals r(k−1) . [sent-157, score-0.555]
76 1] Assume that t iterations of the inexact Arnoldi method have been carried out. [sent-160, score-0.068]
77 The proof of the lemma follows from the QR decomposition of H(t) , see [15, 21]. [sent-165, score-0.04]
78 < Proposition 1 tells us that in order to keep the residuals bounded while ensuring bounded deviations from orthogonality at iteration k, we need to monitor the eigenvalues of H(t) and the measured residuals r(k−1) . [sent-171, score-0.626]
79 If the residuals decrease, we can increase the tolerance of the fast KDE algorithms and viceversa. [sent-174, score-0.443]
80 The bounds do lead to a natural way of constructing adaptive algorithms for setting the tolerance of the fast KDE algorithms. [sent-175, score-0.274]
81 The plot on the right shows the computational gains obtained by using fast Krylov methods. [sent-180, score-0.236]
82 5 Experimental results The results of this section demonstrate that significant computational gains may be obtained by combining fast KDE methods with Krylov iterations. [sent-181, score-0.3]
83 We present results in three domains: spectral clustering and image segmentation [1, 12], Gaussian process regression [4] and stochastic neighbor embedding [6]. [sent-182, score-0.378]
84 1 Gaussian processes with large dimensional features In this experiment we use Gaussian processes to predict the labels of 128-dimensional SIFT features [22] for the purposes of object detection and localization as shown in Figure 2. [sent-184, score-0.129]
85 There are typically thousands of features per image, so it is of paramount importance to generate fast predictions. [sent-185, score-0.189]
86 The hard computational task here involves inverting the covariance matrix of the Gaussian process. [sent-186, score-0.107]
87 The figure shows that it is possible to do this efficiently, under the same ROC error, by combining conjugate gradients [4] with dual trees. [sent-187, score-0.172]
88 2 Spectral clustering and image segmentation We applied spectral clustering to color image segmentation; a generalized eigenvalue problem. [sent-189, score-0.348]
89 We observed that fast Krylov methods run approximately twice as fast as the Nystrom method. [sent-192, score-0.37]
90 One should note that the result of Nystrom depends on the quality of sampling, while fast N-body methods enable us to work directly with the full matrix, so the solution is less sensitive. [sent-193, score-0.205]
91 Once again, fast KDE methods lead to significant computational improvements over Krylov algorithms (Lanczos in this case). [sent-194, score-0.241]
92 3 Stochastic neighbor embedding Our final example is again a generalized eigenvalue problem arising in dimensionality reduction. [sent-196, score-0.202]
93 We use the stochastic neighbor embedding algorithm of [6] to project two 3-D structures to 2-D, as shown in Figure 4. [sent-197, score-0.16]
94 6 Conclusions We presented a general approach for combining Krylov solvers and fast KDE methods to accelerate machine learning techniques based on similarity metrics. [sent-200, score-0.403]
95 We demonstrated some of the methods on several datasets and presented results that shed light on the stability and convergence properties of these methods. [sent-201, score-0.143]
96 One important point to make is that these methods work better when there is structure in the data. [sent-202, score-0.04]
97 Another important avenue for further research is the application of the bounds presented in this paper in the design of adaptive algorithms. [sent-206, score-0.047]
98 Improved fast Gauss transform and efficient kernel density estimation. [sent-237, score-0.229]
99 A new error estimate of the fast Gauss transform. [sent-258, score-0.165]
100 Theory of inexact Krylov subspace methods and applications to scientific computing. [sent-270, score-0.2]
wordName wordTfidf (topN-words)
[('krylov', 0.672), ('kde', 0.277), ('arnoldi', 0.227), ('residuals', 0.216), ('gmres', 0.182), ('aq', 0.168), ('fast', 0.165), ('sne', 0.136), ('lanczos', 0.119), ('subspace', 0.092), ('ifgt', 0.09), ('gauss', 0.086), ('embedding', 0.085), ('orthogonality', 0.079), ('yk', 0.078), ('spectral', 0.074), ('multipole', 0.072), ('cg', 0.072), ('stability', 0.071), ('similarity', 0.069), ('inexact', 0.068), ('miny', 0.068), ('tolerance', 0.062), ('multiplication', 0.061), ('nystrom', 0.059), ('iteration', 0.059), ('segmentation', 0.056), ('ax', 0.055), ('gradients', 0.053), ('clustering', 0.051), ('conjugate', 0.051), ('attack', 0.05), ('xj', 0.05), ('siam', 0.049), ('storage', 0.048), ('bounds', 0.047), ('neighbor', 0.047), ('hessenberg', 0.045), ('minres', 0.045), ('dual', 0.044), ('matrix', 0.044), ('solvers', 0.043), ('columns', 0.042), ('eigenvalue', 0.042), ('scienti', 0.042), ('methods', 0.04), ('relation', 0.04), ('lemma', 0.04), ('qr', 0.04), ('greengard', 0.04), ('mahdaviani', 0.04), ('maryam', 0.04), ('recursions', 0.04), ('preconditioned', 0.04), ('xi', 0.039), ('vt', 0.039), ('entries', 0.037), ('image', 0.037), ('computational', 0.036), ('fraser', 0.036), ('fgt', 0.036), ('lang', 0.036), ('gains', 0.035), ('transform', 0.034), ('expensive', 0.034), ('squares', 0.034), ('nando', 0.034), ('presentation', 0.033), ('seconds', 0.033), ('gaussian', 0.032), ('eigenvalues', 0.032), ('shed', 0.032), ('accelerate', 0.032), ('purposes', 0.031), ('kernel', 0.03), ('sift', 0.03), ('british', 0.03), ('freitas', 0.03), ('columbia', 0.03), ('techniques', 0.03), ('iterative', 0.029), ('laplacian', 0.029), ('dimensionality', 0.028), ('stochastic', 0.028), ('yang', 0.028), ('domains', 0.027), ('cost', 0.027), ('inverting', 0.027), ('numerical', 0.026), ('ht', 0.026), ('ease', 0.026), ('ranking', 0.026), ('augmented', 0.025), ('processes', 0.025), ('running', 0.024), ('features', 0.024), ('deviation', 0.024), ('combining', 0.024), ('metrics', 0.024), ('deviations', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 71 nips-2005-Fast Krylov Methods for N-Body Learning
Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang
Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.
2 0.10619204 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng
Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.
3 0.09315528 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
4 0.074794911 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
5 0.072185099 79 nips-2005-Fusion of Similarity Data in Clustering
Author: Tilman Lange, Joachim M. Buhmann
Abstract: Fusing multiple information sources can yield significant benefits to successfully accomplish learning tasks. Many studies have focussed on fusing information in supervised learning contexts. We present an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a non-negative matrix factorization problem of a mixture of similarity measurements. The tradeoff between the informativeness of data sources and the sparseness of their mixture is controlled by an entropy-based weighting mechanism. For the purpose of model selection, a stability-based approach is employed to ensure the selection of the most self-consistent hypothesis. The experiments demonstrate the performance of the method on toy as well as real world data sets. 1
6 0.061427366 45 nips-2005-Conditional Visual Tracking in Kernel Space
7 0.057255134 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
8 0.056608584 102 nips-2005-Kernelized Infomax Clustering
9 0.056112055 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
10 0.055462264 114 nips-2005-Learning Rankings via Convex Hull Separation
11 0.055145554 126 nips-2005-Metric Learning by Collapsing Classes
12 0.054622054 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
13 0.051974531 178 nips-2005-Soft Clustering on Graphs
14 0.051514529 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
15 0.050114829 75 nips-2005-Fixing two weaknesses of the Spectral Method
16 0.048753493 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
17 0.048547201 105 nips-2005-Large-Scale Multiclass Transduction
18 0.045653533 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
19 0.043543674 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
20 0.043390706 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
topicId topicWeight
[(0, 0.16), (1, 0.078), (2, -0.035), (3, -0.02), (4, -0.098), (5, -0.017), (6, 0.002), (7, -0.01), (8, 0.024), (9, 0.033), (10, -0.004), (11, 0.03), (12, -0.017), (13, 0.047), (14, -0.023), (15, 0.065), (16, -0.057), (17, -0.059), (18, -0.066), (19, -0.033), (20, -0.026), (21, -0.066), (22, 0.042), (23, -0.07), (24, -0.005), (25, -0.005), (26, -0.023), (27, 0.08), (28, -0.034), (29, 0.053), (30, -0.045), (31, -0.028), (32, 0.078), (33, 0.016), (34, -0.001), (35, 0.098), (36, 0.063), (37, -0.07), (38, -0.081), (39, -0.014), (40, -0.037), (41, 0.024), (42, -0.009), (43, -0.03), (44, -0.078), (45, 0.019), (46, 0.032), (47, 0.032), (48, -0.079), (49, 0.086)]
simIndex simValue paperId paperTitle
same-paper 1 0.91327667 71 nips-2005-Fast Krylov Methods for N-Body Learning
Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang
Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.
2 0.70088392 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng
Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.
3 0.62811929 59 nips-2005-Dual-Tree Fast Gauss Transforms
Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray
Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.
4 0.5951044 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters
Author: Pieter Abbeel, Varun Ganapathi, Andrew Y. Ng
Abstract: We consider the problem of modeling a helicopter’s dynamics based on state-action trajectories collected from it. The contribution of this paper is two-fold. First, we consider the linear models such as learned by CIFER (the industry standard in helicopter identification), and show that the linear parameterization makes certain properties of dynamical systems, such as inertia, fundamentally difficult to capture. We propose an alternative, acceleration based, parameterization that does not suffer from this deficiency, and that can be learned as efficiently from data. Second, a Markov decision process model of a helicopter’s dynamics would explicitly model only the one-step transitions, but we are often interested in a model’s predictive performance over longer timescales. In this paper, we present an efficient algorithm for (approximately) minimizing the prediction error over long time scales. We present empirical results on two different helicopters. Although this work was motivated by the problem of modeling helicopters, the ideas presented here are general, and can be applied to modeling large classes of vehicular dynamics. 1
5 0.5567109 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman
Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1
6 0.48595378 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
7 0.47954056 102 nips-2005-Kernelized Infomax Clustering
8 0.45401442 79 nips-2005-Fusion of Similarity Data in Clustering
9 0.44232532 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
10 0.43465713 45 nips-2005-Conditional Visual Tracking in Kernel Space
11 0.4056645 189 nips-2005-Tensor Subspace Analysis
12 0.39556009 126 nips-2005-Metric Learning by Collapsing Classes
13 0.39543337 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
14 0.38718525 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
15 0.37830496 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
16 0.37035924 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
17 0.37035564 159 nips-2005-Q-Clustering
18 0.36907989 84 nips-2005-Generalization in Clustering with Unobserved Features
19 0.36630222 114 nips-2005-Learning Rankings via Convex Hull Separation
20 0.36250204 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error
topicId topicWeight
[(3, 0.04), (10, 0.035), (27, 0.025), (31, 0.033), (34, 0.095), (39, 0.011), (55, 0.028), (65, 0.011), (69, 0.03), (73, 0.494), (88, 0.067), (91, 0.029)]
simIndex simValue paperId paperTitle
1 0.92989868 104 nips-2005-Laplacian Score for Feature Selection
Author: Xiaofei He, Deng Cai, Partha Niyogi
Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “filter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classification problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efficiency of our algorithm. 1
same-paper 2 0.92697871 71 nips-2005-Fast Krylov Methods for N-Body Learning
Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang
Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.
3 0.88071185 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
4 0.83837306 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
Author: Tong Zhang, Rie Kubota Ando
Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.
5 0.82528502 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey
Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we find that vaccine optimization is fairly robust to these uncertainties. 1
6 0.625866 102 nips-2005-Kernelized Infomax Clustering
7 0.59263897 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering
8 0.57086384 189 nips-2005-Tensor Subspace Analysis
9 0.54670221 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
10 0.54394072 84 nips-2005-Generalization in Clustering with Unobserved Features
11 0.51424146 77 nips-2005-From Lasso regression to Feature vector machine
12 0.5128116 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
13 0.51210469 178 nips-2005-Soft Clustering on Graphs
14 0.50612342 98 nips-2005-Infinite latent feature models and the Indian buffet process
15 0.49515185 177 nips-2005-Size Regularized Cut for Data Clustering
16 0.47821754 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.46754515 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
18 0.46652991 133 nips-2005-Nested sampling for Potts models
19 0.46241903 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
20 0.45704421 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories