nips nips2005 nips2005-69 knowledge-graph by maker-knowledge-mining

69 nips-2005-Fast Gaussian Process Regression using KD-Trees


Source: pdf

Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng

Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 UC Berkeley Berkeley, CA 94720 Abstract The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. [sent-5, score-0.177]

2 This makes Gaussian process regression too slow for large datasets. [sent-6, score-0.07]

3 In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression. [sent-7, score-0.158]

4 However, a major drawback of GP models is that the computational cost of learning is about O(n 3 ), and the cost of making a single prediction is O(n), where n is the number of training examples. [sent-11, score-0.213]

5 In this paper, we address the scaling issue by recognizing that learning and predictions with a GP regression (GPR) model can be implemented using the matrix-vector multiplication (MVM) primitive z → K z. [sent-13, score-0.184]

6 Here, K ∈ Rn,n is the kernel matrix, and z ∈ Rn is an arbitrary vector. [sent-14, score-0.092]

7 For the wide class of so-called isotropic kernels, MVM can be approximated efficiently by arranging the dataset in a tree-type multiresolution data structure such as kd-trees [13], ball trees [11], or cover trees [1]. [sent-15, score-0.353]

8 This approximation can sometimes be made orders of magnitude faster than the direct computation, without sacrificing much in terms of accuracy. [sent-16, score-0.053]

9 Further, the storage requirements for the tree is O(n), while a direct storage of the kernel matrix would require O(n2 ) spare. [sent-17, score-0.242]

10 We demonstrate the efficiency of the tree approach on several large datasets. [sent-18, score-0.073]

11 However, it is also completely straightforward to apply the ideas in this paper to other tree-type data structures, for example ball trees and cover trees, which typically scale significantly better to high dimensional data. [sent-20, score-0.078]

12 , n}, xi ∈ X , yi ∈ R, sampled independently and identically distributed (i. [sent-24, score-0.085]

13 Our goal is to predict the response y∗ on future test points x∗ with small mean-squared error under the data distribution. [sent-28, score-0.065]

14 Following the Bayesian paradigm, we place a prior distribution P (u(·)) on the function u(·) and use the posterior distribution P (u(·)|D) ∝ N (y|u, σ 2 I)P (u(·)) in order to predict y∗ on new points x∗ . [sent-30, score-0.065]

15 For a GPR model, the prior distribution is a (zero-mean) Gaussian process defined in terms of a positive definite kernel (or covariance) function K : X 2 → R. [sent-38, score-0.092]

16 ) In this paper, we focus on the problem of speeding up GPR under the assumption that the kernel is monotonic isotropic. [sent-41, score-0.23]

17 A kernel function K(x, x ) is called isotropic if it depends only on the Euclidean distance r = x − x 2 between the points, and it is monotonic isotropic if it can be written as a monotonic function of r. [sent-42, score-0.56]

18 Therefore, if p = M −1 y, the optimal prediction under the model is u∗ = kT p, and the predictive variance (of P (u∗ |D)) can be used to quantify our uncerˆ ∗ tainty in the prediction. [sent-50, score-0.06]

19 ) Once p is determined, making a prediction now requires that we compute T n n kT p = ∗ 2 n K(x∗ , xi )pi = i=1 (2) wi p i i=1 which is O(n) since it requires scanning through the entire training set and computing K(x∗ , xi ) for each xi in the training set. [sent-53, score-0.431]

20 When the training set is very large, this becomes prohibitively slow. [sent-54, score-0.037]

21 In such situations, it is desirable to use a fast approximation instead of the exact direct implementation. [sent-55, score-0.169]

22 1 Weighted Sum Approximation The computations in Equation 2 can be thought of as a weighted sum, where w i = K(x∗ , xi ) is the weight on the i-th summand pi . [sent-57, score-0.325]

23 We observe that if the dataset is divided into groups where all data points in a group have similar weights, then it is possible to compute a fast approximation to the above weighted sum. [sent-58, score-0.268]

24 For example, let G be a set of data points that all have weights near some value w. [sent-59, score-0.109]

25 The contribution to the weighted sum by points in G is wi p i = i:xi ∈G wpi + i:xi ∈G (wi − w)pi = w i:xi ∈G pi + i:xi ∈G i pi i:xi ∈G Where i = wi − w. [sent-60, score-0.555]

26 Assuming that i:xi ∈G pi is known in advance, w i:xi ∈G pi can then be computed in constant time and used as an approximation to i:xi ∈G wi pi if i:xi ∈G i pi is small. [sent-61, score-0.558]

27 We note that for a continuous isotropic kernel function, the weights w i = K(x∗ , xi ) and wj = K(x∗ , xj ) will be similar if xi and xj are close to each other. [sent-62, score-0.525]

28 In addition, if the Figure 1: Example of bounding rectangles for nodes in the first three levels of a kd-tree. [sent-63, score-0.129]

29 kernel function monotonically decreases to zero with increasing ||x i − xj ||, then points that are far away from the query point x∗ will all have weights near zero. [sent-64, score-0.362]

30 Given a new query, we would like to automatically group points together that have similar weights. [sent-65, score-0.096]

31 But the weights are dependent on the query point and hence the best grouping of the data will also be dependent on the query point. [sent-66, score-0.358]

32 Thus, the problem we now face is, given query point, how to quickly divide the dataset into groups such that data points in the same group have similar weights. [sent-67, score-0.25]

33 2 The kd-tree algorithm A kd-tree [13] is a binary tree that recursively partitions a set of data points. [sent-70, score-0.073]

34 Each node in the kd-tree contains a subset of the data, and records the bounding hyper-rectangle for this subset. [sent-71, score-0.183]

35 Any node that contains more than 1 data point has two child nodes, and the data points contained by the parent node are split among the children by cutting the parent node’s bounding hyper-rectangle in the middle of its widest dimension. [sent-73, score-0.562]

36 At a node ND whose set of data points is XND , in addition to the bounding box we also store 1. [sent-76, score-0.278]

37 NND = |XND |: the number of data points contained by ND. [sent-77, score-0.092]

38 xi ∈XND pi : the unweighted sum corresponding to the data con- Now, let Weighted SND = K(x∗ , xi )pi (3) i:xi ∈XND Weighted be the weighted sum corresponding to node ND. [sent-80, score-0.708]

39 One way to calculate S ND is to Weighted Weighted simply have the 2 children of ND recursively compute SLeft(ND) and SRight(ND) (where 1 There are numerous other possible kd-tree splitting criteria. [sent-81, score-0.066]

40 Our criteria is the same as the one used in [9] and [5] Left(ND) and Right(ND) are the 2 children of ND) and then sum the two results. [sent-82, score-0.103]

41 This takes O(n) time—same as the direct computation—since all O(n) nodes need to be processed. [sent-83, score-0.07]

42 However, if we only want an approximate result for the weighted sum, then we can cut off the recursion at nodes whose data points have nearly identical weights for the given query point. [sent-84, score-0.458]

43 Since each node maintains a bounding box of the data points that it owns, we can easily bound the maximum weight variation of the data points owned by a node (as in [9]). [sent-85, score-0.442]

44 The nearest and farthest points in the bounding box to the query point can be computed in O(input dimension) operations, and since the kernel function is isotropic monotonic, these points give us the maximum and minimum possible weights wmax and wmin of any data point in the bounding box. [sent-86, score-1.223]

45 Now, whenever the difference between wmax and wmin is small, we can cutoff the recursion Unweighted 1 and approximate the weighted sum in Equation 3 by w ∗ SND where w = 2 (wmin + wmax ). [sent-87, score-1.001]

46 The speed and accuracy of the approximation is highly dependent on the cutoff criteria. [sent-88, score-0.172]

47 used the following cutoff rule in [9]: wmax − wmin ≤ 2 (WSoFar + NND wmin ). [sent-90, score-0.932]

48 Here, WSoFar is the weight accumulated so far in the computation and WSoFar +NND wmin serves as a lower bound on the total sum of weights involved in the regression. [sent-91, score-0.425]

49 In our experiments, we found that although the above cutoff rule ensures the error incurred at any particular data point in ND is small, the total error incurred by all the data points in ND can still be high if NND is very large. [sent-92, score-0.26]

50 In our experiments (not reported here), their method gave poor performance on the GPR task, in many cases incurring significant errors in the predictions (or, alternatively running no faster than exact computation, if sufficiently small is chosen to prevent the large accumulation of errors). [sent-93, score-0.154]

51 Hence, we chose instead the following cutoff rule: NND (wmax − wmin ) ≤ 2 (WSoFar + NND wmin ), which also takes into account the total number of points contained in a node. [sent-94, score-0.857]

52 From the forumla above, we see that the decision of whether to cutoff computation at a node depends on the value of WSoFar (the total weight of all the points that have been added to the summation so far). [sent-95, score-0.34]

53 Thus it is desirable to quickly accumulate weights at the beginning of the computations, so that more of the later recursions can be cut off. [sent-96, score-0.07]

54 This can be accomplished by going into the child node that’s nearer to the query point first when we recurse into the children of a node that doesn’t meet the cutoff criteria. [sent-97, score-0.665]

55 (In contrast, [9] always visits the children in left-right order, which in our experiments also gave significantly worse performance than our version. [sent-98, score-0.066]

56 However, in practice there are many ways to quickly obtain approximate solutions to linear systems. [sent-101, score-0.058]

57 Since the system matrix is symmetric positive definite, the conjugate gradient (CG) algorithm can be applied. [sent-102, score-0.036]

58 2 Briefly, CG ensures that z after iteration k is a maximizer of q over a (Krylow) subspace of dimension k. [sent-104, score-0.038]

59 Thus, z “converges” to p (the unconstrained maximizer of q) after n steps, but intermediate z can be used as approximate solutions. [sent-106, score-0.07]

60 Crucially, the only operation on M performed in each iteration of CG is a matrix-vector multiplication (MVM) with M. [sent-109, score-0.075]

61 Since M = K + σ 2 I, speeding up MVM with M is critically dependent on our ability to perform fast MVM with the kernel matrix K . [sent-110, score-0.239]

62 We can apply the algorithm from Section 3 to perform fast MVM. [sent-111, score-0.061]

63 Thus, ki has the same form as that of the vector k ∗ used in the prediction step. [sent-116, score-0.06]

64 Indeed, given a dataset, we need only ever find a single kd-tree structure for it, and the same kd-tree structure can then be used to make multiple predictions or multiple MVM operations. [sent-124, score-0.043]

65 , n (to obtain i the vector resulting from one MVM operation), we can also share the same pre-computed partial unweighted sums in the internal nodes of the tree. [sent-128, score-0.17]

66 Only when v (or p) changes do we need to change the partial unweighted sums (discussed in Section 3. [sent-129, score-0.125]

67 2) of v stored in the internal nodes (an O(n) operation). [sent-130, score-0.045]

68 5 Performance Evaluation We evaluate our kd-tree implementation of GPR and an implementation that uses direct computation for the inner products. [sent-131, score-0.058]

69 Our experiments were performed on the nine regression datasets in Table 1. [sent-132, score-0.07]

70 2 2 Data for the Helicopter experiments come from an autonomous helicopter flight project, [10] and the three tasks were to model three subdynamics of the helicopter, namely its yaw rate, forward velocity, and lateral velocity one timestep later as a function of the helicopter’s current state. [sent-133, score-0.402]

71 Helicopter yaw rate Helicopter x-velocity Helicopter y-velocity Mote 10 temperature Mote 47 temperature Mote 47 humidity Housing income Housing value Housing age Exact cost 14. [sent-135, score-0.581]

72 785 Table 2: Prediction performance on 9 regression problems. [sent-180, score-0.07]

73 The error reported is the mean absolute prediction error. [sent-185, score-0.06]

74 For all experiments, we used the Gaussian RBF kernel x−x 2 , K(x, x ) = exp − 2d2 which is monotonic isotropic, with d and σ chosen to be reasonable values for each problem (via cross validation). [sent-186, score-0.173]

75 The parameter used in the cutoff rule was set to be 0. [sent-187, score-0.143]

76 1 Prediction performance Our first set of experiments compare the prediction time of the kd-tree algorithm with exact computation, given a precomputed p. [sent-190, score-0.143]

77 Our average prediction times are given in Table 2. [sent-191, score-0.06]

78 These numbers include the cost of building the kd-tree (but remain small since the cost is then amortized over all the examples in the test set). [sent-192, score-0.116]

79 Further, it incurs only a very small amount of additional error compared to the exact algorithm. [sent-196, score-0.083]

80 , solving the system of Equations 4,) using our kd-tree algorithm for the MVM operation, compared to exact computation. [sent-200, score-0.083]

81 For both approximate and exact MVM, conjugate gradient was used of nearby motes. [sent-201, score-0.151]

82 The housing experiments make use of data collected from the 1990 Census in California. [sent-202, score-0.2]

83 Helicopter yaw rate Helicopter x-velocity Helicopter y-velocity Mote 10 temperature Mote 47 temperature Mote 47 humidity Housing income Housing value Housing age Exact cost 22885 23412 14341 2071 2531 2121 1922 997 1496 Tree cost 279 619 443 253 487 398 581 138 338 Speedup 82. [sent-204, score-0.639]

84 785 Table 3: Training time on the 9 regression problems. [sent-231, score-0.07]

85 1 Related Work Multiresolution tree data structures have been used to speed up the computation of a wide variety of machine learning algorithms [9, 5, 7, 14]. [sent-237, score-0.106]

86 GP regression was introduced to the machine learning community by Rasmussen and Williams [19]. [sent-238, score-0.07]

87 The stability of Krylov subspace iterative solvers (such as CG) with approximate matrix-vector multiplication is discussed in [4]. [sent-240, score-0.1]

88 Sparse methods can typically be trained in O(n d2 ) (including the active forward selection of the subset) and require O(d) prediction time only. [sent-242, score-0.06]

89 Their approach, called the improved fast Gauss transform (IFGT), partitions the space with a k-centers clustering of D and uses a Taylor expansion of the RBF kernel in order to cache repeated computations. [sent-245, score-0.153]

90 The IFGT is limited to the RBF kernel, while our method can be used with all monotonic isotropic kernels. [sent-246, score-0.234]

91 As a topic for future work, we believe it may be possible to apply IFGT’s Taylor expansions at each node of the kd-tree’s query-dependent multiresolution clustering, to obtain an algorithm that enjoys the best properties of both. [sent-247, score-0.175]

92 2 Isotropic Kernels Recall that an isotropic kernel K(x, x ) can be written as a function of the Euclidean distance r = x − x . [sent-249, score-0.245]

93 While the RBF kernel of the form exp(−r 2 ) is the most frequently used isotropic kernel in machine learning, there are many other isotropic kernels to which our method here can be applied without many changes (since the kd-tree cutoff criteria depends on the pairwise Euclidean distances only). [sent-250, score-0.677]

94 An interesting class of kernels is the Mat´ rn model (see [17], Sect. [sent-251, score-0.084]

95 3 The errors reported in this table are identical to Table 2, since for the kd-tree results we always trained and made predictions both using the fast approximate method. [sent-255, score-0.166]

96 For ν = 1/2 we have the “random walk” Ornstein-Uhlenbeck kernel of the form e −αr , and the RBF kernel is obtained in the limit ν → ∞. [sent-257, score-0.184]

97 The RBF kernel forces u(·) to be very smooth, which can lead to bad predictions for training data with partly rough behaviour, and its uncritical usage is therefore discouraged in Geostatistics (where the use of GP models was pioneered). [sent-258, score-0.172]

98 We believe that our e kd-trees approach holds rich promise for speeding up GPR with other isotropic kernels such the Mat´ rn and Ornstein-Uhlenbeck kernels. [sent-260, score-0.294]

99 Fast sparse Gaussian process methods: The informative vector machine. [sent-290, score-0.037]

100 Efficient kernel machines using the improved fast Gauss transform. [sent-342, score-0.153]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wmin', 0.311), ('helicopter', 0.291), ('wsofar', 0.263), ('mvm', 0.239), ('snd', 0.215), ('mote', 0.208), ('housing', 0.2), ('nnd', 0.191), ('wmax', 0.167), ('gpr', 0.167), ('isotropic', 0.153), ('cutoff', 0.143), ('pi', 0.129), ('query', 0.128), ('unweighted', 0.125), ('cg', 0.121), ('weighted', 0.111), ('median', 0.111), ('income', 0.104), ('node', 0.099), ('xnd', 0.096), ('gp', 0.092), ('kernel', 0.092), ('temperature', 0.087), ('xi', 0.085), ('bounding', 0.084), ('humidity', 0.083), ('yaw', 0.083), ('exact', 0.083), ('monotonic', 0.081), ('kt', 0.079), ('age', 0.079), ('multiresolution', 0.076), ('tree', 0.073), ('weightedsum', 0.072), ('regression', 0.07), ('rbf', 0.069), ('child', 0.068), ('andrew', 0.067), ('children', 0.066), ('points', 0.065), ('mat', 0.062), ('nearer', 0.062), ('fast', 0.061), ('prediction', 0.06), ('cost', 0.058), ('ifgt', 0.057), ('speeding', 0.057), ('house', 0.05), ('gaussian', 0.048), ('kan', 0.048), ('rooms', 0.048), ('sfarther', 0.048), ('snearer', 0.048), ('trees', 0.046), ('nodes', 0.045), ('kernels', 0.044), ('weights', 0.044), ('predictions', 0.043), ('wi', 0.042), ('krylov', 0.042), ('stanford', 0.041), ('rn', 0.04), ('multiplication', 0.038), ('ight', 0.038), ('maximizer', 0.038), ('operation', 0.037), ('training', 0.037), ('sparse', 0.037), ('sum', 0.037), ('conjugate', 0.036), ('speedup', 0.035), ('nd', 0.034), ('euclidean', 0.034), ('primitive', 0.033), ('recursion', 0.033), ('computation', 0.033), ('xj', 0.033), ('cover', 0.032), ('approximate', 0.032), ('group', 0.031), ('sensor', 0.031), ('solvers', 0.03), ('gauss', 0.03), ('table', 0.03), ('box', 0.03), ('international', 0.029), ('yang', 0.029), ('seeger', 0.029), ('eric', 0.029), ('dependent', 0.029), ('autonomous', 0.028), ('faster', 0.028), ('parent', 0.027), ('contained', 0.027), ('quickly', 0.026), ('taylor', 0.026), ('storage', 0.026), ('incurred', 0.026), ('direct', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng

Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.

2 0.21849461 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

Author: Pieter Abbeel, Varun Ganapathi, Andrew Y. Ng

Abstract: We consider the problem of modeling a helicopter’s dynamics based on state-action trajectories collected from it. The contribution of this paper is two-fold. First, we consider the linear models such as learned by CIFER (the industry standard in helicopter identification), and show that the linear parameterization makes certain properties of dynamical systems, such as inertia, fundamentally difficult to capture. We propose an alternative, acceleration based, parameterization that does not suffer from this deficiency, and that can be learned as efficiently from data. Second, a Markov decision process model of a helicopter’s dynamics would explicitly model only the one-step transitions, but we are often interested in a model’s predictive performance over longer timescales. In this paper, we present an efficient algorithm for (approximately) minimizing the prediction error over long time scales. We present empirical results on two different helicopters. Although this work was motivated by the problem of modeling helicopters, the ideas presented here are general, and can be applied to modeling large classes of vehicular dynamics. 1

3 0.11016823 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

Author: Jeremy Kubica, Joseph Masiero, Robert Jedicke, Andrew Connolly, Andrew W. Moore

Abstract: In this paper we consider the problem of finding sets of points that conform to a given underlying model from within a dense, noisy set of observations. This problem is motivated by the task of efficiently linking faint asteroid detections, but is applicable to a range of spatial queries. We survey current tree-based approaches, showing a trade-off exists between single tree and multiple tree algorithms. To this end, we present a new type of multiple tree algorithm that uses a variable number of trees to exploit the advantages of both approaches. We empirically show that this algorithm performs well using both simulated and astronomical data.

4 0.10647003 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

Author: Sathiya Keerthi, Wei Chu

Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1

5 0.10619204 71 nips-2005-Fast Krylov Methods for N-Body Learning

Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang

Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.

6 0.094638355 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

7 0.092907593 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

8 0.091538973 70 nips-2005-Fast Information Value for Graphical Models

9 0.090659007 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

10 0.080434062 33 nips-2005-Bayesian Sets

11 0.076502711 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

12 0.072395474 125 nips-2005-Message passing for task redistribution on sparse graphs

13 0.067705244 105 nips-2005-Large-Scale Multiclass Transduction

14 0.066935688 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining

15 0.066158794 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

16 0.065709263 50 nips-2005-Convex Neural Networks

17 0.064836197 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

18 0.064116299 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

19 0.064021774 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

20 0.062652916 80 nips-2005-Gaussian Process Dynamical Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.214), (1, 0.069), (2, -0.01), (3, -0.031), (4, 0.008), (5, -0.066), (6, -0.005), (7, 0.08), (8, 0.109), (9, 0.259), (10, 0.065), (11, -0.004), (12, 0.018), (13, 0.089), (14, 0.011), (15, 0.081), (16, -0.025), (17, -0.069), (18, -0.148), (19, -0.035), (20, -0.034), (21, -0.077), (22, -0.036), (23, -0.114), (24, -0.129), (25, -0.055), (26, -0.133), (27, 0.084), (28, 0.005), (29, 0.011), (30, -0.017), (31, -0.04), (32, 0.074), (33, 0.058), (34, 0.041), (35, 0.115), (36, 0.013), (37, -0.178), (38, -0.013), (39, -0.151), (40, -0.127), (41, 0.033), (42, 0.087), (43, -0.091), (44, -0.304), (45, 0.074), (46, -0.003), (47, 0.149), (48, -0.099), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9074465 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng

Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.

2 0.81768829 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters

Author: Pieter Abbeel, Varun Ganapathi, Andrew Y. Ng

Abstract: We consider the problem of modeling a helicopter’s dynamics based on state-action trajectories collected from it. The contribution of this paper is two-fold. First, we consider the linear models such as learned by CIFER (the industry standard in helicopter identification), and show that the linear parameterization makes certain properties of dynamical systems, such as inertia, fundamentally difficult to capture. We propose an alternative, acceleration based, parameterization that does not suffer from this deficiency, and that can be learned as efficiently from data. Second, a Markov decision process model of a helicopter’s dynamics would explicitly model only the one-step transitions, but we are often interested in a model’s predictive performance over longer timescales. In this paper, we present an efficient algorithm for (approximately) minimizing the prediction error over long time scales. We present empirical results on two different helicopters. Although this work was motivated by the problem of modeling helicopters, the ideas presented here are general, and can be applied to modeling large classes of vehicular dynamics. 1

3 0.6301924 71 nips-2005-Fast Krylov Methods for N-Body Learning

Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang

Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.

4 0.48780343 59 nips-2005-Dual-Tree Fast Gauss Transforms

Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray

Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.

5 0.43216106 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

Author: Jeremy Kubica, Joseph Masiero, Robert Jedicke, Andrew Connolly, Andrew W. Moore

Abstract: In this paper we consider the problem of finding sets of points that conform to a given underlying model from within a dense, noisy set of observations. This problem is motivated by the task of efficiently linking faint asteroid detections, but is applicable to a range of spatial queries. We survey current tree-based approaches, showing a trade-off exists between single tree and multiple tree algorithms. To this end, we present a new type of multiple tree algorithm that uses a variable number of trees to exploit the advantages of both approaches. We empirically show that this algorithm performs well using both simulated and astronomical data.

6 0.41702771 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

7 0.41564378 33 nips-2005-Bayesian Sets

8 0.36082765 70 nips-2005-Fast Information Value for Graphical Models

9 0.35566315 125 nips-2005-Message passing for task redistribution on sparse graphs

10 0.34120274 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

11 0.32349017 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

12 0.3221789 160 nips-2005-Query by Committee Made Real

13 0.31626347 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

14 0.31247494 105 nips-2005-Large-Scale Multiclass Transduction

15 0.29957888 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

16 0.29700837 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

17 0.29520559 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining

18 0.28870985 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

19 0.28209564 195 nips-2005-Transfer learning for text classification

20 0.28183255 138 nips-2005-Non-Local Manifold Parzen Windows


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.047), (10, 0.043), (18, 0.012), (27, 0.014), (31, 0.058), (34, 0.093), (39, 0.012), (55, 0.062), (69, 0.055), (71, 0.328), (73, 0.071), (88, 0.082), (91, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83199376 39 nips-2005-Beyond Pair-Based STDP: a Phenomenological Rule for Spike Triplet and Frequency Effects

Author: Jean-pascal Pfister, Wulfram Gerstner

Abstract: While classical experiments on spike-timing dependent plasticity analyzed synaptic changes as a function of the timing of pairs of pre- and postsynaptic spikes, more recent experiments also point to the effect of spike triplets. Here we develop a mathematical framework that allows us to characterize timing based learning rules. Moreover, we identify a candidate learning rule with five variables (and 5 free parameters) that captures a variety of experimental data, including the dependence of potentiation and depression upon pre- and postsynaptic firing frequencies. The relation to the Bienenstock-Cooper-Munro rule as well as to some timing-based rules is discussed. 1

same-paper 2 0.7619341 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

Author: Yirong Shen, Matthias Seeger, Andrew Y. Ng

Abstract: The computation required for Gaussian process regression with n training examples is about O(n3 ) during training and O(n) for each prediction. This makes Gaussian process regression too slow for large datasets. In this paper, we present a fast approximation method, based on kd-trees, that significantly reduces both the prediction and the training times of Gaussian process regression.

3 0.47192183 102 nips-2005-Kernelized Infomax Clustering

Author: David Barber, Felix V. Agakov

Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of specifically constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to specific unknown vectors in the feature space. The method may be conveniently applied to learning the optimal affinity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1

4 0.47188151 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia

Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.

5 0.46451241 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

Author: John D. Lafferty, Pradeep K. Ravikumar

Abstract: We present a family of approximation techniques for probabilistic graphical models, based on the use of graphical preconditioners developed in the scientific computing literature. Our framework yields rigorous upper and lower bounds on event probabilities and the log partition function of undirected graphical models, using non-iterative procedures that have low time complexity. As in mean field approaches, the approximations are built upon tractable subgraphs; however, we recast the problem of optimizing the tractable distribution parameters and approximate inference in terms of the well-studied linear systems problem of obtaining a good matrix preconditioner. Experiments are presented that compare the new approximation schemes to variational methods. 1

6 0.46230358 144 nips-2005-Off-policy Learning with Options and Recognizers

7 0.46218386 177 nips-2005-Size Regularized Cut for Data Clustering

8 0.45792189 23 nips-2005-An Application of Markov Random Fields to Range Sensing

9 0.45770997 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

10 0.45747337 178 nips-2005-Soft Clustering on Graphs

11 0.45693332 184 nips-2005-Structured Prediction via the Extragradient Method

12 0.45464832 78 nips-2005-From Weighted Classification to Policy Search

13 0.45356518 98 nips-2005-Infinite latent feature models and the Indian buffet process

14 0.45327571 30 nips-2005-Assessing Approximations for Gaussian Process Classification

15 0.45320821 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

16 0.45290595 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

17 0.45270622 50 nips-2005-Convex Neural Networks

18 0.45254597 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

19 0.45247728 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

20 0.45223567 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification