nips nips2009 nips2009-139 knowledge-graph by maker-knowledge-mining

139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems


Source: pdf

Author: Parikshit Ram, Dongryeol Lee, William March, Alexander G. Gray

Abstract: Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (ďŹ nding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientiďŹ c problem of N-body simulation. In this paper we show for the ďŹ rst time O(đ?‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. [sent-14, score-0.509]

2 This class of problems includes the pairwise kernel summation used in kernel density estimation and kernel machines and the all-nearest neighbors computation for classiďŹ cation and manifold learning. [sent-24, score-0.859]

3 In all the problems considered here, the magnitude of the effect of any reference đ? [sent-31, score-0.22]

4 Therefore, the net effect on the query is dominated by references that are “close by� [sent-37, score-0.353]

5 A space-partitioning tree divides the space containing the point set in a hierarchical fashion, allowing for variable resolution to identify major contributing points efďŹ ciently. [sent-39, score-0.263]

6 One approach for employing space-partitioning trees is to consider each query point separately – i. [sent-41, score-0.452]

7 This approach lends itself to single-tree algorithms, in which the references are stored in a tree, and the tree is traversed once for each query. [sent-44, score-0.237]

8 By considering the distance between the query and a collection of references stored in a tree node, the effect of the references can be approximated or ignored if the distances involved are large enough, with appropriate accuracy guarantees for some methods. [sent-45, score-0.66]

9 ‘‘-tree structure [2] was developed to obtain the nearest-neighbors of a given query in expected logarithmic time and has also been used for efďŹ cient kernel summations [3, 4]. [sent-48, score-0.745]

10 ‘ ) time per query for nearest neighbor search for lowintrinsic-dimensional data. [sent-54, score-0.558]

11 The cover tree data structure [1] improves over these two results by both guaranteeing a worst-case runtime for nearest-neighbor and providing efďŹ cient computation in practice relative to đ? [sent-57, score-0.784]

12 The approach described above can be applied to every single query to improve the O(đ? [sent-62, score-0.353]

13 ’Ź and â„› by constructing a space-partitioning tree on each set. [sent-67, score-0.237]

14 Both trees are descended, allowing the contribution from a distant reference node to be pruned for an entire node of query points. [sent-68, score-0.877]

15 These dual-tree algorithms have been shown to be signiďŹ cantly more efďŹ cient in practice than the corresponding single-tree algorithms for nearest neighbor search and kernel summations [9, 10, 11]. [sent-69, score-0.627]

16 ‘ ) growth, they lack rigorous, general runtime bounds. [sent-71, score-0.331]

17 The fast multipole method (FMM)[7] for particle simulations, considered one of the breakthrough algorithms of the 20th century, has a non-rigorous runtime analysis based on the uniform distribution. [sent-76, score-0.451]

18 [12], but was regarding the construction time of the tree and not the querying time. [sent-80, score-0.237]

19 ‘ ) runtime bounds for the monochromatic case, but it is not clear how to extend the analysis to a bichromatic problem. [sent-83, score-0.664]

20 , 2006 [1], the authors conjecture, but do not prove, that the cover tree data structure using a dual-tree algorithm can compute the monochromatic all-nearestneighbors problem in O(đ? [sent-88, score-0.646]

21 ‘ ) runtime bounds for several important instances of the dual-tree algorithms for the ďŹ rst time using the cover tree data structure [1]. [sent-92, score-0.768]

22 We prove the ďŹ rst worst-case bounds for any practical kernel summation algorithms. [sent-93, score-0.379]

23 We also provide the ďŹ rst general runtime proofs for dual-tree algorithms on bichromatic problems. [sent-94, score-0.492]

24 In Section 2, we review the cover tree data structure and state the lemmas necessary for the remainder of the paper. [sent-133, score-0.409]

25 In Section 4, we state the absolute and relative error guarantees for kernel summations and again prove the linear running time of the proposed algorithms. [sent-136, score-0.708]

26 In the same section, we apply the kernel summation result to the đ? [sent-137, score-0.34]

27 The cover tree has two different representations: The implicit representation consists of inďŹ nitely many levels đ? [sent-192, score-0.409]

28 The explicit representation is required to store the tree in O(đ? [sent-200, score-0.289]

29 It coalesces all nodes in the tree for which the only child is the self-child. [sent-202, score-0.271]

30 ‘‡ built on a set â„›, the nearest neighbor of a query đ? [sent-293, score-0.558]

31 ‘ž can be found with the FindNN subroutine in Algorithm 1. [sent-294, score-0.256]

32 The algorithm uses the triangular inequality to prune away portions of the tree that contain points distant from đ? [sent-295, score-0.327]

33 The following theorem provides a runtime bound for the single point search. [sent-297, score-0.466]

34 Batch Query: The dual tree algorithm for all-nearest-neighbor (FindAllNN subroutine in Algorithm 1) using cover trees is provided in Beygelzimer et. [sent-308, score-0.775]

35 3 Runtime Analysis of All-Nearest-Neighbors In the bichromatic case, the performance of the FindAllNN algorithm (or any dual-tree algorithm) will depend on the degree of difference between the query and reference sets. [sent-311, score-0.765]

36 If the sets are nearly identical, then the runtime will be close to the monochromatic case. [sent-312, score-0.531]

37 If the inter-point distances in the query set are very large relative to those between references, then the algorithm may have to descend to the leaves of the query tree before making any descends in the reference tree. [sent-313, score-1.375]

38 In order to quantify this difference in scale for our runtime analysis, we introduce the degree of bichromaticity: DeďŹ nition 3. [sent-315, score-0.41]

39 the tree with the larger scale is always descended. [sent-325, score-0.237]

40 In the monochromatic case, the trees are identical and the traversal alternates between them. [sent-331, score-0.323]

41 As the difference in scales of the two data sets increases, more descends in the query tree become necessary, giving a higher degree of bichromaticity. [sent-334, score-0.754]

42 ’Ź, â„›) pair, the FindAllNN subroutine of Algorithm 1 computes the nearest neighbor in â„› of each point in đ? [sent-348, score-0.551]

43 The computation at Line 3 is done for each of the query nodes at most once, hence takes O(maxđ? [sent-358, score-0.387]

44 The traversal of a reference node is duplicated over the set of queries only if the query tree is descended just before the reference tree descend. [sent-363, score-1.432]

45 For every query descend, there would be at most O(đ? [sent-364, score-0.353]

46 4 ) duplications (width bound) for every reference node traversal. [sent-366, score-0.275]

47 ’Ź 3 Algorithm 1 Single tree and batch query algorithm for Nearest Neighbor search and Approximate Kernel summation FindNN(â„›-Tree đ? [sent-368, score-0.824]

48 ‘…−∞ descends between any two reference descends is upper bounded by đ? [sent-668, score-0.475]

49 œ… and the number of explicit reference nodes is O(đ? [sent-669, score-0.281]

50 ‘ ), the total number of reference node considered in Line 5 in the whole algorithm is at most O(đ? [sent-670, score-0.312]

51 ‘– âˆŁ (width bound), and the â„› maximum depth of any point in the explicit tree is O(đ? [sent-682, score-0.354]

52 Since the traversal down the query tree causes â„› duplication, and the duplication of any reference node is upper bounded by đ? [sent-692, score-1.005]

53 ’Ź â„› Line 9 is executed just once for each of the explicit nodes of the query tree and hence takes at most O(đ? [sent-707, score-0.676]

54 , the FindAllNN subroutine of Algorithm 1 has a runtime bound of O(đ? [sent-908, score-0.649]

55 œ… = 1 since the query and the reference tree are the same. [sent-923, score-0.785]

56 ž(â‹…), the exact computation of kernel summations is infeasible without ∑ O(đ? [sent-928, score-0.366]

57 Approximate kernel summation is more computationally intensive than nearest neighbors because pruning is not based on the distances alone but also on the analytical properties of the kernel (i. [sent-986, score-0.754]

58 Therefore, we require a more extensive runtime analysis, especially for kernels with an inďŹ nite extent, such as the Gaussian kernel. [sent-989, score-0.331]

59 We ďŹ rst prove logarithmic running time for the single-query kernel sum problem under an absolute error bound and then show linear running time for the dual-tree algorithm. [sent-990, score-0.624]

60 1 Single Tree Approximate Kernel Summations Under Absolute Error The algorithm for computing the approximate kernel summation under absolute error is shown in the KernelSum subroutine of Algorithm 1. [sent-993, score-0.821]

61 This amounts to limiting the error per each kernel evaluation to be less than đ? [sent-1045, score-0.273]

62 ‘…−∞ , and by the triangle inequality the kernel ˆ approximate sum đ? [sent-1048, score-0.281]

63 The following theorem proves the runtime of the single-query kernel summation with smooth and monotonically decreasing kernels using a cover tree. [sent-1055, score-0.947]

64 œ–, and a monotonically decreasing smooth non-negative kernel function đ? [sent-1062, score-0.234]

65 ‘Ľ ∈ (â„Ž, ∞) for some â„Ž > 0, the KernelSum subroutine of Algorithm 1 computes the kernel summation at a query đ? [sent-1066, score-1.013]

66 ‘– From the runtime proof of the single-tree nearest neighbor algorithm using cover tree in Beygelzimer et. [sent-1459, score-0.982]

67 2 Dual Tree Approximate Kernel Summations Under Absolute Error An algorithm for the computation of kernel sums for multiple queries is shown in the AllKernelSum subroutine of Algorithm 1, analogous to FindAllNN for batch nearest-neighbor query. [sent-1477, score-0.615]

68 The dual-tree version of the algorithm requires a stricter pruning rule to ensure correctness for all the queries in a query subtree. [sent-1478, score-0.519]

69 ‘— ) that accumulates the postponed kernel contribution for all query points under the subtree đ? [sent-1484, score-0.604]

70 The following theorem proves the correctness of the AllKernelSum subroutine of Algorithm 1. [sent-1487, score-0.369]

71 ’Ź, the AllKernelSum subroutine of Algorithm 1 ˆ ˆ computes approximations đ? [sent-1492, score-0.32]

72 œ– absolute error for each kernel value and the result follows by the triangle inequality. [sent-1546, score-0.415]

73 6 Based on the runtime analysis of the batch nearest neighbor, the runtime bound of AllKernelSum is given by the following theorem: Theorem 4. [sent-1547, score-0.914]

74 ž(â‹…) be a monotonically-decreasing smooth non-negative kernel function that is concave for đ? [sent-1562, score-0.237]

75 œ–, ˆ the AllKernelSum subroutine of Algorithm 1 computes an approximation đ? [sent-1566, score-0.32]

76 Therefore, the methodology of the runtime analysis of batch nearest neighbor gives the O(đ? [sent-1633, score-0.601]

77 3 Approximations Under Relative Error We now extend the analysis for absolute error bounds to cover approximations under the relative error criterion given in DeďŹ nition 4. [sent-1636, score-0.474]

78 An approximation algorithm for a relative error bound is similar to the KernelSum subroutine of Algorithm 1 except that the deďŹ nition of đ? [sent-1644, score-0.496]

79 the set of reference points that are not pruned at the given level đ? [sent-1648, score-0.298]

80 is the scale of the root of the reference cover tree in the explicit representation. [sent-1729, score-0.689]

81 Then, the KernelSum subroutine of Algorithm 1 ˆ with Line 5 redeďŹ ned as Eqn. [sent-1735, score-0.256]

82 This amounts to limiting the error per each kernel evaluation to be less than đ? [sent-1787, score-0.273]

83 ‘…−∞ , and by the triangle inequality the kernel ˆ approximate sum đ? [sent-1797, score-0.281]

84 Since the relative error is an instance of the absolute error, the algorithm also runs in O(log đ? [sent-1814, score-0.242]

85 ‘˘ is larger, taking into account both the maximum possible distance from the root of the query tree to its descendants and the maximum possible distance from the root of the reference tree to its descendants. [sent-1829, score-1.16]

86 â„› are the scales of the roots of the query and reference cover trees respectively in the explicit representations. [sent-1884, score-0.845]

87 œ–, the AllKernelSum subroutine of Algorithm 1 with Line 11 redeďŹ ned as Eq. [sent-1894, score-0.256]

88 œ– relative error bound with a runtime bound of O(đ? [sent-1900, score-0.564]

89 ‘–−1 shown above are sub-optimal in practice, because they require every pairwise kernel value that is pruned to be within đ? [sent-1904, score-0.32]

90 ‘ -body potential summation is an instance of the kernel summation problem that arises in computational physics and chemistry. [sent-1910, score-0.551]

91 This kernel is inďŹ nite at zero distance and has no inection point (i. [sent-1916, score-0.27]

92 Nevertheless, it is possible to obtain the runtime behavior using the results shown in the previous sections. [sent-1920, score-0.331]

93 ‘Ÿ), the KernelSum subroutine of Algorithm 1 computes the potential summation at a query đ? [sent-1946, score-0.84]

94 ‘ ) time using the single-tree algorithm for nearest neighbor described in Beygelzimer et. [sent-2002, score-0.242]

95 2 and applying the same theorem on the KernelSum subroutine of Algorithm 1 with the aforementioned kernel, we prove the O(log đ? [sent-2053, score-0.342]

96 The runtime analysis for the batch case of the algorithm follows naturally. [sent-2055, score-0.433]

97 ’Ź with a bounded degree of bichromaticity for the (đ? [sent-2066, score-0.226]

98 ‘Ÿ), the AllKernelSum subroutine of Algorithm 1 approximates the potential summation ∀đ? [sent-2074, score-0.423]

99 So far, the improvements in runtimes have only been empirical with no rigorous runtime bounds [2, 8, 9, 17, 18]. [sent-2111, score-0.378]

100 Previous work has provided algorithms with rough linear runtime arguments for certain instances of these problems [14, 5, 13], but these results only apply to the monochromatic case. [sent-2112, score-0.584]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('query', 0.353), ('runtime', 0.331), ('subroutine', 0.256), ('allkernelsum', 0.245), ('tree', 0.237), ('kernelsum', 0.222), ('kernel', 0.208), ('monochromatic', 0.2), ('reference', 0.195), ('findallnn', 0.178), ('cover', 0.172), ('summations', 0.158), ('bichromatic', 0.133), ('bichromaticity', 0.133), ('summation', 0.132), ('beygelzimer', 0.126), ('nearest', 0.125), ('descends', 0.117), ('expansion', 0.107), ('absolute', 0.096), ('neighbor', 0.08), ('node', 0.08), ('trees', 0.073), ('pruned', 0.07), ('karger', 0.067), ('error', 0.065), ('batch', 0.065), ('computes', 0.064), ('running', 0.064), ('bound', 0.062), ('particle', 0.056), ('explicit', 0.052), ('gray', 0.051), ('traversal', 0.05), ('computations', 0.049), ('queries', 0.049), ('runtimes', 0.047), ('degree', 0.047), ('theorem', 0.047), ('bounded', 0.046), ('triangle', 0.046), ('pruning', 0.045), ('barnes', 0.044), ('duplication', 0.044), ('electrostatic', 0.044), ('findnn', 0.044), ('fmm', 0.044), ('gravitational', 0.044), ('ruhl', 0.044), ('relative', 0.044), ('physics', 0.044), ('subtree', 0.043), ('log', 0.043), ('pairwise', 0.042), ('balls', 0.042), ('line', 0.041), ('prove', 0.039), ('descend', 0.039), ('depth', 0.039), ('intrinsic', 0.038), ('algorithm', 0.037), ('max', 0.037), ('lee', 0.037), ('distance', 0.036), ('corollary', 0.036), ('agrees', 0.036), ('descended', 0.036), ('multipole', 0.036), ('krauthgamer', 0.036), ('navigating', 0.036), ('neighbors', 0.036), ('potential', 0.035), ('correctness', 0.035), ('else', 0.035), ('trivially', 0.034), ('nodes', 0.034), ('guarantees', 0.034), ('radius', 0.034), ('level', 0.033), ('root', 0.033), ('nition', 0.032), ('rede', 0.032), ('proves', 0.031), ('width', 0.03), ('constant', 0.03), ('concave', 0.029), ('algorithms', 0.028), ('nets', 0.028), ('approximate', 0.027), ('prune', 0.027), ('simulation', 0.027), ('monotonically', 0.026), ('distant', 0.026), ('point', 0.026), ('logarithmic', 0.026), ('problems', 0.025), ('end', 0.025), ('proximity', 0.025), ('kakade', 0.025), ('parent', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

Author: Parikshit Ram, Dongryeol Lee, William March, Alexander G. Gray

Abstract: Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (ďŹ nding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientiďŹ c problem of N-body simulation. In this paper we show for the ďŹ rst time O(đ?‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. 1

2 0.20646319 95 nips-2009-Fast subtree kernels on graphs

Author: Nino Shervashidze, Karsten M. Borgwardt

Abstract: In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G¨ rtner scales as O(n2 4d h). Key to this efficiency is the observation that the a Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1

3 0.1961055 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

Author: Parikshit Ram, Dongryeol Lee, Hua Ouyang, Alexander G. Gray

Abstract: The long-standing problem of efďŹ cient nearest-neighbor (NN) search has ubiquitous applications ranging from astrophysics to MP3 ďŹ ngerprinting to bioinformatics to movie recommendations. As the dimensionality of the dataset increases, exact NN search becomes computationally prohibitive; (1+đ?œ–) distance-approximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the ďŹ rst time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments on high-dimensional datasets show that our algorithm often achieves faster and more accurate results than the best-known distance-approximate method, with much more stable behavior. 1

4 0.12996611 166 nips-2009-Noisy Generalized Binary Search

Author: Robert Nowak

Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1

5 0.12448674 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

Author: Peter Sollich, Matthew Urry, Camille Coti

Abstract: We investigate how well Gaussian process regression can learn functions defined on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we find that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function defined on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive definite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then defined as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, defined as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one finds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be verified by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the figure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to find the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reflecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn fixes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we find the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior defined by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simplification but is appropriate for the present first exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is fixed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and finally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as specified by p, a and σ 2 . We fix d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difficult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by first expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a fixed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are defined appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the definition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects fluctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justified in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative fluctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small filled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reflecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefficient c with p, which implies that to reach a specified (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at fixed , while in the tail ∝ p, it continues to change. For finite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simplifies to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this significantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one finds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a finite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the figure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artificial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning first to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this fixes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 first terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to fine detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9

6 0.11507623 129 nips-2009-Learning a Small Mixture of Trees

7 0.10740061 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

8 0.10683411 190 nips-2009-Polynomial Semantic Indexing

9 0.10611022 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

10 0.099583738 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

11 0.088589475 31 nips-2009-An LP View of the M-best MAP problem

12 0.082166478 128 nips-2009-Learning Non-Linear Combinations of Kernels

13 0.081782281 87 nips-2009-Exponential Family Graph Matching and Ranking

14 0.080432966 74 nips-2009-Efficient Bregman Range Search

15 0.076516345 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

16 0.075047031 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

17 0.073566236 119 nips-2009-Kernel Methods for Deep Learning

18 0.072835341 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

19 0.066092558 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

20 0.065831766 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.188), (1, 0.106), (2, -0.054), (3, 0.051), (4, -0.072), (5, -0.048), (6, -0.125), (7, 0.152), (8, -0.074), (9, -0.136), (10, -0.07), (11, -0.053), (12, 0.162), (13, 0.224), (14, 0.034), (15, -0.044), (16, 0.076), (17, 0.114), (18, 0.048), (19, 0.062), (20, 0.135), (21, 0.071), (22, 0.063), (23, 0.127), (24, 0.07), (25, -0.099), (26, 0.023), (27, -0.008), (28, 0.087), (29, -0.099), (30, -0.042), (31, -0.017), (32, -0.127), (33, -0.064), (34, -0.166), (35, -0.052), (36, -0.06), (37, -0.021), (38, 0.138), (39, -0.047), (40, 0.033), (41, 0.037), (42, 0.032), (43, -0.039), (44, 0.01), (45, -0.074), (46, -0.097), (47, 0.019), (48, 0.016), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9609955 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

Author: Parikshit Ram, Dongryeol Lee, William March, Alexander G. Gray

Abstract: Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (ďŹ nding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientiďŹ c problem of N-body simulation. In this paper we show for the ďŹ rst time O(đ?‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. 1

2 0.81964988 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions

Author: Parikshit Ram, Dongryeol Lee, Hua Ouyang, Alexander G. Gray

Abstract: The long-standing problem of efďŹ cient nearest-neighbor (NN) search has ubiquitous applications ranging from astrophysics to MP3 ďŹ ngerprinting to bioinformatics to movie recommendations. As the dimensionality of the dataset increases, exact NN search becomes computationally prohibitive; (1+đ?œ–) distance-approximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the ďŹ rst time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments on high-dimensional datasets show that our algorithm often achieves faster and more accurate results than the best-known distance-approximate method, with much more stable behavior. 1

3 0.61164731 95 nips-2009-Fast subtree kernels on graphs

Author: Nino Shervashidze, Karsten M. Borgwardt

Abstract: In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G¨ rtner scales as O(n2 4d h). Key to this efficiency is the observation that the a Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1

4 0.5893389 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.

5 0.53894699 74 nips-2009-Efficient Bregman Range Search

Author: Lawrence Cayton

Abstract: We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe an algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive geometric properties of Bregman divergences that yield an efficient algorithm for range search based on a recently proposed space decomposition for Bregman divergences. 1

6 0.53690398 48 nips-2009-Bootstrapping from Game Tree Search

7 0.53351009 166 nips-2009-Noisy Generalized Binary Search

8 0.52632248 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

9 0.51663762 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

10 0.42055514 129 nips-2009-Learning a Small Mixture of Trees

11 0.41486147 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

12 0.41482887 128 nips-2009-Learning Non-Linear Combinations of Kernels

13 0.40902817 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

14 0.39201066 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

15 0.38208723 119 nips-2009-Kernel Methods for Deep Learning

16 0.38016847 24 nips-2009-Adapting to the Shifting Intent of Search Queries

17 0.3799299 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

18 0.37623888 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

19 0.36971292 190 nips-2009-Polynomial Semantic Indexing

20 0.35521242 31 nips-2009-An LP View of the M-best MAP problem


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.062), (25, 0.065), (35, 0.065), (36, 0.121), (39, 0.037), (48, 0.017), (58, 0.062), (61, 0.015), (71, 0.041), (75, 0.246), (81, 0.023), (86, 0.147), (91, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82320708 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems

Author: Parikshit Ram, Dongryeol Lee, William March, Alexander G. Gray

Abstract: Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (ďŹ nding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientiďŹ c problem of N-body simulation. In this paper we show for the ďŹ rst time O(đ?‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. 1

2 0.67505491 119 nips-2009-Kernel Methods for Deep Learning

Author: Youngmin Cho, Lawrence K. Saul

Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1

3 0.67007422 104 nips-2009-Group Sparse Coding

Author: Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow

Abstract: Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification. 1

4 0.66922164 137 nips-2009-Learning transport operators for image manifolds

Author: Benjamin Culpepper, Bruno A. Olshausen

Abstract: We describe an unsupervised manifold learning algorithm that represents a surface through a compact description of operators that traverse it. The operators are based on matrix exponentials, which are the solution to a system of first-order linear differential equations. The matrix exponents are represented by a basis that is adapted to the statistics of the data so that the infinitesimal generator for a trajectory along the underlying manifold can be produced by linearly composing a few elements. The method is applied to recover topological structure from low dimensional synthetic data, and to model local structure in how natural images change over time and scale. 1

5 0.66788834 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

Author: Liefeng Bo, Cristian Sminchisescu

Abstract: In visual recognition, the images are frequently modeled as unordered collections of local features (bags). We show that bag-of-words representations commonly used in conjunction with linear classifiers can be viewed as special match kernels, which count 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse, motivating research into the design of match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels for large datasets due to their significant computational cost. To address this problem, we propose efficient match kernels (EMK) that map local features to a low dimensional feature space and average the resulting vectors to form a setlevel feature. The local feature maps are learned so their inner products preserve, to the best possible, the values of the specified kernel function. Classifiers based on EMK are linear both in the number of images and in the number of local features. We demonstrate that EMK are extremely efficient and achieve the current state of the art in three difficult computer vision datasets: Scene-15, Caltech-101 and Caltech-256. 1

6 0.6646654 2 nips-2009-3D Object Recognition with Deep Belief Nets

7 0.66457987 87 nips-2009-Exponential Family Graph Matching and Ranking

8 0.66239256 151 nips-2009-Measuring Invariances in Deep Networks

9 0.66110998 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

10 0.66018301 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

11 0.6591872 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

12 0.65821761 3 nips-2009-AUC optimization and the two-sample problem

13 0.65658098 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

14 0.65586913 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

15 0.65563649 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

16 0.6550805 72 nips-2009-Distribution Matching for Transduction

17 0.65358025 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

18 0.65300018 129 nips-2009-Learning a Small Mixture of Trees

19 0.65285397 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

20 0.65041476 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions