nips nips2009 nips2009-8 knowledge-graph by maker-knowledge-mining

8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

Source: pdf

Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

Abstract: A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufﬁciently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difﬁculty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an inﬁnite weighted sum of χ2 random variables. Prior ﬁnite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and ﬁtting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artiﬁcial example, high dimensional multivariate data, and text.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In using this distance as a statistic for a test of whether two samples are from different distributions, a major difﬁculty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an inﬁnite weighted sum of χ2 random variables. [sent-12, score-0.72]

2 The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. [sent-14, score-0.622]

3 A proof of consistency of this estimate is provided. [sent-15, score-0.093]

4 The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artiﬁcial example, high dimensional multivariate data, and text. [sent-16, score-0.608]

5 1 Introduction Learning algorithms based on kernel methods have enjoyed considerable success in a wide range of supervised learning tasks, such as regression and classiﬁcation [25]. [sent-17, score-0.156]

6 One reason for the popularity of these approaches is that they solve difﬁcult non-parametric problems by representing the data points in high dimensional spaces of features, speciﬁcally reproducing kernel Hilbert spaces (RKHSs), in which linear algorithms can be brought to bear. [sent-18, score-0.201]

7 While classical kernel methods have addressed the mapping of individual points to feature space, more recent developments [14, 29, 28] have focused on the embedding of probability distributions in RKHSs. [sent-19, score-0.23]

8 When the embedding is injective, the RKHS is said to be characteristic [11, 29, 12], and the distance between feature mappings constitutes a metric on distributions. [sent-20, score-0.079]

9 One well-deﬁned application of the MMD is in testing whether two samples are drawn from two different distributions (i. [sent-22, score-0.127]

10 For instance, we might wish to ﬁnd whether DNA microarrays obtained on the same tissue type by different labs are distributed identically, or whether differences in lab procedure are such that the data have dissimilar distributions (and cannot be aggregated) [8]. [sent-25, score-0.117]

11 A major challenge when using the MMD in two-sample testing is in obtaining a signiﬁcance threshold, which the MMD should exceed with small probability when the null hypothesis (that the samples share the same generating distribution) is satisﬁed. [sent-27, score-0.512]

12 Following [14, Section 4], we deﬁne this threshold as an upper quantile of the asymptotic distribution of the MMD under the null hypothesis. [sent-28, score-0.474]

13 Unfortunately this null distribution takes the form of an inﬁnite weighted sum of χ2 random variables. [sent-29, score-0.381]

14 Thus, obtaining a consistent ﬁnite sample estimate of this threshold — that is, an estimate that converges to the true threshold in the inﬁnite sample limit — is a signiﬁcant challenge. [sent-30, score-0.373]

15 The main contribution of the present study is a consistent ﬁnite sample estimate of the null distribution (not based on bootstrap), and a proof that this estimate converges to the true null distribution in the inﬁnite sample limit. [sent-32, score-1.031]

16 Brieﬂy, the inﬁnite sequence of weights that deﬁnes the null distribution is identical to the sequence of normalized eigenvalues obtained in kernel PCA [26, 27, 7]. [sent-33, score-0.626]

17 Thus, we show that the null distribution deﬁned using ﬁnite sample estimates of these eigenvalues converges to the population distribution, using only convergence results on certain statistics of the eigenvalues. [sent-34, score-0.624]

18 In experiments, our new estimate of the test threshold has a smaller computational cost than that of resampling-based approaches such as the bootstrap, while providing performance as good as the alternatives for larger sample sizes. [sent-35, score-0.248]

19 We also review the maximum mean discrepancy as our chosen distance measure on these embeddings, and recall the asymptotic behaviour of its ﬁnite sample estimate. [sent-37, score-0.174]

20 In Section 3, we present both moment-based approximations to the null distribution of the MMD (which have no consistency guarantees); and our novel, consistent estimate of the null distribution, based on the spectrum of the kernel matrix over the aggregate sample. [sent-38, score-1.167]

21 We also demonstrate the generality of a kernel-based approach by testing whether two samples of text are on the same topic, or on different topics. [sent-40, score-0.126]

22 2 Background In testing whether two samples are generated from the same distribution, we require both a measure of distance between probabilities, and a notion of whether this distance is statistically signiﬁcant. [sent-41, score-0.174]

23 For the former, we deﬁne an embedding of probability distributions in a reproducing kernel Hilbert space (RKHS), such that the distance between these embeddings is our test statistic. [sent-42, score-0.431]

24 For the latter, we give an expression for the asymptotic distribution of this distance measure, from which a signiﬁcance threshold may be obtained. [sent-43, score-0.169]

25 The inner product between feature mappings is given by the positive deﬁnite kernel function k(x, x′ ) := φ(x), φ(x′ ) F . [sent-45, score-0.156]

26 We assume in the following that the kernel k is bounded. [sent-46, score-0.156]

27 For MMD 2 to be a metric, we require that the kernel be characteristic [11, 29, 12]. [sent-51, score-0.156]

28 1 This criterion is satisﬁed for many common kernels, such as the Gaussian kernel (both on compact domains and on Rd ) and the B2l+1 spline kernel on Rd . [sent-52, score-0.312]

29 An unbiased estimate of MMD is the one-sample U-statistic MMD2 := u 1 m(m − 1) m h(zi , zj ), (1) i=j where zi := (xi , yi ) and h(zi , zj ) := k(xi , xj )+k(yi , yj )−k(xi , yj )−k(xj , yi ). [sent-63, score-0.122]

30 In other words we conduct a hypothesis test with null hypothesis H0 deﬁned as P = Q, and alternative hypothesis H1 as P = Q. [sent-67, score-0.512]

31 We must therefore specify a threshold that the empirical MMD will exceed with small probability, when P = Q. [sent-68, score-0.128]

32 For an asymptotic false alarm probability (Type I error) of α, an appropriate threshold is the 1 − α quantile of the asymptotic distribution of the empirical MMD assuming P = Q. [sent-69, score-0.205]

33 According to [14, Theorem 8], this distribution takes the form ∞ mMMD2 → u D 2 λl (zl − 2), (2) l=1 where → denotes convergence in distribution, zl ∼ N(0, 2) i. [sent-70, score-0.18]

34 , λi are the solutions to the eigenD value equation ˜ k(xi , xj )ψl (xi )dP := λl ψl (xj ), (3) X ˜ and k(xi , xj ) := k(xi , xj ) − Ex k(xi , x) − Ex k(x, xi ) + Ex,x′ k(x, x′ ). [sent-73, score-0.108]

35 Consistency in power of the resulting hypothesis test (that is, the convergence of its Type II error to zero for increasing m) is shown in [14]. [sent-74, score-0.122]

36 The eigenvalue problem (3) has been studied extensively in the context of kernel PCA [26, 27, 7]: this connection will be used in obtaining a ﬁnite sample estimate of the null distribution in (2), and we summarize certain important results. [sent-75, score-0.641]

37 (4) The eigenvalues λl of C are the solutions to the eigenvalue problem in (3) [19, Proposition 2]. [sent-77, score-0.089]

38 2511], empirical estimates of these eigenvalues are 1 ˆ λl = νl m where νl are the eigenvalues of the centered Gram matrix (5) K := HKH, Ki,j := k(xi , xj ), and H = I − is a centering matrix. [sent-81, score-0.282]

39 Finally, by subtracting mMMD2 from u ∞ mMMD2 , we observe that these differ by a quantity with expectation tr(C) = l=1 λl , and thus b 1 ⊤ m 11 ∞ mMMD2 → b D 2 λl zl . [sent-82, score-0.146]

40 l=1 1 Other interpretations of the MMD are also possible, for particular kernel choices. [sent-83, score-0.156]

41 The most closely related is the L2 distance between probability density estimates [1], although this requires the kernel bandwidth to decrease with increasing sample size. [sent-84, score-0.324]

42 3 3 Theory In the present section, we describe three approaches for approximating the null distribution of MMD. [sent-87, score-0.381]

43 We ﬁrst present the Pearson curve and Gamma-based approximations, which consist of parametrized families of distributions that we ﬁt by matching the low order moments of the empirical MMD. [sent-88, score-0.13]

44 Such approximations can be accurate in practice, although they remain heuristics with no consistency guarantees. [sent-89, score-0.121]

45 Second, we describe a null distribution estimate based on substituting the empirical estimates (5) of the eigenvalues into (2). [sent-90, score-0.584]

46 We prove that this estimate converges to its population counterpart in the large sample limit. [sent-91, score-0.169]

47 1 Moment-based null distribution estimates The Pearson curves and the Gamma approximation are both based on the low order moments of the empirical MMD. [sent-93, score-0.505]

48 = 2 m (m − 1)2 = (6) (7) Pearson curves take as arguments the variance, skewness and kurtosis As in [14], we replace the 2 + 1. [sent-95, score-0.084]

49 The following result shows that this empirical estimate of the null distribution converges in distribution to its population counterpart. [sent-104, score-0.563]

50 In other words, a test using the MMD statistic, with the threshold computed from quantiles of the null distribution estimate, is asymptotically consistent in level. [sent-105, score-0.608]

51 ) be the eigenvalues of C and C, respectively, in descending order. [sent-131, score-0.089]

52 Since E[tr(C)] is bounded when the kernel has bounded expectation, we again have the desired result by Chebyshev’s inequality. [sent-141, score-0.156]

53 We include the ﬁnal two approaches in the same cost category since even though the Pearson approach scales worse with m than the bootstrap (O(m3 ) vs O(m2 )), the bootstrap has a higher cost for sample sizes less than about 103 due the requirement to repeatedly re-compute the test statistic. [sent-151, score-0.581]

54 We also note that our result of large-sample consistency in level holds under a restrictive condition on the decay of the spectrum of the covariance operator, whereas the Gamma approximation calculations are straightforward and remain possible for any spectrum decay behaviour. [sent-152, score-0.223]

55 The Gamma approximation remains a heuristic, however, and we give an example of a distribution and kernel for which it performs less accurately than the spectrum-based estimate in the upper tail, which is of most interest for testing purposes. [sent-153, score-0.281]

56 4 Experiments In this section, we compare the four approaches to obtaining the null distribution, both in terms of the approximation error computed with respect to simulations from the true null, and when used in homogeneity testing. [sent-154, score-0.431]

57 Our approaches are denoted Gamma (the two-parameter Gamma approximation), Pears (the Pearson curves based on the ﬁrst three moments, using a lower bound for the kurtosis), Spec (our new approximation to the null distribution, using the Gram matrix eigenspectrum), and Boot (the bootstrap approach). [sent-155, score-0.486]

58 Artiﬁcial data: We ﬁrst provide an example of a distribution P for which the heuristics Gamma and Pears have difﬁculty in approximating the null distribution, whereas Spec converges. [sent-156, score-0.42]

59 44), and k as a Gaussian kernel with bandwidth ranging over σ = 2−4 , 2−3 , 2−2 , 2−1 , 20 , 21 , 22 . [sent-161, score-0.193]

60 The sample sizes were set to m = 5000, the total sample size hence being 10, 000, and the results were averaged over 50, 000 replications. [sent-162, score-0.177]

61 The eigenvalues of the Gram matrix were estimated in this experiment using [13], which is slower but more accurate than standard Matlab routines. [sent-163, score-0.089]

62 The true quantiles of the MMD null distribution, referred to as the oracle quantiles, were estimated by Monte Carlo simulations with 50, 000 runs. [sent-164, score-0.433]

63 We report the empirical performance of Spec compared to the oracle in terms of ∆q = 2 2 2 maxtr :q <1 |P(mM M Du > tr ) − Pm (mM M Du > tr )|, where tq is such that P(mM M Du > tq ) = q for q = 0. [sent-165, score-0.193]

64 9, and Pm is the Spec null distribution estimate obtained with m samples from each of P and Q. [sent-169, score-0.472]

65 This focuses the performance comparison on the quantiles corresponding to the upper tail of the null distribution, while still addressing uniform accuracy over a range of thresholds so as to ensure reliable p-values. [sent-171, score-0.433]

66 the Gamma (Gam), Spectrum (Spec), and Pearson (Pears) approximations to the null distribution, as the Gaussian kernel bandwidth parameter varies. [sent-205, score-0.575]

67 Benchmark data: We next demonstrate the performance of the MMD tests on a number of multivariate datasets, taken from [14, Table 1]. [sent-214, score-0.12]

68 In computing the null distributions for both the Spec and Pears cases, we drew 500 samples from the associated null distribution estimates, and computed the test thresholds using the resulting empirical quantiles. [sent-217, score-0.949]

69 For the Spec case, we computed the eigenspectrum on the gram matrix of the aggregate data from P and Q, retaining in all circumstances the maximum number 2m − 1 of nonzero eigenvalues of the empirical Gram matrix. [sent-218, score-0.439]

70 This is a conservative approach, given that the Gram matrix spectrum may decay rapidly [2, Appendix C], in which case it might be possible to safely discard the smallest eigenvalues. [sent-219, score-0.088]

71 For the bootstrap approach Boot, we aggregated points from the two samples, then assigned these randomly without replacement to P and Q. [sent-220, score-0.182]

72 In our experiments, we performed 500 such iterations, and used the resulting histogram of MMD values as our null distribution. [sent-221, score-0.347]

73 We used a Gaussian kernel in all cases, with the bandwidth set to the median distance between points in the aggregation of samples from P and Q. [sent-222, score-0.28]

74 We observe that the kernel tests perform extremely well on these data: the Type I error is in the great majority of cases close to its design value of 1 − α, and the Type II error is very low (and often zero). [sent-228, score-0.292]

75 The Spec test is occasionally slightly conservative, and has a lower Type I error than required: this is most pronounced in the Health Status dataset, for which the sample size m is low. [sent-229, score-0.144]

76 Note that for yet larger sample sizes, however, we expect the cost of Pears to exceed that of the remaining methods, due to its O(m3 ) cost requirement (vs O(m2 ) for the other approaches). [sent-231, score-0.167]

77 07 Table 1: Benchmarks for the kernel two-sample tests on high dimensional multivariate data. [sent-280, score-0.276]

78 As in the earlier work on dependence testing presented in [15], debate transcripts on the three topics of agriculture, ﬁsheries, and immigration were used. [sent-287, score-0.166]

79 Our goal was to distinguish samples on different topics, for instance P being drawn from transcripts on agriculture and Q from transcripts on immigration (in the null case, both samples were from the same topic). [sent-289, score-0.674]

80 We investigated two different kernels on text: the k-substring kernel of [22, 30] with k = 10, and a bag-of-words kernel. [sent-291, score-0.198]

81 We observe that in general, the MMD is very effective at distinguishing distributions of text fragments on different topics: for sample sizes above 30, all the test procedures are able to detect differences in distribution with zero Type II error, for both kernels. [sent-313, score-0.283]

82 We now investigate the fact that for sample sizes below m = 30 on the Hansard data, the Spec test has a much higher Type II error the alternatives. [sent-315, score-0.205]

83 The k-substring and bag-of-words kernels are diagonally dominant: thus for small sample sizes, the empirical estimate of the kernel spectrum is effectively truncated at a point where the eigenvalues remain large, introducing a bias (Figure 2). [sent-316, score-0.516]

84 By contrast, for the Neural data using a Gaussian kernel, this small sample bias is not observed, and the Spec test has equivalent Type II performance to the other three tests (see Figure 1 in the online supplement). [sent-318, score-0.193]

85 , where there are sufﬁcient samples to obtain a Type II error of less than 50%), the bias in the Spec test due to spectral truncation is negligible. [sent-321, score-0.131]

86 We emphasize that the speed advantage of the Spec test becomes important only for larger sample sizes (and the consistency guarantee is only meaningful in this regime). [sent-322, score-0.223]

87 5 Conclusion We have presented a novel method for estimating the null distribution of the RKHS distance between probability distribution embeddings, for use in a nonparametric test of homogeneity. [sent-323, score-0.514]

88 Unlike previous parametric heuristics based on moment matching, our new distribution estimate is consistent; moreover, it is computationally less costly than the bootstrap, which is the only alternative consistent approach. [sent-324, score-0.186]

89 We have demonstrated in experiments that our method performs well on high dimensional multivariate data and text, as well as for distributions where the parametric heuristics show inaccuracies. [sent-325, score-0.118]

90 We anticipate that our approach may also be generalized to kernel independence tests [15], and to homogeneity tests based on the kernel Fisher discriminant [18]. [sent-326, score-0.523]

91 We thank Choon-Hui Teo for generating the Gram matrices for the text data, Malte Rasch for his assistance in the experimental evaluation, and Karsten Borgwardt for his assistance with the microarray data. [sent-328, score-0.179]

92 Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. [sent-343, score-0.099]

93 On the asymptotic properties of a nonparametric l1 -test statistic of homogeneity. [sent-371, score-0.12]

94 Integrating o structured biological data by kernel maximum mean discrepancy. [sent-390, score-0.156]

95 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-403, score-0.201]

96 Permutation tests for equality of distributions in high-dimensional settings. [sent-456, score-0.115]

97 The spectrum kernel: A string kernel for SVM protein classiﬁcation. [sent-491, score-0.244]

98 Nonlinear component analysis as a kernel eigenvalue o u problem. [sent-518, score-0.156]

99 On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. [sent-525, score-0.321]

100 A multivariate two-sample test based on the concept of minimum energy. [sent-564, score-0.099]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mmd', 0.405), ('spec', 0.371), ('null', 0.347), ('pears', 0.303), ('gram', 0.177), ('kernel', 0.156), ('zl', 0.146), ('pearson', 0.144), ('gamma', 0.142), ('boot', 0.141), ('bootstrap', 0.139), ('gretton', 0.103), ('eigenspectrum', 0.101), ('gam', 0.101), ('fukumizu', 0.096), ('eigenvalues', 0.089), ('spectrum', 0.088), ('quantiles', 0.086), ('hansard', 0.081), ('mmdb', 0.081), ('transcripts', 0.081), ('ii', 0.081), ('type', 0.08), ('statistic', 0.079), ('tests', 0.078), ('rkhs', 0.067), ('hilbert', 0.064), ('microarray', 0.063), ('sizes', 0.061), ('sheries', 0.061), ('subtype', 0.061), ('sample', 0.058), ('embeddings', 0.057), ('status', 0.057), ('vs', 0.057), ('test', 0.057), ('moments', 0.056), ('homogeneity', 0.055), ('rasch', 0.052), ('threshold', 0.052), ('sch', 0.052), ('kurtosis', 0.049), ('consistency', 0.047), ('estimate', 0.046), ('drew', 0.045), ('testing', 0.045), ('samples', 0.045), ('reproducing', 0.045), ('replacement', 0.043), ('sriperumbudur', 0.043), ('tr', 0.043), ('kernels', 0.042), ('health', 0.042), ('multivariate', 0.042), ('distance', 0.042), ('sec', 0.041), ('asymptotic', 0.041), ('borgwardt', 0.041), ('teo', 0.041), ('assistance', 0.04), ('immigration', 0.04), ('zi', 0.04), ('exceed', 0.039), ('heuristics', 0.039), ('embedding', 0.037), ('bandwidth', 0.037), ('ex', 0.037), ('empirical', 0.037), ('distributions', 0.037), ('xj', 0.036), ('hypothesis', 0.036), ('population', 0.036), ('text', 0.036), ('aggregate', 0.035), ('costly', 0.035), ('agriculture', 0.035), ('canadian', 0.035), ('ez', 0.035), ('generalisation', 0.035), ('skewness', 0.035), ('tq', 0.035), ('cost', 0.035), ('approximations', 0.035), ('nite', 0.035), ('mm', 0.034), ('distribution', 0.034), ('discrepancy', 0.033), ('du', 0.033), ('harchaoui', 0.032), ('calibrate', 0.032), ('chebyshev', 0.032), ('zp', 0.032), ('consistent', 0.032), ('operator', 0.031), ('estimates', 0.031), ('lkopf', 0.031), ('cybernetics', 0.03), ('french', 0.03), ('error', 0.029), ('converges', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

2 0.32340088 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of ﬁnite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classiﬁer, thus forming a natural link between the distance between distributions and their ease of classiﬁcation. An important consequence is that a kernel must be characteristic to guarantee classiﬁability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive deﬁnite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classiﬁer. The generalized MMD is shown to have consistent ﬁnite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1

3 0.12373457 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is ﬁrst learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the ﬁrst stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efﬁciency of the method. 1

4 0.10646361 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

Author: Peter Sollich, Matthew Urry, Camille Coti

Abstract: We investigate how well Gaussian process regression can learn functions deﬁned on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we ﬁnd that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function deﬁned on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive deﬁnite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then deﬁned as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, deﬁned as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one ﬁnds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be veriﬁed by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the ﬁgure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to ﬁnd the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reﬂecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn ﬁxes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we ﬁnd the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior deﬁned by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simpliﬁcation but is appropriate for the present ﬁrst exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is ﬁxed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and ﬁnally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as speciﬁed by p, a and σ 2 . We ﬁx d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difﬁcult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by ﬁrst expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a ﬁxed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are deﬁned appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the deﬁnition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects ﬂuctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justiﬁed in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative ﬂuctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small ﬁlled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reﬂecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefﬁcient c with p, which implies that to reach a speciﬁed (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at ﬁxed , while in the tail ∝ p, it continues to change. For ﬁnite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simpliﬁes to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this signiﬁcantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one ﬁnds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a ﬁnite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the ﬁgure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artiﬁcial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning ﬁrst to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this ﬁxes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 ﬁrst terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to ﬁne detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9

5 0.094979838 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.

6 0.091000974 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

7 0.079348385 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

8 0.077663504 78 nips-2009-Efficient Moments-based Permutation Tests

9 0.073777556 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

10 0.067732356 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models

11 0.067381456 72 nips-2009-Distribution Matching for Transduction

12 0.064477101 95 nips-2009-Fast subtree kernels on graphs

13 0.062329847 119 nips-2009-Kernel Methods for Deep Learning

14 0.061887793 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis

15 0.059105642 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

16 0.058018405 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

17 0.057679985 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

18 0.057054147 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

19 0.056602471 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

20 0.056574967 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.184), (1, 0.079), (2, -0.038), (3, 0.092), (4, -0.051), (5, -0.1), (6, -0.029), (7, 0.121), (8, -0.1), (9, 0.002), (10, 0.103), (11, -0.122), (12, 0.066), (13, 0.029), (14, -0.028), (15, -0.035), (16, -0.104), (17, 0.09), (18, 0.07), (19, 0.029), (20, -0.002), (21, 0.11), (22, -0.118), (23, -0.084), (24, -0.191), (25, 0.175), (26, 0.051), (27, 0.008), (28, -0.127), (29, 0.155), (30, -0.068), (31, -0.218), (32, -0.061), (33, -0.008), (34, 0.064), (35, 0.14), (36, 0.02), (37, -0.016), (38, -0.029), (39, -0.001), (40, -0.036), (41, 0.032), (42, 0.104), (43, -0.0), (44, -0.028), (45, -0.012), (46, 0.019), (47, -0.033), (48, -0.023), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94174832 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

2 0.81757998 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

3 0.59410417 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

4 0.55678874 128 nips-2009-Learning Non-Linear Combinations of Kernels

Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh

5 0.53300184 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

Author: Liefeng Bo, Cristian Sminchisescu

Abstract: In visual recognition, the images are frequently modeled as unordered collections of local features (bags). We show that bag-of-words representations commonly used in conjunction with linear classiﬁers can be viewed as special match kernels, which count 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse, motivating research into the design of match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels for large datasets due to their signiﬁcant computational cost. To address this problem, we propose efﬁcient match kernels (EMK) that map local features to a low dimensional feature space and average the resulting vectors to form a setlevel feature. The local feature maps are learned so their inner products preserve, to the best possible, the values of the speciﬁed kernel function. Classiﬁers based on EMK are linear both in the number of images and in the number of local features. We demonstrate that EMK are extremely efﬁcient and achieve the current state of the art in three difﬁcult computer vision datasets: Scene-15, Caltech-101 and Caltech-256. 1

6 0.48196107 119 nips-2009-Kernel Methods for Deep Learning

7 0.45189592 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

8 0.44661519 78 nips-2009-Efficient Moments-based Permutation Tests

9 0.43061104 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

10 0.42886776 206 nips-2009-Riffled Independence for Ranked Data

11 0.41141146 95 nips-2009-Fast subtree kernels on graphs

12 0.40707186 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

13 0.39602867 72 nips-2009-Distribution Matching for Transduction

14 0.38608763 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

15 0.38457426 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations

16 0.35985988 33 nips-2009-Analysis of SVM with Indefinite Kernels

17 0.34051996 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

18 0.33877078 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions

19 0.3370806 69 nips-2009-Discrete MDL Predicts in Total Variation

20 0.33575562 11 nips-2009-A General Projection Property for Distribution Families

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.012), (24, 0.053), (25, 0.069), (35, 0.064), (36, 0.079), (39, 0.039), (49, 0.312), (58, 0.097), (61, 0.014), (71, 0.05), (81, 0.015), (86, 0.104)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79816538 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

Author: Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

2 0.73756462 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games

Author: Marc Lanctot, Kevin Waugh, Martin Zinkevich, Michael Bowling

Abstract: Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efﬁcient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-speciﬁc augmentation involving chance outcome sampling. In this paper, we describe a general family of domain-independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-speciﬁc versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: outcome sampling and external sampling, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFR’s bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games. 1

3 0.57293367 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur

4 0.53723902 104 nips-2009-Group Sparse Coding

Author: Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow

Abstract: Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classiﬁcation dataset show that when compact image or dictionary representations are needed for computational efﬁciency, the proposed approach yields better mean average precision in classiﬁcation. 1

5 0.53372777 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

6 0.53363526 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

7 0.53026253 137 nips-2009-Learning transport operators for image manifolds

8 0.52866507 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

9 0.52824759 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

10 0.52793908 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

11 0.52775121 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

12 0.52594775 113 nips-2009-Improving Existing Fault Recovery Policies

13 0.52520251 100 nips-2009-Gaussian process regression with Student-t likelihood

14 0.52494681 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

15 0.52427006 70 nips-2009-Discriminative Network Models of Schizophrenia

16 0.52361494 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

17 0.52324164 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

18 0.5231446 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

19 0.52305037 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

20 0.52287 87 nips-2009-Exponential Family Graph Matching and Ranking