nips nips2009 nips2009-148 knowledge-graph by maker-knowledge-mining

148 nips-2009-Matrix Completion from Power-Law Distributed Samples


Source: pdf

Author: Raghu Meka, Prateek Jain, Inderjit S. Dhillon

Abstract: The low-rank matrix completion problem is a fundamental problem with many important applications. Recently, [4],[13] and [5] obtained the first non-trivial theoretical results for the problem assuming that the observed entries are sampled uniformly at random. Unfortunately, most real-world datasets do not satisfy this assumption, but instead exhibit power-law distributed samples. In this paper, we propose a graph theoretic approach to matrix completion that solves the problem for more realistic sampling models. Our method is simpler to analyze than previous methods with the analysis reducing to computing the threshold for complete cascades in random graphs, a problem of independent interest. By analyzing the graph theoretic problem, we show that our method achieves exact recovery when the observed entries are sampled from the Chung-Lu-Vu model, which can generate power-law distributed graphs. We also hypothesize that our algorithm solves the matrix completion problem from an optimal number of entries for the popular preferential attachment model and provide strong empirical evidence for the claim. Furthermore, our method is easy to implement and is substantially faster than existing methods. We demonstrate the effectiveness of our method on random instances where the low-rank matrix is sampled according to the prevalent random graph models for complex networks and present promising preliminary results on the Netflix challenge dataset. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The low-rank matrix completion problem is a fundamental problem with many important applications. [sent-4, score-0.265]

2 In this paper, we propose a graph theoretic approach to matrix completion that solves the problem for more realistic sampling models. [sent-7, score-0.49]

3 We also hypothesize that our algorithm solves the matrix completion problem from an optimal number of entries for the popular preferential attachment model and provide strong empirical evidence for the claim. [sent-10, score-0.613]

4 Since completion of arbitrary matrices is not a well-posed problem, it is often assumed that the underlying matrix comes from a restricted class. [sent-14, score-0.265]

5 Here we address the matrix completion problem under the natural assumption that the underlying matrix is low-rank. [sent-15, score-0.324]

6 Formally, for an unknown matrix M ∈ Rm×n of rank at most k, given Ω ⊆ [m] × [n], PΩ (M )1 and k, the low-rank matrix completion problem is to find a matrix X ∈ Rm×n such that rank(X) ≤ k and PΩ (X) = PΩ (M ). [sent-16, score-0.462]

7 For Ω ⊆ [m]×[n], let the sampling graph GΩ = (U, V, Ω) be the bipartite graph with vertices U = {u1 , . [sent-25, score-0.517]

8 Then, assumption [A2] can be reformulated as follows: A3 The sampling graph GΩ is an Erd˝ s-R´ nyi random graph3 . [sent-32, score-0.348]

9 o e A prominent feature of Erd˝ s-R´ nyi graphs is that the degrees of vertices are Poisson distributed and o e are sharply concentrated about their mean. [sent-33, score-0.646]

10 However, for most large real-world graphs o e such as the World Wide Web ([1]), the degree distribution deviates significantly from the Poisson distribution and has high variance. [sent-35, score-0.24]

11 , the number of vertices of degree d is proportional to d−β for a constant β (Figure 1). [sent-38, score-0.265]

12 In this paper, we overcome some of the shortcomings of assumption [A3] above by considering more realistic random graph models for the sampling graph GΩ . [sent-39, score-0.253]

13 We propose a natural graph theoretic approach for matrix completion (referred to as ICMC for information cascading matrix completion) that we prove can handle sampling graphs with power-law distributed degrees. [sent-40, score-1.011]

14 Our approach is motivated by the models for information cascading in social networks proposed by Kempe et al. [sent-41, score-0.305]

15 Moreover, the analysis of ICMC reduces to the problem of finding density thresholds for complete cascades in random graphs - a problem of independent interest. [sent-43, score-0.27]

16 By analyzing the threshold for complete cascades in the random graph model of Chung, Lu & Vu [6] (CLV model), we show that ICMC solves the matrix completion problem for sampling graphs drawn from the CLV model. [sent-44, score-0.743]

17 On the other hand, for Erdos-Renyi graphs the density requirements for ICMC are stronger than those of the above papers. [sent-47, score-0.227]

18 We also empirically investigate the threshold for complete cascading in other popular random graph models such as the preferential attachment model [1], the forest-fire model [17] and the affiliation networks model [16]. [sent-48, score-0.671]

19 The empirical estimates we obtain for the threshold for complete cascading in the preferential attachment model strongly suggest that ICMC solves the exact matrix-completion problem from an optimal number of entries for sampling procedures with preferential attachment. [sent-49, score-0.89]

20 Our experiments demonstrate that for sampling graphs drawn from more realistic models such as the preferential attachment, forest-fire and affiliation network models, ICMC outperforms - both in accuracy and time - the methods of [4, 5, 3, 13] by an order of magnitude. [sent-50, score-0.386]

21 In summary, our main contributions are: • We formulate the sampling process in matrix completion as generating random graphs (GΩ ) and demonstrate that the sampling assumption [A3] does not hold for real-world datasets. [sent-51, score-0.599]

22 • We propose a novel graph theoretic approach to matrix completion (ICMC) that extensively uses the link structure of the sampling graph. [sent-52, score-0.455]

23 • We prove that our method solves the matrix completion problem exactly for sampling graphs generated from the CLV model which can generate power-law distributed graphs. [sent-54, score-0.593]

24 2 Previous Work and Preliminaries The Netflix challenge has recently drawn much attention to the low-rank matrix completion problem. [sent-56, score-0.293]

25 We consider the Erd˝ s-R´ nyi model, where edges (ui , vj ) ∈ E independently with probability for p for o e (i, j) ∈ [m] × [n] and p is the density parameter. [sent-59, score-0.352]

26 al do not apply to the case of matrix completion as the constraints in matrix completion do not satisfy the restricted isoperimetry property they assume. [sent-64, score-0.53]

27 [13] also obtained similar results for matrix completion using different techniques that generalize the works of Friedman et al. [sent-69, score-0.265]

28 However, the results of [13], crucially rely on the regularity of Erd˝ so R´ nyi graphs and do not extend to sampling graphs with skewed degree distributions even for rank e one matrices. [sent-71, score-0.804]

29 and Feige and Ofek on the spectral gap of Erd˝ s-R´ nyi graphs do not hold for graph models with skewed expected degrees (see o e [6, 19]). [sent-73, score-0.538]

30 A similar observation was made in [19], [10] who address the problem of re-weighting the edges of graphs with skewed degrees in the context of LSA. [sent-75, score-0.316]

31 1 Random Graph Models We focus on four popular models of random graphs all of which can generate graphs with power-law distributed degrees. [sent-77, score-0.396]

32 al [6], and refer to the original papers for the preferential attachment [1], forest-fire [17] and affiliation networks [16] models. [sent-80, score-0.247]

33 The CLV model [6] generates graphs with arbitrary expected degree sequences, p1 , . [sent-81, score-0.24]

34 , vn } is generated by independently placing an edge between vertices ui , vj with probability pi qj /w for all i ∈ [m], j ∈ [n]. [sent-100, score-0.488]

35 The CLV model is more general than the standard Erd˝ s-R´ nyi model with the case pi = np, qi = o e mp corresponding to the standard Erd˝ s-R´ nyi model with density p for bipartite random graphs. [sent-105, score-0.528]

36 Consider the following standard formulation of the low-rank matrix completion problem: Given k, Ω, PΩ (M ) for a rank k matrix M , find X, Y such that PΩ (XY T ) = PΩ (M ), X ∈ Rm×k , Y ∈ Rn×k . [sent-108, score-0.403]

37 1 A rank k matrix Z is non-degenerate if there exist X ∈ Rm×k , Y ∈ Rn×k , Z = XY T such that any k rows of X are linearly independent and any k rows of Y are linearly independent. [sent-116, score-0.26]

38 Call a vertex ui ∈ U as infected if the i’th row of X has been computed (the term infected is used to reflect 3 that infection spreads by contact as in an epidemic). [sent-122, score-0.833]

39 Similarly, call a vertex vj ∈ V as infected if the j’th row of Y has been computed. [sent-123, score-0.493]

40 Casting the condition in terms of the sampling graph T GΩ , yj can be computed and vertex vj ∈ V be marked as infected if there are at least k edges from vj to infected vertices in L. [sent-128, score-1.377]

41 Analogously, xT can be computed and the vertex ui ∈ U be marked as i infected if there are at least k edges from ui to previously infected vertices R. [sent-129, score-1.149]

42 This suggests the following cascading procedure for infecting vertices in GΩ and progressively computing the rows of X, Y . [sent-132, score-0.587]

43 ICMC(GΩ , PΩ (M ), L0 ): 1 Start with initially infected sets L = L0 ⊆ U , R = ∅. [sent-134, score-0.351]

44 2 Repeat until convergence: (a) Mark as infected all uninfected vertices in V that have at least k edges to previously infected vertices L and add the newly infected vertices to R. [sent-136, score-1.818]

45 (b) For each newly infected vertex vj ∈ R, compute the j’th row of Y using the observed entries of M corresponding to edges from vj to L. [sent-137, score-0.715]

46 (c) Mark as infected all uninfected vertices in U that have at least k edges to previously infected vertices R and add the newly infected vertices to L. [sent-138, score-1.818]

47 (d) For each newly infected vertex ui ∈ L, compute the i’th row of X using the observed entries of M corresponding to edges from ui to R 3 Output M ′ = XY T . [sent-139, score-0.693]

48 We abstract the cascading procedure from above using the framework of Kempe et al. [sent-140, score-0.281]

49 2 The influence of a set A ⊆ W , σG,k (A), is the number of vertices infected by the cascading process upon termination when starting at A. [sent-145, score-0.837]

50 We say A is completely cascading of order k if σG,k (A) = |W |. [sent-147, score-0.323]

51 We remark that using a variant of the standard depth-first search algorithm, the cascading process above can be computed in linear time for any set A. [sent-148, score-0.309]

52 From the discussion preceding ICMC it follows that ICMC recovers M exactly if the cascading process starting at L0 infects all vertices of GΩ and we get the following theorem. [sent-149, score-0.547]

53 Then, given GΩ = (U, V, Ω), PΩ (M ) and L0 ⊆ U with |L0 | = k, ICMC(GΩ , PΩ (M ), L0 ) recovers the matrix M exactly if L0 is a completely cascading set of order k in GΩ . [sent-152, score-0.416]

54 Thus, we have reduced the matrix-completion problem to the graph-theoretic problem of finding a completely cascading set (if it exists) in a graph. [sent-153, score-0.323]

55 However, it appears that for most reasonable random graph models, the highest degree vertices have large influence with high probability. [sent-157, score-0.389]

56 In the following we investigate completely cascading sets in random graphs and show that for CLV graphs, the k highest degree vertices form a completely cascading set with high probability. [sent-158, score-1.127]

57 4 4 Information Cascading in Random Graphs We now show that for sufficiently dense CLV graphs and fixed k, the k highest degree vertices form a completely cascading set with high probability. [sent-159, score-0.804]

58 Then, for G = (U, V, E) generated from the model, the k highest degree vertices of U form a completely cascading set of order k with probability at least 1 − n−γ . [sent-169, score-0.624]

59 Proof sketch We will show that the highest weight vertices L0 = {u1 , . [sent-170, score-0.241]

60 , uk } form a completely cascading set with high probability; the theorem follows from the above statement and the observation that the highest degree vertices of G will almost surely correspond to vertices with large weights in the model; we omit these details for lack of space. [sent-173, score-0.86]

61 Fix a vertex ui ∈ L0 and consider an arbitrary vertex vj ∈ V . [sent-175, score-0.273]

62 Let Pji be the indicator variable / that is 1 if (ui , vj ) ∈ E and vj is connected to all vertices of L0 . [sent-176, score-0.351]

63 Note that vertex ui will be infected after two rounds by the cascading process starting at L0 if j Pji ≥ k. [sent-177, score-0.763]

64 + Pn ] = j=1 pi q j w l≤k n pi pl q j = k+1 · ( w w k+1 qj . [sent-181, score-0.228]

65 , um , the probability that there is an uninfected vertex in the left partition after two steps of cascading starting from L0 is at most 1/2mγ . [sent-221, score-0.441]

66 The theorem now follows by observing that if the left partition is completely infected, for a suitably large constant c(γ), all vertices in the right will be infected with probability at least 1 − 1/2mγ as qj ≥ c(γ)k log n. [sent-222, score-0.69]

67 1 we obtain exact matrix-completion for sampling graphs drawn from the CLV model. [sent-224, score-0.257]

68 Then, for sampling graphs GΩ generated from a CLV model satisfying the conditions of Theorem 4. [sent-227, score-0.257]

69 However, when the relating j qj 5 expected degrees qj are skewed, say with a power-law distribution, it should be possible to obtain much better bounds than those of Equation (4. [sent-234, score-0.226]

70 Thus, in a sense the Erd˝ s-R´ nyi graphs are the worst-case examples for our analysis. [sent-236, score-0.363]

71 o e Our empirical simulations also suggest that completely cascading sets are more likely to exist in random graph models with power-law distributed expected degrees as compared to Erd˝ s-R´ nyi o e graphs. [sent-237, score-0.672]

72 • In graphs with power-law distributed degrees, the high degree vertices have much higher degrees than the average degree of the graph. [sent-239, score-0.583]

73 So, infecting the highest degree vertices is more likely to infect more vertices in the first step. [sent-240, score-0.586]

74 • More importantly, as observed in the seminal work of Kleinberg [14] in most real-world graphs there are a small number of vertices (hubs) that have much higher connectivity than most vertices. [sent-241, score-0.385]

75 In particular, as strongly supported by experiments (see Figure 3), we hypothesize that ICMC solves exact matrix completion from an almost optimal number of entries for sampling graphs drawn from the preferential attachment model. [sent-244, score-0.87]

76 For G = (U, V, E) generated from the preferential attachment model with parameters m, n, k1 , k2 , the k highest degree vertices of U form a completely cascading set of order k with high probability. [sent-247, score-0.871]

77 Then, for sampling graphs GΩ generated from a PA model with parameters k1 , k2 ≥ Ck, ICMC recovers the matrix M exactly with high probability. [sent-252, score-0.35]

78 Remark: To solve the matrix completion problem we need to sample at least (m + n)k entries. [sent-253, score-0.265]

79 Moreover, the bounds above are stronger than those obtainable - even information theoretically - for Erd˝ s-R´ nyi graphs, as for Erd˝ s-R´ nyi o e o e graphs we need to sample Ω(n log n) entries even for k = 1. [sent-255, score-0.612]

80 5 Experimental Results We first demonstrate that for many real-world matrix completion datasets, the observed entries are far from being sampled uniformly with the sampling graph having power-law distributed degrees. [sent-256, score-0.532]

81 We then use various random graph models to compare our method against the trace-norm based singular value thresholding algorithm of [3], the spectral matrix completion algorithm (SMC) of [13] and the regularized alternating least squares minimization (ALS) heuristic. [sent-257, score-0.381]

82 Let Lj be the set of vertices in L that have an edge to vj , Lk be any size k subset of Lj , and let X(Lk , :) be the sub-matrix of X containing rows correj j sponding to vertices in Lk . [sent-262, score-0.544]

83 If the underlying matrix is indeed low-rank and there is no noise in the j T observed entries, then for a newly infected vertex vj , the corresponding row of Y , yj , can be comk k puted by solving the following linear system of equations: M (Lj , j) = X(Lj , :)yj . [sent-263, score-0.627]

84 L U or R V , then rows of X and Y will not be computed for vertices in U \L and V \R. [sent-268, score-0.266]

85 Let X = [XL , XL ], where XL is the set of ˜ computed rows of X (for vertices in L) and XL denotes the remaining rows of X. [sent-269, score-0.327]

86 06 Table 1: Time required (in seconds) by various methods on synthetic datasets for fixed sampling density with sampling graph coming from different Graph Models: (a) Erd˝ s-R´ nyi model, (b) o e Chung-Lu-Vu model, (c) Preferential attachment model, and (d) Forest-fire model. [sent-344, score-0.59]

87 For both datasets we form the corresponding bipartite sampling graphs and plot the left (users) and right (movies/artists) cumulative degree distributions of the bipartite sampling graphs. [sent-348, score-0.512]

88 The figure clearly shows that the sampling graphs for the Netflix and Yahoo Music datasets are far from regular as assumed in [4],[5],[13] and have power-law distributed degrees. [sent-351, score-0.293]

89 Attachment −2 10 −2 10 −1 10 p (Sampling Density) 0 10 k 200 100 COMBMC m=Ck+C 0 0 10 20 30 40 k (Rank of the Matrix) 50 5 10 20 25 30 Fraction of infected rows & columns 0. [sent-355, score-0.412]

90 9602 Figure 3: Left: Fraction of infected nodes as edge density increases. [sent-365, score-0.398]

91 Right: Fraction of infected rows and columns using ICMC for the Netflix challenge dataset. [sent-371, score-0.44]

92 1, and the sampling graphs are generated from the four random graph models. [sent-376, score-0.345]

93 Threshold for Complete Cascading Here we investigate the threshold for complete cascading in the random graph models. [sent-382, score-0.424]

94 Besides being interesting on its own, the existence of completely cascading sets is closely tied to the success of ICMC by Theorem 3. [sent-383, score-0.323]

95 Figure 3 shows the fraction of vertices infected by the cascading process starting from the k highest degree vertices for graphs generated from the random graph models as the edge density increases. [sent-385, score-1.486]

96 The left plot of Figure 3 shows the existence of a clear threshold for the density p, beyond which the fraction of infected vertices is almost surely one. [sent-386, score-0.722]

97 As was explained in Section 4, the threshold is bigger for the Erd˝ s-R´ nyi graph model. [sent-388, score-0.326]

98 o e The right plot of Figure 3 shows the threshold value (the minimum value above which the infected fraction is almost surely one) for k1 , k2 as a function of k in the PA model. [sent-389, score-0.47]

99 The rightmost table in Figure 3 shows the fraction of rows and columns infected by ICMC on the dataset for several values of the rank parameter k. [sent-394, score-0.552]

100 Also, for rank 30 the fraction of infected rows and columns drops to almost zero, suggesting that the sampling density of the matrix is below the sampling threshold for rank 30. [sent-396, score-0.918]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('icmc', 0.433), ('infected', 0.351), ('cascading', 0.281), ('clv', 0.254), ('completion', 0.206), ('vertices', 0.205), ('svt', 0.187), ('erd', 0.183), ('nyi', 0.183), ('graphs', 0.18), ('smc', 0.151), ('als', 0.147), ('ix', 0.14), ('preferential', 0.129), ('attachment', 0.118), ('qj', 0.092), ('net', 0.089), ('graph', 0.088), ('rmse', 0.081), ('rank', 0.079), ('xl', 0.078), ('sampling', 0.077), ('vj', 0.073), ('liation', 0.07), ('vertex', 0.069), ('kempe', 0.067), ('uninfected', 0.067), ('recht', 0.067), ('entries', 0.066), ('yr', 0.064), ('ui', 0.062), ('rows', 0.061), ('degree', 0.06), ('matrix', 0.059), ('bipartite', 0.059), ('pi', 0.056), ('threshold', 0.055), ('lj', 0.052), ('yahoo', 0.052), ('music', 0.05), ('edges', 0.049), ('density', 0.047), ('users', 0.046), ('candes', 0.046), ('skewed', 0.045), ('pn', 0.044), ('jon', 0.043), ('keshavan', 0.043), ('cascades', 0.043), ('pa', 0.043), ('completely', 0.042), ('chung', 0.042), ('degrees', 0.042), ('af', 0.042), ('yj', 0.041), ('incomparable', 0.04), ('infect', 0.04), ('infecting', 0.04), ('pji', 0.04), ('poisson', 0.038), ('qn', 0.038), ('kleinberg', 0.038), ('nk', 0.036), ('distributed', 0.036), ('highest', 0.036), ('feige', 0.035), ('solves', 0.035), ('recovers', 0.034), ('newly', 0.034), ('fraction', 0.033), ('pr', 0.032), ('lu', 0.032), ('artists', 0.032), ('hubs', 0.032), ('surely', 0.031), ('pm', 0.03), ('christos', 0.03), ('emmanuel', 0.03), ('ck', 0.03), ('xy', 0.03), ('challenge', 0.028), ('remark', 0.028), ('dataset', 0.028), ('minimization', 0.028), ('clauset', 0.027), ('erdos', 0.027), ('infects', 0.027), ('mihail', 0.027), ('netflix', 0.027), ('ofek', 0.027), ('renyi', 0.027), ('yehuda', 0.027), ('theoretic', 0.025), ('corr', 0.025), ('rm', 0.024), ('lk', 0.024), ('um', 0.024), ('pl', 0.024), ('social', 0.024), ('movies', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

Author: Raghu Meka, Prateek Jain, Inderjit S. Dhillon

Abstract: The low-rank matrix completion problem is a fundamental problem with many important applications. Recently, [4],[13] and [5] obtained the first non-trivial theoretical results for the problem assuming that the observed entries are sampled uniformly at random. Unfortunately, most real-world datasets do not satisfy this assumption, but instead exhibit power-law distributed samples. In this paper, we propose a graph theoretic approach to matrix completion that solves the problem for more realistic sampling models. Our method is simpler to analyze than previous methods with the analysis reducing to computing the threshold for complete cascades in random graphs, a problem of independent interest. By analyzing the graph theoretic problem, we show that our method achieves exact recovery when the observed entries are sampled from the Chung-Lu-Vu model, which can generate power-law distributed graphs. We also hypothesize that our algorithm solves the matrix completion problem from an optimal number of entries for the popular preferential attachment model and provide strong empirical evidence for the claim. Furthermore, our method is easy to implement and is substantially faster than existing methods. We demonstrate the effectiveness of our method on random instances where the low-rank matrix is sampled according to the prevalent random graph models for complex networks and present promising preliminary results on the Netflix challenge dataset. 1

2 0.11829069 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Author: John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, Yi Ma

Abstract: Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized “robust principal component analysis” problem of recovering a low rank matrix A from corrupted observations D = A + E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation space and the number of errors E grows in proportion to the total number of entries in the matrix. A by-product of our analysis is the first proportional growth results for the related problem of completing a low-rank matrix from a small fraction of its entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision. 1

3 0.11558165 147 nips-2009-Matrix Completion from Noisy Entries

Author: Raghunandan Keshavan, Andrea Montanari, Sewoong Oh

Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced in [1], based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. 1

4 0.093267664 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

Author: Peter Sollich, Matthew Urry, Camille Coti

Abstract: We investigate how well Gaussian process regression can learn functions defined on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we find that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function defined on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive definite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then defined as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, defined as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one finds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be verified by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the figure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to find the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reflecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn fixes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we find the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior defined by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simplification but is appropriate for the present first exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is fixed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and finally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as specified by p, a and σ 2 . We fix d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difficult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by first expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a fixed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are defined appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the definition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects fluctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justified in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative fluctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small filled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reflecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefficient c with p, which implies that to reach a specified (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at fixed , while in the tail ∝ p, it continues to change. For finite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simplifies to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this significantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one finds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a finite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the figure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artificial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning first to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this fixes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 first terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to fine detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9

5 0.086546041 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufficient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classification boundary and the underlying distribution are smooth to a finite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are infinitely smooth, the improvement is exponential.

6 0.080767758 256 nips-2009-Which graphical models are difficult to learn?

7 0.074170299 95 nips-2009-Fast subtree kernels on graphs

8 0.072894678 122 nips-2009-Label Selection on Graphs

9 0.068084702 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

10 0.067408711 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

11 0.065400653 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

12 0.06264893 87 nips-2009-Exponential Family Graph Matching and Ranking

13 0.062498849 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

14 0.059882283 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

15 0.049032826 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

16 0.046879243 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

17 0.04505169 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

18 0.044054888 149 nips-2009-Maximin affinity learning of image segmentation

19 0.043438673 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

20 0.042110924 166 nips-2009-Noisy Generalized Binary Search


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.139), (1, 0.052), (2, -0.018), (3, 0.005), (4, -0.022), (5, -0.073), (6, 0.017), (7, -0.012), (8, -0.02), (9, -0.088), (10, -0.09), (11, 0.028), (12, 0.037), (13, -0.003), (14, -0.055), (15, 0.04), (16, 0.033), (17, -0.047), (18, -0.074), (19, -0.072), (20, 0.017), (21, 0.026), (22, 0.075), (23, 0.031), (24, -0.005), (25, -0.084), (26, 0.026), (27, -0.194), (28, -0.042), (29, -0.131), (30, 0.07), (31, -0.089), (32, 0.011), (33, -0.12), (34, 0.031), (35, -0.048), (36, 0.068), (37, -0.04), (38, 0.009), (39, 0.123), (40, -0.025), (41, 0.07), (42, -0.0), (43, -0.044), (44, 0.074), (45, 0.1), (46, 0.11), (47, -0.04), (48, 0.036), (49, 0.107)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94686699 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

Author: Raghu Meka, Prateek Jain, Inderjit S. Dhillon

Abstract: The low-rank matrix completion problem is a fundamental problem with many important applications. Recently, [4],[13] and [5] obtained the first non-trivial theoretical results for the problem assuming that the observed entries are sampled uniformly at random. Unfortunately, most real-world datasets do not satisfy this assumption, but instead exhibit power-law distributed samples. In this paper, we propose a graph theoretic approach to matrix completion that solves the problem for more realistic sampling models. Our method is simpler to analyze than previous methods with the analysis reducing to computing the threshold for complete cascades in random graphs, a problem of independent interest. By analyzing the graph theoretic problem, we show that our method achieves exact recovery when the observed entries are sampled from the Chung-Lu-Vu model, which can generate power-law distributed graphs. We also hypothesize that our algorithm solves the matrix completion problem from an optimal number of entries for the popular preferential attachment model and provide strong empirical evidence for the claim. Furthermore, our method is easy to implement and is substantially faster than existing methods. We demonstrate the effectiveness of our method on random instances where the low-rank matrix is sampled according to the prevalent random graph models for complex networks and present promising preliminary results on the Netflix challenge dataset. 1

2 0.63600671 147 nips-2009-Matrix Completion from Noisy Entries

Author: Raghunandan Keshavan, Andrea Montanari, Sewoong Oh

Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced in [1], based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. 1

3 0.60680342 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

Author: John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, Yi Ma

Abstract: Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized “robust principal component analysis” problem of recovering a low rank matrix A from corrupted observations D = A + E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation space and the number of errors E grows in proportion to the total number of entries in the matrix. A by-product of our analysis is the first proportional growth results for the related problem of completing a low-rank matrix from a small fraction of its entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision. 1

4 0.55327064 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.

5 0.50596827 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah

Abstract: We consider the question of computing Maximum A Posteriori (MAP) assignment in an arbitrary pair-wise Markov Random Field (MRF). We present a randomized iterative algorithm based on simple local updates. The algorithm, starting with an arbitrary initial assignment, updates it in each iteration by first, picking a random node, then selecting an (appropriately chosen) random local neighborhood and optimizing over this local neighborhood. Somewhat surprisingly, we show that this algorithm finds a near optimal assignment within n log 2 n iterations with high probability for any n node pair-wise MRF with geometry (i.e. MRF graph with polynomial growth) with the approximation error depending on (in a reasonable manner) the geometric growth rate of the graph and the average radius of the local neighborhood – this allows for a graceful tradeoff between the complexity of the algorithm and the approximation error. Through extensive simulations, we show that our algorithm finds extremely good approximate solutions for various kinds of MRFs with geometry.

6 0.47512081 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

7 0.47176194 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

8 0.47099018 256 nips-2009-Which graphical models are difficult to learn?

9 0.46022651 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

10 0.38328317 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

11 0.37895548 46 nips-2009-Bilinear classifiers for visual recognition

12 0.37787408 95 nips-2009-Fast subtree kernels on graphs

13 0.37725794 180 nips-2009-On the Convergence of the Concave-Convex Procedure

14 0.36451137 122 nips-2009-Label Selection on Graphs

15 0.34692645 87 nips-2009-Exponential Family Graph Matching and Ranking

16 0.34549013 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

17 0.34183997 149 nips-2009-Maximin affinity learning of image segmentation

18 0.33957827 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

19 0.3361969 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation

20 0.33392853 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.011), (24, 0.056), (25, 0.055), (35, 0.026), (36, 0.062), (39, 0.168), (57, 0.014), (58, 0.103), (71, 0.035), (76, 0.229), (81, 0.027), (86, 0.095), (91, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82233417 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

Author: Raghu Meka, Prateek Jain, Inderjit S. Dhillon

Abstract: The low-rank matrix completion problem is a fundamental problem with many important applications. Recently, [4],[13] and [5] obtained the first non-trivial theoretical results for the problem assuming that the observed entries are sampled uniformly at random. Unfortunately, most real-world datasets do not satisfy this assumption, but instead exhibit power-law distributed samples. In this paper, we propose a graph theoretic approach to matrix completion that solves the problem for more realistic sampling models. Our method is simpler to analyze than previous methods with the analysis reducing to computing the threshold for complete cascades in random graphs, a problem of independent interest. By analyzing the graph theoretic problem, we show that our method achieves exact recovery when the observed entries are sampled from the Chung-Lu-Vu model, which can generate power-law distributed graphs. We also hypothesize that our algorithm solves the matrix completion problem from an optimal number of entries for the popular preferential attachment model and provide strong empirical evidence for the claim. Furthermore, our method is easy to implement and is substantially faster than existing methods. We demonstrate the effectiveness of our method on random instances where the low-rank matrix is sampled according to the prevalent random graph models for complex networks and present promising preliminary results on the Netflix challenge dataset. 1

2 0.70662785 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Author: Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han

Abstract: Ensemble classifiers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives1 . 1

3 0.70421886 251 nips-2009-Unsupervised Detection of Regions of Interest Using Iterative Link Analysis

Author: Gunhee Kim, Antonio Torralba

Abstract: This paper proposes a fast and scalable alternating optimization technique to detect regions of interest (ROIs) in cluttered Web images without labels. The proposed approach discovers highly probable regions of object instances by iteratively repeating the following two functions: (1) choose the exemplar set (i.e. a small number of highly ranked reference ROIs) across the dataset and (2) refine the ROIs of each image with respect to the exemplar set. These two subproblems are formulated as ranking in two different similarity networks of ROI hypotheses by link analysis. The experiments with the PASCAL 06 dataset show that our unsupervised localization performance is better than one of state-of-the-art techniques and comparable to supervised methods. Also, we test the scalability of our approach with five objects in Flickr dataset consisting of more than 200K images. 1

4 0.69939417 110 nips-2009-Hierarchical Mixture of Classification Experts Uncovers Interactions between Brain Regions

Author: Bangpeng Yao, Dirk Walther, Diane Beck, Li Fei-fei

Abstract: The human brain can be described as containing a number of functional regions. These regions, as well as the connections between them, play a key role in information processing in the brain. However, most existing multi-voxel pattern analysis approaches either treat multiple regions as one large uniform region or several independent regions, ignoring the connections between them. In this paper we propose to model such connections in an Hidden Conditional Random Field (HCRF) framework, where the classiďŹ er of one region of interest (ROI) makes predictions based on not only its voxels but also the predictions from ROIs that it connects to. Furthermore, we propose a structural learning method in the HCRF framework to automatically uncover the connections between ROIs. We illustrate this approach with fMRI data acquired while human subjects viewed images of different natural scene categories and show that our model can improve the top-level (the classiďŹ er combining information from all ROIs) and ROI-level prediction accuracy, as well as uncover some meaningful connections between ROIs. 1

5 0.69109428 54 nips-2009-Compositionality of optimal control laws

Author: Emanuel Todorov

Abstract: We present a theory of compositionality in stochastic optimal control, showing how task-optimal controllers can be constructed from certain primitives. The primitives are themselves feedback controllers pursuing their own agendas. They are mixed in proportion to how much progress they are making towards their agendas and how compatible their agendas are with the present task. The resulting composite control law is provably optimal when the problem belongs to a certain class. This class is rather general and yet has a number of unique properties – one of which is that the Bellman equation can be made linear even for non-linear or discrete dynamics. This gives rise to the compositionality developed here. In the special case of linear dynamics and Gaussian noise our framework yields analytical solutions (i.e. non-linear mixtures of LQG controllers) without requiring the final cost to be quadratic. More generally, a natural set of control primitives can be constructed by applying SVD to Green’s function of the Bellman equation. We illustrate the theory in the context of human arm movements. The ideas of optimality and compositionality are both very prominent in the field of motor control, yet they have been difficult to reconcile. Our work makes this possible.

6 0.66847354 109 nips-2009-Hierarchical Learning of Dimensional Biases in Human Categorization

7 0.66551358 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

8 0.64585459 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

9 0.63685566 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

10 0.62968719 21 nips-2009-Abstraction and Relational learning

11 0.62332863 112 nips-2009-Human Rademacher Complexity

12 0.62141794 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

13 0.61431551 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships

14 0.61299026 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

15 0.60884571 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

16 0.60623652 154 nips-2009-Modeling the spacing effect in sequential category learning

17 0.60535616 86 nips-2009-Exploring Functional Connectivities of the Human Brain using Multivariate Information Analysis

18 0.60420555 115 nips-2009-Individuation, Identification and Object Discovery

19 0.60209936 85 nips-2009-Explaining human multiple object tracking as resource-constrained approximate inference in a dynamic probabilistic model

20 0.59958315 121 nips-2009-Know Thy Neighbour: A Normative Theory of Synaptic Depression