nips nips2009 nips2009-143 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Laura Dietz, Valentin Dallmeier, Andreas Zeller, Tobias Scheffer
Abstract: We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. The model is trained using execution traces of passing test runs; it reflects the distribution over transitional patterns of code positions. Given a failing test case, the model determines the least likely transitional pattern in the execution trace. The model is designed such that Bayesian inference has a closed-form solution. We evaluate the Bernoulli graph model on data of the software projects AspectJ and Rhino. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. [sent-9, score-0.482]
2 The model is trained using execution traces of passing test runs; it reflects the distribution over transitional patterns of code positions. [sent-10, score-0.942]
3 Given a failing test case, the model determines the least likely transitional pattern in the execution trace. [sent-11, score-0.589]
4 1 Introduction In today’s software projects, two types of source code are developed: product and test code. [sent-14, score-0.479]
5 In addition to product code, developers write test code that consists of small test programs, each testing a single procedure or module for compliance with the specification. [sent-19, score-0.451]
6 When a program is executed, its trace through the source code can be recorded. [sent-23, score-0.534]
7 An executed line of source code is identified by a code position s ∈ S. [sent-24, score-0.946]
8 The stream of code positions forms the trace t of a test case execution. [sent-25, score-0.617]
9 In addition to the passing tests we are given a single trace t of a failing test case. [sent-27, score-0.479]
10 The passing test traces and the trace of the failing case refer to the same code revision; hence, the semantics of each code position remain constant. [sent-28, score-1.422]
11 For the failing test case, the developer is to be provided with a ranking of code positions according to their likelihood of being defective. [sent-29, score-0.88]
12 The semantics of code positions may change across revisions, and modifications of code may impact the distribution of execution patterns in the modified as well as other locations of the code. [sent-30, score-1.096]
13 We focus on the problem of localizing defects within a current code revision. [sent-31, score-0.542]
14 After each defect is localized, the code is typically revised and the semantics of code positions changes. [sent-32, score-1.002]
15 * @return 10 */ public static int defect (int param) { int i = 0; while (i < 10) { if (param == 5) { return 100; } i++; } return i; } public static class TestDefect extends TestCase { public void testParam1() { assertEquals(10, defect(1)); } /** Failing test case. [sent-35, score-0.53]
16 */ public void testParam5() { assertEquals(10, defect(5)); } } public void testParam10() { assertEquals(10, defect(10)); } Figure 1: Example with product code (left) and test code (right). [sent-36, score-0.914]
17 cannot assume that any negative training data—that is, previous failing test cases of the same code revision—are available. [sent-37, score-0.613]
18 So far, Tarantula [1] is the standard reference model for localizing defects in execution traces. [sent-41, score-0.511]
19 The authors propose an interface widget for test case results in which a pixel represents a code position. [sent-42, score-0.38]
20 The hue value of the pixel is determined by the number of failing and passing traces that execute this position and correlates with the likelihood that s is faulty [1]. [sent-43, score-0.554]
21 Another approach [2] includes return values and flags for executed code blocks and builds on sensitivity and increase of failure probability. [sent-44, score-0.568]
22 [4] extend latent Dirichlet allocation (LDA) [5] to find bug patterns in recorded execution events. [sent-47, score-0.375]
23 Their probabilistic model captures low-signal bug patterns by explaining passing executions from a set of usage topics and failing executions from a mix of usage and bug topics. [sent-48, score-0.739]
24 This section’s main result is the closed-form solution for Bayesian inference of the likelihood of a transitional pattern in a test trace given example execution traces. [sent-52, score-0.468]
25 Furthermore, we discuss how to learn hyperparameters and smoothing coefficients from other revisions, despite the fragile semantics of code positions. [sent-53, score-0.376]
26 2 Bernoulli Graph Model The Bernoulli graph model is a probabilistic model that generates program execution graphs. [sent-57, score-0.467]
27 In contrast to an execution trace, the graph is a representation of an execution that abstracts from the number of iterations over code fragments. [sent-58, score-0.986]
28 The model allows for Bayesian inference of the likelihood of a transition between code positions within an execution, given previously seen executions. [sent-59, score-0.465]
29 The n-gram execution graph Gt = (Vt , Et , Lt ) of an execution t connects vertices Vt by edges Et ⊆ Vt × Vt . [sent-60, score-0.699]
30 Labeling function Lt : Vt → S (n−1) injectively maps vertices to n − 1-grams of code positions, where S is the alphabet of code positions. [sent-61, score-0.733]
31 In the bigram execution graph, each vertex v represents a code position Lt (v); each arc (u, v) indicates that code position Lt (v) has been executed directly after code position Lt (u) at least once during the program execution. [sent-62, score-1.841]
32 In n-gram execution graphs, each vertex v represents a fragment Lt (v) = s1 . [sent-63, score-0.629]
33 Vertices u and v can only be connected by an arc if the fragments are overlapping in all but the first code position of u and the last code position of v; that is, Lt (u) = s1 . [sent-67, score-0.929]
34 | 22| 18 | 24 Figure 2: Expanding vertex “22 18” in the generation of a tri-gram execution graph corresponding to the trace at the bottom. [sent-78, score-0.601]
35 For the example program in Figure 1 the tri-gram execution graph is given in Figure 2. [sent-84, score-0.439]
36 The Bernoulli graph model generates one graph Gm,t = (Vm,t , Em,t , Lm,t ) per execution t and procedure m. [sent-86, score-0.555]
37 The model starts the graph generation with an initial vertex representing a fragment of virtual code positions ε. [sent-87, score-0.975]
38 3 αψ βψ false βψ αψ Beta Beta Edge coin u V (u,v) b for each graph G for each code position s for each vertex u for each procedure m a) Bernoulli graph f ϕ for each fragment f for each procedure m Bernoulli E Code Pos. [sent-142, score-1.138]
39 Dirichlet Multinomial Procedure m t b for each trace t Fragment n for each fragment f S for each procedure m b) Bernoulli fragment f=si-1,. [sent-145, score-0.664]
40 s for each code position in t for each trace t c) Multinomial n-gram Figure 3: Generative models in directed factor graph notation with dashed rectangles indicating gates [6]. [sent-149, score-0.663]
41 (1) #u + αψ + βψ By definition, an execution graph G for an execution contains a vertex if its label is a substring of the execution’s trace t. [sent-157, score-0.856]
42 It follows1 that the predictive distribution can be reformulated as in Equation 2 to predict the probability of seeing the code position s = sn after a fragment of preceding statements ˜ ˜ f = s1 . [sent-159, score-0.983]
43 si−n+1 , T, αψ , βψ ) s s ˜ n=1 We can learn from different revisions by integrating multiple Bernoulli graphs models in a generative process, in which coin parameters are not shared across revisions and context lengths n. [sent-176, score-0.489]
44 This process generates a stream of statements with defect flags. [sent-177, score-0.328]
45 Having learned point estimates for αψ , βψ , and θ from other ˆ ˆ revisions in a leave-one-out fashion, statements s are scored by the complementary event of being ˜ ˜ normal for any preceding fragment f . [sent-180, score-0.652]
46 1 Tarantula Tarantula [1] scores the likelihood of a code position s being defective according to the proportions of failing F and passing traces T that execute this position (Equation 4). [sent-187, score-1.143]
47 In the unigram case, the Bernoulli graph model generates a graph in which all statements in an execution are directly linked to an empty start vertex. [sent-189, score-0.801]
48 Letting g(s) = #{t∈T |˜∈t} , the rank order of any two code s #{t∈T } #{t∈T } 1 1 positions s1 , s2 is determined by 1 − g(s1 ) > 1 − g(s2 ) or equivalently 1+g(s1 ) > 1+g(s2 ) which is Tarantula’s ranking criterion if #F is 1. [sent-192, score-0.5]
49 Given a fixed order n, the Bernoulli fragment model draws a coin parameter for each possible fragment f = s1 . [sent-196, score-0.629]
50 For each execution the fragment set is generated by tossing a fragment’s coin and including all fragments with outcome b = 1 (cf. [sent-200, score-0.696]
51 ψ The model deviates from reality in that it may generate fragments that may not be aggregateable into a consistent sequence of code positions. [sent-203, score-0.401]
52 In contrast to the Bernoulli graph model, the multinomial model takes the number of occurrences of a pattern within an execution into account. [sent-209, score-0.49]
53 It consists of a hierarchical process in which first a procedure m is drawn from multinomial distribution γ, then a code position s is drawn from the multinomial distribution φm ranging over all code positions Sm in the procedure. [sent-210, score-1.078]
54 The n-gram model is a well-known extension of the unigram multinomial model, where the distributions φ are conditioned on the preceding fragment of code positions f = s1 . [sent-211, score-0.985]
55 Using fixed symmetric Dirichlet distributions with parameter αγ and αφ as priors for the multinomial distributions, the probability for unseen code positions s ˜ ˜ is given in Equation 5. [sent-215, score-0.564]
56 Shorthand #T denotes how often statements in following on fragment f s∈m prodecure m are executed (summing over all traces t ∈ T in the training set); and #T 1 . [sent-216, score-0.724]
57 [3] propose an approach that relies on a stream of sampled boolean predicates P , each corresponding to an executed control flow branch starting at code position s. [sent-225, score-0.615]
58 The approach evaluates whether P being true increases the probability of failure in contrast to reaching the code position by chance. [sent-226, score-0.455]
59 Each code position is scored according to the importance of its predicate P which is the harmonic mean of sensitivity and increase in failure probability. [sent-227, score-0.482]
60 Shorthands Fe (P ) and Se (P ) refer to the failing/passing traces that executed the path P , where Fo (P ) and So (P ) refer to failing/passing traces that executed the start point of P . [sent-228, score-0.618]
61 2 Importance(P ) = log #F log Fe (P ) Fe (P ) Se (P )+Fe (P ) + − Fo (P ) So (P )+Fo (P ) −1 This scoring procedure is not applicable to cases where a path is executed in only one failing trace, as a division by zero occurs in the first term when Fe (P ) = 1. [sent-229, score-0.39]
62 Most topics may be used to explain passing and failing traces, where some topics are reserved to explain statements in the failing traces only. [sent-234, score-0.907]
63 This is obtained by running LDA with different Dirichlet priors on passing and failing traces. [sent-235, score-0.327]
64 Then statements j are ranked according to the confidence Sij = p(z = i|s = j) − maxk=i p(z = k|s = j) of being rather about a bug topic i than any other topic k. [sent-237, score-0.292]
65 4 Experimental Evaluation In this section we study empirically how accurately the Bernoulli graph model and the reference models discussed in Section 3 localize defects that occurred in two large-scale development projects. [sent-238, score-0.379]
66 The SIR repository [9] provides traces of small programs into which defects have been injected. [sent-240, score-0.305]
67 The Cooperative Bug Isolation project [11], on the other hand, collects execution data from real applications, but records only a random sample of 1% of the executed code positions; complete execution traces cannot be reconstructed. [sent-242, score-1.159]
68 From Rhino’s and AspectJ’s bug database, we select defects which are reproducable by a test case and identify corresponding revisions in the source code repository. [sent-245, score-0.859]
69 For such revisions, the test code contains a test case that fails in one revision, but passes in the following revision. [sent-246, score-0.42]
70 We use the code positions that were modified between the two revisions as ground truth for the defective code positions D. [sent-247, score-1.276]
71 For AspectJ, these are one or two lines of code; the Rhino project contains larger code changes. [sent-248, score-0.34]
72 In the same manner, the failing trace t (in which the defective code is to be identified) is recorded. [sent-250, score-0.859]
73 The AspectJ data set consists of 41 defective revisions and a total of 45 failing traces. [sent-251, score-0.579]
74 Each failing trace has a length of up to 2,000,000 executed statements covering approx. [sent-252, score-0.641]
75 10,000 different code positions (of the 75,000 lines in the project), spread across 300 to 600 files and 1,000 to 4,000 procedures. [sent-253, score-0.465]
76 Rhino consists of 15 defective revisions with one failing trace per bug. [sent-256, score-0.691]
77 Failing traces have an average length of 3,500,000 executed statements, covering approx. [sent-257, score-0.309]
78 6 Top k% Figure 4: Recall of defective code positions within the 1% highest scored statements for AspectJ (top) and Rhino (bottom), for windows of h = 0, h = 1, and h = 10 code lines. [sent-318, score-1.145]
79 code positions, spread across 70 files and 650 procedures. [sent-319, score-0.34]
80 Following the evaluation in [1], we evaluate how well the models are able to guide the user into the vicinity of a defective code position. [sent-327, score-0.563]
81 The models return a ranked list of code positions. [sent-328, score-0.404]
82 Envisioning that the developer can navigate from the ranking into the source code to inspect a code line within its context, we evaluate the rank k at which a line of code occurs that lies within a window of ±h lines of code of a defective line. [sent-329, score-1.768]
83 We plot relative ranks; that is, absolute ranks divided by the number of covered code lines, corresponding to the fraction of code that the developer has to walk through in order to find the defect. [sent-330, score-0.787]
84 We examine the recall@k%, that is the fraction of successfully localized defects over the fraction of code the user has to inspect. [sent-331, score-0.493]
85 In order to study the helpfulness of each generative model, we evaluate smoothed models with maximum length N = 5 for each the multinomial, Bernoulli fragment and Bernoulli graph model. [sent-337, score-0.445]
86 Using the 7 Bernoulli graph model, a developer finds nearly every second bug in the top 1% in both data sets, where ranking a failing trace takes between 10 and 20 seconds. [sent-355, score-0.778]
87 From revisions where one model drastically outperformed the others we identified different categories of suspicious code areas. [sent-362, score-0.586]
88 In some cases, the defective procedures were executed in very few or no passing trace; we refer such code as being insufficiently covered. [sent-363, score-0.765]
89 Another category refers to defective code lines in the vicinity of branching points such as if-statements. [sent-364, score-0.594]
90 If code before the branch point is executed in many passing traces, but code in one of the branches only rarely, we call this a suspicious branch point. [sent-365, score-1.091]
91 The Bernoulli fragment model treats both kinds of suspicious code areas in a similar way. [sent-366, score-0.69]
92 Revisions in which Bernoulli fragment outperformed Bernoulli graph contained defects in insufficiently covered areas. [sent-372, score-0.565]
93 Revisions in which Bernoulli graph outperformed Bernoulli fragment contained bugs at suspicious branching points. [sent-374, score-0.578]
94 In contrast to the Bernoulli-style models, the multinomial models take the number of occurrences of a code position within a trace into account. [sent-375, score-0.626]
95 Presumably, multiple occurrences of code lines within a trace do not indicate their defectiveness. [sent-376, score-0.452]
96 Compared to execution traces, execution graphs abstract from the number of iterations that sequences of code positions have been executed for. [sent-379, score-1.167]
97 The model allows for Bayesian inference of the likelihood of transitional patterns in a new trace, given execution traces of passing test cases. [sent-380, score-0.602]
98 The Bernoulli graph model outperforms the reference models with closed-form solution with respect to giving a high rank to code positions that lie in close vicinity of the actual defect. [sent-387, score-0.704]
99 In order to find every second defect in the release history of Rhino, the Bernoulli graph model walks the developer through approximately 0. [sent-388, score-0.404]
100 5% of the code positions and 1% in the AspectJ project. [sent-389, score-0.465]
wordName wordTfidf (topN-words)
[('code', 0.34), ('fragment', 0.276), ('bernoulli', 0.275), ('execution', 0.255), ('failing', 0.233), ('aspectj', 0.214), ('rhino', 0.183), ('defective', 0.174), ('revisions', 0.172), ('tarantula', 0.168), ('defect', 0.161), ('executed', 0.157), ('defects', 0.153), ('traces', 0.152), ('statements', 0.139), ('graph', 0.136), ('positions', 0.125), ('bug', 0.12), ('sn', 0.115), ('trace', 0.112), ('unigram', 0.107), ('developer', 0.107), ('multinomial', 0.099), ('vertex', 0.098), ('passing', 0.094), ('revision', 0.08), ('lda', 0.077), ('coin', 0.077), ('position', 0.075), ('delta', 0.074), ('lt', 0.074), ('suspicious', 0.074), ('fe', 0.067), ('software', 0.065), ('fragments', 0.061), ('bugs', 0.061), ('liblit', 0.061), ('transitional', 0.061), ('public', 0.057), ('reference', 0.054), ('vertices', 0.053), ('localizing', 0.049), ('vicinity', 0.049), ('gm', 0.048), ('program', 0.048), ('holmes', 0.046), ('executions', 0.046), ('param', 0.046), ('andrzejewski', 0.046), ('assertequals', 0.046), ('dallmeier', 0.046), ('debugging', 0.046), ('dietz', 0.046), ('saarbruecken', 0.046), ('zeller', 0.046), ('sm', 0.044), ('branch', 0.043), ('vm', 0.041), ('void', 0.04), ('test', 0.04), ('failure', 0.04), ('preceding', 0.038), ('arc', 0.038), ('ben', 0.037), ('vt', 0.036), ('development', 0.036), ('semantics', 0.036), ('graphs', 0.035), ('projects', 0.035), ('top', 0.035), ('ranking', 0.035), ('scheffer', 0.034), ('source', 0.034), ('ranked', 0.033), ('generative', 0.033), ('dirichlet', 0.032), ('fo', 0.031), ('branching', 0.031), ('return', 0.031), ('aiken', 0.031), ('chilimbi', 0.031), ('developers', 0.031), ('distr', 0.031), ('naik', 0.031), ('navigate', 0.031), ('valentin', 0.031), ('germany', 0.03), ('topics', 0.028), ('int', 0.028), ('generates', 0.028), ('scored', 0.027), ('cooperative', 0.027), ('mayur', 0.027), ('inspect', 0.027), ('laura', 0.027), ('potsdam', 0.027), ('tobias', 0.027), ('tossing', 0.027), ('usage', 0.026), ('insuf', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
Author: Laura Dietz, Valentin Dallmeier, Andreas Zeller, Tobias Scheffer
Abstract: We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. The model is trained using execution traces of passing test runs; it reflects the distribution over transitional patterns of code positions. Given a failing test case, the model determines the least likely transitional pattern in the execution trace. The model is designed such that Bayesian inference has a closed-form solution. We evaluate the Bernoulli graph model on data of the software projects AspectJ and Rhino. 1
2 0.071842797 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
Author: Maxim Raginsky, Svetlana Lazebnik
Abstract: This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing.
3 0.06908454 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
4 0.061257459 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
Author: Peter Sollich, Matthew Urry, Camille Coti
Abstract: We investigate how well Gaussian process regression can learn functions defined on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we find that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function defined on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive definite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then defined as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, defined as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one finds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be verified by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the figure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to find the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reflecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn fixes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we find the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior defined by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simplification but is appropriate for the present first exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is fixed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and finally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as specified by p, a and σ 2 . We fix d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difficult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by first expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a fixed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are defined appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the definition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects fluctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justified in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative fluctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small filled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reflecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefficient c with p, which implies that to reach a specified (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at fixed , while in the tail ∝ p, it continues to change. For finite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simplifies to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this significantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one finds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a finite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the figure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artificial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning first to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this fixes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 first terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to fine detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artificial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9
5 0.053026017 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process
Author: Finale Doshi-velez, Shakir Mohamed, Zoubin Ghahramani, David A. Knowles
Abstract: Nonparametric Bayesian models provide a framework for flexible probabilistic modelling of complex datasets. Unfortunately, the high-dimensional averages required for Bayesian methods can be slow, especially with the unbounded representations used by nonparametric models. We address the challenge of scaling Bayesian inference to the increasingly large datasets found in real-world applications. We focus on parallelisation of inference in the Indian Buffet Process (IBP), which allows data points to have an unbounded number of sparse latent features. Our novel MCMC sampler divides a large data set between multiple processors and uses message passing to compute the global likelihoods and posteriors. This algorithm, the first parallel inference scheme for IBP-based models, scales to datasets orders of magnitude larger than have previously been possible. 1
6 0.052506208 205 nips-2009-Rethinking LDA: Why Priors Matter
7 0.050987806 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
8 0.050406545 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
9 0.04988829 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
10 0.049032826 148 nips-2009-Matrix Completion from Power-Law Distributed Samples
11 0.046321746 95 nips-2009-Fast subtree kernels on graphs
12 0.045703828 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
13 0.045509096 204 nips-2009-Replicated Softmax: an Undirected Topic Model
14 0.045396097 87 nips-2009-Exponential Family Graph Matching and Ranking
15 0.044668753 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
16 0.042896569 122 nips-2009-Label Selection on Graphs
17 0.042603217 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
18 0.042371344 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
19 0.041177806 260 nips-2009-Zero-shot Learning with Semantic Output Codes
20 0.03978828 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning
topicId topicWeight
[(0, -0.129), (1, -0.004), (2, -0.02), (3, -0.067), (4, 0.019), (5, -0.074), (6, -0.035), (7, -0.0), (8, -0.043), (9, 0.006), (10, -0.041), (11, 0.045), (12, 0.013), (13, 0.015), (14, -0.025), (15, -0.01), (16, -0.046), (17, -0.012), (18, -0.041), (19, -0.017), (20, 0.046), (21, 0.072), (22, 0.083), (23, 0.033), (24, 0.013), (25, 0.006), (26, 0.024), (27, -0.046), (28, -0.001), (29, -0.078), (30, 0.056), (31, -0.005), (32, -0.008), (33, 0.019), (34, 0.044), (35, -0.004), (36, 0.037), (37, 0.05), (38, -0.021), (39, 0.054), (40, 0.053), (41, -0.094), (42, -0.012), (43, 0.069), (44, -0.014), (45, 0.046), (46, 0.03), (47, 0.038), (48, -0.02), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.94928479 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
Author: Laura Dietz, Valentin Dallmeier, Andreas Zeller, Tobias Scheffer
Abstract: We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. The model is trained using execution traces of passing test runs; it reflects the distribution over transitional patterns of code positions. Given a failing test case, the model determines the least likely transitional pattern in the execution trace. The model is designed such that Bayesian inference has a closed-form solution. We evaluate the Bernoulli graph model on data of the software projects AspectJ and Rhino. 1
2 0.56692451 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.
3 0.54191339 205 nips-2009-Rethinking LDA: Why Priors Matter
Author: Andrew McCallum, David M. Mimno, Hanna M. Wallach
Abstract: Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such “smoothing parameters” have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document–topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic–word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling. 1
4 0.53390634 256 nips-2009-Which graphical models are difficult to learn?
Author: Andrea Montanari, Jose A. Pereira
Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random fields) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random field develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random field µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is defined as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change significantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for specific classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results confirm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be definite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to find W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after fixing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is defined as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More specifically, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This first result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More specifically, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the fluctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a specific low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is defined from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our first result shows that Rlr(λ) indeed reconstructs G if θ is sufficiently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires significant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difficult to learn. We will instead show how to evaluate it on specific graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufficient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and define S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Define Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our first auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufficient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufficiently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisfied since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by definition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and define the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external field’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any fixed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) finite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simplifications arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random fields”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ fling and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9
5 0.53127843 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
Author: Chong Wang, David M. Blei
Abstract: We present a nonparametric hierarchical Bayesian model of document collections that decouples sparsity and smoothness in the component distributions (i.e., the “topics”). In the sparse topic model (sparseTM), each topic is represented by a bank of selector variables that determine which terms appear in the topic. Thus each topic is associated with a subset of the vocabulary, and topic smoothness is modeled on this subset. We develop an efficient Gibbs sampler for the sparseTM that includes a general-purpose method for sampling from a Dirichlet mixture with a combinatorial number of components. We demonstrate the sparseTM on four real-world datasets. Compared to traditional approaches, the empirical results will show that sparseTMs give better predictive performance with simpler inferred models. 1
6 0.52448577 148 nips-2009-Matrix Completion from Power-Law Distributed Samples
7 0.48821434 153 nips-2009-Modeling Social Annotation Data with Content Relevance using a Topic Model
8 0.48800138 122 nips-2009-Label Selection on Graphs
9 0.48437577 87 nips-2009-Exponential Family Graph Matching and Ranking
10 0.48400924 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units
11 0.45149329 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
12 0.45085669 204 nips-2009-Replicated Softmax: an Undirected Topic Model
13 0.44254938 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
14 0.43834746 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
15 0.42692298 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
16 0.41874638 56 nips-2009-Conditional Neural Fields
17 0.40735859 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models
18 0.40494284 95 nips-2009-Fast subtree kernels on graphs
19 0.40477273 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
20 0.3974663 226 nips-2009-Spatial Normalized Gamma Processes
topicId topicWeight
[(7, 0.01), (24, 0.012), (25, 0.036), (35, 0.033), (36, 0.062), (39, 0.032), (55, 0.014), (58, 0.039), (61, 0.01), (71, 0.577), (81, 0.014), (86, 0.065), (91, 0.011)]
simIndex simValue paperId paperTitle
1 0.9584344 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall
Author: Richard Socher, Samuel Gershman, Per Sederberg, Kenneth Norman, Adler J. Perotte, David M. Blei
Abstract: We develop a probabilistic model of human memory performance in free recall experiments. In these experiments, a subject first studies a list of words and then tries to recall them. To model these data, we draw on both previous psychological research and statistical topic models of text documents. We assume that memories are formed by assimilating the semantic meaning of studied words (represented as a distribution over topics) into a slowly changing latent context (represented in the same space). During recall, this context is reinstated and used as a cue for retrieving studied words. By conceptualizing memory retrieval as a dynamic latent variable model, we are able to use Bayesian inference to represent uncertainty and reason about the cognitive processes underlying memory. We present a particle filter algorithm for performing approximate posterior inference, and evaluate our model on the prediction of recalled words in experimental data. By specifying the model hierarchically, we are also able to capture inter-subject variability. 1
same-paper 2 0.95728803 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models
Author: Laura Dietz, Valentin Dallmeier, Andreas Zeller, Tobias Scheffer
Abstract: We devise a graphical model that supports the process of debugging software by guiding developers to code that is likely to contain defects. The model is trained using execution traces of passing test runs; it reflects the distribution over transitional patterns of code positions. Given a failing test case, the model determines the least likely transitional pattern in the execution trace. The model is designed such that Bayesian inference has a closed-form solution. We evaluate the Bernoulli graph model on data of the software projects AspectJ and Rhino. 1
3 0.94175172 53 nips-2009-Complexity of Decentralized Control: Special Cases
Author: Martin Allen, Shlomo Zilberstein
Abstract: The worst-case complexity of general decentralized POMDPs, which are equivalent to partially observable stochastic games (POSGs) is very high, both for the cooperative and competitive cases. Some reductions in complexity have been achieved by exploiting independence relations in some models. We show that these results are somewhat limited: when these independence assumptions are relaxed in very small ways, complexity returns to that of the general case. 1
4 0.90269715 11 nips-2009-A General Projection Property for Distribution Families
Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári
Abstract: Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes. 1
5 0.81517184 56 nips-2009-Conditional Neural Fields
Author: Jian Peng, Liefeng Bo, Jinbo Xu
Abstract: Conditional random fields (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs significantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.
6 0.81277502 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
7 0.73580092 205 nips-2009-Rethinking LDA: Why Priors Matter
8 0.7216689 65 nips-2009-Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process
9 0.62748879 204 nips-2009-Replicated Softmax: an Undirected Topic Model
10 0.6199128 150 nips-2009-Maximum likelihood trajectories for continuous-time Markov chains
11 0.60781831 206 nips-2009-Riffled Independence for Ranked Data
12 0.60722733 171 nips-2009-Nonparametric Bayesian Models for Unsupervised Event Coreference Resolution
13 0.59318966 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
14 0.59053409 96 nips-2009-Filtering Abstract Senses From Image Search Results
15 0.57499337 226 nips-2009-Spatial Normalized Gamma Processes
16 0.57399464 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory
17 0.56605256 154 nips-2009-Modeling the spacing effect in sequential category learning
18 0.56078267 260 nips-2009-Zero-shot Learning with Semantic Output Codes
19 0.55613381 107 nips-2009-Help or Hinder: Bayesian Models of Social Goal Inference
20 0.54704076 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization