nips nips2009 nips2009-141 knowledge-graph by maker-knowledge-mining

141 nips-2009-Local Rules for Global MAP: When Do They Work ?

Source: pdf

Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah

Abstract: We consider the question of computing Maximum A Posteriori (MAP) assignment in an arbitrary pair-wise Markov Random Field (MRF). We present a randomized iterative algorithm based on simple local updates. The algorithm, starting with an arbitrary initial assignment, updates it in each iteration by ﬁrst, picking a random node, then selecting an (appropriately chosen) random local neighborhood and optimizing over this local neighborhood. Somewhat surprisingly, we show that this algorithm ﬁnds a near optimal assignment within n log 2 n iterations with high probability for any n node pair-wise MRF with geometry (i.e. MRF graph with polynomial growth) with the approximation error depending on (in a reasonable manner) the geometric growth rate of the graph and the average radius of the local neighborhood – this allows for a graceful tradeoff between the complexity of the algorithm and the approximation error. Through extensive simulations, we show that our algorithm ﬁnds extremely good approximate solutions for various kinds of MRFs with geometry.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider the question of computing Maximum A Posteriori (MAP) assignment in an arbitrary pair-wise Markov Random Field (MRF). [sent-5, score-0.149]

2 We present a randomized iterative algorithm based on simple local updates. [sent-6, score-0.28]

3 The algorithm, starting with an arbitrary initial assignment, updates it in each iteration by ﬁrst, picking a random node, then selecting an (appropriately chosen) random local neighborhood and optimizing over this local neighborhood. [sent-7, score-0.526]

4 Somewhat surprisingly, we show that this algorithm ﬁnds a near optimal assignment within n log 2 n iterations with high probability for any n node pair-wise MRF with geometry (i. [sent-8, score-0.519]

5 MRF graph with polynomial growth) with the approximation error depending on (in a reasonable manner) the geometric growth rate of the graph and the average radius of the local neighborhood – this allows for a graceful tradeoff between the complexity of the algorithm and the approximation error. [sent-10, score-1.257]

6 In most applications, the primary inference question of interest is that of ﬁnding maximum a posteriori (MAP) solution – e. [sent-18, score-0.137]

7 These ”local update” algorithms start from an initial solution and proceed by making a series of changes which lead to solutions having lower energy (or better likelihood), and hence are also called ”move making algorithms”. [sent-26, score-0.141]

8 At each step, the algorithms search the space of all possible local changes that can be made to the current solution (also called move space), and choose the one which leads to the solution having the highest probability or lowest energy. [sent-27, score-0.306]

9 Its local update involves selecting (randomly or deterministically) a variable of the problem. [sent-29, score-0.159]

10 The size of the move space is the deﬁning characteristic of any such move making algorithm. [sent-33, score-0.264]

11 A large move space means that more extensive changes to the current solution can be made. [sent-34, score-0.163]

12 This makes the algorithm less prone to getting stuck in local minima and also results in a faster rate of convergence. [sent-35, score-0.225]

13 Expansion and Swap are move making algorithms which search for the optimal move in a move space of size 2n where n is the number of random variables. [sent-36, score-0.425]

14 For energy functions composed of metric pairwise potentials, the optimal move can be found in polynomial time by minimizing a submodular quadratic pseudo-boolean function [3] (or solving an equivalent minimum cost st-cut problem). [sent-37, score-0.365]

15 The last few years have seen a lot of interest in st-mincut based move algorithms for energy minimization. [sent-38, score-0.275]

16 They showed that expansion can be seen as solving the dual of a linear programming relaxation of the energy minimization problem. [sent-41, score-0.159]

17 Researchers have also proposed a number of novel move encoding strategies for solving particular forms of energy functions. [sent-42, score-0.214]

18 Veksler [18] proposed a move algorithm in which variables can choose any label from a range of labels. [sent-43, score-0.185]

19 They showed that this move space allowed them to obtain better minima of energy functions with truncated convex pairwise terms. [sent-44, score-0.288]

20 Kumar and Torr [10] have since shown that the range move algorithm achieves the same guarantees as the ones obtained by methods based on the standard linear programming relaxation. [sent-45, score-0.231]

21 In a sense, it can be viewed as an iterative algorithm that makes local updates based optimizing based on the immediate graphical structure. [sent-48, score-0.314]

22 There is a long list of literature on understanding the conditions under which max-product belief propagation algorithm ﬁnd correct solution. [sent-49, score-0.128]

23 Speciﬁcally, in recent years a sequence of results suggest that there is an intimate relation between the maxproduct algorithm and a natural linear programming relaxation – for example, see [1, 2, 8, 16, 21]. [sent-50, score-0.128]

24 We also note that Swendsen-Wang algorithm (SW) [17], a local ﬂipping algorithm, has a philosophy similar to ours in that it repeats a process of randomly partitioning the graph, and computing an assignment. [sent-51, score-0.226]

25 However, the graph partitioning of SW is fundamentally different from ours and there is no known guarantee for the error bound of SW. [sent-52, score-0.248]

26 In summary, all the approaches thus far with provable guarantees for local update based algorithm are primarily for linear or more generally convex optimization setup. [sent-53, score-0.263]

27 As the main result of this paper, we propose a randomized iterative local algorithm that is based on simple local updates. [sent-55, score-0.392]

28 The algorithm, starting with an arbitrary initial assignment, updates it in each iteration by ﬁrst picking a random node, then its (appropriate) random local neighborhood and optimizing over this local neighborhood. [sent-56, score-0.526]

29 Somewhat surprisingly, we show that this algorithm ﬁnds near optimal assignment within n log 2 n iterations with high probability for graphs with geometry – i. [sent-57, score-0.507]

30 graphs in which the neighborhood of each node within distance r grows no faster than a polynomial in r. [sent-59, score-0.402]

31 Such graphs can have arbitrarily structure subject to this polynomial growth structure. [sent-60, score-0.447]

32 We show that the approximation error depends gracefully on the average random radius of the local neighborhood and degree of polynomial growth of the graph. [sent-61, score-0.808]

33 Overall, our algorithm can provide an ε−approximation MAP with C(ε)n log 2 n total computation with C(ε) depending only on ε and the degree of polynomial growth. [sent-62, score-0.278]

34 The crucial novel feature of our algorithm is the appropriate selection of random local neighborhood rather than deterministic in order to achieve provable performance guarantee. [sent-63, score-0.338]

35 Speciﬁcally, we apply our algorithm to two popular setting: (a) a grid graph based pairwise MRF with varying node and edge interaction strengths, and (b) a grid graph based MRF on the weighted independent set (or hardcore) model. [sent-66, score-0.836]

36 We ﬁnd that with very small radius (within 3), we ﬁnd assignment which within 1% (0. [sent-67, score-0.245]

37 99 factor) of the MAP for a large range of parameters and upto graph of 1000 nodes. [sent-68, score-0.18]

38 1 In this paper, the question of interest is to ﬁnd the maximum a posteriori (MAP) assignment x ∗ ∈ Σn , i. [sent-83, score-0.255]

39 n x∈Σ Equivalently, from the optimization point of view, we wish to ﬁnd an optimal assignment of the problem maximize H(x) over x ∈ Σn , where ln Ψv (xv ) + ln Ψuv (xu , xv ). [sent-86, score-0.482]

40 Speciﬁcally, we call an assignment x as an ε-approximate MAP if (1 − ε)H(x∗ ) ≤ H(x) ≤ H(x∗ ). [sent-90, score-0.149]

41 To this end, a graph G = (V, E) induces a natural ‘graph metric’ on vertices V , denoted by d G : V × V → R+ with dG (v, u) as the length of the shortest path between u and v; with it deﬁned as ∞ if there is no path between them. [sent-93, score-0.223]

42 We call a graph G with polynomial growth of degree (or growth rate) ρ, if for any v ∈ V and r ∈ N, |BG (v, r)| ≤ C · rρ , where C > 0 is a universal constant and B G (v, r) = {w ∈ V |dG (w, v) < r}. [sent-95, score-0.81]

43 A large class of graph model naturally fall into the graphs with polynomial growth. [sent-96, score-0.404]

44 To begin with, the standard d-dimensional regular grid graphs have polynomial growth rate d – e. [sent-97, score-0.61]

45 More generally, in recent years in the context of computational geometry and metric embedding, the graphs with ﬁnite doubling dimensions have become popular object of study [6, 7]. [sent-100, score-0.259]

46 3 It can be checked that a graph with doubling dimension ρ is also a graph with polynomial growth rate ρ. [sent-102, score-0.804]

47 Finally, the popular geometric graph model where nodes are placed arbitrarily on a two dimensional surface with minimum distance separation and two nodes have an edge between them if they are within certain ﬁnite distance, then it is indeed a graph with ﬁnite polynomial growth rate. [sent-103, score-0.991]

48 The main result of this paper is a randomized iterative algorithm based on simple local updates. [sent-105, score-0.28]

49 In each iteration, it picks a node, say v from all n nodes of V , uniformly at random and picks a random radius Q (as per speciﬁc distribution). [sent-108, score-0.325]

50 The algorithm re-assigns values to all nodes within distance Q of node v with respect to graph distance d G by ﬁnding the optimal assignment for this local neighborhood subject to keeping the assignment to all other nodes the same. [sent-109, score-1.003]

51 The algorithm L OC -A LGO described in Section 3 repeats this process for n log 2 n many times. [sent-110, score-0.128]

52 We show that L OC -A LGO ﬁnds near optimal solution with high probability as long as the graph has ﬁnite polynomial growth rate. [sent-111, score-0.624]

53 Given MRF based on graph G = (V, E) of n = |V | nodes with polynomial growth rate ρ and approximation parameter ε ∈ (0, 1), our algorithm L OC -A LGO with O (log(1/δ)n log n) iterations produces a solution x such that 1 . [sent-113, score-0.878]

54 Pr[H(x∗ ) − H(x) ≤ 2εH(x∗ )] ≥ 1 − δ − poly(n) And each iteration takes at most ζ(ε, ρ) computation, with ρ ζ(ε, ρ) ≤ |Σ|CK(ε,ρ) , where K(ε, ρ) is deﬁned as 8ρ K = K(ε, ρ) = log ϕ 8ρ ϕ + 4 4 1 log C + log + 2 ϕ ϕ ϕ with ϕ= ε . [sent-114, score-0.188]

55 On one hand, this result establishes that it is indeed possible to have polynomial (or almost linear) time approximation algorithm for arbitrary pair-wise MRF with polynomial growth. [sent-116, score-0.385]

56 Iteratively, at each step a vertex v ∈ V is chosen uniformly at random along with a random radius Q that is chosen independently as per distribution Q. [sent-122, score-0.364]

57 Then, select R ⊂ V , the local neighborhood (or ball) of radius Q around v as per graph distance d G , i. [sent-123, score-0.516]

58 Then while keeping the assignment of all nodes in V \R ﬁxed as per x = (ˆv )v∈V , ﬁnd MAP x assignment x∗,R restricted to nodes of R. [sent-126, score-0.515]

59 And, update the assignment of nodes in v ∈ R as per x∗,R . [sent-127, score-0.303]

60 Speciﬁcally, given parameters ε ∈ (0, 1) and the polynomial growth rate ρ ε (with constant C) of the graph, deﬁne ϕ = 5C2ρ , and 8ρ 4 4 1 8ρ log + log C + log + 2. [sent-132, score-0.53]

61 Through dynamic programming (or exhaustive computation) ﬁnd an exact MAP x∗,R for R while ﬁxing all the other assignment of x value outside R. [sent-140, score-0.255]

62 If we run the L OC -A LGO with (2n ln n) iterations, with probability at least 1 − 1/n, we have (1 − ε)H(x∗ ) ≤ E[H(x)] ≤ H(x∗ ). [sent-147, score-0.128]

63 Deﬁne T = 2 log(1/δ), and consider L OC A LGO with (2T n ln n) iterations. [sent-149, score-0.128]

64 2 (2) Note that (2) is true for any initial assignment of L OC -A LGO. [sent-151, score-0.149]

65 Hence for each 1 ≤ t ≤ T , after (2tn ln n) iterations, (2) holds independently with probability 1 − 1/n. [sent-152, score-0.158]

66 Hence, H(x ∗ ) − H(x) > 2εH(x∗ ) holds after (2T n ln n) iterations only if the same holds after (2tn ln n) iterations for all 1 ≤ t ≤ T . [sent-154, score-0.384]

67 For the total computation bound in Theorem 1, note that each iteration of L OC -A LGO involves dynamic programming over a local neighborhood of radius at most K = K(ε, ρ) around a node. [sent-156, score-0.442]

68 5 This involves, due to the polynomial growth condition, at most CK ρ nodes. [sent-157, score-0.374]

69 see [4]), it follows that after 2n ln n iterations, with probability at least 1 − 1/n, all the vertices of V will be chosen as ‘ball centers’ at least once. [sent-163, score-0.207]

70 Now we prove that if all the vertices of V are chosen as ‘ball centers’ at least once, the answer x generated by L OC -A LGO after 2n ln n iterations, is indeed an ε-approximation on average. [sent-165, score-0.207]

71 Then for each vertex v ∈ V, we assign the largest iteration number t such that the chosen ball R at the iteration t contains w. [sent-168, score-0.293]

72 Clearly, this is well deﬁned algorithm is run till each node is chosen as the ‘ball center’ at least once. [sent-170, score-0.174]

73 Now consider graph G = (V, E\B) obtained by removing edges B from G. [sent-172, score-0.18]

74 Given two MRFs X1 and X2 on the same graph G = (V, E) with identical edge potentials {ψij (·, ·)}, (i, j) ∈ E but distinct node potentials {φ 1 (·)}, {φ2 (·)}, i ∈ V respectively. [sent-177, score-0.395]

75 Finally, for ∈ {1, 2} and any x ∈ Σ n , i i i deﬁne H (x) = i∈V φi (xi ) + (i,j)∈E ψij (xi , xj ), with x∗, being a MAP assignment of MRF D x . [sent-179, score-0.149]

76 Given MRF X deﬁned on G (as in (1)), the algorithm L OC -A LGO produces output x such that ⎛ ⎞ |H(x∗ ) − H(x)| ≤ 5 ⎝ U L ψij − ψij ⎠ , (i,j)∈B U where B is the (random) imaginary boundary set of L OC -A LGO, ψ ij L ψij minσ,σ ∈Σ ψij (σ, σ ). [sent-182, score-0.243]

77 maxσ,σ ∈Σ ψij (σ, σ ), and Now we obtain the following lemma that utilizes the fact that the distribution Q follows a geometric distribution with rate (1 − ϕ) – its proof is omitted. [sent-183, score-0.202]

78 (3) (i,j)∈E Finally, we establish the following lemma that bounds (i,j)∈E U L ψij − ψij – its proof is omitted. [sent-187, score-0.134]

79 If G has maximum vertex degree d ∗ , then U L ψij − ψij ≤ (d∗ + 1)H(x∗ ). [sent-189, score-0.138]

80 (4) (i,j)∈E Now recall that the maximum vertex degree d ∗ of G is less than 2ρ C by the deﬁnition of polynomially growing graph. [sent-190, score-0.231]

81 6 5 Experiments Our algorithm provides a provable approximation for any MRF on a polynomially growing graph. [sent-193, score-0.227]

82 In this section, we present experimental evaluations of our algorithm for two popular models: (a) synthetic Ising model, and (b) hardcore (independent set) model. [sent-194, score-0.17]

83 Σ = {0, 1}) MRF on an n 1 × n2 grid G = (V, E): ⎛ ⎞ Pr(x) ∝ exp ⎝ θij xi xj ⎠ , for x ∈ {0, 1}n1n2 . [sent-198, score-0.13]

84 For each (i, j) ∈ E, choose θ ij independently from U[−α, α]. [sent-202, score-0.183]

85 5 1 2 4 8 16 32 64 Figure 3: (A) plots the error of local update algorithm for a random Ising model in the grid graph of size 10 × 10, and (B) plots the error in the grid of size 100 × 10. [sent-224, score-0.821]

86 To compare the effectiveness of our algorithm for each size of the local updates, in our simulation, we ﬁx the square size as a constant instead of choosing it from a distribution. [sent-225, score-0.165]

87 We run the simulation for the local square size r×r with r = 1, 2, 3, where r = 1 is the case when each square consists of a single vertex. [sent-226, score-0.163]

88 We computed an exact MAP assignment x ∗ by dynamic programming, and computed the output x of our local update algorithm for each r, by doing 4n 1 n2 log(n1 n2 ) many local updates for n1 × n2 grid graph. [sent-227, score-0.688]

89 The Figure 3(A) plots the error for the grid of size 10 × 10, while Figure 3(B) plots the error for the grid of size 100 × 10. [sent-230, score-0.4]

90 As the simulation result suggests, for any graph and any range of α, the error of the local update algorithm decreases dramatically as r increases. [sent-233, score-0.484]

91 Moreover, when r is comparably small as r = 3, the output of the local update algorithm achieves remarkably good approximation. [sent-234, score-0.239]

92 We consider the vertex weighted independent set model deﬁned on a grid graph. [sent-237, score-0.203]

93 Speciﬁcally, consider a binary MRF on an n 1 × n2 grid G = (V, E): ⎞ ⎛ Pr(x) ∝ exp ⎝ Ψ(xi xj )⎠ , for x ∈ {0, 1}n1n2 . [sent-239, score-0.13]

94 For this model, we did simulations for grid graphs of size 10×10, 30×10, and 100×10 respectively. [sent-245, score-0.229]

95 As the result shows, our local update algorithm achieves remarkably good approximation of the MAP or equivalently in this setup the maximum weight independent set, even with very small r values ! [sent-248, score-0.365]

96 6 Conclusion We considered the question of designing simple, iterative algorithm with local updates for ﬁnding MAP in any pair-wise MRF. [sent-259, score-0.279]

97 Our results are somewhat surprising given that thus far the known theoretical justiﬁcation for such local algorithms strongly dependended on some form of convexity of the ‘energy’ function. [sent-262, score-0.14]

98 We believe that our algorithm will be of great practical interest in near future as a large class of problems that utilize MRF based modeling and inference in practice have the underlying graphical structure possessing some form of geometry naturally. [sent-264, score-0.247]

99 A new framework for approximate labeling via graph cuts. [sent-314, score-0.18]

100 Graph cut based optimization for mrfs with truncated convex priors. [sent-358, score-0.135]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('oc', 0.396), ('lgo', 0.395), ('mrf', 0.364), ('growth', 0.223), ('graph', 0.18), ('ij', 0.153), ('polynomial', 0.151), ('assignment', 0.149), ('move', 0.132), ('grid', 0.13), ('ln', 0.128), ('local', 0.112), ('map', 0.105), ('shah', 0.099), ('pr', 0.098), ('radius', 0.096), ('neighborhood', 0.093), ('lemma', 0.092), ('mrfs', 0.088), ('geometry', 0.088), ('node', 0.085), ('hardcore', 0.085), ('energy', 0.082), ('dg', 0.077), ('xv', 0.077), ('uv', 0.073), ('graphs', 0.073), ('vertex', 0.073), ('nodes', 0.072), ('iteration', 0.065), ('iterations', 0.064), ('setup', 0.064), ('iterative', 0.059), ('bayati', 0.056), ('kyomin', 0.056), ('randomized', 0.056), ('updates', 0.055), ('ball', 0.054), ('algorithm', 0.053), ('polynomially', 0.053), ('provable', 0.051), ('simulation', 0.051), ('ising', 0.05), ('devavrat', 0.049), ('komodakis', 0.049), ('update', 0.047), ('ck', 0.047), ('truncated', 0.047), ('programming', 0.046), ('edge', 0.046), ('veksler', 0.045), ('belief', 0.044), ('vertices', 0.043), ('description', 0.042), ('proof', 0.042), ('posteriori', 0.042), ('potentials', 0.042), ('error', 0.041), ('log', 0.041), ('nds', 0.04), ('growing', 0.04), ('near', 0.039), ('keeping', 0.038), ('theorem', 0.037), ('doubling', 0.037), ('sw', 0.037), ('imaginary', 0.037), ('chosen', 0.036), ('geometric', 0.035), ('per', 0.035), ('graphical', 0.035), ('repeats', 0.034), ('degree', 0.033), ('rate', 0.033), ('picks', 0.032), ('kumar', 0.032), ('interest', 0.032), ('maximum', 0.032), ('statement', 0.032), ('popular', 0.032), ('expansion', 0.031), ('solution', 0.031), ('picking', 0.031), ('propagation', 0.031), ('exhaustive', 0.03), ('dynamic', 0.03), ('approximation', 0.03), ('independently', 0.03), ('random', 0.029), ('centers', 0.029), ('essence', 0.029), ('years', 0.029), ('plots', 0.029), ('foundations', 0.028), ('somewhat', 0.028), ('start', 0.028), ('minima', 0.027), ('partitioning', 0.027), ('remarkably', 0.027), ('simulations', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah

2 0.13054755 31 nips-2009-An LP View of the M-best MAP problem

Author: Menachem Fromer, Amir Globerson

Abstract: We consider the problem of ﬁnding the M assignments with maximum probability in a probabilistic graphical model. We show how this problem can be formulated as a linear program (LP) on a particular polytope. We prove that, for tree graphs (and junction trees in general), this polytope has a particularly simple form and diﬀers from the marginal polytope in a single inequality constraint. We use this characterization to provide an approximation scheme for non-tree graphs, by using the set of spanning trees over such graphs. The method we present puts the M -best inference problem in the context of LP relaxations, which have recently received considerable attention and have proven useful in solving diﬃcult inference problems. We show empirically that our method often ﬁnds the provably exact M best conﬁgurations for problems of high tree-width. A common task in probabilistic modeling is ﬁnding the assignment with maximum probability given a model. This is often referred to as the MAP (maximum a-posteriori) problem. Of particular interest is the case of MAP in graphical models, i.e., models where the probability factors into a product over small subsets of variables. For general models, this is an NP-hard problem [11], and thus approximation algorithms are required. Of those, the class of LP based relaxations has recently received considerable attention [3, 5, 18]. In fact, it has been shown that some problems (e.g., ﬁxed backbone protein design) can be solved exactly via sequences of increasingly tighter LP relaxations [13]. In many applications, one is interested not only in the MAP assignment but also in the M maximum probability assignments [19]. For example, in a protein design problem, we might be interested in the M amino acid sequences that are most stable on a given backbone structure [2]. In cases where the MAP problem is tractable, one can devise tractable algorithms for the M best problem [8, 19]. Speciﬁcally, for low tree-width graphs, this can be done via a variant of max-product [19]. However, when ﬁnding MAPs is not tractable, it is much less clear how to approximate the M best case. One possible approach is to use loopy max-product to obtain approximate max-marginals and use those to approximate the M best solutions [19]. However, this is largely a heuristic and does not provide any guarantees in terms of optimality certiﬁcates or bounds on the optimal values. LP approximations to MAP do enjoy such guarantees. Speciﬁcally, they provide upper bounds on the MAP value and optimality certiﬁcates. Furthermore, they often work for graphs with large tree-width [13]. The goal of the current work is to leverage the power of LP relaxations to the M best case. We begin by focusing on the problem of ﬁnding the second best solution. We show how it can be formulated as an LP over a polytope we call the “assignment-excluding marginal polytope”. In the general case, this polytope may require an exponential number of inequalities, but we prove that when the graph is a tree it has a very compact representation. We proceed to use this result to obtain approximations to the second best problem, and show how these can be tightened in various ways. Next, we show how M best assignments can be found by relying on algorithms for 1 second best assignments, and thus our results for the second best case can be used to devise an approximation algorithm for the M best problem. We conclude by applying our method to several models, showing that it often ﬁnds the exact M best assignments. 1 The M-best MAP problem and its LP formulation Consider a function on n variables deﬁned as: f (x1 , . . . , xn ; θ) = θij (xi , xj ) + ij∈E θi (xi ) (1) i∈V where V and E are the vertices and nodes of a graph G with n nodes. We shall be interested in the M assignments with largest f (x; θ) value.1 Denote these by x(1) , . . . , x(M) , so that x(1) is the assignment that maximizes f (x; θ), x(2) is the 2nd best assignment, etc. The MAP problem (i.e., ﬁnding x(1) ) can be formulated as an LP as follows [15]. Let µ be a vector of distributions that includes {µij (xi , xj )}ij∈E over edge variables and {µi (xi )}i∈V over nodes. The set of µ that arise from some joint distribution is known as the marginal polytope [15] and is denoted by M(G). Formally: M(G) = {µ | ∃p(x) ∈ ∆ s.t. p(xi , xj ) = µij (xi , xj ) , p(xi ) = µi (xi )} . where ∆ is the set of distributions on x. The MAP problem can then be shown to be equivalent to the following LP:2 max f (x; θ) = max µ · θ , (2) x µ∈M(G) It can be shown that this LP always has a maximizing µ that is a vertex of M(G) and is integral. Furthermore, this µ corresponds to the MAP assignment x(1) . Although the number of variables in this LP is only O(|E| + |V |), the diﬃculty comes from an exponential number of linear inequalities generally required to describe the marginal polytope M(G). We shall ﬁnd it useful to deﬁne a mapping between assignments x and integral vertices of the polytope. Given an integral vertex v ∈ M(G), deﬁne x(v) to be the assignment that maximizes vi (xi ). And, given an assignment z deﬁne v(z) to be the integral vertex in M(G) corresponding to the assignment z. Thus the LP in Eq. 2 will be maximized by v(x(1) ). One simple outer bound of the marginal polytope is the local polytope ML (G), which only enforces pairwise constraints between variables:     µi (xi ) = 1 (3) µij (xi , xj ) = µj (xj ), µij (xi , xj ) = µi (xi ), ML (G) = µ ≥ 0   x x x i j i The LP relaxation is then to maximize µ · θ where µ ∈ ML (G). For tree structured graphs, ML (G) = M(G) [15] and thus the LP relaxation yields the exact MAP x(1) . An LP Formulation for the 2nd -best MAP 2 Assume we found the MAP assignment x(1) and are now interested in ﬁnding x(2) . Is there a simple LP whose solution yields x(2) ? We begin by focusing on the case where G is a tree so that the local LP relaxation is exact. We ﬁrst treat the case of a connected tree. To construct an LP whose solution is x(2) , a natural approach is to use the LP for x(1) (i.e., the LP in Eq. 2) but somehow eliminate the solution x(1) using additional constraints. This, however, is somewhat trickier than it sounds. The key diﬃculty is that the new constraints should not generate fractional vertices, so that the resulting LP is still exact. We begin by deﬁning the polytope over which we need to optimize in order to obtain x(2) . 1 2 This is equivalent to ﬁnding P maximum probability assignments for a model p(x) ∝ ef (x;θ) . the P P P We use the notation µ · θ = ij∈E xi ,xj µij (xi , xj )θij (xi , xj ) + i xi µi (xi )θi (xi ) 2 Deﬁnition 1. The assignment-excluding marginal polytope is deﬁned as: ˆ M(G, z) = {µ | ∃p(x) ∈ ∆ s.t. p(z) = 0, p(xi , xj ) = µij (xi , xj ), p(xi ) = µi (xi )} . ˆ M(G, z) is simply the convex hull of all (integral) vectors v(x) for x = z. (4) ˆ The following result shows that optimizing over M(G, x(1) ) will yield the second best soluˆ tion x(2) , so that we refer to M(G, x(1) ) as the second-best marginal polytope. Lemma 1. The 2nd best solution is obtained via the following LP: maxx=x(1) f (x; θ) = maxµ∈M(G,x(1) ) µ · θ. Furthermore, the µ that maximizes the LP on ˆ the right is integral and corresponds to the second-best MAP assignment x(2) . The proof is similar to that of Eq. 2: instead of optimizing over x, we optimize over distributions p(x), while enforcing that p(x(1) ) = 0 so that x(1) is excluded from the maximization. The key question which we now address is how to obtain a simple characterization of ˆ ˆ M(G, z). Intuitively, it would seems that M(G, z) should be “similar” to M(G), such that it can be described as M(G) plus some constraints that “block” the assignment z. To illustrate the diﬃculty in ﬁnding such “blocking” constraints, consider the following constraint, originally suggested by Santos [10]: i µi (zi ) ≤ n − 1. This inequality is not satisﬁed by µ = v(z) since v(z) attains the value n for the LHS of the above. Furthermore, for any x = z and µ = v(x), the LHS would be n − 1 or less. Thus, this inequality separates ˆ v(z) from all other integral vertices. One might conclude that we can deﬁne M(G, z) by adding this inequality to M(G). The diﬃculty is that the resulting polytope has fractional vertices,3 and maximizing over it won’t generally yield an integral solution. It turns out that there is a diﬀerent inequality that does yield an exact characterization of ˆ M(G, z) when G is a tree. We now deﬁne this inequality and state our main theorem. Deﬁnition 2. Consider the functional I(µ, z) (which is linear in µ): (1 − di )µi (zi ) + I(µ, z) = i µij (zi , zj ) (5) ij∈E where di is the degree of node i in the tree graph G. ˆ Theorem 1. Adding the single inequality I(µ, z) ≤ 0 to M(G) yields M(G, z). ˆ M(G, z) = {µ | µ ∈ M(G), I(µ, z) ≤ 0 } (6) The theorem is proved in the appendix. Taken together with Lemma 1, it implies that x(2) may be obtained via an LP that is very similar to the MAP-LP, but has an additional constraint. We note the interesting similarity between I(µ, z) and the Bethe entropy [20]. The only diﬀerence is that in Bethe, µi , µij are replaced by H(Xi ), H(Xi , Xj ) respectively.4 The theorem also generalizes to the case where G is not a tree, but we have a junction tree for G. In this case, the theorem still holds if we deﬁne a generalized I(µ, z) inequality as: (1 − dS )µS (zS ) + S∈S µC (zC ) ≤ 0 (7) C∈C where C and S are the junction tree cliques and their separators, respectively, and dS is the number of cliques that intersect on separator S. In this case, the marginal polytope should enforce consistency between marginals µC (zC ) and their separators µS (zS ). However, such a characterization requires variables whose cardinality is exponential in the tree-width and is thus tractable only for graphs of low tree-width. In the next section, we address approximations for general graphs. A corresponding result exists for the case when G is a forest. In this case, the inequality in Eq. 6 is modiﬁed to: I(µ, z) ≤ |P | − 1, where |P | denotes the number of connected components of G. Interestingly, for a graph without edges, this gives the Santos inequality. 3 Consider the case of a single edge between 2 nodes where the MAP assignment is (0, 0). Adding the inequality µ1 (0) + µ2 (0) ≤ 1 produces the fractional vertex (0.5, 0.5). 4 The connection to Bethe can be more clearly understood from a duality-based proof of Theorem 1. We will cover this in an extended version of the manuscript. 3 2nd best LPs for general graphs - Spanning tree inequalities 3 When the graph G is not a tree, the marginal polytope M(G) generally requires an exponential number of inequalities. However, as mentioned above, it does have an exact description in terms of marginals over cliques and separators of a junction tree. Given such marginals on ˆ junction tree cliques, we also have an exact characterization of M(G, z) via the constraint in Eq. 7. However, in general, we cannot aﬀord to be exponential in tree-width. Thus a common strategy [15] is to replace M(G) with an outer bound that enforces consistency between marginals on overlapping sets of variables. The simplest example is ML (G) in Eq. 3. ˆ In what follows, we describe an outer-bound approximation scheme for M(G, z). We use ML (G) as the approximation for M(G) (more generally ML (G) can enforce consistency between any set of small regions, e.g., triplets). When G is not a tree, the linear constraint in ˆ Eq. 6 will no longer suﬃce to derive M(G, z). Moreover, direct application of the inequality will incorrectly remove some integral vertices. An alternative approach is to add inequalities that separate v(z) from the other integral vertices. This will serve to eliminate more and more fractional vertices, and if enough constraints are added, this may result in an integral solution. One obvious family of such constraints are those corresponding to spanning trees in G and have the form of Eq. 5. Deﬁnition 3. Consider any T that is a spanning tree of G. Deﬁne the functional I T (µ, z): (1 − dT )µi (zi ) + i I T (µ, z) = i µij (zi , zj ) (8) ij∈T where dT is the degree of i in T . We refer to I T (µ, z) ≤ 0 as a spanning tree inequality. i For any sub-tree T of G, the corresponding spanning tree inequality separates the vertex v(z) from the other vertices. This can be shown via similar arguments as in the proof of Theorem 1. Note, however, that the resulting polytope may still have fractional vertices. The above argument shows that any spanning tree provides a separating inequality for ˆ M(G, z). In principle, we would like to use as many such inequalities as possible. Deﬁnition 4. The spanning tree assignment-excluding marginal polytope is deﬁned as: ˆ MST (G, z) = µ | µ ∈ ML (G), L ∀ tree T ⊆ E I T (µ, z) ≤ 0 (9) where the ST notation indicates the inclusion of all spanning tree inequalities for G.5 Thus, we would actually like to perform the following optimization problem: max ˆ µ∈MST (G,z) L µ·θ ˆ as an approximation to optimization over M(G, z); i.e., we seek the optimal µ subject to all spanning tree inequalities for G with the ambition that this µ be integral and thus provide the non-z MAP assignment, with a certiﬁcate of optimality. Although the number of spanning trees is exponential in n, it turns out that all spanning inequalities can be used in practice. One way to achieve this is via a cutting plane algorithm [12] that ﬁnds the most violated spanning tree inequality and adds it to the LP. To implement this eﬃciently, we note that for a particular µ and a spanning tree T , the value of I T (µ, z) can be decomposed into a sum over the edges in T (and a T -independent constant): I T (µ, z) = µi (zi ) µij (zi , zj ) − µi (zi ) − µj (zj ) + (10) i ij∈T The tree maximizing the above is the maximum-weight spanning tree with edge-weights wij = µij (zi , zj ) − µi (zi ) − µj (zj ). It can thus be found eﬃciently. The cutting plane algorithm proceeds as follows. We start by adding an arbitrary spanning tree. Then, as long as the optimal µ is fractional, we ﬁnd the spanning tree inequality that µ most violates (where this is implemented via the maximum-weight spanning tree). This constraint will necessarily remove µ from the polytope. If there are no violated inequalities 5 ˆ ˆL Note that M(G, z) ⊆ MST (G, z) ⊂ ML (G). 4 but µ is still fractional, then spanning tree inequalities do not suﬃce to ﬁnd an integral solution (but see below on hypertree constraints to add in this case). In practice, we found that only a relatively small number of inequalities are needed to successfully yield an integral solution, or determine that all such inequalities are already satisﬁed. An alternative approach for solving the all spanning-tree problem is to work via the dual. The dual variables roughly correspond to points in the spanning tree polytope [16], optimization over which can be done in polynomial time, e.g., via the ellipsoid algorithm. We do not pursue this here since the cutting plane algorithm performed well in our experiments. ˆ As mentioned earlier, we can exactly characterize M(G, z) using Eq. 7, albeit at a cost exponential in the tree-width of the graph. A practical compromise would be to use inequalities over clique trees of G, where the cliques are relatively small, e.g., triplets. The corresponding constraint (Eq. 7 with the small cliques and their separators) will necessarily separate v(z) from the other integral vertices. Finding the maximally violated such inequality is an NP-hard problem, equivalent to a prize collecting Steiner tree problem, but recent work has found that such problems are often exactly solvable in practice [7]. It thus might be practical to include all such trees as constraints using a cutting plane algorithm. 4 From 2nd -best to M-best Thus far, we only dealt with the 2nd best case. As we show now, it turns out that the 2nd -best formalism can be used to devise an algorithm for M best. We begin by describing an algorithm for the exact M best and then show how it can be used to approximate those via the approximations for 2nd best described above. Fig. 1 describes our scheme, which we call Partitioning for Enumerating Solutions (or PES) for solving the M best problem. The scheme is general and only assumes that MAP-“like” problems can be solved. It is inspired by several pre-existing M best solution schemes [4, 6, 8, 19] but diﬀers from them in highlighting the role of ﬁnding a second best solution within a given subspace. for m ← 1 to M do if m = 1 then Run MAP solver to obtain the best assignment: x(1) ≡ arg max f (x; θ) CONSTRAINTS1 ← ∅ else k ←− arg max ′ k′ ∈{1,...,m−1} f (y(k ) ; θ) // sub-space containing mth best assignment x(m) ← y(k) // mth best assignment // A variable choice that distinguishes x(m) from x(k) : (m) (v, a) ← any member of the set {(i, xi (m) ) : xi (k) = xi } CONSTRAINTSm ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(k) (as MAP) from subspace m CONSTRAINTSk ← CONSTRAINTSk ∪ {xv = a} // Eliminate x(m) (as 2nd -best) from subspace k y(k) ← CalcNextBestSolution(CONSTRAINTSk , x(k) ) end y(m) ← CalcNextBestSolution(CONSTRAINTSm , x(m) ) end return {x(m) }M m=1 /* Find next best solution in sub-space defined by CONSTRAINTS */ Function CalcNextBestSolution(CONSTRAINTS, x(∗) ) // x(∗) is the MAP in the sub-space defined by CONSTRAINTS: Run MAP solver to obtain the second-best solution: y ≡ arg max f (x; θ), and return y. x=x(∗) ,CONSTRAINTS end Figure 1: Pseudocode for the PES algorithm. The modus operandi of the PES algorithm is to eﬃciently partition the search space while systematically excluding all previously determined assignments. Signiﬁcantly, any MAP 5 Attractive Grids Ranks Run-times 1 50 Mixed Grids Ranks Run-times 1 50 0.5 0 S N B 0 Hard Protein SCP Ranks Run-times 1 50 0.5 S N B 0 0 S+R N+R B+R 0.5 S+R N+R B+R 0 S+R B B+R 0 S+R B B+R Figure 2: Number of best ranks and normalized run-times for the attractive and mixed grids, and the more diﬃcult protein SCP problems. S, N, and B denote the STRIPES, Nilsson, and BMMF algorithms. Algorithms marked with +R denote that regions of variables were added for those runs. solver can be plugged into it, on the condition that it is capable of solving the arg max in the CalcNextBestSolution subroutine. The correctness of PES can be shown by observing that at the M th stage, all previous best solutions are excluded from the optimization and no other assignment is excluded. Of note, this simple partitioning scheme is possible due to the observation that the ﬁrst-best and second-best MAP assignments must diﬀer in the assignment of at least one variable in the graph. The main computational step of the PES algorithm is to maximize f (x; θ) subject to x = x(∗) and x ∈ CONSTRAINTS (see the CalcNextBestSolution subroutine). The CONSTRAINTS set merely enforces that some of the coordinates of x are either equal to or diﬀerent from speciﬁed values.6 Within the LP, these can be enforced by setting µi (xi = a) = 1 or µi (xi = a) = 0. It can be shown that if one optimizes µ · θ with ˆ these constraints and µ ∈ M(G, x(∗) ), the solution is integral. Thus, the only element ˆ requiring approximation in the general case is the description of M(G, x(∗) ). We choose as ˆ this approximation the polytope MST (G, x(∗) ) in Eq. 9. We call the resulting approximaL tion algorithm Spanning TRee Inequalities and Partitioning for Enumerating Solutions, or STRIPES. In the next section, we evaluate this scheme experimentally. 5 Experiments We compared the performance of STRIPES to the BMMF algorithm [19] and the Lawler/Nilsson algorithm [6, 8]. Nilsson’s algorithm is equivalent to PES where the 2nd best assignment is obtained from maximizations within O(n) partitions, so that its runtime is O(n) times the cost of ﬁnding a single MAP. Here we approximated each MAP with its LP relaxation (as in STRIPES), so that both STRIPES and Nilsson come with certiﬁcates of optimality when their LP solutions are integral. BMMF relies on loopy BP to approximate the M best solutions.7 We used M = 50 in all experiments. To compare the algorithms, we pooled all their solutions, noting the 50 top probabilities, and then counted the fraction of these that any particular algorithm found (its solution rank). For run-time comparisons, we normalized the times by the longest-running algorithm for each example. We begin by considering pairwise MRFs on binary grid graphs of size 10 × 10. In the ﬁrst experiment, we used an Ising model with attractive (submodular) potentials, a setting in which the pairwise LP relaxation is exact [14]. For each grid edge ij, we randomly chose Jij ∈ [0, 0.5], and local potentials were randomized in the range ±0.5. The results for 25 graphs are shown in Fig. 2. Both the STRIPES and Nilsson algorithms obtained the 50 optimal solutions (as learned from their optimality certiﬁcates), while BMMF clearly fared less well for some of the graphs. While the STRIPES algorithm took < 0.5 to 2 minutes to run, the Nilsson algorithm took around 13 minutes. On the other hand, BMMF was quicker, taking around 10 seconds per run, while failing to ﬁnd a signiﬁcant portion of the top solutions. Overall, the STRIPES algorithm was required to employ up to 19 spanning tree inequalities per calculation of second-best solution. 6 This is very diﬀerent from the second best constraint, since setting x1 = 1 blocks all assignments with this value, as opposed to setting x = 1 which blocks only the assignment with all ones. 7 For BMMF, we used the C implementation at http://www.cs.huji.ac.il/~ talyam/ inference.html. The LPs for STRIPES and Nilsson were solved using CPLEX. 6 Next, we studied Ising models with mixed interaction potentials (with Jij and the local potentials randomly chosen in [−0.5, 0.5]). For almost all of the 25 models, all three algorithms were not able to successfully ﬁnd the top solutions. Thus, we added regions of triplets (two for every grid face) to tighten the LP relaxation (for STRIPES and Nilsson) and to perform GBP instead of BP (for BMMF). This resulted in STRIPES and Nilsson always provably ﬁnding the optimal solutions, and BMMF mostly ﬁnding these solutions (Fig. 2). For these more diﬃcult grids, however, STRIPES was the fastest of the algorithms, taking 0.5 - 5 minutes. On the other hand, the Nilsson and BMMF algorithms took 18 minutes and 2.5 7 minutes, respectively. STRIPES added up to 23 spanning tree inequalities per iteration. The protein side-chain prediction (SCP) problem is to to predict the placement of amino acid side-chains given a protein backbone [2, 18]. Minimization of a protein energy function corresponds to ﬁnding a MAP assignment for a pairwise MRF [19]. We employed the dataset of [18] (up to 45 states per variable, mean approximate tree-width 50), running all algorithms to calculate the optimal side-chain conﬁgurations. For 315 of 370 problems in the dataset, the ﬁrst MAP solution was obtained directly as a result of the LP relaxation having an integral solution (“easy” problems). STRIPES provably found the subsequent top 50 solutions within 4.5 hours for all but one of these cases (up to 8 spanning trees per calculation), and BMMF found the same 50 solutions for each case within 0.5 hours; note that only STRIPES provides a certiﬁcate of optimality for these solutions. On the other hand, only for 146 of the 315 problems was the Nilsson method able to complete within ﬁve days; thus, we do not compare its performance here. For the remaining 55 (“hard”) problems (Fig. 2), we added problem-speciﬁc triplet regions using the MPLP algorithm [13]. We then ran the STRIPES algorithm to ﬁnd the optimal solutions. Surprisingly, it was able to exactly ﬁnd the 50 top solutions for all cases, using up to 4 standard spanning tree inequalities per second-best calculation. The STRIPES run-times for these problems ranged from 6 minutes to 23 hours. On the other hand, whether running BMMF without these regions (BP) or with the regions (GBP), it did not perform as well as STRIPES in terms of the number of high-ranking solutions or its speed. To summarize, STRIPES provably found the top 50 solutions for 369 of the 370 protein SCP problems. 6 Conclusion ˆ In this work, we present a novel combinatorial object M(G, z) and show its utility in obtaining the M best MAP assignments. We provide a simple characterization of it for tree structured graphs, and show how it can be used for approximations in non-tree graphs. As with the marginal polytope, many interesting questions arise about the properties of ˆ M(G, z). For example, in which non-tree cases can we provide a compact characterization (e.g., as for the cut-polytope for planar graphs [1]). Another compelling question is in which problems the spanning tree inequalities are provably optimal. An interesting generalization of our method is to predict diverse solutions satisfying some local measure of “distance” from each other, e.g., as in [2]. Here we studied the polytope that results from excluding one assignment. An intriguing question is to characterize the polytope that excludes M assignments. We have found that it does not simply correspond to adding M constraints I(µ, z i ) ≤ 0 for i = 1, . . . , M , so its ˆ geometry is apparently more complicated than that of M(G, z). Here we used LP solvers to solve for µ. Such generic solvers could be slow for large-scale problems. However, in recent years, specialized algorithms have been suggested for solving MAP-LP relaxations [3, 5, 9, 17]. These use the special form of the constraints to obtain local-updates and more scalable algorithms. We intend to apply these schemes to our method. Finally, our empirical results show that our method indeed leverages the power of LP relaxations and yields exact M best optimal solutions for problems with large tree-width. Acknowledgements We thank Nati Linial for his helpful discussions and Chen Yanover and Talya Meltzer for their insight and help in running BMMF. We also thank the anonymous reviewers for their useful advice. 7 A Proof of Theorem 1 Recall that for any µ ∈ M(G), there exists a probability density p(x) s.t. µ = x p(x)v(x). Denote pµ (z) as the minimal value of p(z) among all p(x) that give µ. We prove that ˆ pµ (z) = max(0, I(µ, z)), from which the theorem follows (since pµ (z) = 0 iﬀ µ ∈ M(G, z)). The proof is by induction on n. For n = 1, the node has degree 0, so I(µ, z) = µ1 (z1 ). Clearly, pµ (z) = µ1 (z1 ), so pµ (z) = I(µ, z). For n > 1, there must exist a leaf in G ˆ (assume that its index is n and its neighbor’s is n − 1). Denote G as the tree obtained ˆ by removing node n and its edge with n − 1. For any assignment x, denote x as the corresponding sub-assignment for the ﬁrst n − 1 variables. Also, any µ can be derived by ˆ ˆ adding appropriate coordinates to a unique µ ∈ M(G). For an integral vertex µ = v(x), ˆˆ ˆ ˆ ˆ ˆ x denote its projected µ as v (ˆ ). Denote by I(µ, z ) the functional in Eq. 5 applied to G. For ˆ any µ and its projected µ, it can be seen that: ˆˆ ˆ I(µ, z) = I(µ, z ) − α (11) where we deﬁne α = xn =zn µn−1,n (zn−1 , xn ) (so 0 ≤ α ≤ 1). The inductive assumption ˆ ˆ ˆ gives a p(ˆ ) that has marginals µ and also p(ˆ ) = max(0, I(µ, z )). We next use p(ˆ ) to ˆx ˆz ˆx construct a p(x) that has marginals µ and the desired minimal pµ (z). Consider three cases: ˆˆ ˆ I. I(µ, z) ≤ 0 and I(µ, z ) ≤ 0. From the inductive assumption, pµ (ˆ ) = 0, so we deﬁne: ˆˆ z µn−1,n (xn−1 , xn ) p(x) = p(ˆ ) ˆx (12) µn−1 (xn−1 ) which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0 as required. If µn−1 (xn−1 ) = 0, then p(ˆ ) is necessarily 0, in which case we deﬁne p(x) = 0. Note that this construction ˆx is identical to that used in proving that ML (G) = M(G) for a tree graph G. ˆˆ ˆ II. I(µ, z) > 0. Based on Eq. 11 and α ≥ 0, we have I(µ, z ) > 0. Applying the inductive ˆ µ, z ) = pµ (ˆ ) > 0. Now, deﬁne p(x) so that p(z) = I(µ, z): ˆ assumption to µ, we obtain I( ˆ ˆ ˆˆ z xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) p(x) no constraint 0 no constraint As in Eq. 12 0 0 ∃ l x l = zl 1 ∀ l x l = zl 1 µn−1,n (zn−1 , xn ) 1 1 p(ˆ ) ˆx 0 I(µ, z) Simple algebra shows that p(x) is non-negative and has µ as marginals. We now show that p(z) is minimal. Based on the inductive assumption and Eq. 11, it can easily be shown that I(v(z), z) = 1, I(v(x), z) ≤ 0 for x = z. For any p(x) s.t. µ = x p(x)v(x), from linearity, I(µ, z) = p(z) + x=z p(x)I(v(x), z) ≤ p(z) (since I(v(x), z) ≤ 0 for x = z). Since the p(z) we deﬁne achieves this lower bound, it is clearly minimal. ˆˆ ˆ ˆ III. I(µ, z) ≤ 0 but I(µ, z ) > 0. Applying the inductive assumption to µ, we see that ˆ µ, z ) > 0; Eq. 11 implies α − I(µ, z ) ≥ 0. Deﬁne β = µn−1 (zn−1 ) − pµ (ˆ ), which ˆˆ ˆ ˆˆ z pµ (ˆ ) = I( ˆ ˆ ˆˆ z ˆ is non-negative since µn−1 (zn−1 ) = µn−1 (ˆ n−1 ) and p marginalizes to µ. Deﬁne p(x) as: ˆ z ˆ xl , l ≤ n − 2 δ(xn−1 = zn−1 ) δ(xn = zn ) no constraint 0 no constraint ∃ l x l = zl As in Eq. 12 0 ˆ ˆ z µ (z ,x ) p(ˆ ) n−1,n βn−1 n α−I(µ,ˆ ) ˆx α µ (z ,z ) p(ˆ ) n−1,n βn−1 n ˆx (z ,x ) ˆˆ ˆ µ I(µ, z ) n−1,n αn−1 n 1 0 0 1 1 ∀ l x l = zl p(x) 1 which indeed marginalizes to µ, and p(z) = 0 so that pµ (z) = 0, as required. 8 References [1] F. Barahona. On cuts and matchings in planar graphs. Math. Program., 60(1):53–68, 1993. [2] M. Fromer and C. Yanover. Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space. Proteins: Structure, Function, and Bioinformatics, 75:682–705, 2009. [3] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 21. MIT Press, Cambridge, MA, 2007. [4] E. Kloppmann, G. M. Ullmann, and T. Becker. An extended dead-end elimination algorithm to determine gap-free lists of low energy states. Journal of Comp. Chem., 28:2325–2335, 2007. [5] N. Komodakis and N. Paragios. Beyond loose LP-relaxations: Optimizing MRFs by repairing cycles. In D. Forsyth, P. Torr, and A. Zisserman, editors, ECCV, pages 806–820, Heidelberg, Germany, 2008. Springer. [6] E. L. Lawler. A procedure for computing the K best solutions to discrete optimization problems and its application to the shortest path problem. Management Science, 18(7):401–405, 1972. [7] I. Ljubic, R. Weiskircher, U. Pferschy, G. W. Klau, P. Mutzel, and M. Fischetti. An algorithmic framework for the exact solution of the prize-collecting steiner tree problem. Mathematical Programming, 105:427–449, Feb 2006. [8] D. Nilsson. An eﬃcient algorithm for ﬁnding the M most probable conﬁgurations in probabilistic expert systems. Statistics and Computing, 8:159–173, Jun 1998. [9] P. Ravikumar, A. Agarwal, and M. Wainwright. Message-passing for graph-structured linear programs: proximal projections, convergence and rounding schemes. In Proc. of the 25th international conference on Machine learning, pages 800–807, New York, NY, USA, 2008. ACM. [10] E. Santos. On the generation of alternative explanations with implications for belief revision. In Proc. of the 7th Annual Conference on Uncertainty in Artiﬁcial Intelligence, 1991. [11] Y. Shimony. Finding the MAPs for belief networks is NP-hard. 68(2):399–410, 1994. Aritiﬁcal Intelligence, [12] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1393–1400. MIT Press, Cambridge, MA, 2007. [13] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In Proc. of the 24th Annual Conference on Uncertainty in Artiﬁcial Intelligence, pages 503–510, 2008. [14] B. Taskar, S. Lacoste-Julien, and M. I. Jordan. Structured prediction, dual extragradient and bregman projections. J. Mach. Learn. Res., 7:1627–1653, 2006. [15] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn., 1(1-2):1–305, 2008. [16] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005. [17] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Trans. Pattern Anal. Mach. Intell., 29(7):1165–1179, 2007. [18] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation – an empirical study. Journal of Machine Learning Research, 7:1887–1907, 2006. [19] C. Yanover and Y. Weiss. Finding the M most probable conﬁgurations using loopy belief propagation. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [20] J. Yedidia, W. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282– 2312, 2005. 9

3 0.11813272 256 nips-2009-Which graphical models are difficult to learn?

Author: Andrea Montanari, Jose A. Pereira

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random ﬁelds) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random ﬁeld develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random ﬁeld µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is deﬁned as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change signiﬁcantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for speciﬁc classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results conﬁrm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be deﬁnite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to ﬁnd W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after ﬁxing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is deﬁned as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More speciﬁcally, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This ﬁrst result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More speciﬁcally, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the ﬂuctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a speciﬁc low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is deﬁned from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our ﬁrst result shows that Rlr(λ) indeed reconstructs G if θ is sufﬁciently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires signiﬁcant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difﬁcult to learn. We will instead show how to evaluate it on speciﬁc graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufﬁcient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and deﬁne S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Deﬁne Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our ﬁrst auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufﬁcient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufﬁciently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisﬁed since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by deﬁnition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and deﬁne the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external ﬁeld’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any ﬁxed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) ﬁnite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simpliﬁcations arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random ﬁelds”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ ﬂing and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9

4 0.10650568 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

Author: Peter Sollich, Matthew Urry, Camille Coti

Abstract: We investigate how well Gaussian process regression can learn functions deﬁned on graphs, using large regular random graphs as a paradigmatic example. Random-walk based kernels are shown to have some non-trivial properties: within the standard approximation of a locally tree-like graph structure, the kernel does not become constant, i.e. neighbouring function values do not become fully correlated, when the lengthscale σ of the kernel is made large. Instead the kernel attains a non-trivial limiting form, which we calculate. The fully correlated limit is reached only once loops become relevant, and we estimate where the crossover to this regime occurs. Our main subject are learning curves of Bayes error versus training set size. We show that these are qualitatively well predicted by a simple approximation using only the spectrum of a large tree as input, and generically scale with n/V , the number of training examples per vertex. We also explore how this behaviour changes for kernel lengthscales that are large enough for loops to become important. 1 Motivation and Outline Gaussian processes (GPs) have become a standard part of the machine learning toolbox [1]. Learning curves are a convenient way of characterizing their capabilities: they give the generalization error as a function of the number of training examples n, averaged over all datasets of size n under appropriate assumptions about the process generating the data. We focus here on the case of GP regression, where a real-valued output function f (x) is to be learned. The general behaviour of GP learning curves is then relatively well understood for the scenario where the inputs x come from a continuous space, typically Rn [2, 3, 4, 5, 6, 7, 8, 9, 10]. For large n, the learning curves then typically decay as a power law ∝ n−α with an exponent α ≤ 1 that depends on the dimensionality n of the space as well as the smoothness properties of the function f (x) as encoded in the covariance function. But there are many interesting application domains that involve discrete input spaces, where x could be a string, an amino acid sequence (with f (x) some measure of secondary structure or biological function), a research paper (with f (x) related to impact), a web page (with f (x) giving a score used to rank pages), etc. In many such situations, similarity between different inputs – which will govern our prior beliefs about how closely related the corresponding function values are – can be represented by edges in a graph. One would then like to know how well GP regression can work in such problem domains; see also [11] for a related online regression algorithm. We study this 1 problem here theoretically by focussing on the paradigmatic example of random regular graphs, where every node has the same connectivity. Sec. 2 discusses the properties of random-walk inspired kernels [12] on such random graphs. These are analogous to the standard radial basis function kernels exp[−(x − x )2 /(2σ 2 )], but we ﬁnd that they have surprising properties on large graphs. In particular, while loops in large random graphs are long and can be neglected for many purposes, by approximating the graph structure as locally tree-like, here this leads to a non-trivial limiting form of the kernel for σ → ∞ that is not constant. The fully correlated limit, where the kernel is constant, is obtained only because of the presence of loops, and we estimate when the crossover to this regime takes place. In Sec. 3 we move on to the learning curves themselves. A simple approximation based on the graph eigenvalues, using only the known spectrum of a large tree as input, works well qualitatively and predicts the exact asymptotics for large numbers of training examples. When the kernel lengthscale is not too large, below the crossover discussed in Sec. 2 for the covariance kernel, the learning curves depend on the number of examples per vertex. We also explore how this behaviour changes as the kernel lengthscale is made larger. Sec. 4 summarizes the results and discusses some open questions. 2 Kernels on graphs and trees We assume that we are trying to learn a function deﬁned on the vertices of a graph. Vertices are labelled by i = 1 . . . V , instead of the generic input label x we used in the introduction, and the associated function values are denoted fi ∈ R. By taking the prior P (f ) over these functions f = (f1 , . . . , fV ) as a (zero mean) Gaussian process we are saying that P (f ) ∝ exp(− 1 f T C −1 f ). 2 The covariance function or kernel C is then, in our graph setting, just a positive deﬁnite V × V matrix. The graph structure is characterized by a V × V adjacency matrix, with Aij = 1 if nodes i and j are connected by an edge, and 0 otherwise. All links are assumed to be undirected, so that Aij = Aji , V and there are no self-loops (Aii = 0). The degree of each node is then deﬁned as di = j=1 Aij . The covariance kernels we discuss in this paper are the natural generalizations of the squaredexponential kernel in Euclidean space [12]. They can be expressed in terms of the normalized graph Laplacian, deﬁned as L = 1 − D −1/2 AD −1/2 , where D is a diagonal matrix with entries d1 , . . . , dV and 1 is the V × V identity matrix. An advantage of L over the unnormalized Laplacian D − A, which was used in the earlier paper [13], is that the eigenvalues of L (again a V × V matrix) lie in the interval [0,2] (see e.g. [14]). From the graph Laplacian, the covariance kernels we consider here are constructed as follows. The p-step random walk kernel is (for a ≥ 2) C ∝ (1 − a−1 L)p = 1 − a−1 1 + a−1 D −1/2 AD −1/2 p (1) while the diffusion kernel is given by 1 C ∝ exp − 2 σ 2 L ∝ exp 1 2 −1/2 AD −1/2 2σ D (2) We will always normalize these so that (1/V ) i Cii = 1, which corresponds to setting the average (over vertices) prior variance of the function to be learned to unity. To see the connection of the above kernels to random walks, assume we have a walker on the graph who at each time step selects randomly one of the neighbouring vertices and moves to it. The probability for a move from vertex j to i is then Aij /dj . The transition matrix after s steps follows as (AD −1 )s : its ij-element gives the probability of being on vertex i, having started at j. We can now compare this with the p-step kernel by expanding the p-th power in (1): p p ( p )a−s (1−a−1 )p−s (D −1/2 AD −1/2 )s = D −1/2 s C∝ s=0 ( p )a−s (1−a−1 )p−s (AD −1 )s D 1/2 s s=0 (3) Thus C is essentially a random walk transition matrix, averaged over the number of steps s with s ∼ Binomial(p, 1/a) 2 (4) a=2, d=3 K1 1 1 Cl,p 0.9 p=1 p=2 p=3 p=4 p=5 p=10 p=20 p=50 p=100 p=200 p=500 p=infty 0.8 0.6 0.4 d=3 0.8 0.7 0.6 a=2, V=infty a=2, V=500 a=4, V=infty a=4, V=500 0.5 0.4 0.3 0.2 0.2 ln V / ln(d-1) 0.1 0 0 5 10 l 0 15 1 10 p/a 100 1000 Figure 1: (Left) Random walk kernel C ,p plotted vs distance along graph, for increasing number of steps p and a = 2, d = 3. Note the convergence to a limiting shape for large p that is not the naive fully correlated limit C ,p→∞ = 1. (Right) Numerical results for average covariance K1 between neighbouring nodes, averaged over neighbours and over randomly generated regular graphs. This shows that 1/a can be interpreted as the probability of actually taking a step at each of p “attempts”. To obtain the actual C the resulting averaged transition matrix is premultiplied by D −1/2 and postmultiplied by D 1/2 , which ensures that the kernel C is symmetric. For the diffusion kernel, one ﬁnds an analogous result but the number of random walk steps is now distributed as s ∼ Poisson(σ 2 /2). This implies in particular that the diffusion kernel is the limit of the p-step kernel for p, a → ∞ at constant p/a = σ 2 /2. Accordingly, we discuss mainly the p-step kernel in this paper because results for the diffusion kernel can be retrieved as limiting cases. In the limit of a large number of steps s, the random walk on a graph will reach its stationary distribution p∞ ∝ De where e = (1, . . . , 1). (This form of p∞ can be veriﬁed by checking that it remains unchanged after multiplication with the transition matrix AD −1 .) The s-step transition matrix for large s is then p∞ eT = DeeT because we converge from any starting vertex to the stationary distribution. It follows that for large p or σ 2 the covariance kernel becomes C ∝ D 1/2 eeT D 1/2 , i.e. Cij ∝ (di dj )1/2 . This is consistent with the interpretation of σ or (p/a)1/2 as a lengthscale over which the random walk can diffuse along the graph: once this lengthscale becomes large, the covariance kernel Cij is essentially independent of the distance (along the graph) between the vertices i and j, and the function f becomes fully correlated across the graph. (Explicitly f = vD 1/2 e under the prior, with v a single Gaussian random variable.) As we next show, however, the approach to this fully correlated limit as p or σ are increased is non-trivial. We focus in this paper on kernels on random regular graphs. This means we consider adjacency matrices A which are regular in the sense that they give for each vertex the same degree, di = d. A uniform probability distribution is then taken across all A that obey this constraint [15]. What will the above kernels look like on typical samples drawn from this distribution? Such random regular graphs will have long loops, of length of order ln(V ) or larger if V is large. Their local structure is then that of a regular tree of degree d, which suggests that it should be possible to calculate the kernel accurately within a tree approximation. In a regular tree all nodes are equivalent, so the kernel can only depend on the distance between two nodes i and j. Denoting this kernel value C ,p for a p-step random walk kernel, one has then C ,p=0 = δ ,0 and γp+1 C0,p+1 γp+1 C ,p+1 = = 1− 1 ad C 1 a C0,p + −1,p 1 a + 1− C1,p 1 a C (5) ,p + d−1 ad C +1,p for ≥1 (6) where γp is chosen to achieve the desired normalization C0,p = 1 of the prior variance for every p. Fig. 1(left) shows results obtained by iterating this recursion numerically, for a regular graph (in the tree approximation) with degree d = 3, and a = 2. As expected the kernel becomes more longranged initially as p increases, but eventually it is seen to approach a non-trivial limiting form. This can be calculated as C ,p→∞ = [1 + (d − 1)/d](d − 1)− /2 (7) 3 and is also plotted in the ﬁgure, showing good agreement with the numerical iteration. There are (at least) two ways of obtaining the result (7). One is to take the limit σ → ∞ of the integral representation of the diffusion kernel on regular trees given in [16] (which is also quoted in [13] but with a typographical error that effectively removes the factor (d − 1)− /2 ). Another route is to ﬁnd the steady state of the recursion for C ,p . This is easy to do but requires as input the unknown steady state value of γp . To determine this, one can map from C ,p to the total random walk probability S ,p in each “shell” of vertices at distance from the starting vertex, changing variables to S0,p = C0,p and S ,p = d(d − 1) −1 C ,p ( ≥ 1). Omitting the factors γp , this results in a recursion for S ,p that simply describes a biased random walk on = 0, 1, 2, . . ., with a probability of 1 − 1/a of remaining at the current , probability 1/(ad) of moving to the left and probability (d − 1)/(ad) of moving to the right. The point = 0 is a reﬂecting barrier where only moves to the right are allowed, with probability 1/a. The time evolution of this random walk starting from = 0 can now be analysed as in [17]. As expected from the balance of moves to the left and right, S ,p for large p is peaked around the average position of the walk, = p(d − 2)/(ad). For smaller than this S ,p has a tail behaving as ∝ (d − 1) /2 , and converting back to C ,p gives the large- scaling of C ,p→∞ ∝ (d − 1)− /2 ; this in turn ﬁxes the value of γp→∞ and so eventually gives (7). The above analysis shows that for large p the random walk kernel, calculated in the absence of loops, does not approach the expected fully correlated limit; given that all vertices have the same degree, the latter would correspond to C ,p→∞ = 1. This implies, conversely, that the fully correlated limit is reached only because of the presence of loops in the graph. It is then interesting to ask at what point, as p is increased, the tree approximation for the kernel breaks down. To estimate this, we note that a regular tree of depth has V = 1 + d(d − 1) −1 nodes. So a regular graph can be tree-like at most out to ≈ ln(V )/ ln(d − 1). Comparing with the typical number of steps our random walk takes, which is p/a from (4), we then expect loop effects to appear in the covariance kernel when p/a ≈ ln(V )/ ln(d − 1) (8) To check this prediction, we measure the analogue of C1,p on randomly generated [15] regular graphs. Because of the presence of loops, the local kernel values are not all identical, so the appropriate estimate of what would be C1,p on a tree is K1 = Cij / Cii Cjj for neighbouring nodes i and j. Averaging over all pairs of such neighbours, and then over a number of randomly generated graphs we ﬁnd the results in Fig. 1(right). The results for K1 (symbols) accurately track the tree predictions (lines) for small p/a, and start to deviate just around the values of p/a expected from (8), as marked by the arrow. The deviations manifest themselves in larger values of K1 , which eventually – now that p/a is large enough for the kernel to “notice” the loops - approach the fully correlated limit K1 = 1. 3 Learning curves We now turn to the analysis of learning curves for GP regression on random regular graphs. We assume that the target function f ∗ is drawn from a GP prior with a p-step random walk covariance kernel C. Training examples are input-output pairs (iµ , fi∗ + ξµ ) where ξµ is i.i.d. Gaussian noise µ of variance σ 2 ; the distribution of training inputs iµ is taken to be uniform across vertices. Inference from a data set D of n such examples µ = 1, . . . , n takes place using the prior deﬁned by C and a Gaussian likelihood with noise variance σ 2 . We thus assume an inference model that is matched to the data generating process. This is obviously an over-simpliﬁcation but is appropriate for the present ﬁrst exploration of learning curves on random graphs. We emphasize that as n is increased we see more and more function values from the same graph, which is ﬁxed by the problem domain; the graph does not grow. ˆ The generalization error is the squared difference between the estimated function fi and the target fi∗ , averaged across the (uniform) input distribution, the posterior distribution of f ∗ given D, the distribution of datasets D, and ﬁnally – in our non-Euclidean setting – the random graph ensemble. Given the assumption of a matched inference model, this is just the average Bayes error, or the average posterior variance, which can be expressed explicitly as [1] (n) = V −1 Cii − k(i)T Kk−1 (i) i 4 D,graphs (9) where the average is over data sets and over graphs, K is an n × n matrix with elements Kµµ = Ciµ ,iµ + σ 2 δµµ and k(i) is a vector with entries kµ (i) = Ci,iµ . The resulting learning curve depends, in addition to n, on the graph structure as determined by V and d, and the kernel and noise level as speciﬁed by p, a and σ 2 . We ﬁx d = 3 throughout to avoid having too many parameters to vary, although similar results are obtained for larger d. Exact prediction of learning curves by analytical calculation is very difﬁcult due to the complicated way in which the random selection of training inputs enters the matrix K and vector k in (9). However, by ﬁrst expressing these quantities in terms of kernel eigenvalues (see below) and then approximating the average over datasets, one can derive the approximation [3, 6] =g n + σ2 V , g(h) = (λ−1 + h)−1 α (10) α=1 This equation for has to be solved self-consistently because also appears on the r.h.s. In the Euclidean case the resulting predictions approximate the true learning curves quite reliably. The derivation of (10) for inputs on a ﬁxed graph is unchanged from [3], provided the kernel eigenvalues λα appearing in the function g(h) are deﬁned appropriately, by the eigenfunction condition Cij φj = λφi ; the average here is over the input distribution, i.e. . . . = V −1 j . . . From the deﬁnition (1) of the p-step kernel, we see that then λα = κV −1 (1 − λL /a)p in terms of the corα responding eigenvalue of the graph Laplacian L. The constant κ has to be chosen to enforce our normalization convention α λα = Cjj = 1. Fortunately, for large V the spectrum of the Laplacian of a random regular graph can be approximated by that of the corresponding large regular tree, which has spectral density [14] L ρ(λ ) = 4(d−1) − (λL − 1)2 d2 2πdλL (2 − λL ) (11) in the range λL ∈ [λL , λL ], λL = 1 + 2d−1 (d − 1)1/2 , where the term under the square root is ± + − positive. (There are also two isolated eigenvalues λL = 0, 2 but these have weight 1/V each and so can be ignored for large V .) Rewriting (10) as = V −1 α [(V λα )−1 + (n/V )( + σ 2 )−1 ]−1 and then replacing the average over kernel eigenvalues by an integral over the spectral density leads to the following prediction for the learning curve: = dλL ρ(λL )[κ−1 (1 − λL /a)−p + ν/( + σ 2 )]−1 (12) with κ determined from κ dλL ρ(λL )(1 − λL /a)p = 1. A general consequence of the form of this result is that the learning curve depends on n and V only through the ratio ν = n/V , i.e. the number of training examples per vertex. The approximation (12) also predicts that the learning curve will have two regimes, one for small ν where σ 2 and the generalization error will be essentially 2 independent of σ ; and another for large ν where σ 2 so that can be neglected on the r.h.s. and one has a fully explicit expression for . We compare the above prediction in Fig. 2(left) to the results of numerical simulations of the learning curves, averaged over datasets and random regular graphs. The two regimes predicted by the approximation are clearly visible; the approximation works well inside each regime but less well in the crossover between the two. One striking observation is that the approximation seems to predict the asymptotic large-n behaviour exactly; this is distinct to the Euclidean case, where generally only the power-law of the n-dependence but not its prefactor come out accurately. To see why, we exploit that for large n (where σ 2 ) the approximation (9) effectively neglects ﬂuctuations in the training input “density” of a randomly drawn set of training inputs [3, 6]. This is justiﬁed in the graph case for large ν = n/V , because the number of training inputs each vertex receives, Binomial(n, 1/V ), has negligible relative ﬂuctuations away from its mean ν. In the Euclidean case there is no similar result, because all training inputs are different with probability one even for large n. Fig. 2(right) illustrates that for larger a the difference in the crossover region between the true (numerically simulated) learning curves and our approximation becomes larger. This is because the average number of steps p/a of the random walk kernel then decreases: we get closer to the limit of uncorrelated function values (a → ∞, Cij = δij ). In that limit and for low σ 2 and large V the 5 V=500 (filled) & 1000 (empty), d=3, a=2, p=10 V=500, d=3, a=4, p=10 0 0 10 10 ε ε -1 -1 10 10 -2 10 -2 10 2 σ = 0.1 2 σ = 0.1 2 -3 10 σ = 0.01 2 σ = 0.01 -3 10 2 σ = 0.001 2 σ = 0.001 2 -4 10 2 σ = 0.0001 σ = 0.0001 -4 10 2 σ =0 -5 2 σ =0 -5 10 0.1 1 ν=n/V 10 10 0.1 1 ν=n/V 10 Figure 2: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500 (small ﬁlled circles) and V = 1000 (empty circles) vertices. Plotting generalization error versus ν = n/V superimposes the results for both values of V , as expected from the approximation (12). The lines are the quantitative predictions of this approximation. Noise level as shown, kernel parameters a = 2, p = 10. (Right) As on the left but with V = 500 only and for larger a = 4. 2 V=500, d=3, a=2, p=20 0 0 V=500, d=3, a=2, p=200, σ =0.1 10 10 ε ε simulation -1 2 10 1/(1+n/σ ) theory (tree) theory (eigenv.) -1 10 -2 10 2 σ = 0.1 -3 10 -4 10 -2 10 2 σ = 0.01 2 σ = 0.001 2 σ = 0.0001 -3 10 2 σ =0 -5 10 -4 0.1 1 ν=n/V 10 10 1 10 100 n 1000 10000 Figure 3: (Left) Learning curves for GP regression on random regular graphs with degree d = 3 and V = 500, and kernel parameters a = 2, p = 20; noise level σ 2 as shown. Circles: numerical simulations; lines: approximation (12). (Right) As on the left but for much larger p = 200 and for a single random graph, with σ 2 = 0.1. Dotted line: naive estimate = 1/(1 + n/σ 2 ). Dashed line: approximation (10) using the tree spectrum and the large p-limit, see (17). Solid line: (10) with numerically determined graph eigenvalues λL as input. α true learning curve is = exp(−ν), reﬂecting the probability of a training input set not containing a particular vertex, while the approximation can be shown to predict = max{1 − ν, 0}, i.e. a decay of the error to zero at ν = 1. Plotting these two curves (not displayed here) indeed shows the same “shape” of disagreement as in Fig. 2(right), with the approximation underestimating the true generalization error. Increasing p has the effect of making the kernel longer ranged, giving an effect opposite to that of increasing a. In line with this, larger values of p improve the accuracy of the approximation (12): see Fig. 3(left). One may ask about the shape of the learning curves for large number of training examples (per vertex) ν. The roughly straight lines on the right of the log-log plots discussed so far suggest that ∝ 1/ν in this regime. This is correct in the mathematical limit ν → ∞ because the graph kernel has a nonzero minimal eigenvalue λ− = κV −1 (1−λL /a)p : for ν σ 2 /(V λ− ), the square bracket + 6 in (12) can then be approximated by ν/( +σ 2 ) and one gets (because also regime) ≈ σ 2 /ν. σ 2 in the asymptotic However, once p becomes reasonably large, V λ− can be shown – by analysing the scaling of κ, see Appendix – to be extremely (exponentially in p) small; for the parameter values in Fig. 3(left) it is around 4 × 10−30 . The “terminal” asymptotic regime ≈ σ 2 /ν is then essentially unreachable. A more detailed analysis of (12) for large p and large (but not exponentially large) ν, as sketched in the Appendix, yields ∝ (cσ 2 /ν) ln3/2 (ν/(cσ 2 )), c ∝ p−3/2 (13) This shows that there are logarithmic corrections to the naive σ 2 /ν scaling that would apply in the true terminal regime. More intriguing is the scaling of the coefﬁcient c with p, which implies that to reach a speciﬁed (low) generalization error one needs a number of training examples per vertex of order ν ∝ cσ 2 ∝ p−3/2 σ 2 . Even though the covariance kernel C ,p – in the same tree approximation that also went into (12) – approaches a limiting form for large p as discussed in Sec. 2, generalization performance thus continues to improve with increasing p. The explanation for this must presumably be that C ,p converges to the limit (7) only at ﬁxed , while in the tail ∝ p, it continues to change. For ﬁnite graph sizes V we know of course that loops will eventually become important as p increases, around the crossover point estimated in (8). The approximation for the learning curve in (12) should then break down. The most naive estimate beyond this point would be to say that the kernel becomes nearly fully correlated, Cij ∝ (di dj )1/2 which in the regular case simpliﬁes to Cij = 1. With only one function value to learn, and correspondingly only one nonzero kernel eigenvalue λα=1 = 1, one would predict = 1/(1 + n/σ 2 ). Fig. 3(right) shows, however, that this signiﬁcantly underestimates the actual generalization error, even though for this graph λα=1 = 0.994 is very close to unity so that the other eigenvalues sum to no more than 0.006. An almost perfect prediction is obtained, on the other hand, from the approximation (10) with the numerically calculated values of the Laplacian – and hence kernel – eigenvalues. The presence of the small kernel eigenvalues is again seen to cause logarithmic corrections to the naive ∝ 1/n scaling. Using the tree spectrum as an approximation and exploiting the large-p limit, one ﬁnds indeed (see Appendix, Eq. (17)) that ∝ (c σ 2 /n) ln3/2 (n/c σ 2 ) where now n enters rather than ν = n/V , c being a constant dependent only on p and a: informally, the function to be learned only has a ﬁnite (rather than ∝ V ) number of degrees of freedom. The approximation (17) in fact provides a qualitatively accurate description of the data Fig. 3(right), as the dashed line in the ﬁgure shows. We thus have the somewhat unusual situation that the tree spectrum is enough to give a good description of the learning curves even when loops are important, while (see Sec. 2) this is not so as far as the evaluation of the covariance kernel itself is concerned. 4 Summary and Outlook We have studied theoretically the generalization performance of GP regression on graphs, focussing on the paradigmatic case of random regular graphs where every vertex has the same degree d. Our initial concern was with the behaviour of p-step random walk kernels on such graphs. If these are calculated within the usual approximation of a locally tree-like structure, then they converge to a non-trivial limiting form (7) when p – or the corresponding lengthscale σ in the closely related diffusion kernel – becomes large. The limit of full correlation between all function values on the graph is only reached because of the presence of loops, and we have estimated in (8) the values of p around which the crossover to this loop-dominated regime occurs; numerical data for correlations of function values on neighbouring vertices support this result. In the second part of the paper we concentrated on the learning curves themselves. We assumed that inference is performed with the correct parameters describing the data generating process; the generalization error is then just the Bayes error. The approximation (12) gives a good qualitative description of the learning curve using only the known spectrum of a large regular tree as input. It predicts in particular that the key parameter that determines the generalization error is ν = n/V , the number of training examples per vertex. We demonstrated also that the approximation is in fact more useful than in the Euclidean case because it gives exact asymptotics for the limit ν 1. Quantitatively, we found that the learning curves decay as ∝ σ 2 /ν with non-trivial logarithmic correction terms. Slower power laws ∝ ν −α with α < 1, as in the Euclidean case, do not appear. 7 We attribute this to the fact that on a graph there is no analogue of the local roughness of a target function because there is a minimum distance (one step along the graph) between different input points. Finally we looked at the learning curves for larger p, where loops become important. These can still be predicted quite accurately by using the tree eigenvalue spectrum as an approximation, if one keeps track of the zero graph Laplacian eigenvalue which we were able to ignore previously; the approximation shows that the generalization error scales as σ 2 /n with again logarithmic corrections. In future work we plan to extend our analysis to graphs that are not regular, including ones from application domains as well as artiﬁcial ones with power-law tails in the distribution of degrees d, where qualitatively new effects are to be expected. It would also be desirable to improve the predictions for the learning curve in the crossover region ≈ σ 2 , which should be achievable using iterative approaches based on belief propagation that have already been shown to give accurate approximations for graph eigenvalue spectra [18]. These tools could then be further extended to study e.g. the effects of model mismatch in GP regression on random graphs, and how these are mitigated by tuning appropriate hyperparameters. Appendix We sketch here how to derive (13) from (12) for large p. Eq. (12) writes = g(νV /( + σ 2 )) with λL + g(h) = dλL ρ(λL )[κ−1 (1 − λL /a)−p + hV −1 ]−1 (14) λL − and κ determined from the condition g(0) = 1. (This g(h) is the tree spectrum approximation to the g(h) of (10).) Turning ﬁrst to g(0), the factor (1 − λL /a)p decays quickly to zero as λL increases above λL . One can then approximate this factor according to (1 − λL /a)p [(a − λL )/(a − λL )]p ≈ − − − (1 − λL /a)p exp[−(λL − λL )p/(a − λL )]. In the regime near λL one can also approximate the − − − − spectral density (11) by its leading square-root increase, ρ(λL ) = r(λL − λL )1/2 , with r = (d − − 1)1/4 d5/2 /[π(d − 2)2 ]. Switching then to a new integration variable y = (λL − λL )p/(a − λL ) and − − extending the integration limit to ∞ gives ∞ √ 1 = g(0) = κr(1 − λL /a)p [p/(a − λL )]−3/2 dy y e−y (15) − − 0 and this ﬁxes κ. Proceeding similarly for h > 0 gives ∞ g(h) = κr(1−λL /a)p [p/(a−λL )]−3/2 F (hκV −1 (1−λL /a)p ), − − − F (z) = √ dy y (ey +z)−1 0 (16) Dividing by g(0) = 1 shows that simply g(h) = F (hV −1 c−1 )/F (0), where c = 1/[κ(1 − σ2 λL /a)p ] = rF (0)[p/(a − λL )]−3/2 which scales as p−3/2 . In the asymptotic regime − − 2 2 we then have = g(νV /σ ) = F (ν/(cσ ))/F (0) and the desired result (13) follows from the large-z behaviour of F (z) ≈ z −1 ln3/2 (z). One can proceed similarly for the regime where loops become important. Clearly the zero Laplacian eigenvalue with weight 1/V then has to be taken into account. If we assume that the remainder of the Laplacian spectrum can still be approximated by that of a tree [18], we get (V + hκ)−1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (hκV −1 (1 − λL /a)p ) − − − g(h) = (17) V −1 + r(1 − λL /a)p [p/(a − λL )]−3/2 F (0) − − The denominator here is κ−1 and the two terms are proportional respectively to the covariance kernel eigenvalue λ1 , corresponding to λL = 0 and the constant eigenfunction, and to 1−λ1 . Dropping the 1 ﬁrst terms in the numerator and denominator of (17) by taking V → ∞ leads back to the previous analysis as it should. For a situation as in Fig. 3(right), on the other hand, where λ1 is close to unity, we have κ ≈ V and so g(h) ≈ (1 + h)−1 + rV (1 − λL /a)p [p/(a − λL )]−3/2 F (h(1 − λL /a)p ) (18) − − − The second term, coming from the small kernel eigenvalues, is the more slowly decaying because it corresponds to ﬁne detail of the target function that needs many training examples to learn accurately. It will therefore dominate the asymptotic behaviour of the learning curve: = g(n/σ 2 ) ∝ F (n/(c σ 2 )) with c = (1 − λL /a)−p independent of V . The large-n tail of the learning curve in − Fig. 3(right) is consistent with this form. 8 References [1] C E Rasmussen and C K I Williams. Gaussian processes for regression. In D S Touretzky, M C Mozer, and M E Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 514–520, Cambridge, MA, 1996. MIT Press. [2] M Opper. Regression with Gaussian processes: Average case performance. In I K Kwok-Yee, M Wong, I King, and Dit-Yun Yeung, editors, Theoretical Aspects of Neural Computation: A Multidisciplinary Perspective, pages 17–23. Springer, 1997. [3] P Sollich. Learning curves for Gaussian processes. In M S Kearns, S A Solla, and D A Cohn, editors, Advances in Neural Information Processing Systems 11, pages 344–350, Cambridge, MA, 1999. MIT Press. [4] M Opper and F Vivarelli. General bounds on Bayes errors for regression with Gaussian processes. In M Kearns, S A Solla, and D Cohn, editors, Advances in Neural Information Processing Systems 11, pages 302–308, Cambridge, MA, 1999. MIT Press. [5] C K I Williams and F Vivarelli. Upper and lower bounds on the learning curve for Gaussian processes. Mach. Learn., 40(1):77–102, 2000. [6] D Malzahn and M Opper. Learning curves for Gaussian processes regression: A framework for good approximations. In T K Leen, T G Dietterich, and V Tresp, editors, Advances in Neural Information Processing Systems 13, pages 273–279, Cambridge, MA, 2001. MIT Press. [7] D Malzahn and M Opper. A variational approach to learning curves. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 463–469, Cambridge, MA, 2002. MIT Press. [8] P Sollich and A Halees. Learning curves for Gaussian process regression: approximations and bounds. Neural Comput., 14(6):1393–1428, 2002. [9] P Sollich. Gaussian process regression with mismatched models. In T G Dietterich, S Becker, and Z Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 519–526, Cambridge, MA, 2002. MIT Press. [10] P Sollich. Can Gaussian process regression be made robust against model mismatch? In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Artiﬁcial Intelligence, pages 199–210. 2005. [11] M Herbster, M Pontil, and L Wainer. Online learning over graphs. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 305–312, New York, NY, USA, 2005. ACM. [12] A J Smola and R Kondor. Kernels and regularization on graphs. In M Warmuth and B Sch¨ lkopf, o editors, Proc. Conference on Learning Theory (COLT), Lect. Notes Comp. Sci., pages 144–158. Springer, Heidelberg, 2003. [13] R I Kondor and J D Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 315–322, San Francisco, CA, USA, 2002. Morgan Kaufmann. [14] F R K Chung. Spectral graph theory. Number 92 in Regional Conference Series in Mathematics. Americal Mathematical Society, 1997. [15] A Steger and N C Wormald. Generating random regular graphs quickly. Combinator. Probab. Comput., 8(4):377–396, 1999. [16] F Chung and S-T Yau. Coverings, heat kernels and spanning trees. The Electronic Journal of Combinatorics, 6(1):R12, 1999. [17] C Monthus and C Texier. Random walk on the Bethe lattice and hyperbolic brownian motion. J. Phys. A, 29(10):2399–2409, 1996. [18] T Rogers, I Perez Castillo, R Kuehn, and K Takeda. Cavity approach to the spectral density of sparse symmetric random matrices. Phys. Rev. E, 78(3):031116, 2008. 9

5 0.10595332 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

Author: Ruslan Salakhutdinov

Abstract: Markov random ﬁelds (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.

6 0.097133353 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

7 0.093889974 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

8 0.093603827 129 nips-2009-Learning a Small Mixture of Trees

9 0.090778209 122 nips-2009-Label Selection on Graphs

10 0.086400092 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

11 0.085129589 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

12 0.084922664 187 nips-2009-Particle-based Variational Inference for Continuous Systems

13 0.081795014 95 nips-2009-Fast subtree kernels on graphs

14 0.078638576 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

15 0.078135915 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

16 0.077880695 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

17 0.077457592 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

18 0.073731162 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

19 0.072826579 87 nips-2009-Exponential Family Graph Matching and Ranking

20 0.070782855 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.231), (1, 0.075), (2, -0.017), (3, -0.021), (4, -0.052), (5, -0.021), (6, 0.015), (7, 0.04), (8, -0.039), (9, -0.154), (10, -0.145), (11, 0.051), (12, 0.102), (13, 0.051), (14, 0.018), (15, 0.057), (16, -0.018), (17, -0.102), (18, 0.061), (19, -0.072), (20, -0.025), (21, 0.011), (22, 0.116), (23, 0.058), (24, 0.032), (25, -0.024), (26, 0.145), (27, -0.048), (28, -0.06), (29, -0.01), (30, 0.017), (31, -0.013), (32, 0.037), (33, -0.015), (34, -0.127), (35, -0.033), (36, 0.088), (37, -0.057), (38, -0.158), (39, 0.007), (40, 0.043), (41, -0.009), (42, -0.1), (43, -0.013), (44, -0.001), (45, 0.008), (46, -0.025), (47, -0.033), (48, -0.022), (49, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95395726 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah

2 0.71284854 256 nips-2009-Which graphical models are difficult to learn?

Author: Andrea Montanari, Jose A. Pereira

3 0.63422042 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Author: Yusuke Watanabe, Kenji Fukumizu

Abstract: We propose a new approach to the analysis of Loopy Belief Propagation (LBP) by establishing a formula that connects the Hessian of the Bethe free energy with the edge zeta function. The formula has a number of theoretical implications on LBP. It is applied to give a sufﬁcient condition that the Hessian of the Bethe free energy is positive deﬁnite, which shows non-convexity for graphs with multiple cycles. The formula clariﬁes the relation between the local stability of a ﬁxed point of LBP and local minima of the Bethe free energy. We also propose a new approach to the uniqueness of LBP ﬁxed point, and show various conditions of uniqueness. 1

4 0.62701184 31 nips-2009-An LP View of the M-best MAP problem

Author: Menachem Fromer, Amir Globerson

5 0.5984627 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our ﬁrst contribution is to show how to efﬁciently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efﬁciently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.

6 0.59313363 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

7 0.56846142 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

8 0.55176759 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

9 0.52184451 187 nips-2009-Particle-based Variational Inference for Continuous Systems

10 0.51670843 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares

11 0.5015232 180 nips-2009-On the Convergence of the Concave-Convex Procedure

12 0.49934918 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

13 0.49567354 149 nips-2009-Maximin affinity learning of image segmentation

14 0.48792052 143 nips-2009-Localizing Bugs in Program Executions with Graphical Models

15 0.48111978 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

16 0.47385383 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

17 0.4714568 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

18 0.46331131 87 nips-2009-Exponential Family Graph Matching and Ranking

19 0.45922881 188 nips-2009-Perceptual Multistability as Markov Chain Monte Carlo Inference

20 0.44966 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.012), (24, 0.07), (25, 0.065), (35, 0.06), (36, 0.086), (39, 0.049), (58, 0.077), (71, 0.06), (81, 0.331), (86, 0.095), (91, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98480231 200 nips-2009-Reconstruction of Sparse Circuits Using Multi-neuronal Excitation (RESCUME)

Author: Tao Hu, Anthony Leonardo, Dmitri B. Chklovskii

Abstract: One of the central problems in neuroscience is reconstructing synaptic connectivity in neural circuits. Synapses onto a neuron can be probed by sequentially stimulating potentially pre-synaptic neurons while monitoring the membrane voltage of the post-synaptic neuron. Reconstructing a large neural circuit using such a “brute force” approach is rather time-consuming and inefficient because the connectivity in neural circuits is sparse. Instead, we propose to measure a post-synaptic neuron’s voltage while stimulating sequentially random subsets of multiple potentially pre-synaptic neurons. To reconstruct these synaptic connections from the recorded voltage we apply a decoding algorithm recently developed for compressive sensing. Compared to the brute force approach, our method promises significant time savings that grow with the size of the circuit. We use computer simulations to find optimal stimulation parameters and explore the feasibility of our reconstruction method under realistic experimental conditions including noise and non-linear synaptic integration. Multineuronal stimulation allows reconstructing synaptic connectivity just from the spiking activity of post-synaptic neurons, even when sub-threshold voltage is unavailable. By using calcium indicators, voltage-sensitive dyes, or multi-electrode arrays one could monitor activity of multiple postsynaptic neurons simultaneously, thus mapping their synaptic inputs in parallel, potentially reconstructing a complete neural circuit. 1 In tro d uc tio n Understanding information processing in neural circuits requires systematic characterization of synaptic connectivity [1, 2]. The most direct way to measure synapses between a pair of neurons is to stimulate potentially pre-synaptic neuron while recording intra-cellularly from the potentially post-synaptic neuron [3-8]. This method can be scaled to reconstruct multiple synaptic connections onto one neuron by combining intracellular recordings from the postsynaptic neuron with photo-activation of pre-synaptic neurons using glutamate uncaging [913] or channelrhodopsin [14, 15], or with multi-electrode arrays [16, 17]. Neurons are sequentially stimulated to fire action potentials by scanning a laser beam (or electrode voltage) over a brain slice, while synaptic weights are measured by recording post-synaptic voltage. Although sequential excitation of single potentially pre-synaptic neurons could reveal connectivity, such a “brute force” approach is inefficient because the connectivity among neurons is sparse. Even among nearby neurons in the cerebral cortex, the probability of connection is only about ten percent [3-8]. Connection probability decays rapidly with the 1 distance between neurons and falls below one percent on the scale of a cortical column [3, 8]. Thus, most single-neuron stimulation trials would result in zero response making the brute force approach slow, especially for larger circuits. Another drawback of the brute force approach is that single-neuron stimulation cannot be combined efficiently with methods allowing parallel recording of neural activity, such as calcium imaging [18-22], voltage-sensitive dyes [23-25] or multi-electrode arrays [17, 26]. As these techniques do not reliably measure sub-threshold potential but report only spiking activity, they would reveal only the strongest connections that can drive a neuron to fire [2730]. Therefore, such combination would reveal only a small fraction of the circuit. We propose to circumvent the above limitations of the brute force approach by stimulating multiple potentially pre-synaptic neurons simultaneously and reconstructing individual connections by using a recently developed method called compressive sensing (CS) [31-35]. In each trial, we stimulate F neurons randomly chosen out of N potentially pre-synaptic neurons and measure post-synaptic activity. Although each measurement yields only a combined response to stimulated neurons, if synaptic inputs sum linearly in a post-synaptic neuron, one can reconstruct the weights of individual connections by using an optimization algorithm. Moreover, if the synaptic connections are sparse, i.e. only K << N potentially pre-synaptic neurons make synaptic connections onto a post-synaptic neuron, the required number of trials M ~ K log(N/K), which is much less than N [31-35]. The proposed method can be used even if only spiking activity is available. Because multiple neurons are driven to fire simultaneously, if several of them synapse on the post-synaptic neuron, they can induce one or more spikes in that neuron. As quantized spike counts carry less information than analog sub-threshold voltage recordings, reconstruction requires a larger number of trials. Yet, the method can be used to reconstruct a complete feedforward circuit from spike recordings. Reconstructing neural circuit with multi-neuronal excitation may be compared with mapping retinal ganglion cell receptive fields. Typically, photoreceptors are stimulated by white-noise checkerboard stimulus and the receptive field is obtained by Reverse Correlation (RC) in case of sub-threshold measurements or Spike-Triggered Average (STA) of the stimulus [36, 37]. Although CS may use the same stimulation protocol, for a limited number of trials, the reconstruction quality is superior to RC or STA. 2 Ma pp i ng sy na pti c inp ut s o nto o n e ne uro n We start by formalizing the problem of mapping synaptic connections from a population of N potentially pre-synaptic neurons onto a single neuron, as exemplified by granule cells synapsing onto a Purkinje cell (Figure 1a). Our experimental protocol can be illustrated using linear algebra formalism, Figure 1b. We represent synaptic weights as components of a column vector x, where zeros represent non-existing connections. Each row in the stimulation matrix A represents a trial, ones indicating neurons driven to spike once and zeros indicating non-spiking neurons. The number of rows in the stimulation matrix A is equal to the number of trials M. The column vector y represents M measurements of membrane voltage obtained by an intra-cellular recording from the post-synaptic neuron: y = Ax. (1) In order to recover individual synaptic weights, Eq. (1) must be solved for x. RC (or STA) solution to this problem is x = (ATA)-1AT y, which minimizes (y-Ax)2 if M>N. In the case M << N for a sparse circuit. In this section we search computationally for the minimum number of trials required for exact reconstruction as a function of the number of non-zero synaptic weights K out of N potentially pre-synaptic neurons. First, note that the number of trials depends on the number of stimulated neurons F. If F = 1 we revert to the brute force approach and the number of measurements is N, while for F = N, the measurements are redundant and no finite number suffices. As the minimum number of measurements is expected to scale as K logN, there must be an optimal F which makes each measurement most informative about x. To determine the optimal number of stimulated neurons F for given K and N, we search for the minimum number of trials M, which allows a perfect reconstruction of the synaptic connectivity x. For each F, we generate 50 synaptic weight vectors and attempt reconstruction from sequentially increasing numbers of trials. The value of M, at which all 50 recoveries are successful (up to computer round-off error), estimates the number of trial needed for reconstruction with probability higher than 98%. By repeating this procedure 50 times for each F, we estimate the mean and standard deviation of M. We find that, for given N and K, the minimum number of trials, M, as a function of the number of stimulated neurons, F, has a shallow minimum. As K decreases, the minimum shifts towards larger F because more neurons should be stimulated simultaneously for sparser x. For the explored range of simulation parameters, the minimum is located close to 0.1N. Next, we set F = 0.1N and explore how the minimum number of measurements required for exact reconstruction depends on K and N. Results of the simulations following the recipe described above are shown in Figure 3a. As expected, when x is sparse, M grows approximately linearly with K (Figure 3b), and logarithmically with N (Figure 3c). N = 1000 180 25 160 20 140 120 15 100 10 80 5 150 250 400 650 1000 Number of potential connections (N) K = 30 220 200 (a) 200 220 Number of measurements (M) 30 Number of measurements (M) Number of actual connections (K) Number of necessary measurements (M) (b) 180 160 140 120 100 80 5 10 15 20 25 30 Number of actual connections (K) 210 (c) 200 190 180 170 160 150 140 130 120 2 10 10 3 Number of potential connections (N) Figure 3: a) Minimum number of measurements M required for reconstruction as a function of the number of actual synapses, K, and the number of potential synapses, N. b) For given N, we find M ~ K. c) For given K, we find M ~ logN (note semi-logarithmic scale in c). 4 R o b ust nes s o f re con st r uc t io n s t o noi se a n d v io la tio n o f si m pli fy in g a ss umpt io n s To make our simulation more realistic we now take into account three possible sources of noise: 1) In reality, post-synaptic voltage on a given synapse varies from trial to trial [4, 5, 46-52], an effect we call synaptic noise. Such noise detrimentally affects reconstructions because each row of A is multiplied by a different instantiation of vector x. 2) Stimulation of neurons may be imprecise exciting a slightly different subset of neurons than intended and/or firing intended neurons multiple times. We call this effect stimulation noise. Such noise detrimentally affects reconstructions because, in its presence, the actual measurement matrix A is different from the one used for recovery. 3) A synapse may fail to release neurotransmitter with some probability. Naturally, in the presence of noise, reconstructions cannot be exact. We quantify the 4 reconstruction x − xr l2 = error ∑ N i =1 by the normalized x − xr l2–error l2 / xl , where 2 ( xi − xri ) 2 . We plot normalized reconstruction error in brute force approach (M = N = 500 trials) as a function of noise, as well as CS and RC reconstruction errors (M = 200, 600 trials), Figure 4. 2 0.9 Normalized reconstruction error ||x-x|| /||x|| 1 r 2 For each noise source, the reconstruction error of the brute force approach can be achieved with 60% fewer trials by CS method for the above parameters (Figure 4). For the same number of trials, RC method performs worse. Naturally, the reconstruction error decreases with the number of trials. The reconstruction error is most sensitive to stimulation noise and least sensitive to synaptic noise. 1 (a) 1 (b) 0.9 0.8 (c) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0 0 0.7 RC: M=200 RC: M=600 Brute force method: M=500 CS: M=200 CS: M=600 0.6 0.5 0 0.05 0.1 0.15 Synaptic noise level 0.2 0.25 0 0.05 0.1 0.15 Stimulation noise level 0.2 0.25 0 0 0.05 0.1 0.15 0.2 0.25 Synaptic failure probability Figure 4: Impact of noise on the reconstruction quality for N = 500, K = 30, F = 50. a) Recovery error due to trial-to-trial variation in synaptic weight. The response y is calculated using the synaptic connectivity x perturbed by an additive Gaussian noise. The noise level is given by the coefficient of variation of synaptic weight. b) Recovery error due to stimulation noise. The matrix A used for recovery is obtained from the binary matrix used to calculate the measurement vector y by shifting, in each row, a fraction of ones specified by the noise level to random positions. c) Recovery error due to synaptic failures. The detrimental effect of the stimulation noise on the reconstruction can be eliminated by monitoring spiking activity of potentially pre-synaptic neurons. By using calcium imaging [18-22], voltage-sensitive dyes [23] or multi-electrode arrays [17, 26] one could record the actual stimulation matrix. Because most random matrices satisfy the reconstruction requirements [31, 34, 35], the actual stimulation matrix can be used for a successful recovery instead of the intended one. If neuronal activity can be monitored reliably, experiments can be done in a different mode altogether. Instead of stimulating designated neurons with high fidelity by using highly localized and intense light, one could stimulate all neurons with low probability. Random firing events can be detected and used in the recovery process. The light intensity can be tuned to stimulate the optimal number of neurons per trial. Next, we explore the sensitivity of the proposed reconstruction method to the violation of simplifying assumptions. First, whereas our simulation assumes that the actual number of connections, K, is known, in reality, connectivity sparseness is known a priori only approximately. Will this affect reconstruction results? In principle, CS does not require prior knowledge of K for reconstruction [31, 34, 35]. For the CoSaMP algorithm, however, it is important to provide value K larger than the actual value (Figure 5a). Then, the algorithm will find all the actual synaptic weights plus some extra non-zero weights, negligibly small when compared to actual ones. Thus, one can provide the algorithm with the value of K safely larger than the actual one and then threshold the reconstruction result according to the synaptic noise level. Second, whereas we assumed a linear summation of inputs [53], synaptic integration may be 2 non-linear [54]. We model non-linearity by setting y = yl + α yl , where yl represents linearly summed synaptic inputs. Results of simulations (Figure 5b) show that although nonlinearity can significantly degrade CS reconstruction quality, it still performs better than RC. 5 (b) 0.45 0.4 Actual K = 30 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 10 20 30 40 50 60 Normalized reconstruction error ||x-x||2/||x|| r 2 Normalized reconstrcution error ||x-x||2/||x|| r 2 (a) 0.9 0.8 0.7 CS RC 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.15-0.12-0.09-0.06-0.03 0 0.03 0.06 0.09 0.12 0.15 Relative strength of the non-linear term α × mean(y ) K fed to CoSaMP l Figure 5: Sensitivity of reconstruction error to the violation of simplifying assumptions for N = 500, K = 30, M = 200, F = 50. a) The quality of the reconstruction is not affected if the CoSaMP algorithm is fed with the value of K larger than actual. b) Reconstruction error computed in 100 realizations for each value of the quadratic term relative to the linear term. 5 Ma pp i ng sy na pti c inp ut s o nto a n e uro na l po pu la tio n Until now, we considered reconstruction of synaptic inputs onto one neuron using subthreshold measurements of its membrane potential. In this section, we apply CS to reconstructing synaptic connections onto a population of potentially post-synaptic neurons. Because in CS the choice of stimulated neurons is non-adaptive, by recording from all potentially post-synaptic neurons in response to one sequence of trials one can reconstruct a complete feedforward network (Figure 6). (a) x p(y=1) A Normalized reconstruction error ||x/||x||2−xr/||xr||2||2 y (b) (c) 100 (d) 500 700 900 1100 Number of spikes 1 STA CS 0.8 0.6 0.4 0.2 0 1000 0 300 3000 5000 7000 9000 Number of trials (M) Ax Figure 6: Mapping of a complete feedforward network. a) Each post-synaptic neuron (red) receives synapses from a sparse subset of potentially pre-synaptic neurons (blue). b) Linear algebra representation of the experimental protocol. c) Probability of firing as a function of synaptic current. d) Comparison of CS and STA reconstruction error using spike trains for N = 500, K = 30 and F = 50. Although attractive, such parallelization raises several issues. First, patching a large number of neurons is unrealistic and, therefore, monitoring membrane potential requires using different methods, such as calcium imaging [18-22], voltage sensitive dyes [23-25] or multielectrode arrays [17, 26]. As these methods can report reliably only spiking activity, the measurement is not analog but discrete. Depending on the strength of summed synaptic inputs compared to the firing threshold, the postsynaptic neuron may be silent, fire once or multiple times. As a result, the measured response y is quantized by the integer number of spikes. Such quantized measurements are less informative than analog measurements of the sub-threshold membrane potential. In the extreme case of only two quantization levels, spike or no spike, each measurement contains only 1 bit of information. Therefore, to achieve reasonable reconstruction quality using quantized measurements, a larger number of trials M>>N is required. We simulate circuit reconstruction from spike recordings in silico as follows. First, we draw synaptic weights from an experimentally motivated distribution. Second, we generate a 6 random stimulation matrix and calculate the product Ax. Third, we linear half-wave rectify this product and use the result as the instantaneous firing rate for the Poisson spike generator (Figure 6c). We used a rectifying threshold that results in 10% of spiking trials as typically observed in experiments. Fourth, we reconstruct synaptic weights using STA and CS and compare the results with the generated weights. We calculated mean error over 100 realizations of the simulation protocol (Figure 6d). Due to the non-linear spike generating procedure, x can be recovered only up to a scaling factor. We propose to calibrate x with a few brute-force measurements of synaptic weights. Thus, in calculating the reconstruction error using l2 norm, we normalize both the generated and recovered synaptic weights. Such definition is equivalent to the angular error, which is often used to evaluate the performance of STA in mapping receptive field [37, 55]. Why is CS superior to STA for a given number of trials (Figure 6d)? Note that spikeless trials, which typically constitute a majority, also carry information about connectivity. While STA discards these trials, CS takes them into account. In particular, CoSaMP starts with the STA solution as zeroth iteration and improves on it by using the results of all trials and the sparseness prior. 6 D i s c uss ion We have demonstrated that sparse feedforward networks can be reconstructed by stimulating multiple potentially pre-synaptic neurons simultaneously and monitoring either subthreshold or spiking response of potentially post-synaptic neurons. When sub-threshold voltage is recorded, significantly fewer measurements are required than in the brute force approach. Although our method is sensitive to noise (with stimulation noise worse than synapse noise), it is no less robust than the brute force approach or RC. The proposed reconstruction method can also recover inputs onto a neuron from spike counts, albeit with more trials than from sub-threshold potential measurements. This is particularly useful when intra-cellular recordings are not feasible and only spiking can be detected reliably, for example, when mapping synaptic inputs onto multiple neurons in parallel. For a given number of trials, our method yields smaller error than STA. The proposed reconstruction method assumes linear summation of synaptic inputs (both excitatory and inhibitory) and is sensitive to non-linearity of synaptic integration. Therefore, it is most useful for studying connections onto neurons, in which synaptic integration is close to linear. On the other hand, multi-neuron stimulation is closer than single-neuron stimulation to the intrinsic activity in the live brain and can be used to study synaptic integration under realistic conditions. In contrast to circuit reconstruction using intrinsic neuronal activity [56, 57], our method relies on extrinsic stimulation of neurons. Can our method use intrinsic neuronal activity instead? We see two major drawbacks of such approach. First, activity of non-monitored presynaptic neurons may significantly distort reconstruction results. Thus, successful reconstruction would require monitoring all active pre-synaptic neurons, which is rather challenging. Second, reliable reconstruction is possible only when the activity of presynaptic neurons is uncorrelated. Yet, their activity may be correlated, for example, due to common input. We thank Ashok Veeraraghavan for introducing us to CS, Anthony Leonardo for making a retina dataset available for the analysis, Lou Scheffer and Hong Young Noh for commenting on the manuscript and anonymous reviewers for helpful suggestions. References [1] Luo, L., Callaway, E.M. & Svoboda, K. (2008) Genetic dissection of neural circuits. Neuron 57(5):634-660. [2] Helmstaedter, M., Briggman, K.L. & Denk, W. (2008) 3D structural imaging of the brain with photons and electrons. Current opinion in neurobiology 18(6):633-641. [3] Holmgren, C., Harkany, T., Svennenfors, B. & Zilberter, Y. (2003) Pyramidal cell communication within local networks in layer 2/3 of rat neocortex. Journal of Physiology 551:139-153. [4] Markram, H. (1997) A network of tufted layer 5 pyramidal neurons. Cerebral Cortex 7(6):523-533. 7 [5] Markram, H., Lubke, J., Frotscher, M., Roth, A. & Sakmann, B. (1997) Physiology and anatomy of synaptic connections between thick tufted pyramidal neurones in the developing rat neocortex. Journal of Physiology 500(2):409-440. [6] Thomson, A.M. & Bannister, A.P. (2003) Interlaminar connections in the neocortex. Cerebral Cortex 13(1):5-14. [7] Thomson, A.M., West, D.C., Wang, Y. & Bannister, A.P. (2002) Synaptic connections and small circuits involving excitatory and inhibitory neurons in layers 2-5 of adult rat and cat neocortex: triple intracellular recordings and biocytin labelling in vitro. Cerebral Cortex 12(9):936-953. [8] Song, S., Sjostrom, P.J., Reigl, M., Nelson, S. & Chklovskii, D.B. (2005) Highly nonrandom features of synaptic connectivity in local cortical circuits. Plos Biology 3(3):e68. [9] Callaway, E.M. & Katz, L.C. (1993) Photostimulation using caged glutamate reveals functional circuitry in living brain slices. Proceedings of the National Academy of Sciences of the United States of America 90(16):7661-7665. [10] Dantzker, J.L. & Callaway, E.M. (2000) Laminar sources of synaptic input to cortical inhibitory interneurons and pyramidal neurons. Nature Neuroscience 3(7):701-707. [11] Shepherd, G.M. & Svoboda, K. (2005) Laminar and columnar organization of ascending excitatory projections to layer 2/3 pyramidal neurons in rat barrel cortex. Journal of Neuroscience 25(24):5670-5679. [12] Nikolenko, V., Poskanzer, K.E. & Yuste, R. (2007) Two-photon photostimulation and imaging of neural circuits. Nature Methods 4(11):943-950. [13] Shoham, S., O'connor, D.H., Sarkisov, D.V. & Wang, S.S. (2005) Rapid neurotransmitter uncaging in spatially defined patterns. Nature Methods 2(11):837-843. [14] Gradinaru, V., Thompson, K.R., Zhang, F., Mogri, M., Kay, K., Schneider, M.B. & Deisseroth, K. (2007) Targeting and readout strategies for fast optical neural control in vitro and in vivo. Journal of Neuroscience 27(52):14231-14238. [15] Petreanu, L., Huber, D., Sobczyk, A. & Svoboda, K. (2007) Channelrhodopsin-2-assisted circuit mapping of long-range callosal projections. Nature Neuroscience 10(5):663-668. [16] Na, L., Watson, B.O., Maclean, J.N., Yuste, R. & Shepard, K.L. (2008) A 256×256 CMOS Microelectrode Array for Extracellular Neural Stimulation of Acute Brain Slices. Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International. [17] Fujisawa, S., Amarasingham, A., Harrison, M.T. & Buzsaki, G. (2008) Behavior-dependent shortterm assembly dynamics in the medial prefrontal cortex. Nature Neuroscience 11(7):823-833. [18] Ikegaya, Y., Aaron, G., Cossart, R., Aronov, D., Lampl, I., Ferster, D. & Yuste, R. (2004) Synfire chains and cortical songs: temporal modules of cortical activity. Science 304(5670):559-564. [19] Ohki, K., Chung, S., Ch'ng, Y.H., Kara, P. & Reid, R.C. (2005) Functional imaging with cellular resolution reveals precise micro-architecture in visual cortex. Nature 433(7026):597-603. [20] Stosiek, C., Garaschuk, O., Holthoff, K. & Konnerth, A. (2003) In vivo two-photon calcium imaging of neuronal networks. Proceedings of the National Academy of Sciences of the United States of America 100(12):7319-7324. [21] Svoboda, K., Denk, W., Kleinfeld, D. & Tank, D.W. (1997) In vivo dendritic calcium dynamics in neocortical pyramidal neurons. Nature 385(6612):161-165. [22] Sasaki, T., Minamisawa, G., Takahashi, N., Matsuki, N. & Ikegaya, Y. (2009) Reverse optical trawling for synaptic connections in situ. Journal of Neurophysiology 102(1):636-643. [23] Zecevic, D., Djurisic, M., Cohen, L.B., Antic, S., Wachowiak, M., Falk, C.X. & Zochowski, M.R. (2003) Imaging nervous system activity with voltage-sensitive dyes. Current Protocols in Neuroscience Chapter 6:Unit 6.17. [24] Cacciatore, T.W., Brodfuehrer, P.D., Gonzalez, J.E., Jiang, T., Adams, S.R., Tsien, R.Y., Kristan, W.B., Jr. & Kleinfeld, D. (1999) Identification of neural circuits by imaging coherent electrical activity with FRET-based dyes. Neuron 23(3):449-459. [25] Taylor, A.L., Cottrell, G.W., Kleinfeld, D. & Kristan, W.B., Jr. (2003) Imaging reveals synaptic targets of a swim-terminating neuron in the leech CNS. Journal of Neuroscience 23(36):11402-11410. [26] Hutzler, M., Lambacher, A., Eversmann, B., Jenkner, M., Thewes, R. & Fromherz, P. (2006) High-resolution multitransistor array recording of electrical field potentials in cultured brain slices. Journal of Neurophysiology 96(3):1638-1645. [27] Egger, V., Feldmeyer, D. & Sakmann, B. (1999) Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nature Neuroscience 2(12):1098-1105. [28] Feldmeyer, D., Egger, V., Lubke, J. & Sakmann, B. (1999) Reliable synaptic connections between pairs of excitatory layer 4 neurones within a single 'barrel' of developing rat somatosensory cortex. Journal of Physiology 521:169-190. [29] Peterlin, Z.A., Kozloski, J., Mao, B.Q., Tsiola, A. & Yuste, R. (2000) Optical probing of neuronal circuits with calcium indicators. Proceedings of the National Academy of Sciences of the United States of America 97(7):3619-3624. 8 [30] Thomson, A.M., Deuchars, J. & West, D.C. (1993) Large, deep layer pyramid-pyramid single axon EPSPs in slices of rat motor cortex display paired pulse and frequency-dependent depression, mediated presynaptically and self-facilitation, mediated postsynaptically. Journal of Neurophysiology 70(6):2354-2369. [31] Baraniuk, R.G. (2007) Compressive sensing. Ieee Signal Processing Magazine 24(4):118-120. [32] Candes, E.J. (2008) Compressed Sensing. Twenty-Second Annual Conference on Neural Information Processing Systems, Tutorials. [33] Candes, E.J., Romberg, J.K. & Tao, T. (2006) Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics 59(8):1207-1223. [34] Candes, E.J. & Tao, T. (2006) Near-optimal signal recovery from random projections: Universal encoding strategies? Ieee Transactions on Information Theory 52(12):5406-5425. [35] Donoho, D.L. (2006) Compressed sensing. Ieee Transactions on Information Theory 52(4):12891306. [36] Ringach, D. & Shapley, R. (2004) Reverse Correlation in Neurophysiology. Cognitive Science 28:147-166. [37] Schwartz, O., Pillow, J.W., Rust, N.C. & Simoncelli, E.P. (2006) Spike-triggered neural characterization. Journal of Vision 6(4):484-507. [38] Candes, E.J. & Tao, T. (2005) Decoding by linear programming. Ieee Transactions on Information Theory 51(12):4203-4215. [39] Needell, D. & Vershynin, R. (2009) Uniform Uncertainty Principle and Signal Recovery via Regularized Orthogonal Matching Pursuit. Foundations of Computational Mathematics 9(3):317-334. [40] Tropp, J.A. & Gilbert, A.C. (2007) Signal recovery from random measurements via orthogonal matching pursuit. Ieee Transactions on Information Theory 53(12):4655-4666. [41] Dai, W. & Milenkovic, O. (2009) Subspace Pursuit for Compressive Sensing Signal Reconstruction. Ieee Transactions on Information Theory 55(5):2230-2249. [42] Needell, D. & Tropp, J.A. (2009) CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis 26(3):301-321. [43] Varshney, L.R., Sjostrom, P.J. & Chklovskii, D.B. (2006) Optimal information storage in noisy synapses under resource constraints. Neuron 52(3):409-423. [44] Brunel, N., Hakim, V., Isope, P., Nadal, J.P. & Barbour, B. (2004) Optimal information storage and the distribution of synaptic weights: perceptron versus Purkinje cell. Neuron 43(5):745-757. [45] Napoletani, D. & Sauer, T.D. (2008) Reconstructing the topology of sparsely connected dynamical networks. Physical Review E 77(2):026103. [46] Allen, C. & Stevens, C.F. (1994) An evaluation of causes for unreliability of synaptic transmission. Proceedings of the National Academy of Sciences of the United States of America 91(22):10380-10383. [47] Hessler, N.A., Shirke, A.M. & Malinow, R. (1993) The probability of transmitter release at a mammalian central synapse. Nature 366(6455):569-572. [48] Isope, P. & Barbour, B. (2002) Properties of unitary granule cell-->Purkinje cell synapses in adult rat cerebellar slices. Journal of Neuroscience 22(22):9668-9678. [49] Mason, A., Nicoll, A. & Stratford, K. (1991) Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. Journal of Neuroscience 11(1):72-84. [50] Raastad, M., Storm, J.F. & Andersen, P. (1992) Putative Single Quantum and Single Fibre Excitatory Postsynaptic Currents Show Similar Amplitude Range and Variability in Rat Hippocampal Slices. European Journal of Neuroscience 4(1):113-117. [51] Rosenmund, C., Clements, J.D. & Westbrook, G.L. (1993) Nonuniform probability of glutamate release at a hippocampal synapse. Science 262(5134):754-757. [52] Sayer, R.J., Friedlander, M.J. & Redman, S.J. (1990) The time course and amplitude of EPSPs evoked at synapses between pairs of CA3/CA1 neurons in the hippocampal slice. Journal of Neuroscience 10(3):826-836. [53] Cash, S. & Yuste, R. (1999) Linear summation of excitatory inputs by CA1 pyramidal neurons. Neuron 22(2):383-394. [54] Polsky, A., Mel, B.W. & Schiller, J. (2004) Computational subunits in thin dendrites of pyramidal cells. Nature Neuroscience 7(6):621-627. [55] Paninski, L. (2003) Convergence properties of three spike-triggered analysis techniques. Network: Computation in Neural Systems 14(3):437-464. [56] Okatan, M., Wilson, M.A. & Brown, E.N. (2005) Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computation 17(9):1927-1961. [57] Timme, M. (2007) Revealing network connectivity from response dynamics. Physical Review Letters 98(22):224101. 9

2 0.89598393 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation

Author: Mikkel Schmidt

Abstract: We present a general Bayesian approach to probabilistic matrix factorization subject to linear constraints. The approach is based on a Gaussian observation model and Gaussian priors with bilinear equality and inequality constraints. We present an efﬁcient Markov chain Monte Carlo inference procedure based on Gibbs sampling. Special cases of the proposed model are Bayesian formulations of nonnegative matrix factorization and factor analysis. The method is evaluated on a blind source separation problem. We demonstrate that our algorithm can be used to extract meaningful and interpretable features that are remarkably different from features extracted using existing related matrix factorization techniques.

3 0.85229039 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

Author: Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, Gideon S. Mann

Abstract: Training conditional maximum entropy models on massive data sets requires signiﬁcant computational resources. We examine three common distributed training methods for conditional maxent: a distributed gradient computation method, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods and present a theoretical analysis of conditional maxent models, including a study of the convergence of the mixture weight method, the most resource-efﬁcient technique. We also report the results of large-scale experiments comparing these three methods which demonstrate the beneﬁts of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches.

same-paper 4 0.82992721 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

Author: Kyomin Jung, Pushmeet Kohli, Devavrat Shah

5 0.70609754 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

Author: Steven Chase, Andrew Schwartz, Wolfgang Maass, Robert A. Legenstein

Abstract: The control of neuroprosthetic devices from the activity of motor cortex neurons beneﬁts from learning effects where the function of these neurons is adapted to the control task. It was recently shown that tuning properties of neurons in monkey motor cortex are adapted selectively in order to compensate for an erroneous interpretation of their activity. In particular, it was shown that the tuning curves of those neurons whose preferred directions had been misinterpreted changed more than those of other neurons. In this article, we show that the experimentally observed self-tuning properties of the system can be explained on the basis of a simple learning rule. This learning rule utilizes neuronal noise for exploration and performs Hebbian weight updates that are modulated by a global reward signal. In contrast to most previously proposed reward-modulated Hebbian learning rules, this rule does not require extraneous knowledge about what is noise and what is signal. The learning rule is able to optimize the performance of the model system within biologically realistic periods of time and under high noise levels. When the neuronal noise is ﬁtted to experimental data, the model produces learning effects similar to those found in monkey experiments.

6 0.6701203 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

7 0.66946852 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs

8 0.65011638 121 nips-2009-Know Thy Neighbour: A Normative Theory of Synaptic Depression

9 0.6378746 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

10 0.6301145 13 nips-2009-A Neural Implementation of the Kalman Filter

11 0.62690938 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

12 0.612463 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

13 0.59581655 70 nips-2009-Discriminative Network Models of Schizophrenia

14 0.5913496 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

15 0.58255094 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

16 0.58094901 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

17 0.57466614 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

18 0.57372403 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

19 0.57199687 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

20 0.56691265 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity