nips nips2006 nips2006-163 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Herbster, Massimiliano Pontil
Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We study the problem of online prediction of a noisy labeling of a graph with the perceptron. [sent-6, score-0.477]
2 To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. [sent-9, score-0.714]
3 We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. [sent-13, score-1.376]
4 The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. [sent-14, score-0.549]
5 Nature presents a vertex vi1 ; the learner predicts the label of the vertex y1 ∈ {−1, 1}; nature presents a label y1 ; nature presents a vertex vi2 ; the learner predicts y2 ; ˆ ˆ and so forth. [sent-17, score-0.752]
6 The learner’s goal is minimize the total number of mistakes (|{t : yt = yt }|). [sent-18, score-0.396]
7 In this paper, we consider the cut size as a measure of the complexity of a graph’s labeling, where the size of the cut is the number of edges between disagreeing labels. [sent-21, score-0.338]
8 We will give bounds which depend on the cut size and the diameter of the graph and thus do not directly depend on the size of the graph. [sent-22, score-0.711]
9 The problem of learning a labeling of a graph is a natural problem in the online learning setting, as well as a foundational technique for a variety of semi-supervised learning methods [2, 3, 4, 5, 6]. [sent-23, score-0.424]
10 The web pages may be identified with the vertices of a graph and the edges as links between pages. [sent-25, score-0.48]
11 The online prediction problem is then that, at a given time t the system may receive a request to serve an advertisement on a particular web page. [sent-26, score-0.276]
12 , do Predict: receive vit + + yt = sign(eit wt ) ˆ Update: receive yt Figure 2: Barbell if yt = yt then ˆ + wt+1 = wt else + + wt+1 = wt + yt vit MA = MA ∪ {t} + + end Figure 1: Perceptron on set VM . [sent-37, score-0.958]
13 Kernels on graph labelings were introduced in [3, 5]. [sent-39, score-0.269]
14 There it was shown that, given a fixed labeling of a graph, the number of mistakes made by an algorithm similar to the kernel perceptron [9] with a kernel that was the pseudoinverse of the graph Laplacian, could be bounded by the quantity [8, Theorems 3. [sent-41, score-0.943]
15 This bound is interesting in that the mistakes of the algorithm could be bounded in terms of simple properties of a labeled graph. [sent-46, score-0.353]
16 Now suppose that, say, the first clique contains one vertex which is labeled as the second clique (see Figure 3). [sent-56, score-0.258]
17 For this purpose, we apply the well-known kernel perceptron [9] to the problem of online learning on the graph. [sent-61, score-0.374]
18 We discuss the background material for this problem in section 2, where we also show that the bounds of [1] can be specialized to relative mistake bounds when applied to, for example, prediction on the graph. [sent-62, score-0.283]
19 A second important aim of this paper is to interpret the mistake bounds by an explanation in terms of high level graph properties. [sent-63, score-0.438]
20 Hence, in section 3, we refine a diameter based bound of [8, Theorem 4. [sent-64, score-0.36]
21 In section 4, we introduce a kernel which is a simple augmentation of the pseudoinverse of the graph Laplacian and then prove a theorem on the performance of the perceptron with this kernel which solves the three problems above. [sent-66, score-0.721]
22 We conclude in section 5, with a discussion comparing the mistake bounds for prediction on the graph with the halving algorithm [11] and the k-nearest neighbors algorithm. [sent-67, score-0.714]
23 2 Preliminaries In this section, we describe our setup for Hilbert spaces on finite sets and its specification to the graph case. [sent-68, score-0.269]
24 We then recall a result of Gentile [1] on prediction with the perceptron and discuss a special case in which relative 0–1 loss (mistake) bounds are obtainable. [sent-69, score-0.372]
25 When V indexes the vertices of an undirected graph G, a natural norm to use is that induced by the graph Laplacian. [sent-98, score-0.712]
26 Let A be the n × n symmetric weight matrix of the graph such that Aij ≥ 0, and define the edge set E(G) := {(i, j) : 0 < Aij , i < j}. [sent-100, score-0.306]
27 The graph Laplacian G is the n × n matrix ij defined as G := D − A where D = diag(d1 , . [sent-102, score-0.305]
28 := w Gw = (4) (i,j)∈E(G) When the graph is connected, it follows from equation (4) that the null space of G is spanned by the constant vector 1 only. [sent-107, score-0.316]
29 In this paper, we always assume that the graph G is connected. [sent-108, score-0.269]
30 Where it is not ambiguous, we use the notation G to denote both the graph G and the graph Laplacian. [sent-109, score-0.538]
31 2 Online prediction of functions on a finite set with the perceptron Gentile [1] bounded the performance of the perceptron algorithm on nonseparable data with the linear hinge loss. [sent-111, score-0.666]
32 Here, we apply his result to study the problem of prediction on a finite set with the perceptron (see Figure 1). [sent-112, score-0.311]
33 This assumption, along with the fact that the inputs are coordinates, enables us to upper bound the hinge loss and hence we may give a relative mistake bound in terms of the complete set of base classifiers {−1, 1}n . [sent-115, score-0.452]
34 Thus, the hinge loss Lu,t := max(0, 1 − yt u, vit M ) of any classifier u ∈ {−1, 1}n with any example (vit , yt ) is either 0 or 2, since | u, vit M | = 1 by equation (3). [sent-124, score-0.593]
35 Moreover, we note that, for deterministic prediction the constant 2 in the first term of the right hand side of equation (5) is optimal for an online algorithm as a mistake may be forced on every trial. [sent-127, score-0.287]
36 3 Interpretation of the space H(G) The bound for prediction on a finite set in equation (5) involves two quantities, namely the squared norm of a classifier u ∈ {−1, 1}n and the maximum of the squared norms of the coordinates v ∈ VM . [sent-128, score-0.282]
37 ΦG (u) := The norm v − w G is a metric distance for v, w ∈ span(VG ) however, surprisingly, the square of 2 the norm vp − vq G when restricted to graph coordinates vp , vq ∈ VG is also a metric known as the resistance distance [10], 2 rG (p, q) := (ep − eq ) G+ (ep − eq ) = vp − vq G . [sent-132, score-1.961]
38 (7) It is interesting to note that the resistance distance between vertex p and vertex q is the effective resistance between vertices p and q, where the graph is the circuit and edge (i, j) is a resistor with the resistance ∆ij = A−1 . [sent-133, score-1.551]
39 ij 2 2 As we shall see, our bounds in section 4 depend on vp G = vp − 0 G . [sent-134, score-0.759]
40 Formally, this is not an effective resistance between vertex p and another vertex “0”. [sent-135, score-0.618]
41 The vector 0, informally however, is P v v∈VG the center of the graph as 0 = , since 1 is in the null space of H(G). [sent-136, score-0.269]
42 First, we observe qualitatively that the more interconnected the graph 2 vp G (Corollary 3. [sent-138, score-0.6]
43 2 we quantitatively upper bound further characterize the smaller the term 2 vp G by the average (over q) of the effective resistance between vertex p and each vertex q in the graph (including q = p), which in turn may be upper bounded by the eccentricity of p. [sent-141, score-1.575]
44 If M and M are symmetric positive semidefinite matrices with span(VM ) = 2 2 span(VM ) and, for every w ∈ span(VM ), w M ≤ w M then n n 2 2 ≥ ai vi M i=1 ai vi i=1 , M where vi ∈ VM , vi ∈ VM and a ∈ IRn . [sent-149, score-0.325]
45 As a corollary to the above theorem we have the following when M is a graph Laplacian. [sent-154, score-0.375]
46 Given connected graphs G and G with distance matrices ∆ and ∆ such that 2 2 ∆ij ≤ ∆ij then for all p, q ∈ V , we have that vp G ≤ vp G and rG (p, q) ≤ rG (p, q). [sent-157, score-0.805]
47 2 The first inequality in the above corollary demonstrates that vp G is nonincreasing in a graph that is strictly more connected. [sent-158, score-0.68]
48 The second inequality is the well-known Rayleigh’s monotonicity law which states that if any resistance in a circuit is decreased then the effective resistance between any two points cannot increase. [sent-159, score-0.515]
49 We define the geodesic distance between vertices p, q ∈ V to be dG (p, q) := min |P (p, q)| where the minimum is taken with respect to all paths P(p, q) from p to q, with the path length defined as |P(p, q)| := (i,j)∈E(P(p,q)) ∆ij . [sent-160, score-0.322]
50 The eccentricity dG (p) of a vertex p ∈ V is the geodesic distance on the graph between p and the furthest vertex on the graph to p, that is, dG (p) = maxq∈V dG (p, q) ≤ DG , and DG is the (geodesic) diameter of the graph, DG := maxp∈V dG (p). [sent-161, score-1.387]
51 A tree is an n-vertex connected graph with n − 1 edges. [sent-163, score-0.347]
52 [10]), establishes that the resistance distance can be be equated with the geodesic distance when the graph is a tree. [sent-166, score-0.735]
53 If the graph T is a tree with graph Laplacian T then rT (p, q) = dT (p, q). [sent-169, score-0.573]
54 2 The next theorem provides a quantitative relationship between vp G and two measures of the connectivity of vertex p, namely its eccentricity and the mean of the effective resistances between vertex p and each vertex on the graph. [sent-170, score-1.006]
55 If G is a connected graph then n vp 2 G ≤ 1 rG (p, q) ≤ dG (p). [sent-173, score-0.643]
56 Recall that rG (p, q) = vp − vq 2 (see equation (7)) and use q=1 vq = 0 to obtain that G n n 1 1 vp − vq 2 = vp Gvp + n q=1 vq Gvq which implies the left inequality in (8). [sent-175, score-1.431]
57 We identify the resistance diameter of a graph G as RG := maxp,q∈V rG (p, q); thus, from the previous theorem, we may also conclude that max vp p∈V 2 G ≤ RG ≤ DG . [sent-183, score-1.111]
58 Specifically, we consider the “flower graph” (see Figure 4) obtained by connecting the first vertex of a chain with p − 1 vertices to the root vertex of an m-ary tree of depth one. [sent-185, score-0.556]
59 We index the vertices of this graph so that vertices 1 to p correspond to “stem vertices” and vertices p + 1 to p + m to “petals”. [sent-186, score-0.692]
60 Clearly, this graph has diameter equal to p, hence our upper 2 bound above establishes that v1 G ≤ p. [sent-187, score-0.705]
61 Therefore, as m → ∞ the upper bound on v1 2 (equation (8)) G 4m computation gives v1 2 that v1 G ≥ for the flower graph is matched by a lower bound with a gap of 1. [sent-196, score-0.569]
62 4 Prediction on the graph We define the following symmetric positive definite graph kernel, Kb := G+ + b11 + cI, (0 < b, 0 ≤ c), (10) c b b −1 b where Gc = (Kc ) is the matrix of the associated Hilbert space H(Gc ). [sent-197, score-0.575]
63 Given a vertex p with associated coordinates vp ∈ VG and vp ∈ VGb we have that c vp 2 Gb c = vp 2 G + b + c. [sent-212, score-1.551]
64 To prove equation (11) we recall equation (3) and note that vp vp , vp Gb c + vp , b1+cep Gb c = 2 vp G 2 = Gb c (12) vp , vp +b1+cep Gb c = + b + c. [sent-215, score-2.441]
65 0 We can now state our relative mistake bound for online prediction on the graph. [sent-226, score-0.352]
66 The upper bound of the theorem is more resilient to label imbalance, concept noise, and label noise than the bound in [8, Theorems 3. [sent-236, score-0.518]
67 For example, given the noisy barbell graph in Figure 3 but with k n noisy vertices the bound (1) is O(kn) while the bound (15) with b = 1, c = 1, and |Mu | = 0 is O(k). [sent-240, score-0.719]
68 In the bound above, for easy interpretability, one may upper bound the resistance diameter RG by the geodesic diameter DG . [sent-242, score-1.168]
69 However, the resistance diameter makes for a sharper bound in a number of natural situations. [sent-243, score-0.686]
70 We observe between any two vertices there are at least edge-disjoint paths of length no more than five, therefore the resistance diameter is at most 5/ by the “resistors-in-parallel” rule while the geodesic diameter is 3. [sent-245, score-1.009]
71 Thus, for “thick barbells” if we use the geodesic diameter we have a mistake bound of 16 (substituting βu = 0, and RG ≤ 3 into (16)) while surprisingly with the resistance diameter the bound (substituting 1 b = 4n , c = 0, |Mu | = 0, βu = 0, and RG ≤ 5/ into (15)) is independent of and is 20. [sent-246, score-1.2]
72 In the following, we compare the perceptron with two other approaches. [sent-248, score-0.258]
73 First, we compare the perceptron with the graph kernel K1 to the conceptually simpler k-nearest neighbors algorithm with either 0 the graph geodesic distance or the resistance distance. [sent-249, score-1.308]
74 In particular, we prove the impossibility of bounding performance of k-nearest neighbors only in terms of the diameter and the cut size. [sent-250, score-0.496]
75 Specifically, we give a parameterized family of graphs for which the number of mistakes of the perceptron is upper bounded by a fixed constant independent of the graph size while k-nearest neighbors provably incurs mistakes linearly in the graph size. [sent-251, score-1.41]
76 Second, we compare the perceptron with the graph kernel K1 with a simple application of the classical halving algorithm [11]. [sent-252, score-0.706]
77 Here, we conclude that 0 the upper bound for the perceptron is better for graphs with a small diameter while the halving algorithm’s upper bound is better for graphs with a large diameter. [sent-253, score-1.155]
78 In the following, for simplicity we limit our discussion to binary-weighted graphs, noise-free data (see equation (16)) and upper bound the resistance diameter RG with the geodesic diameter DG (see equation (9)). [sent-254, score-1.15]
79 1 K-nearest neighbors on the graph We consider the k-nearest neighbors algorithms on the graph with both the resistance distance (see equation (7)) and the graph geodesic distance. [sent-256, score-1.385]
80 The geodesic distance between two vertices is the length of the shortest path between the two vertices (recall the discussion in section 3). [sent-257, score-0.463]
81 An octopus graph (see Figure 5) consists of a “head” which is an -clique (C ( ) ) with vertices denoted by c1 , . [sent-260, score-0.469]
82 , c , and a set of m “tentacles” ({Ti }m ), where i=1 each tentacle is a line graph of length p. [sent-263, score-0.32]
83 Thus, this graph has diameter D = max(p + 2, 2p) and there are + mp + 1 vertices in total; an octopus Om,p is balanced if = mp + 1. [sent-269, score-0.717]
84 Note that the distance of every vertex in the head to every other vertex in the graph is no more than p + 2, and every tentacle “tip” ti,p is distance 2p to other tips tj,p : j = i. [sent-270, score-0.794]
85 We now contrast this result with the performance of the perceptron with the graph kernel K1 (see equation (10)). [sent-277, score-0.611]
86 By equation (16), 0 the number of mistakes will be upper bounded by 10p + 5 because there is a cut of size 1 and the diameter is 2p. [sent-278, score-0.745]
87 Thus, for balanced octopi Om,p with p ≥ 3, as m grows the number of mistakes of the kernel perceptron will be bounded by a fixed constant. [sent-279, score-0.536]
88 Whereas distance k-nearest neighbors will incur mistakes linearly in m. [sent-280, score-0.291]
89 Hence, the number of mistakes of the halving algorithm is upper bounded by the logarithm of the cardinality of the concept class. [sent-284, score-0.541]
90 Let k KG = {u ∈ {−1, 1}n : ΦG (u) = k} be the set all of all classifiers with a cut size equal to k on a k k fixed graph G. [sent-285, score-0.402]
91 The cardinality of KG is upper bounded by n(n−1) since any classifier (cut) in KG k can be uniquely identified by a choice of k edges and 1 bit which determines the sign of the vertices in the same of partition (however we overcount as not every set of edges determines a classifier). [sent-286, score-0.385]
92 The number of mistakes of the halving algorithm is upper bounded by O(k log n ). [sent-287, score-0.459]
93 For example, k on a line graph with a cut size of 1 the halving algorithm has an upper bound of log n while the upper bound for the number of mistakes of the perceptron as given in equation (16) is 5n + 5. [sent-288, score-1.413]
94 Although the halving algorithm has a sharper bound on such large diameter graphs as the line graph, it unfortunately has a logarithmic dependence on n. [sent-289, score-0.643]
95 This contrasts to the bound of the perceptron which is essentially independent of n. [sent-290, score-0.37]
96 Thus, the bound for the halving algorithm is roughly sharper on graphs with a diameter ω(log n ), while the perceptron bound is roughly sharper on graphs with k a diameter o(log n ). [sent-291, score-1.402]
97 We emphasize that this analysis of upper bounds is quite rough and sharper k bounds for both algorithms could be obtained for example, by including a term representing the minimal possible cut, that is, the minimum number of edges necessary to disconnect a graph. [sent-292, score-0.329]
98 For k the halving algorithm this would enable a better bound on the cardinality of KG (see [13]). [sent-293, score-0.283]
99 While, for the perceptron the larger the connectivity of the graph, the weaker the diameter upper bound in Theorem 3. [sent-294, score-0.694]
100 Learning from labeled and unlabeled data using graph mincuts. [sent-305, score-0.269]
wordName wordTfidf (topN-words)
[('vp', 0.331), ('graph', 0.269), ('perceptron', 0.258), ('rg', 0.256), ('dg', 0.251), ('diameter', 0.248), ('resistance', 0.238), ('vertex', 0.19), ('mistakes', 0.188), ('gb', 0.15), ('vit', 0.147), ('halving', 0.142), ('vertices', 0.141), ('vm', 0.14), ('mu', 0.135), ('geodesic', 0.134), ('cut', 0.133), ('bound', 0.112), ('mistake', 0.108), ('yt', 0.104), ('irn', 0.102), ('ui', 0.101), ('sharper', 0.088), ('vq', 0.088), ('barbell', 0.085), ('online', 0.079), ('upper', 0.076), ('labeling', 0.076), ('vg', 0.067), ('ma', 0.067), ('wi', 0.067), ('theorem', 0.065), ('bounds', 0.061), ('vi', 0.06), ('advertisement', 0.059), ('bal', 0.059), ('octopus', 0.059), ('neighbors', 0.056), ('aij', 0.054), ('mx', 0.054), ('lemma', 0.054), ('concept', 0.053), ('prediction', 0.053), ('bounded', 0.053), ('graphs', 0.053), ('gentile', 0.051), ('tentacle', 0.051), ('label', 0.05), ('kg', 0.05), ('equation', 0.047), ('distance', 0.047), ('mw', 0.044), ('span', 0.044), ('hinge', 0.044), ('edges', 0.043), ('connected', 0.043), ('laplacian', 0.043), ('semide', 0.042), ('learner', 0.041), ('corollary', 0.041), ('eccentricity', 0.04), ('inequality', 0.039), ('kernel', 0.037), ('coordinates', 0.037), ('symmetric', 0.037), ('ij', 0.036), ('tree', 0.035), ('barbells', 0.034), ('cep', 0.034), ('gc', 0.034), ('herbster', 0.034), ('request', 0.034), ('uit', 0.034), ('vgb', 0.034), ('clique', 0.034), ('mc', 0.034), ('norm', 0.033), ('nite', 0.032), ('thick', 0.032), ('wt', 0.032), ('dt', 0.03), ('prove', 0.03), ('hilbert', 0.03), ('impossibility', 0.029), ('imbalance', 0.029), ('disagreeing', 0.029), ('ower', 0.029), ('wp', 0.029), ('maxt', 0.029), ('cardinality', 0.029), ('web', 0.027), ('mint', 0.027), ('conclude', 0.025), ('minw', 0.025), ('pseudoinverse', 0.025), ('pontil', 0.025), ('ai', 0.024), ('ei', 0.024), ('receive', 0.024), ('blum', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 163 nips-2006-Prediction on a Graph with a Perceptron
Author: Mark Herbster, Massimiliano Pontil
Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1
2 0.20895167 14 nips-2006-A Small World Threshold for Economic Network Formation
Author: Eyal Even-dar, Michael Kearns
Abstract: We introduce a game-theoretic model for network formation inspired by earlier stochastic models that mix localized and long-distance connectivity. In this model, players may purchase edges at distance d at a cost of dα , and wish to minimize the sum of their edge purchases and their average distance to other players. In this model, we show there is a striking “small world” threshold phenomenon: in two dimensions, if α < 2 then every Nash equilibrium results in a network of constant diameter (independent of network size), and if α > 2 then every Nash equilibrium results in a network whose diameter grows as a root of the network size, and thus is unbounded. We contrast our results with those of Kleinberg [8] in a stochastic model, and empirically investigate the “navigability” of equilibrium networks. Our theoretical results all generalize to higher dimensions. 1
3 0.14204134 77 nips-2006-Fast Computation of Graph Kernels
Author: Karsten M. Borgwardt, Nicol N. Schraudolph, S.v.n. Vishwanathan
Abstract: Using extensions of linear algebra concepts to Reproducing Kernel Hilbert Spaces (RKHS), we define a unifying framework for random walk kernels on graphs. Reduction to a Sylvester equation allows us to compute many of these kernels in O(n3 ) worst-case time. This includes kernels whose previous worst-case time complexity was O(n6 ), such as the geometric kernels of G¨ rtner et al. [1] and a the marginal graph kernels of Kashima et al. [2]. Our algebra in RKHS allow us to exploit sparsity in directed and undirected graphs more effectively than previous methods, yielding sub-cubic computational complexity when combined with conjugate gradient solvers or fixed-point iterations. Experiments on graphs from bioinformatics and other application domains show that our algorithms are often more than 1000 times faster than existing approaches. 1
4 0.13594262 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
5 0.13007782 35 nips-2006-Approximate inference using planar graph decomposition
Author: Amir Globerson, Tommi S. Jaakkola
Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external field). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various fields such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external field case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external field, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Definitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } defined as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, defined over pairs of variables. The P partition function is defined as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We define those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then defined via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is defined as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we define E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are defined as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is defined in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is defined as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is defined as the product of the weights of the edges in the matching. Finally, we recall the definition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is defined as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, suffice to describe M(G) (see [3] page 434). The triangle inequalities are defined by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The definition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We briefly review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The first step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We define an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a first step in calculating a partition function over GT , we introduce the following definition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is defined as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by finding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Define the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Define Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfaffian orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Define the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Define G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following definitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisfies the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Define the optimal bound for a fixed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, define the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact find the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efficiently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difficulty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope defined in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efficiently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , find θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to find the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external field. Original graph is shown on the left. The field vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the field vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Specifically, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as defined above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We first characterize the optimum of Equation 7 for a fixed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisfies τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well defined set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We first P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external field. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for field parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various field parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the field node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The field parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difficult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is significantly better than TRW for almost all parameter settings, although the difference becomes smaller for large field values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a fixed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the field is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given field. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external fields. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to find one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.
6 0.12830035 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
7 0.11169393 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
8 0.10033215 203 nips-2006-implicit Online Learning with Kernels
9 0.099894889 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
10 0.098912008 156 nips-2006-Ordinal Regression by Extended Binary Classification
11 0.097213008 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
12 0.096414149 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
13 0.095253527 128 nips-2006-Manifold Denoising
14 0.093162969 39 nips-2006-Balanced Graph Matching
15 0.087148555 80 nips-2006-Fundamental Limitations of Spectral Clustering
16 0.084193729 117 nips-2006-Learning on Graph with Laplacian Regularization
17 0.08130949 138 nips-2006-Multi-Task Feature Learning
18 0.081105113 65 nips-2006-Denoising and Dimension Reduction in Feature Space
19 0.075892419 61 nips-2006-Convex Repeated Games and Fenchel Duality
20 0.075716548 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
topicId topicWeight
[(0, -0.213), (1, 0.1), (2, -0.172), (3, 0.115), (4, 0.064), (5, 0.023), (6, -0.054), (7, 0.145), (8, -0.099), (9, -0.163), (10, 0.129), (11, -0.194), (12, 0.034), (13, -0.046), (14, -0.051), (15, -0.06), (16, -0.065), (17, 0.007), (18, -0.03), (19, -0.121), (20, -0.116), (21, -0.06), (22, 0.021), (23, -0.007), (24, 0.103), (25, -0.006), (26, -0.096), (27, 0.075), (28, 0.086), (29, 0.102), (30, -0.004), (31, 0.074), (32, 0.003), (33, 0.056), (34, -0.063), (35, -0.105), (36, -0.04), (37, 0.058), (38, 0.025), (39, -0.088), (40, -0.121), (41, 0.02), (42, -0.086), (43, -0.021), (44, -0.077), (45, -0.064), (46, -0.147), (47, -0.149), (48, 0.026), (49, 0.168)]
simIndex simValue paperId paperTitle
same-paper 1 0.96306551 163 nips-2006-Prediction on a Graph with a Perceptron
Author: Mark Herbster, Massimiliano Pontil
Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1
2 0.74572092 77 nips-2006-Fast Computation of Graph Kernels
Author: Karsten M. Borgwardt, Nicol N. Schraudolph, S.v.n. Vishwanathan
Abstract: Using extensions of linear algebra concepts to Reproducing Kernel Hilbert Spaces (RKHS), we define a unifying framework for random walk kernels on graphs. Reduction to a Sylvester equation allows us to compute many of these kernels in O(n3 ) worst-case time. This includes kernels whose previous worst-case time complexity was O(n6 ), such as the geometric kernels of G¨ rtner et al. [1] and a the marginal graph kernels of Kashima et al. [2]. Our algebra in RKHS allow us to exploit sparsity in directed and undirected graphs more effectively than previous methods, yielding sub-cubic computational complexity when combined with conjugate gradient solvers or fixed-point iterations. Experiments on graphs from bioinformatics and other application domains show that our algorithms are often more than 1000 times faster than existing approaches. 1
3 0.67673296 14 nips-2006-A Small World Threshold for Economic Network Formation
Author: Eyal Even-dar, Michael Kearns
Abstract: We introduce a game-theoretic model for network formation inspired by earlier stochastic models that mix localized and long-distance connectivity. In this model, players may purchase edges at distance d at a cost of dα , and wish to minimize the sum of their edge purchases and their average distance to other players. In this model, we show there is a striking “small world” threshold phenomenon: in two dimensions, if α < 2 then every Nash equilibrium results in a network of constant diameter (independent of network size), and if α > 2 then every Nash equilibrium results in a network whose diameter grows as a root of the network size, and thus is unbounded. We contrast our results with those of Kleinberg [8] in a stochastic model, and empirically investigate the “navigability” of equilibrium networks. Our theoretical results all generalize to higher dimensions. 1
4 0.66813374 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
Author: Dengyong Zhou, Jiayuan Huang, Bernhard Schölkopf
Abstract: We usually endow the investigated objects with pairwise relationships, which can be illustrated as graphs. In many real-world problems, however, relationships among the objects of our interest are more complex than pairwise. Naively squeezing the complex relationships into pairwise ones will inevitably lead to loss of information which can be expected valuable for our learning tasks however. Therefore we consider using hypergraphs instead to completely represent complex relationships among the objects of our interest, and thus the problem of learning with hypergraphs arises. Our main contribution in this paper is to generalize the powerful methodology of spectral clustering which originally operates on undirected graphs to hypergraphs, and further develop algorithms for hypergraph embedding and transductive classification on the basis of the spectral hypergraph clustering approach. Our experiments on a number of benchmarks showed the advantages of hypergraphs over usual graphs. 1
5 0.59450376 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
6 0.49076223 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
7 0.48007107 35 nips-2006-Approximate inference using planar graph decomposition
8 0.41848093 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
9 0.40627566 39 nips-2006-Balanced Graph Matching
10 0.4006131 92 nips-2006-High-Dimensional Graphical Model Selection Using $\ell 1$-Regularized Logistic Regression
11 0.39057696 142 nips-2006-Mutagenetic tree Fisher kernel improves prediction of HIV drug resistance from viral genotype
12 0.35972068 156 nips-2006-Ordinal Regression by Extended Binary Classification
13 0.35628188 109 nips-2006-Learnability and the doubling dimension
14 0.33614191 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
15 0.33438015 117 nips-2006-Learning on Graph with Laplacian Regularization
16 0.33312863 128 nips-2006-Manifold Denoising
17 0.31410211 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
18 0.30096951 60 nips-2006-Convergence of Laplacian Eigenmaps
19 0.29093432 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
20 0.28915977 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
topicId topicWeight
[(1, 0.104), (3, 0.017), (7, 0.101), (9, 0.065), (12, 0.019), (22, 0.067), (37, 0.25), (44, 0.066), (57, 0.04), (64, 0.032), (65, 0.07), (69, 0.023), (83, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.81532902 163 nips-2006-Prediction on a Graph with a Perceptron
Author: Mark Herbster, Massimiliano Pontil
Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1
2 0.78053522 91 nips-2006-Hierarchical Dirichlet Processes with Random Effects
Author: Seyoung Kim, Padhraic Smyth
Abstract: Data sets involving multiple groups with shared characteristics frequently arise in practice. In this paper we extend hierarchical Dirichlet processes to model such data. Each group is assumed to be generated from a template mixture model with group level variability in both the mixing proportions and the component parameters. Variabilities in mixing proportions across groups are handled using hierarchical Dirichlet processes, also allowing for automatic determination of the number of components. In addition, each group is allowed to have its own component parameters coming from a prior described by a template mixture model. This group-level variability in the component parameters is handled using a random effects model. We present a Markov Chain Monte Carlo (MCMC) sampling algorithm to estimate model parameters and demonstrate the method by applying it to the problem of modeling spatial brain activation patterns across multiple images collected via functional magnetic resonance imaging (fMRI). 1
3 0.60468107 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
Author: Hamed Valizadegan, Rong Jin
Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its efficiency for real-world applications. First, it is computationally expensive and difficult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It significantly improves the computational efficiency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the efficiency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1
4 0.59962761 35 nips-2006-Approximate inference using planar graph decomposition
Author: Amir Globerson, Tommi S. Jaakkola
Abstract: A number of exact and approximate methods are available for inference calculations in graphical models. Many recent approximate methods for graphs with cycles are based on tractable algorithms for tree structured graphs. Here we base the approximation on a different tractable model, planar graphs with binary variables and pure interaction potentials (no external field). The partition function for such models can be calculated exactly using an algorithm introduced by Fisher and Kasteleyn in the 1960s. We show how such tractable planar models can be used in a decomposition to derive upper bounds on the partition function of non-planar models. The resulting algorithm also allows for the estimation of marginals. We compare our planar decomposition to the tree decomposition method of Wainwright et. al., showing that it results in a much tighter bound on the partition function, improved pairwise marginals, and comparable singleton marginals. Graphical models are a powerful tool for modeling multivariate distributions, and have been successfully applied in various fields such as coding theory and image processing. Applications of graphical models typically involve calculating two types of quantities, namely marginal distributions, and MAP assignments. The evaluation of the model partition function is closely related to calculating marginals [12]. These three problems can rarely be solved exactly in polynomial time, and are provably computationally hard in the general case [1]. When the model conforms to a tree structure, however, all these problems can be solved in polynomial time. This has prompted extensive research into tree based methods. For example, the junction tree method [6] converts a graphical model into a tree by clustering nodes into cliques, such that the graph over cliques is a tree. The resulting maximal clique size (cf. tree width) may nevertheless be prohibitively large. Wainwright et. al. [9, 11] proposed an approximate method based on trees known as tree reweighting (TRW). The TRW approach decomposes the potential vector of a graphical model into a mixture over spanning trees of the model, and then uses convexity arguments to bound various quantities, such as the partition function. One key advantage of this approach is that it provides bounds on partition function value, a property which is not shared by approximations based on Bethe free energies [13]. In this paper we focus on a different class of tractable models: planar graphs. A graph is called planar if it can be drawn in the plane without crossing edges. Works in the 1960s by physicists Fisher [5] and Kasteleyn [7], among others, have shown that the partition function for planar graphs may be calculated in polynomial time. This, however, is true under two key restrictions. One is that the variables xi are binary. The other is that the interaction potential depends only on xi xj (where xi ∈ {±1}), and not on their individual values (i.e., the zero external field case). Here we show how the above method can be used to obtain upper bounds on the partition function for non-planar graphs. As in TRW, we decompose the potential of a non-planar graph into a sum over spanning planar models, and then use a convexity argument to obtain an upper bound on the log partition function. The bound optimization is a convex problem, and can be solved in polynomial time. We compare our method with TRW on a planar graph with an external field, and show that it performs favorably with respect to both pairwise marginals and the bound on the partition function, and the two methods give similar results for singleton marginals. 1 Definitions and Notations Given a graph G with n vertices and a set of edges E, we are interested in pairwise Markov Random Fields (MRF) over the graph G. A pairwise MRF [13] is a multivariate distribution over variables x = {x1 , . . . , xn } defined as 1 P p(x) = e ij∈E fij (xi ,xj ) (1) Z where fij are a set of |E| functions, or interaction potentials, defined over pairs of variables. The P partition function is defined as Z = x e ij∈E fij (xi ,xj ) . Here we will focus on the case where xi ∈ {±1}. Furthermore, we will be interested in interaction potentials which only depend on agreement or disagreement between the signs of their variables. We define those by 1 θij (1 + xi xj ) = θij I(xi = xj ) (2) 2 so that fij (xi , xj ) is zero if xi = xj and θij if xi = xj . The model is then defined via the set of parameters θij . We use θ to denote the vector of parameters θij , and denote the partition function by Z(θ) to highlight its dependence on these parameters. f (xi , xj ) = A graph G is defined as planar if it can be drawn in the plane without any intersection of edges [4]. With some abuse of notation, we define E as the set of line segments in 2 corresponding to the edges in the graph. The regions of 2 \ E are defined as the faces of the graph. The face which corresponds to an unbounded region is called the external face. Given a planar graph G, its dual graph G∗ is defined in the following way: the vertices of G∗ correspond to faces of G, and there is an edge between two vertices in G∗ iff the two corresponding faces in G share an edge. If the graph G is weighted, the weight on an edge in G∗ is the weight on the edge shared by the corresponding faces in G. A plane triangulation of a planar graph G is obtained from G by adding edges such that all the faces of the resulting graph have exactly three vertices. Thus a plane triangulated graph has a dual where all vertices have degree three. It can be shown that every plane graph can be plane triangulated [4]. We shall also need the notion of a perfect matching on a graph. A perfect matching on a graph G is defined as a set of edges H ⊆ E such that every vertex in G has exactly one edge in H incident on it. If the graph is weighted, the weight of the matching is defined as the product of the weights of the edges in the matching. Finally, we recall the definition of a marginal polytope of a graph [12]. Consider an MRF over a graph G where fij are given by Equation 2. Denote the probability of the event I(xi = xj ) under p(x) by τij . The marginal polytope of G, denoted by M(G), is defined as the set of values τij that can be obtained under some assignment to the parameters θij . For a general graph G the polytope M(G) cannot be described using a polynomial number of inequalities. However, for planar graphs, it turns out that a set of O(n3 ) constraints, commonly referred to as triangle inequalities, suffice to describe M(G) (see [3] page 434). The triangle inequalities are defined by 1 TRI(n) = {τij : τij + τjk − τik ≤ 1, τij + τjk + τik ≥ 1, ∀i, j, k ∈ {1, . . . , n}} (3) Note that the above inequalities actually contain variables τij which do not correspond to edges in the original graph G. Thus the equality M(G) = TRI(n) should be understood as referring only to the values of τij that correspond to edges in the graph. Importantly, the values of τij for edges not in the graph need not be valid marginals for any MRF. In other words M(G) is a projection of TRI(n) on the set of edges of G. It is well known that the marginal polytope for trees is described via pairwise constraints. It is thus interesting that for planar graphs, it is triplets, rather than pairwise 1 The definition here is slightly different from that in [3], since here we refer to agreement probabilities, whereas [3] refers to disagreement probabilities. This polytope is also referred to as the cut polytope. constraints, that characterize the polytope. In this sense, planar graphs and trees may be viewed as a hierarchy of polytope complexity classes. It remains an interesting problem to characterize other structures in this hierarchy and their related inference algorithms. 2 Exact calculation of partition function using perfect matching The seminal works of Kasteleyn [7] and Fisher [5] have shown how one can calculate the partition function for a binary MRF over a planar graph with pure interaction potentials. We briefly review Fisher’s construction, which we will use in what follows. Our interpretation of the method differs somewhat from that of Fisher, but we believe it is more straightforward. The key idea in calculating the partition function is to convert the summation over values of x to the problem of calculating the sum of weights of all perfect matchings in a graph constructed from G, as shown below. In this section, we consider weighted graphs (graphs with numbers assigned to their edges). For the graph G associated with the pairwise MRF, we assign weights wij = e2θij to the edges. The first step in the construction is to plane triangulate the graph G. Let us call the resulting graph GT . We define an MRF on GT by assigning a parameter θij = 0 to the edges that have been added to G, and the corresponding weight wij = 1. Thus GT essentially describes the same distribution as G, and therefore has the same partition function. We can thus restrict our attention to calculating the partition function for the MRF on GT . As a first step in calculating a partition function over GT , we introduce the following definition: a ˆ set of edges E in GT is an agreement edge set (or AES) if for every triangle face F in GT one of the ˆ ˆ following holds: The edges in F are all in E, or exactly one of the edges in F is in E. The weight ˆ is defined as the product of the weights of the edges in E. ˆ of a set E It can be shown that there exists a bijection between pairs of assignments {x, −x} and agreement edge sets. The mapping from x to an edge set is simply the set of edges such that xi = xj . It is easy to see that this is an agreement edge set. The reverse mapping is obtained by finding an assignment x such that xi = xj iff the corresponding edge is in the agreement edge set. The existence of this mapping can be shown by induction on the number of (triangle) faces. P The contribution of a given assignment x to the partition function is e ˆ sponds to an AES denoted by E it is easy to see that P e ij∈E θij I(xi =xj ) = e− P ij∈E θij P e ˆ ij∈E 2θij = ce P ˆ ij∈E ij∈E 2θij θij I(xi =xj ) =c wij . If x corre(4) ˆ ij∈E P where c = e− ij∈E θij . Define the superset Λ as the set of agreement edge sets. The above then implies that Z(θ) = 2c E∈Λ ij∈E wij , and is thus proportional to the sum of AES weights. ˆ ˆ To sum over agreement edge sets, we use the following elegant trick introduced by Fisher [5]. Construct a new graph GPM from the dual of GT by introducing new vertices and edges according to the following rule: Replace each original vertex with three vertices that are connected to each other, and assign a weight of one to the new edges. Next, consider the three neighbors of the original vertex 2 . Connect each of the three new vertices to one of these three neighbors, keeping the original weights on these edges. The transformation is illustrated in Figure 1. The new graph GPM has O(3n) vertices, and is also planar. It can be seen that there is a one to one correspondence between perfect matchings in GPM and agreement edge sets in GT . Define Ω to be the set of perfect matchings in GPM . Then Z(θ) = 2c M ∈Ω ij∈M wij where we have used the fact that all the new weights have a value of one. Thus, the partition function is a sum over the weights of perfect matchings in GPM . Finally, we need a way of summing over the weights of the set of perfect matchings in a graph. Kasteleyn [7] proved that for a planar graph GPM , this sum may be obtained using the following sequence of steps: • Direct the edges of the graph GPM such that for every face (except possibly the external face), the number of edges on its perimeter oriented in a clockwise manner is odd. Kasteleyn showed that such a so called Pfaffian orientation may be constructed in polynomial time for a planar graph (see also [8] page 322). 2 Note that in the dual of GT all vertices have degree three, since GT is plane triangulated. 1.2 0.7 0.6 1 1 1 0.8 0.6 0.8 1.5 1.4 1.5 1 1 1.2 1 1 1 1 0.7 1.4 1 1 1 Figure 1: Illustration of the graph transformations in Section 2 for a complete graph with four vertices. Left panel shows the original weighted graph (dotted edges and grey vertices) and its dual (solid edges and black vertices). Right panel shows the dual graph with each vertex replaced by a triangle (the graph GPM in the text). Weights for dual graph edges correspond to the weights on the original graph. • Define the matrix P (GPM ) to be a skew symmetric matrix such that Pij = 0 if ij is not an edge, Pij = wij if the arrow on edge ij runs from i to j and Pij = −wij otherwise. • The sum over weighted matchings can then be shown to equal |P (GPM )|. The partition function is thus given by Z(θ) = 2c |P (GPM )|. To conclude this section we reiterate the following two key points: the partition function of a binary MRF over a planar graph with interaction potentials as in Equation 2 may be calculated in polynomial time by calculating the determinant of a matrix of size O(3n). An important outcome of this result is that the functional relation between Z(θ) and the parameters θij is known, a fact we shall use in what follows. 3 Partition function bounds via planar decomposition Given a non-planar graph G over binary variables with a vector of interaction potentials θ, we wish to use the exact planar computation to obtain a bound on the partition function of the MRF on G. We assume for simplicity that the potentials on the MRF for G are given in the form of Equation 2. Thus, G violates the assumptions of the previous section only in its non-planarity. Define G(r) as a set of spanning planar subgraphs of G, i.e., each graph G(r) is planar and contains all the vertices of G and some its edges. Denote by m the number of such graphs. Introduce the following definitions: (r) • θ (r) is a set of parameters on the edges of G(r) , and θij is an element in this set. Z(θ (r) ) is the partition function of the MRF on G(r) with parameters θ (r) . ˆ (r) ˆ(r) • θ is a set of parameters on the edges of G such that if edge (ij) is in G(r) then θij = (r) ˆ(r) θ , and otherwise θ = 0. ij ij Given a distribution ρ(r) on the graphs G(r) (i.e., ρ(r) ≥ 0 for r = 1, . . . , m and assume that the parameters for G(r) are such that ˆ ρ(r)θ θ= (r) r ρ(r) = 1), (5) r Then, by the convexity of the log partition function, as a function of the model parameters, we have ρ(r) log Z(θ (r) ) ≡ f (θ, ρ, θ (r) ) log Z(θ) ≤ (6) r Since by assumption the graphs G(r) are planar, this bound can be calculated in polynomial time. Since this bound is true for any set of parameters θ (r) which satisfies the condition in Equation 5 and for any distribution ρ(r), we may optimize over these two variables to obtain the tightest bound possible. Define the optimal bound for a fixed value of ρ(r) by g(ρ, θ) (optimization is w.r.t. θ (r) ) g(ρ, θ) = f (θ, ρ, θ (r) ) min θ (r) : P ˆ ρ(r)θ (r) =θ (7) Also, define the optimum of the above w.r.t. ρ by h(θ). h(θ) = min g(θ, ρ) ρ(r) ≥ 0, ρ(r) = 1 (8) Thus, h(θ) is the optimal upper bound for the given parameter vector θ. In the following section we argue that we can in fact find the global optimum of the above problem. 4 Globally Optimal Bound Optimization First consider calculating g(ρ, θ) from Equation 7. Note that since log Z(θ (r) ) is a convex function of θ (r) , and the constraints are linear, the overall optimization is convex and can be solved efficiently. In the current implementation, we use a projected gradient algorithm [2]. The gradient of f (θ, ρ, θ (r) ) w.r.t. θ (r) is given by ∂f (θ, ρ, θ (r) ) (r) ∂θij (r) = ρ(r) 1 + eθij (r) P −1 (GPM ) (r) k(i,j) Sign(Pk(i,j) (GPM )) (9) where k(i, j) returns the row and column indices of the element in the upper triangular matrix of (r) (r) P (GPM ), which contains the element e2θij . Since the optimization in Equation 7 is convex, it has an equivalent convex dual. Although we do not use this dual for optimization (because of the difficulty of expressing the entropy of planar models solely in terms of triplet marginals), it nevertheless allows some insight into the structure of the problem. The dual in this case is closely linked to the notion of the marginal polytope defined in Section 1. Using a derivation similar to [11], we arrive at the following characterization of the dual g(ρ, θ) = max τ ∈TRI(n) ρ(r)H(θ (r) (τ )) θ·τ + (10) r where θ (r) (τ ) denotes the parameters of an MRF on G(r) such that its marginals are given by the restriction of τ to the edges of G(r) , and H(θ (r) (τ )) denotes the entropy of the MRF over G(r) with parameters θ (r) (τ ). The maximized function in Equation 10 is linear in ρ and thus g(ρ, θ) is a pointwise maximum over (linear) convex functions in ρ and is thus convex in ρ. It therefore has no (r) local minima. Denote by θmin (ρ) the set of parameters that minimizes Equation 7 for a given value of ρ. Using a derivation similar to that in [11], the gradient of g(ρ, θ) can be shown to be ∂g(ρ, θ) (r) = H(θmin (ρ)) ∂ρ(r) (11) Since the partition function for G(r) can be calculated efficiently, so can the entropy. We can now summarize the algorithm for calculating h(θ) • Initialize ρ0 . Iterate: – For ρt , find θ (r) which solves the minimization in Equation 7. – Calculate the gradient of g(ρ, θ) at ρt using the expression in Equation 11 – Update ρt+1 = ρt + αv where v is a feasible search direction calculated from the gradient of g(ρ, θ) and the simplex constraints on ρ. The step size α is calculated via an Armijo line search. – Halt when the change in g(ρ, θ) is smaller than some threshold. Note that the minimization w.r.t. θ (r) is not very time consuming since we can initialize it with the minimum from the previous step, and thus only a few iterations are needed to find the new optimum, provided the change in ρ is not too big. The above algorithm is guaranteed to converge to a global optimum of ρ [2], and thus we obtain the tightest possible upper bound on Z(θ) given our planar graph decomposition. The procedure described here is asymmetric w.r.t. ρ and θ (r) . In a symmetric formulation the minimizing gradient steps could be carried out jointly or in an alternating sequence. The symmetric ˆ (r) formulation can be obtained by decoupling ρ and θ (r) in the bi-linear constraint ρ(r)θ = θ. Field Figure 2: Illustration of planar subgraph construction for a rectangular lattice with external field. Original graph is shown on the left. The field vertex is connected to all vertices (edges not shown). The graph on the right results from isolating the 4th ,5th columns of the original graph (shown in grey), and connecting the field vertex to the external vertices of the three disconnected components. Note that the resulting graph is planar. ˜ ˜ Specifically, we introduce θ (r) = θ (r) ρ(r) and perform the optimization w.r.t. ρ and θ (r) . It can be ˜(r) ) with the relevant (de-coupled) constraint is equivalent shown that a stationary point of f (θ, ρ, θ to the procedure described above. The advantage of this approach is that the exact minimization w.r.t θ (r) is not required before modifying ρ. Our experiments have shown, however, that the methods take comparable times to converge, although this may be a property of the implementation. 5 Estimating Marginals The optimization problem as defined above minimizes an upper bound on the partition function. However, it may also be of interest to obtain estimates of the marginals of the MRF over G. To obtain marginal estimates, we follow the approach in [11]. We first characterize the optimum of Equation 7 for a fixed value of ρ. Deriving the Lagrangian of Equation 7 w.r.t. θ (r) we obtain the (r) following characterization of θmin (ρ): Marginal Optimality Criterion: For any two graphs G(r) , G(s) such that the edge (ij) is in both (r) (s) graphs, the optimal parameter vector satisfies τij (θmin (ρ)) = τij (θmin (ρ)). Thus, the optimal set of parameters for the graphs G(r) is such that every two graphs agree on the marginals of all the edges they share. This implies that at the optimum, there is a well defined set of marginals over all the edges. We use this set as an approximation to the true marginals. A different method for estimating marginals uses the partition function bound directly. We first P calculate partition function bounds on the sums: αi (1) = x:xi =1 e ij∈E fij (xi ,xj ) and αi (−1) = P αi (1) e ij∈E fij (xi ,xj ) and then normalize αi (1)+αi (−1) to obtain an estimate for p(xi = 1). This method has the advantage of being more numerically stable (since it does not depend on derivatives of log Z). However, it needs to be calculated separately for each variable, so that it may be time consuming if one is interested in marginals for a large set of variables. x:xi =−1 6 Experimental Evaluation We study the application of our Planar Decomposition (PDC) P method to a binary MRF on a square P lattice with an external field. The MRF is given by p(x) ∝ e ij∈E θij xi xj + i∈V θi xi where V are the lattice vertices, and θi and θij are parameters. Note that this interaction does not satisfy the conditions for exact calculation of the partition function, even though the graph is planar. This problem is in fact NP hard [1]. However, it is possible to obtain the desired interaction form by introducing an additional variable xn+1 that is connected to all the original variables.P Denote the correspondP ij∈E θij xi xj + i∈V θi,n+1 xi xn+1 , where ing graph by Gf . Consider the distribution p(x, xn+1 ) ∝ e θi,n+1 = θi . It is easy to see that any property of p(x) (e.g., partition function, marginals) may be calculated from the corresponding property of p(x, xn+1 ). The advantage of the latter distribution is that it has the desired interaction form. We can thus apply PDC by choosing planar subgraphs of the non-planar graph Gf . 0.25 0.15 0.1 0.05 0.5 1 1.5 Interaction Strength 0.03 Singleton Marginal Error Z Bound Error Pairwise Marginals Error 0.08 PDC TRW 0.2 0.07 0.06 0.05 0.04 0.03 0.02 2 0.5 1 1.5 Interaction Strength 0.025 0.02 0.015 0.01 0.005 2 0.5 1 1.5 Interaction Strength 2 !3 x 10 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 Singleton Marginal Error Pairwise Marginals Error Z Bound Error 0.03 0.03 0.025 0.02 0.015 0.5 1 Field Strength 1.5 2 9 8 7 6 5 4 3 0.5 1 Field Strength 1.5 2 Figure 3: Comparison of the TRW and Planar Decomposition (PDC) algorithms on a 7×7 square lattice. TRW results shown in red squares, and PDC in blue circles. Left column shows the error in the log partition bound. Middle column is the mean error for pairwise marginals, and right column is the error for the singleton marginal of the variable at the lattice center. Results in upper row are for field parameters drawn from U[−0.05, 0.05] and various interaction parameters. Results in the lower row are for interaction parameters drawn from U [−0.5, 0.5] and various field parameters. Error bars are standard errors calculated from 40 random trials. There are clearly many ways to choose spanning planar subgraphs of Gf . Spanning subtrees are one option, and were used in [11]. Since our optimization is polynomial in the number of subgraphs, √ we preferred to use a number of subgraphs that is linear in n. The key idea in generating these planar subgraphs is to generate disconnected components of the lattice and connect xn+1 only to the external vertices of these components. Here we generate three disconnected components by isolating two neighboring columns (or rows) from the rest of the graph, resulting in three components. This is √ illustrated in Figure 2. To this set of 2 n graphs, we add the independent variables graph consisting only of edges from the field node to all the other nodes. We compared the performance of the PDC and TRW methods 3 4 on a 7 × 7 lattice . Since the exact partition function and marginals can be calculated for this case, we could compare both algorithms to the true values. The MRF parameters were set according to the two following scenarios: 1) Varying Interaction - The field parameters θi were drawn uniformly from U[−0.05, 0.05], and the interaction θij from U[−α, α] where α ∈ {0.2, 0.4, . . . , 2}. This is the setting tested in [11]. 2) Varying Field θi was drawn uniformly from U[−α, α], where α ∈ {0.2, 0.4, . . . , 2} and θij from U[−0.5, 0.5]. For each scenario, we calculated the following measures: 1) Normalized log partition error 1 1 alg − log Z true ). 2) Error in pairwise marginals |E| ij∈E |palg (xi = 1, xj = 1) − 49 (log Z ptrue (xi = 1, xj = 1)|. Pairwise marginals were calculated jointly using the marginal optimality criterion of Section 5. 3) Error in singleton marginals. We calculated the singleton marginals for the innermost node in the lattice (i.e., coordinate [3, 3]), which intuitively should be the most difficult for the planar based algorithm. This marginal was calculated using two partition functions, as explained in Section 5 5 . The same method was used for TRW. The reported error measure is |palg (xi = 1) − ptrue (xi = 1)|. Results were averaged over 40 random trials. Results for the two scenarios and different evaluation measures are given in Figure 3. It can be seen that the partition function bound for PDC is significantly better than TRW for almost all parameter settings, although the difference becomes smaller for large field values. Error for the PDC pairwise 3 TRW and PDC bounds were optimized over both the subgraph parameters and the mixture parameters ρ. In terms of running time, PDC optimization for a fixed value of ρ took about 30 seconds, which is still slower than the TRW message passing implementation. 5 Results using the marginal optimality criterion were worse for PDC, possibly due to its reduced numerical precision. 4 marginals are smaller than those of TRW for all parameter settings. For the singleton parameters, TRW slightly outperforms PDC. This is not surprising since the field is modeled by every spanning tree in the TRW decomposition, whereas in PDC not all the structures model a given field. 7 Discussion We have presented a method for using planar graphs as the basis for approximating non-planar graphs such as planar graphs with external fields. While the restriction to binary variables limits the applicability of our approach, it remains relevant in many important applications, such as coding theory and combinatorial optimization. Moreover, it is always possible to convert a non-binary graphical model to a binary one by introducing additional variables. The resulting graph will typically not be planar, even when the original graph over k−ary variables is. However, the planar decomposition method can then be applied to this non-planar graph. The optimization of the decomposition is carried out explicitly over the planar subgraphs, thus limiting the number of subgraphs that can be used in the approximation. In the TRW method this problem is circumvented since it is possible to implicitly optimize over all spanning trees. The reason this can be done for trees is that the entropy of an MRF over a tree may be written as a function of its marginal variables. We do not know of an equivalent result for planar graphs, and it remains a challenge to find one. It is however possible to combine the planar and tree decompositions into one single bound, which is guaranteed to outperform the tree or planar approximations alone. The planar decomposition idea may in principle be applied to bounding the value of the MAP assignment. However, as in TRW, it can be shown that the solution is not dependent on the decomposition (as long as each edge appears in some structure), and the problem is equivalent to maximizing a linear function over the marginal polytope (which can be done in polynomial time for planar graphs). However, such a decomposition may suggest new message passing algorithms, as in [10]. Acknowledgments The authors acknowledge support from the Defense Advanced Research Projects Agency (Transfer Learning program). Amir Globerson is also supported by the Rothschild Yad-Hanadiv fellowship. The authors also wish to thank Martin Wainwright for providing his TRW code. References [1] F. Barahona. On the computational complexity of ising spin glass models. J. Phys. A., 15(10):3241–3253, 1982. [2] D. P. Bertsekas, editor. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [3] M.M. Deza and M. Laurent. Geometry of Cuts and Metrics. Springe-Verlag, 1997. [4] R. Diestel. Graph Theory. Springer-Verlag, 1997. [5] M.E. Fisher. On the dimer solution of planar ising models. J. Math. Phys., 7:1776–1781, 1966. [6] M.I. Jordan, editor. Learning in graphical models. MIT press, Cambridge, MA, 1998. [7] P.W. Kasteleyn. Dimer statistics and phase transitions. Journal of Math. Physics, 4:287–293, 1963. [8] L. Lovasz and M.D. Plummer. Matching Theory, volume 29 of Annals of discrete mathematics. NorthHolland, New-York, 1986. [9] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 49(5):1120–1146, 2003. [10] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Trans. on Information Theory, 51(11):1120–1146, 2005. [11] M. J. Wainwright, T. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, 51(7):2313–2335, 2005. [12] M.J. Wainwright and M.I. Jordan. Graphical models, exponential families, and variational inference. Technical report, UC Berkeley Dept. of Statistics, 2003. [13] J.S. Yedidia, W.T. W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282–2312, 2005.
5 0.59940481 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
6 0.59865421 65 nips-2006-Denoising and Dimension Reduction in Feature Space
7 0.59674269 150 nips-2006-On Transductive Regression
8 0.5928635 117 nips-2006-Learning on Graph with Laplacian Regularization
9 0.59263605 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
10 0.59053361 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
11 0.58931148 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
12 0.58873999 20 nips-2006-Active learning for misspecified generalized linear models
13 0.5880785 77 nips-2006-Fast Computation of Graph Kernels
14 0.58378828 109 nips-2006-Learnability and the doubling dimension
15 0.58198059 128 nips-2006-Manifold Denoising
16 0.58142585 33 nips-2006-Analysis of Representations for Domain Adaptation
17 0.57967424 61 nips-2006-Convex Repeated Games and Fenchel Duality
18 0.57854515 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
19 0.57772017 21 nips-2006-AdaBoost is Consistent
20 0.57704002 5 nips-2006-A Kernel Method for the Two-Sample-Problem