jmlr jmlr2013 jmlr2013-92 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella
Abstract: We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. Experiments on real-world data sets show that our method compares well to both global (Perceptron) and local (label propagation) methods, while being generally faster in practice. Keywords: online learning, learning on graphs, graph prediction, random spanning trees
Reference: text
sentIndex sentText sentNum sentScore
1 IT Dipartimento di Matematica Universit` degli Studi di Milano a via Saldini 50 20133 - Milano, Italy Editor: Shie Mannor Abstract We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. [sent-9, score-0.357]
2 We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. [sent-10, score-1.211]
3 The cutsize is induced by the unknown adversarial labeling of the graph nodes. [sent-11, score-0.826]
4 In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. [sent-12, score-0.291]
5 Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. [sent-13, score-1.074]
6 Keywords: online learning, learning on graphs, graph prediction, random spanning trees 1. [sent-15, score-0.599]
7 Introduction A widespread approach to the solution of classification problems is representing data sets through a weighted graph where nodes are the data items and edge weights quantify the similarity between pairs of data items. [sent-16, score-0.553]
8 In the sequential version of this problem, nodes are presented in an arbitrary (possibly adversarial) order, and the learner must predict the binary label of each node before observing its true value. [sent-21, score-0.269]
9 An interesting special case of the online problem is the so-called transductive setting, where the entire graph structure (including edge weights) is known in advance. [sent-25, score-0.358]
10 The transductive setting is interesting in that the learner has the chance of reconfiguring the graph before learning starts, so as to make the problem look easier. [sent-26, score-0.292]
11 This data preprocessing can be viewed as a kind of regularization in the context of graph prediction. [sent-27, score-0.2]
12 However, while the number of mistakes is obviously bounded by the number of nodes, the cutsize scales with the number of edges. [sent-32, score-0.586]
13 This naturally led to the idea of solving the prediction problem on a spanning tree of the graph (Cesa-Bianchi et al. [sent-33, score-0.763]
14 Now, since the cutsize of the spanning tree is smaller than that of the original graph, the number of mistakes in predicting the nodes is more tightly controlled. [sent-36, score-1.213]
15 In light of the previous discussion, we can also view the spanning tree as a “maximally regularized” version of the original graph. [sent-37, score-0.506]
16 Since a graph has up to exponentially many spanning trees, which one should be used to maximize the predictive performance? [sent-38, score-0.512]
17 This question can be answered by recalling the adversarial nature of the online setting, where the presentation of nodes and the assignment of labels to them are both arbitrary. [sent-39, score-0.307]
18 This suggests to pick a tree at random among all spanning trees of the graph so as to prevent the adversary from concentrating the cutsize on the chosen tree (Cesa-Bianchi et al. [sent-40, score-1.481]
19 Kirchoff’s equivalence between the effective resistance of an edge and its probability of being included in a random spanning tree allows to express the expected cutsize of a random spanning tree in a simple form. [sent-42, score-1.666]
20 Namely, as the sum of resistances over all edges in the cut of G induced by the adversarial label assignment. [sent-43, score-0.228]
21 (2009) yield a mistake bound for arbitrary unweighted graphs in terms of the cutsize of a random spanning tree, no general lower bounds are known for online unweighted graph prediction. [sent-45, score-1.435]
22 In this paper we fill this gap, and show that the expected cutsize of a random spanning tree of the graph delivers a convenient parametrization1 that captures the hardness of the graph learning problem in the general weighted case. [sent-47, score-1.454]
23 Given any weighted graph, we prove that any online prediction algorithm must err on a number of nodes which is at least as big as the expected cutsize of the graph’s random spanning tree (which is defined in terms of the graph weights). [sent-48, score-1.478]
24 Moreover, we exhibit a simple randomized algorithm achieving in expectation the optimal mistake bound to within logarithmic 1. [sent-49, score-0.208]
25 This bound applies to any sufficiently connected weighted graph whose weighted cutsize is not an overwhelming fraction of the total weight. [sent-52, score-0.879]
26 (2009), our algorithm first extracts a random spanning tree of the original graph. [sent-54, score-0.506]
27 Then, it predicts all nodes of this tree using a generalization of the method proposed by Herbster et al. [sent-55, score-0.315]
28 Our tree prediction procedure is extremely efficient: it only requires constant amortized time per prediction and space linear in the number of nodes. [sent-57, score-0.361]
29 (2009a), our algorithm first linearizes the tree, and then operates on the resulting line graph via a nearest neighbor rule. [sent-61, score-0.2]
30 In order to provide convincing empirical evidence, we also present an experimental evaluation of our method compared to other algorithms recently proposed in the literature on graph prediction. [sent-65, score-0.2]
31 The two tree-based algorithms (ours and the Perceptron algorithm) have been tested using spanning trees generated in various ways, including committees of spanning trees aggregated by majority votes. [sent-71, score-0.71]
32 In Section 3 we prove the general lower bound relating the mistakes of any prediction algorithm to the expected cutsize of a random spanning tree of the weighted graph. [sent-76, score-1.27]
33 In the subsequent section, we present our prediction algorithm WTA (Weighted Tree Algorithm), along with a detailed mistake bound analysis restricted to weighted trees. [sent-77, score-0.286]
34 This analysis is extended to weighted graphs in Section 5, where we provide an upper bound matching the lower bound up to log factors on any sufficiently connected graph. [sent-78, score-0.237]
35 1 Preliminaries and Basic Notation Let G = (V, E,W ) be an undirected, connected, and weighted graph with n nodes and positive edge weights wi, j > 0 for (i, j) ∈ E. [sent-84, score-0.532]
36 The game is parameterized by the graph G = (V, E,W ). [sent-91, score-0.2]
37 , n the adversary chooses the next node it in the permutation of V , and presents it to the learner for the prediction of the associated label yit . [sent-97, score-0.35]
38 Note that while the adversarial choice of the permutation can depend on the algorithm’s randomization, the choice of the labeling is oblivious to it. [sent-100, score-0.21]
39 In other words, the learner uses randomization to fend off the adversarial choice of labels, whereas it is fully deterministic against the adversarial choice of the permutation. [sent-101, score-0.237]
40 The requirement that the adversary is fully oblivious when choosing labels is then dictated by the fact that the randomized learners considered in this paper make all their random choices at the beginning of the prediction process (i. [sent-102, score-0.286]
41 For this reason, our analysis of graph prediction algorithms bounds from above the number of prediction mistakes in terms of appropriate notions of graph label regularity. [sent-106, score-0.707]
42 A standard notion of label regularity is the cutsize of a labeled graph, defined as follows. [sent-107, score-0.483]
43 A φ-edge of a labeled graph (G, y) is any edge (i, j) such that yi = y j . [sent-108, score-0.281]
44 The quantity ΦG (y) = E φ is the cutsize of (G, y), that is, the number of φ-edges in E φ (independent of the edge weights). [sent-111, score-0.519]
45 The weighted cutsize of (G, y) is defined by ΦW (y) = ∑ wi, j . [sent-112, score-0.527]
46 G (i, j)∈E φ For a fixed (G, y), we denote by rWj the effective resistance between nodes i and j of G. [sent-113, score-0.256]
47 In the i, interpretation of the graph as an electric network, where the weights wi, j are the edge conductances, the effective resistance rWj is the voltage between i and j when a unit current flow is maintained i, through them. [sent-114, score-0.457]
48 For (i, j) ∈ E, let also pi, j = wi, j rWj be the probability that (i, j) belongs to a random i, spanning tree T —see, for example, the monograph of Lyons and Peres (2009). [sent-115, score-0.528]
49 Then we have E ΦT (y) = ∑ pi, j = (i, j)∈E φ ∑ wi, j rWj , i, (1) (i, j)∈E φ where the expectation E is over the random choice of spanning tree T . [sent-116, score-0.506]
50 1 Hence the ratio n−1 E ΦT (y) ∈ [0, 1] provides a density-independent measure of the cutsize in G, and even allows to compare labelings on different graphs. [sent-120, score-0.438]
51 Now contrast E ΦT (y) to the more standard weighted cutsize measure ΦW (y). [sent-121, score-0.527]
52 Therefore, the effect of i, i, the large weight wi, j may often be compensated by the small probability of including (i, j) in the random spanning tree. [sent-131, score-0.338]
53 1254 R ANDOM S PANNING T REES AND THE P REDICTION OF W EIGHTED G RAPHS A different way of taking into account graph connectivity is provided by the covering ball approach taken by Herbster (2008) and Herbster and Lever (2009)—see the next section. [sent-133, score-0.2]
54 If the two labels y1 and y2 are such that y1 = y2 , then the contribution of the edges on the left clique C1 to the cutsizes ΦG (y) and ΦW (y) must be large. [sent-136, score-0.201]
55 G However, since the probability of including each edge of C1 in a random spanning tree T is O (1/|V |), C1 ’s contribution to E ΦT (y) is |V | times smaller than ΦC1 (y) = ΦW1 (y). [sent-137, score-0.624]
56 Hence, the contribution of (5, 6) to E ΦT (y) 5,6 is small because the large weight of (5, 6) is offset by the fact that nodes 5 and 6 are strongly connected (i. [sent-143, score-0.215]
57 The resulting mistake bound is of the form ΦW (y)DW , where DW = maxi, j rWj is the resistance diameter of G. [sent-156, score-0.379]
58 The idea of using a spanning tree to reduce the cutsize of G has been investigated by Herbster et al. [sent-160, score-0.944]
59 (2009b), where the graph Perceptron algorithm is applied to a spanning tree T of G. [sent-161, score-0.706]
60 The 1255 C ESA -B IANCHI , G ENTILE , V ITALE AND Z APPELLA resulting mistake bound is of the form ΦW (y)DW , that is, the graph Perceptron bound applied to T T tree T . [sent-162, score-0.566]
61 Since ΦW (y) ≤ ΦW (y) this bound has a smaller cutsize than the previous one. [sent-163, score-0.47]
62 (2009b) suggest to apply the graph Perceptron algorithm to the spanning tree T with smallest geodesic diameter. [sent-167, score-0.736]
63 The geodesic diameter of a weighted graph G is defined by ∆W = max min G i, j Πi, j ∑ (r,s)∈Πi, j 1 , wi, j where the minimum is over all paths Πi, j between i and j. [sent-168, score-0.443]
64 The reason behind this choice of T is that, for the spanning tree T with smallest geodesic diameter, it holds that DW ≤ 2∆W . [sent-169, score-0.536]
65 However, T G one the one hand DW ≤ ∆W , so there is no guarantee that DW = O DW , and on the other hand the T G G G adversary may still concentrate all φ-edges on the chosen tree T , so there is no guarantee that ΦW (y) T remains small either. [sent-170, score-0.294]
66 After reducing the graph to a spanning tree T , the tree is linearized via a depthfirst visit. [sent-173, score-0.9]
67 Another natural parametrization for the labels of a weighted graph that takes the graph structure into account is clusterability, that is, the extent to which the graph nodes can be covered by a few balls of small resistance diameter. [sent-178, score-1.049]
68 The number of mistakes has a bound of the form min N (G, ρ) + ΦW (y)ρ , G ρ>0 (2) where N (G, ρ) is the smallest number of balls of resistance diameter ρ it takes to cover the nodes of G. [sent-180, score-0.566]
69 Note that the graph Perceptron bound is recovered when ρ = DW . [sent-181, score-0.232]
70 Moreover, observe that, G unlike graph Perceptron’s, bound (2) is never vacuous, as it holds uniformly for all covers of G (even the one made up of singletons, corresponding to ρ → 0). [sent-182, score-0.232]
71 This “support tree” helps in keeping the diameter of G as small as possible, for example, logarithmic in the number of nodes n. [sent-185, score-0.262]
72 The resulting prediction algorithm is again a combination of a Perceptron-like algorithm and NN, and the corresponding number of mistakes is the minimum over two earlier bounds: a NN-based bound of the form ΦG (y)(log n)2 and an unweighted version of bound (2). [sent-186, score-0.393]
73 Generally speaking, clusterability and resistance-weighted cutsize E ΦT (y) exploit the graph structure in different ways. [sent-187, score-0.673]
74 Consider, for instance, a barbell graph made up of two m-cliques joined by k unweighted φ-edges with no endpoints in common (hence k ≤ m). [sent-188, score-0.407]
75 On the other hand, if k gets close to m the 1256 R ANDOM S PANNING T REES AND THE P REDICTION OF W EIGHTED G RAPHS resistance diameter of the graph decreases, and (2) becomes a constant. [sent-192, score-0.439]
76 In particular, the probability that a φ3m−1 edge is included in the random spanning tree T is upper bounded by m(m+1) , that is, E ΦT (y) → 3 when m grows large. [sent-194, score-0.587]
77 When the graph at hand has a large diameter, for example, an m-line graph connected to an m-clique (this is sometimes called a “lollipop” graph) the gap between the covering-based bound (2) and E ΦT (y) is magnified. [sent-196, score-0.463]
78 In order to trade-off the contribution of cutsize ΦW and resistance diameter DW , the authors develop a notion of p-norm G G resistance. [sent-201, score-0.714]
79 Roughly speaking, one can select the dual norm parameter of the algorithm to obtain a logarithmic contribution from the resistance diameter at the cost of squaring the contribution due to the cutsize. [sent-203, score-0.35]
80 For instance, in the unweighted barbell graph mentioned above, selecting the norm appropriately leads to a bound which is constant even when k ≪ m. [sent-205, score-0.409]
81 Our graph prediction algorithm is based on a random spanning tree of the original graph. [sent-213, score-0.763]
82 The problem of drawing a random spanning tree of an arbitrary graph has a long history—see, for example, the monograph by Lyons and Peres (2009). [sent-214, score-0.728]
83 In the unweighted case, a random spanning tree can be sampled with a random walk in expected time O (n ln n) for “most” graphs, as shown by Broder (1989). [sent-215, score-0.65]
84 In the weighted case, the above methods can take longer due to the hardness of reaching, via a random walk, portions of the graph which are connected only via lightweighted edges. [sent-219, score-0.341]
85 To sidestep this issue, in our experiments we tested a viable fast approximation where weights are disregarded when building the spanning tree, and only used at prediction time. [sent-220, score-0.41]
86 Finally, the space complexity for generating a random spanning tree is always linear in the graph size. [sent-221, score-0.706]
87 To conclude this section, it is worth mentioning that, although we exploit random spanning trees to reduce the cutsize, similar approaches can also be used to approximate the cutsize of a weighted graph by sparsification—see, for example, the work of Spielman and Srivastava (2008). [sent-222, score-1.082]
88 Numbers on edges are the probabilities pi, j of those edges being included in a random spanning tree for the weighted graph under consideration. [sent-224, score-0.937]
89 The adversary assigns a random label to each node in S thus forcing |S|/2 mistakes in expectation. [sent-227, score-0.337]
90 Then, it labels all nodes in V \ S with a unique label, chosen in such a way as to minimize the cutsize consistent with the labels previously assigned to the nodes of S. [sent-228, score-0.786]
91 because the resulting graphs are not as sparse as spanning trees, we do not currently see how to use those results. [sent-229, score-0.365]
92 2 Theorem 1 Let G = (V, E,W ) be a weighted undirected graph with n nodes and weights wi, j > 0 for (i, j) ∈ E. [sent-233, score-0.451]
93 Then for all K ≤ n there exists a randomized labeling y of G such that for all (deterministic or randomized) algorithms A, the expected number of prediction mistakes made by A is at least K/2, while E ΦT (y) < K. [sent-234, score-0.312]
94 By (1), pi, j i, is the probability that edge (i, j) belongs to a random spanning tree T of G. [sent-236, score-0.587]
95 This guarantees that, no matter what, the algorithm A will make on average K/2 mistakes on the nodes in S. [sent-241, score-0.269]
96 We now show that the weighted cutsize ΦP (y) of this labeling y is less than K, G independent of the labels of the nodes in S. [sent-244, score-0.777]
97 Since the nodes in V \ S have all the same label, the φ-edges induced by this labeling can only connect either two nodes in S or one node in S and one node in V \ S. [sent-245, score-0.429]
98 Hence ΦP (y) can be written G as ΦP (y) = ΦP,int (y) + ΦP,ext (y) , G G G 1258 R ANDOM S PANNING T REES AND THE P REDICTION OF W EIGHTED G RAPHS where ΦP,int (y) is the cutsize contribution within S, and ΦP,ext (y) is the one from edges between S G G and V \ S. [sent-246, score-0.546]
99 Moreover, from the way the G G P,ext ext labels of nodes in V \ S are selected, it follows that ΦG (y) ≤ PS /2. [sent-250, score-0.228]
100 Finally, ∑ Pi = 2PSint + PSext i∈S holds, since each edge connecting nodes in S is counted twice in the sum ∑i∈S Pi . [sent-251, score-0.202]
wordName wordTfidf (topN-words)
[('herbster', 0.49), ('cutsize', 0.438), ('spanning', 0.312), ('graph', 0.2), ('tree', 0.194), ('rwj', 0.158), ('mistakes', 0.148), ('perceptron', 0.135), ('dw', 0.135), ('resistance', 0.135), ('unweighted', 0.124), ('nodes', 0.121), ('mistake', 0.108), ('pi', 0.107), ('diameter', 0.104), ('adversary', 0.1), ('lever', 0.09), ('milano', 0.09), ('weighted', 0.089), ('adversarial', 0.089), ('edge', 0.081), ('labeling', 0.076), ('edges', 0.071), ('ps', 0.07), ('appella', 0.07), ('entile', 0.07), ('ianchi', 0.07), ('itale', 0.07), ('panning', 0.07), ('wi', 0.068), ('eighted', 0.06), ('learner', 0.059), ('prediction', 0.057), ('ext', 0.054), ('rees', 0.054), ('graphs', 0.053), ('labels', 0.053), ('amortized', 0.053), ('barbell', 0.053), ('fabio', 0.053), ('peres', 0.053), ('unimi', 0.053), ('vitale', 0.053), ('esa', 0.05), ('andom', 0.05), ('rediction', 0.05), ('raphs', 0.047), ('pontil', 0.046), ('label', 0.045), ('claudio', 0.045), ('dipartimento', 0.045), ('lyons', 0.045), ('online', 0.044), ('node', 0.044), ('italy', 0.044), ('int', 0.044), ('trees', 0.043), ('weights', 0.041), ('clique', 0.04), ('degli', 0.04), ('studi', 0.04), ('logarithmic', 0.037), ('gentile', 0.037), ('giovanni', 0.037), ('contribution', 0.037), ('clusterability', 0.035), ('comelico', 0.035), ('zappella', 0.035), ('transductive', 0.033), ('bound', 0.032), ('randomized', 0.031), ('connected', 0.031), ('endpoints', 0.03), ('lg', 0.03), ('informatica', 0.03), ('rw', 0.03), ('geodesic', 0.03), ('nn', 0.03), ('universit', 0.028), ('di', 0.027), ('balls', 0.026), ('weight', 0.026), ('yit', 0.025), ('parametrization', 0.025), ('propagation', 0.025), ('oblivious', 0.025), ('vacuous', 0.023), ('err', 0.023), ('nicol', 0.023), ('induced', 0.023), ('baselines', 0.022), ('monograph', 0.022), ('hardness', 0.021), ('quantify', 0.021), ('learners', 0.02), ('walk', 0.02), ('preliminaries', 0.02), ('permutation', 0.02), ('zhu', 0.02), ('paths', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella
Abstract: We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. Experiments on real-world data sets show that our method compares well to both global (Perceptron) and local (label propagation) methods, while being generally faster in practice. Keywords: online learning, learning on graphs, graph prediction, random spanning trees
2 0.097870983 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
3 0.091043726 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
Author: Nicholas Ruozzi, Sekhar Tatikonda
Abstract: Gaussian belief propagation (GaBP) is an iterative algorithm for computing the mean (and variances) of a multivariate Gaussian distribution, or equivalently, the minimum of a multivariate positive definite quadratic function. Sufficient conditions, such as walk-summability, that guarantee the convergence and correctness of GaBP are known, but GaBP may fail to converge to the correct solution given an arbitrary positive definite covariance matrix. As was observed by Malioutov et al. (2006), the GaBP algorithm fails to converge if the computation trees produced by the algorithm are not positive definite. In this work, we will show that the failure modes of the GaBP algorithm can be understood via graph covers, and we prove that a parameterized generalization of the min-sum algorithm can be used to ensure that the computation trees remain positive definite whenever the input matrix is positive definite. We demonstrate that the resulting algorithm is closely related to other iterative schemes for quadratic minimization such as the Gauss-Seidel and Jacobi algorithms. Finally, we observe, empirically, that there always exists a choice of parameters such that the above generalization of the GaBP algorithm converges. Keywords: belief propagation, Gaussian graphical models, graph covers
4 0.074855 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi
Abstract: In this paper we establish that the Lov´ sz ϑ function on a graph can be restated as a kernel learning a problem. We introduce the notion of SVM − ϑ graphs, on which Lov´ sz ϑ function can be apa proximated well by a Support vector machine (SVM). We show that Erd¨ s-R´ nyi random G(n, p) o e log4 n np graphs are SVM − ϑ graphs for n ≤ p < 1. Even if we embed a large clique of size Θ 1−p in a G(n, p) graph the resultant graph still remains a SVM − ϑ graph. This immediately suggests an SVM based algorithm for recovering a large planted clique in random graphs. Associated with the ϑ function is the notion of orthogonal labellings. We introduce common orthogonal labellings which extends the idea of orthogonal labellings to multiple graphs. This allows us to propose a Multiple Kernel learning (MKL) based solution which is capable of identifying a large common dense subgraph in multiple graphs. Both in the planted clique case and common subgraph detection problem the proposed solutions beat the state of the art by an order of magnitude. Keywords: orthogonal labellings of graphs, planted cliques, random graphs, common dense subgraph
5 0.071172342 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
Author: Matthew J. Urry, Peter Sollich
Abstract: We consider learning on graphs, guided by kernels that encode similarity between vertices. Our focus is on random walk kernels, the analogues of squared exponential kernels in Euclidean spaces. We show that on large, locally treelike graphs these have some counter-intuitive properties, specifically in the limit of large kernel lengthscales. We consider using these kernels as covariance functions of Gaussian processes. In this situation one typically scales the prior globally to normalise the average of the prior variance across vertices. We demonstrate that, in contrast to the Euclidean case, this generically leads to significant variation in the prior variance across vertices, which is undesirable from a probabilistic modelling point of view. We suggest the random walk kernel should be normalised locally, so that each vertex has the same prior variance, and analyse the consequences of this by studying learning curves for Gaussian process regression. Numerical calculations as well as novel theoretical predictions for the learning curves using belief propagation show that one obtains distinctly different probabilistic models depending on the choice of normalisation. Our method for predicting the learning curves using belief propagation is significantly more accurate than previous approximations and should become exact in the limit of large random graphs. Keywords: Gaussian process, generalisation error, learning curve, cavity method, belief propagation, graph, random walk kernel
6 0.069926068 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
7 0.05976351 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies
8 0.056474876 120 jmlr-2013-Variational Algorithms for Marginal MAP
9 0.051025957 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
10 0.050478958 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
12 0.040734332 8 jmlr-2013-A Theory of Multiclass Boosting
13 0.040727854 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
14 0.040290002 82 jmlr-2013-Optimally Fuzzy Temporal Memory
15 0.039732464 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
16 0.037779249 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
17 0.037308134 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
18 0.034871012 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
19 0.034168649 41 jmlr-2013-Experiment Selection for Causal Discovery
20 0.031431504 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
topicId topicWeight
[(0, -0.152), (1, 0.053), (2, -0.008), (3, 0.024), (4, 0.106), (5, 0.239), (6, 0.034), (7, 0.002), (8, -0.165), (9, -0.029), (10, 0.005), (11, 0.063), (12, -0.016), (13, 0.12), (14, 0.108), (15, 0.215), (16, -0.008), (17, 0.139), (18, 0.126), (19, -0.111), (20, 0.063), (21, 0.023), (22, 0.088), (23, -0.203), (24, 0.076), (25, 0.001), (26, -0.065), (27, -0.012), (28, 0.022), (29, -0.004), (30, -0.008), (31, -0.149), (32, -0.067), (33, 0.026), (34, -0.035), (35, -0.132), (36, 0.051), (37, 0.02), (38, 0.051), (39, -0.021), (40, 0.022), (41, -0.08), (42, 0.006), (43, 0.019), (44, 0.056), (45, 0.057), (46, 0.044), (47, 0.013), (48, 0.006), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.96631277 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella
Abstract: We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. Experiments on real-world data sets show that our method compares well to both global (Perceptron) and local (label propagation) methods, while being generally faster in practice. Keywords: online learning, learning on graphs, graph prediction, random spanning trees
2 0.66734672 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
Author: Nicholas Ruozzi, Sekhar Tatikonda
Abstract: Gaussian belief propagation (GaBP) is an iterative algorithm for computing the mean (and variances) of a multivariate Gaussian distribution, or equivalently, the minimum of a multivariate positive definite quadratic function. Sufficient conditions, such as walk-summability, that guarantee the convergence and correctness of GaBP are known, but GaBP may fail to converge to the correct solution given an arbitrary positive definite covariance matrix. As was observed by Malioutov et al. (2006), the GaBP algorithm fails to converge if the computation trees produced by the algorithm are not positive definite. In this work, we will show that the failure modes of the GaBP algorithm can be understood via graph covers, and we prove that a parameterized generalization of the min-sum algorithm can be used to ensure that the computation trees remain positive definite whenever the input matrix is positive definite. We demonstrate that the resulting algorithm is closely related to other iterative schemes for quadratic minimization such as the Gauss-Seidel and Jacobi algorithms. Finally, we observe, empirically, that there always exists a choice of parameters such that the above generalization of the GaBP algorithm converges. Keywords: belief propagation, Gaussian graphical models, graph covers
3 0.57715607 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
4 0.52252299 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
Author: Matthew J. Urry, Peter Sollich
Abstract: We consider learning on graphs, guided by kernels that encode similarity between vertices. Our focus is on random walk kernels, the analogues of squared exponential kernels in Euclidean spaces. We show that on large, locally treelike graphs these have some counter-intuitive properties, specifically in the limit of large kernel lengthscales. We consider using these kernels as covariance functions of Gaussian processes. In this situation one typically scales the prior globally to normalise the average of the prior variance across vertices. We demonstrate that, in contrast to the Euclidean case, this generically leads to significant variation in the prior variance across vertices, which is undesirable from a probabilistic modelling point of view. We suggest the random walk kernel should be normalised locally, so that each vertex has the same prior variance, and analyse the consequences of this by studying learning curves for Gaussian process regression. Numerical calculations as well as novel theoretical predictions for the learning curves using belief propagation show that one obtains distinctly different probabilistic models depending on the choice of normalisation. Our method for predicting the learning curves using belief propagation is significantly more accurate than previous approximations and should become exact in the limit of large random graphs. Keywords: Gaussian process, generalisation error, learning curve, cavity method, belief propagation, graph, random walk kernel
5 0.51446295 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
Author: Vinay Jethava, Anders Martinsson, Chiranjib Bhattacharyya, Devdatt Dubhashi
Abstract: In this paper we establish that the Lov´ sz ϑ function on a graph can be restated as a kernel learning a problem. We introduce the notion of SVM − ϑ graphs, on which Lov´ sz ϑ function can be apa proximated well by a Support vector machine (SVM). We show that Erd¨ s-R´ nyi random G(n, p) o e log4 n np graphs are SVM − ϑ graphs for n ≤ p < 1. Even if we embed a large clique of size Θ 1−p in a G(n, p) graph the resultant graph still remains a SVM − ϑ graph. This immediately suggests an SVM based algorithm for recovering a large planted clique in random graphs. Associated with the ϑ function is the notion of orthogonal labellings. We introduce common orthogonal labellings which extends the idea of orthogonal labellings to multiple graphs. This allows us to propose a Multiple Kernel learning (MKL) based solution which is capable of identifying a large common dense subgraph in multiple graphs. Both in the planted clique case and common subgraph detection problem the proposed solutions beat the state of the art by an order of magnitude. Keywords: orthogonal labellings of graphs, planted cliques, random graphs, common dense subgraph
6 0.49272597 91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies
7 0.4449563 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
8 0.36825901 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
9 0.35854852 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
10 0.32730255 94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
11 0.3008841 120 jmlr-2013-Variational Algorithms for Marginal MAP
13 0.28777388 82 jmlr-2013-Optimally Fuzzy Temporal Memory
14 0.24255702 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
15 0.22328953 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
16 0.21401088 8 jmlr-2013-A Theory of Multiclass Boosting
17 0.1952786 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
18 0.19433323 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
19 0.18311711 41 jmlr-2013-Experiment Selection for Causal Discovery
20 0.16803385 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
topicId topicWeight
[(0, 0.016), (5, 0.138), (6, 0.077), (10, 0.041), (11, 0.429), (20, 0.011), (23, 0.033), (68, 0.034), (70, 0.038), (75, 0.022), (85, 0.015), (87, 0.013), (89, 0.016), (90, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.68580788 92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella
Abstract: We investigate the problem of sequentially predicting the binary labels on the nodes of an arbitrary weighted graph. We show that, under a suitable parametrization of the problem, the optimal number of prediction mistakes can be characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving in expectation the optimal mistake bound on any polynomially connected weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant expected amortized time and linear space. Experiments on real-world data sets show that our method compares well to both global (Perceptron) and local (label propagation) methods, while being generally faster in practice. Keywords: online learning, learning on graphs, graph prediction, random spanning trees
2 0.36118308 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
Author: Nima Noorshams, Martin J. Wainwright
Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion
3 0.35895488 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.35421196 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
5 0.3532429 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
Author: Jun Wang, Tony Jebara, Shih-Fu Chang
Abstract: Graph-based semi-supervised learning (SSL) methods play an increasingly important role in practical machine learning systems, particularly in agnostic settings when no parametric information or other prior knowledge is available about the data distribution. Given the constructed graph represented by a weight matrix, transductive inference is used to propagate known labels to predict the values of all unlabeled vertices. Designing a robust label diffusion algorithm for such graphs is a widely studied problem and various methods have recently been suggested. Many of these can be formalized as regularized function estimation through the minimization of a quadratic cost. However, most existing label diffusion methods minimize a univariate cost with the classification function as the only variable of interest. Since the observed labels seed the diffusion process, such univariate frameworks are extremely sensitive to the initial label choice and any label noise. To alleviate the dependency on the initial observed labels, this article proposes a bivariate formulation for graph-based SSL, where both the binary label information and a continuous classification function are arguments of the optimization. This bivariate formulation is shown to be equivalent to a linearly constrained Max-Cut problem. Finally an efficient solution via greedy gradient Max-Cut (GGMC) is derived which gradually assigns unlabeled vertices to each class with minimum connectivity. Once convergence guarantees are established, this greedy Max-Cut based SSL is applied on both artificial and standard benchmark data sets where it obtains superior classification accuracy compared to existing state-of-the-art SSL methods. Moreover, GGMC shows robustness with respect to the graph construction method and maintains high accuracy over extensive experiments with various edge linking and weighting schemes. Keywords: graph transduction, semi-supervised learning, bivariate formulation, mixed integer programming, greedy
7 0.35056496 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
8 0.34824654 106 jmlr-2013-Stationary-Sparse Causality Network Learning
9 0.34673131 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
10 0.34602037 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
11 0.34585053 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
12 0.34439519 114 jmlr-2013-The Rate of Convergence of AdaBoost
13 0.34419197 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
15 0.34399459 15 jmlr-2013-Bayesian Canonical Correlation Analysis
16 0.34254619 120 jmlr-2013-Variational Algorithms for Marginal MAP
17 0.34151471 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
18 0.34039158 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
19 0.34015849 41 jmlr-2013-Experiment Selection for Causal Discovery
20 0.34003472 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning