nips nips2008 nips2008-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano
Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. [sent-7, score-0.799]
2 We approximate the graph with a spanning tree and then we predict with the kernel perceptron. [sent-9, score-0.749]
3 We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. [sent-10, score-0.94]
4 The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. [sent-11, score-0.252]
5 We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. [sent-12, score-0.438]
6 The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. [sent-13, score-0.309]
7 1 Introduction Classification methods which rely upon the graph Laplacian (see [3, 20, 13] and references therein), have proven to be useful for semi-supervised learning. [sent-14, score-0.248]
8 These methods reduce to the problem of labeling a graph whose vertices are associated to the data points and the edges to the similarity between pairs of data points. [sent-16, score-0.547]
9 The labeling of the graph can be achieved either in a batch [3, 20] or in an online manner [13]. [sent-17, score-0.284]
10 If an n-vertex tree is given, our method requires an O(n) initialization step and after that any m × m block of the pseudoinverse of the Laplacian may be computed in O(m2 + mS) time, where S is the structural diameter of the tree. [sent-21, score-0.671]
11 The pseudoinverse of the Laplacian may then be used as a kernel for a variety of label prediction methods. [sent-22, score-0.242]
12 If a generic graph is given, we first approximate it with a tree and then run our method on the tree. [sent-23, score-0.531]
13 The use of a minimum spanning tree and shortest path tree is discussed. [sent-24, score-0.951]
14 It is important to note that prediction is also possible using directly the graph Laplacian, without computing its pseudoinverse. [sent-25, score-0.289]
15 However, computation via the graph kernel allows for multiple prediction problems on the same graph to be computed more efficiently. [sent-27, score-0.618]
16 To illustrate the advantage of our approach consider the case in which we are provided with a small subset of labeled vertices of a large graph and we wish to predict the label of a different subset of p vertices. [sent-29, score-0.482]
17 If the graph is a tree the total time required to predict with the kernel perceptron using our method will be O(n + m2 + mS). [sent-33, score-0.645]
18 The promise of our technique is that, if m + S n and a tree is given, it requires O(n) time versus O(n3 ) for standard methods. [sent-34, score-0.284]
19 To the best of our knowledge this is the first paper which addresses the problem of fast prediction in semi-supervised learning using tree graphs. [sent-35, score-0.323]
20 In the case of unbalanced bipartite graphs [15] presents a method which significantly improves the computation time of the pseudoinverse to Θ(k 2 (n − k)), where k is the size of a minority partition. [sent-39, score-0.301]
21 Thus, in the case of a binary tree the computation is still Θ(n3 ) time. [sent-40, score-0.293]
22 In Section 2 we review the notions which are needed in order to present our technique in Section 3, concerning the fast computation of a tree graph kernel. [sent-42, score-0.564]
23 In Section 4 we address the issue of tree selection, commenting in particular on a potential advantage of shortest path trees. [sent-43, score-0.468]
24 2 Background In this paper any graph G is assumed to be connected, to have n vertices, and to have edge weights. [sent-45, score-0.293]
25 We say that G is a tree if it is connected and has n − 1 edges. [sent-51, score-0.296]
26 The graph Laplacian is the n × n matrix defined as G = D − A, where D is the diagonal matrix with i-th diagonal n element Dii = j=1 Aij , the weighted degree of vertex i. [sent-52, score-0.468]
27 Where it is not ambiguous, we will use the notation G to denote both the graph G and the graph Laplacian and the notation T to denote both a Laplacian of a tree and the tree itself. [sent-53, score-1.014]
28 As the graph is connected, it follows from the definition of the semi-norm that the null space of G is spanned by the constant vector 1 only. [sent-58, score-0.248]
29 That is, the weighted graph may be seen as a network of resistors where edge (i, j) is a resistor with resistance πij = A−1 . [sent-60, score-0.6]
30 Then the effective resistance rG (i, j) may be defined as the resistance ij measured between vertex i and j in this network and may be calculated using Kirchoff’s circuit laws or directly from G+ using the formula [16] rG (i, j) = G+ + G+ − 2G+ . [sent-61, score-0.88]
31 1) ii jj ij The effective resistance is a metric distance on the graph [16] as well as the geodesic and structural distances. [sent-63, score-0.885]
32 The structural distance between vertices i, j ∈ V is defined as sG (i, j) := min {|P (i, j)| : P (i, j) ∈ P} where P is the set of all paths in G and P (i, j) is the set of edges in a particular path from i to j. [sent-64, score-0.484]
33 The diameter is the maximum distance between any two points on the graph, hence the resistance, structural, and, geodesic diameter are defined as RG = maxi,j∈V rG (i, j) SG = maxi,j∈V sG (i, j), and DG = maxi,j∈V dG (i, j), respectively. [sent-66, score-0.396]
34 1 Inverse Connectivity Let us begin by noting that the effective resistance is a better measure of connectivity than the geodesic distance, as for example if there are k edge disjoint paths of geodesic distance d between d two vertices, then the effective resistance is no more than k . [sent-70, score-1.101]
35 The first quantity is the total resistance Rtot = i>j rG (i, j), which is a measure of the inverse connectivity of the graph: the n smaller Rtot the more connected the graph. [sent-73, score-0.421]
36 The second quantity is R(i) = j=1 rG (i, j), which is used as a measure of inverse centrality of vertex i [6, Def. [sent-74, score-0.38]
37 3) Method Throughout this section we assume that G is a tree with corresponding Laplacian matrix T. [sent-85, score-0.259]
38 The principle of the method to compute T+ is that, on a tree there is a unique path between any two vertices and, so, the effective resistance is simply the sum of resistances along that path, see e. [sent-86, score-1.001]
39 [16, 13] (for the same reason, on a tree the geodesic distance is the same as the resistance distance). [sent-88, score-0.72]
40 The parent and the children of vertex i are denoted by ↑(i) and ↓(i), respectively. [sent-90, score-0.22]
41 Initialization We split the computation of the inverse centrality R(i) into two terms, namely R(i) = T (i) + S(i), where T (i) and S(i) are the sum of the resistances of vertex i to each descendant and nondescendant, respectively. [sent-105, score-0.503]
42 We next descend the tree caching each calculated S(i) with the root-to-leaves recursion S(i) := S(↑(i)) + T (↑(i)) − T (i) + (n − 2κ(i))πi ↑(i) 0 It is clear that the time complexity of the above recursions is O(n). [sent-111, score-0.387]
43 Computing an m × m block of the Laplacian pseudoinverse Our algorithm (see Figure 1) computes the effective resistance matrix of an m × m block which effectively gives the kernel (via equation (2. [sent-133, score-0.606]
44 The motivating idea is that a single effective resistance rT (i, j) is simply the sum of resistances along the path from i to j. [sent-135, score-0.528]
45 It may be computed by separately ascending the path from i–to–root and j–to–root in O(ST ) time and summing the resistances along each edge that is either in the i–to–root or j–to–root path but not in both. [sent-136, score-0.441]
46 This is realized by additionally caching the cumulative sums of resistances along the path to the root during each ascent from a vertex. [sent-138, score-0.409]
47 We outline in further detail the algorithm as follows: for each vertex vi in the set Vm = {v1 , . [sent-139, score-0.315]
48 As we ascend, we cache each cumulative resistance (from the starting vertex vi to the current vertex c) along the path on the way to the root (line 11). [sent-143, score-0.994]
49 If, while ascending from vi we enter a vertex c which has previously been visited during the ascent from another vertex w (line 6) then we compute rT (vi , w) as rT (vi , c)+rT (c, w). [sent-144, score-0.698]
50 For example, during the ascent from vertex vk ∈ Vm to the root we will compute {rT (v1 , vk ), . [sent-145, score-0.453]
51 The computational complexity is obtained by noting that every ascent to the root requires O(ST ) steps and along each ascent we must compute up to max(m, ST ) resistances. [sent-149, score-0.239]
52 ” Let us return to the practical scenario described in the introduction, in which we wish to predict p vertices of the tree based on labeled vertices. [sent-156, score-0.493]
53 4 Tree Construction In the previous discussion, we have considered that a tree has already been given. [sent-160, score-0.259]
54 In the following, we assume that a graph G or a similarity function is given and the aim is to construct an approximating tree. [sent-161, score-0.248]
55 We will consider both the minimum spanning tree (MST) as a “best” in norm approximation; and the shortest path tree (SPT) as an approximation which maintains a mistake bound [13] guarantee. [sent-162, score-0.988]
56 Given a graph with a “cost” on each edge, an MST is a connected n-vertex subgraph with n − 1 edges such that the total cost is minimized. [sent-163, score-0.358]
57 In our set-up the cost of edge (i, j) is the resistance πij = 1 Aij , therefore, a minimum spanning tree of G solves the problem min πij : T ∈ T (G) , (4. [sent-164, score-0.801]
58 1) (i,j)∈E(T) where T (G) denotes the set of spanning trees of G. [sent-165, score-0.281]
59 An MST is also a tree whose Laplacian best approximates the Laplacian of the given graph according to the trace norm, that is, it solves the problem min {tr(G − T) : T ∈ T (G)} . [sent-166, score-0.507]
60 2) have the same solution follows by noting that the edges in a minimum spanning tree are invariant with respect to any strictly increasing function of the “costs” on the edges in the original graph [8] and the function −π −1 is increasing in π. [sent-171, score-0.877]
61 As noted in [10], the total resistance n n is a convex function of the graph Laplacian. [sent-179, score-0.521]
62 However, we do not know how to minimize Rtot (T) over the set of spanning trees of G. [sent-180, score-0.281]
63 We choose a vertex i and look for a spanning tree which minimizes the inverse centrality R(i) of vertex i, that is we solve the problem min {R(i) : T ∈ T (G)} . [sent-182, score-1.027]
64 3) Note that R(i) is the contribution of vertex i to the total resistance of T and that, by equations (3. [sent-184, score-0.493]
65 The above problem can then be interpreted as minimizing a tradeoff between inverse centrality of vertex i and inverse connectivity of the tree. [sent-187, score-0.437]
66 3) is a shortest path tree (SPT) centered at vertex i, namely a spanning tree for which the geodesic distance in “costs” is minimized from i to every other vertex in the graph. [sent-191, score-1.55]
67 This is because the geodesic distance is equivalent to the resistance distance on a tree and any SPT of G is formed from a set of shortest paths connecting the root to any other vertex in G [8, Ch. [sent-192, score-1.219]
68 Let us observe a fundamental difference between MST and SPT, which provides a justification for approximating the given graph with an SPT. [sent-195, score-0.248]
69 To explain 2 our argument, first we note that when we approximate the graph with a tree T the term u G is always decreasing, while the term RG is always increasing by Rayleigh’s monotonicity law (see for example [13, Corollary 3. [sent-198, score-0.507]
70 Now, note that the resistance diameter RT of an SPT of a graph G is bounded by twice the geodesic diameter of the original graph, RT ≤ 2DG . [sent-200, score-0.881]
71 4) Indeed, as an SPT is formed from a set of shortest paths between the root and any other vertex in G, for any pair of vertices p, q in the graph there is in the SPT a path from p to the root and then to q which can be no longer than 2DG . [sent-202, score-1.087]
72 4), imply that a tree built with an SPT would still have a non-vacuous mistake bound. [sent-205, score-0.296]
73 For example, consider a bicycle wheel graph whose edge set is the union of n spoke edges {(0, i) : i = 1, . [sent-208, score-0.395]
74 , n} with costs on the spoke edges of 2 and on the rim edges of 1. [sent-214, score-0.23]
75 In the general case of a non-sparse graph or similarity function the time complexity is Θ(n2 ), however as both Prim and Dijkstra are “greedy” algorithms their space complexity is O(n) which may be a dominant consideration in a large graph. [sent-219, score-0.325]
76 The motivation for our methodology is that on graphs with already 10,000 vertices it is computationally challenging to use standard graph labeling methods such as [3, 20, 13], as they require the computation of the full graph Laplacian kernel. [sent-221, score-0.82]
77 On the other hand, in the practical scenario described in the introduction the computational time of our method scales linearly in the number of vertices in the graph and can be run comfortably on large graphs (see Figure 2 below) and at worst quadratically if the full graph needs to be labeled. [sent-223, score-0.799]
78 The initial results are promising in that the performance of the predictor with a single SPT or MST is competitive with that of the existing methods, some of which use the full graph information. [sent-225, score-0.286]
79 The first one is formed by 9,072 vertices and 464,959 edges, which represent computer hosts – we call this the host-graph. [sent-230, score-0.267]
80 In this graph, one host is “connected” to another host if there is at least one link from a web-page in the first host to a web-page in the other host. [sent-231, score-0.24]
81 The second graph consists of 400,000 vertices (corresponding to web-pages) and 10,455,545 edges – we call this the web-graph. [sent-232, score-0.511]
82 Each vertex is either labeled as spam or as non-spam. [sent-236, score-0.322]
83 In both graphs there are about 80% of non-spam vertices and 20% of spam ones. [sent-237, score-0.312]
84 Additional tf-idf feature vectors (determined by the web-pages’ html content) are provided for each vertex in the graph, but we have discarded this information for simplicity. [sent-238, score-0.251]
85 Following the web-spam protocol, for both graphs we used 10% of labeled vertices for training and 90% for testing. [sent-239, score-0.298]
86 The method of Witschel and Biemann [4] consisted of iteratively selecting vertices and classifying them with the predominant class in their neighborhood (hence this is very similar to label propagation method of [20]). [sent-245, score-0.238]
87 For the unweighted graphs, every tree is an MST, so we simply used trees generated by a randomized unweighted depth-first traversal of the graph and SPT’s may be generated by using the breadth-first-search algorithm, all in O(|E|) time. [sent-446, score-0.643]
88 ” stands for aggregate and the “bidir” tag indicates that the original graph was modified by setting w = 2 for bidirectional edges. [sent-448, score-0.367]
89 In the case of the larger web-graph, we used 21 trees and the modified graph with bidirectional weights. [sent-449, score-0.38]
90 It is interesting to note that some of the previous methods [1, 4] take the full graph information into account. [sent-453, score-0.248]
91 Thus, the above results indicate that our method is statistically competitive (in fact better than most of the other methods) even though the full graph structure is discarded. [sent-454, score-0.31]
92 Remarkably, in the case of the large web-graph, using just a single tree gives a very good accuracy, particularly in the case of SPT. [sent-455, score-0.259]
93 On this graph SPT is also more stable in terms of variance than MST. [sent-456, score-0.248]
94 In the case of the smaller host-graph, just using one tree leads to a decrease in performance. [sent-457, score-0.259]
95 We then fixed p = 1000 predictive vertices and let the number of labeled vertices vary in the set {20, 40, 60, 80, 100, 200, 400}. [sent-466, score-0.424]
96 The method is simple to implement and, in the practical regime of small labeled and testing sets and diameters, scales linearly in the number of vertices in the tree. [sent-471, score-0.258]
97 When we are presented with a generic undirected weighted graph, we first extract a spanning tree from it and then run the method. [sent-472, score-0.454]
98 We have studied minimum spanning trees and shortest path trees, both of which can be computed efficiently with standard algorithms. [sent-473, score-0.519]
99 We have tested the method on a web-spam classification problem involving a graph of 400,000 vertices. [sent-474, score-0.272]
100 Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. [sent-617, score-0.521]
wordName wordTfidf (topN-words)
[('spt', 0.385), ('mst', 0.309), ('resistance', 0.273), ('tree', 0.259), ('graph', 0.248), ('vertex', 0.22), ('spanning', 0.195), ('vertices', 0.19), ('laplacian', 0.156), ('pseudoinverse', 0.154), ('rtot', 0.154), ('geodesic', 0.152), ('rg', 0.144), ('rt', 0.132), ('resistances', 0.116), ('shortest', 0.108), ('diameter', 0.104), ('path', 0.101), ('bidir', 0.096), ('vi', 0.095), ('centrality', 0.093), ('trees', 0.086), ('root', 0.085), ('aij', 0.084), ('auc', 0.081), ('hosts', 0.077), ('prim', 0.077), ('edges', 0.073), ('host', 0.072), ('visited', 0.07), ('dg', 0.065), ('graphs', 0.064), ('ascent', 0.064), ('sg', 0.062), ('herbster', 0.062), ('spam', 0.058), ('bencz', 0.058), ('dijkstra', 0.058), ('filoche', 0.058), ('vm', 0.055), ('ij', 0.051), ('paths', 0.05), ('initialization', 0.049), ('kernel', 0.047), ('block', 0.047), ('bidirectional', 0.046), ('edge', 0.045), ('connectivity', 0.044), ('labeled', 0.044), ('aggregate', 0.044), ('caching', 0.043), ('tang', 0.043), ('vk', 0.042), ('perceptron', 0.042), ('laplacians', 0.041), ('st', 0.041), ('prediction', 0.041), ('inverse', 0.04), ('abernathy', 0.039), ('comment', 0.039), ('fpt', 0.039), ('witschel', 0.039), ('effective', 0.038), ('competitive', 0.038), ('mistake', 0.037), ('connected', 0.037), ('labeling', 0.036), ('distance', 0.036), ('structural', 0.034), ('computation', 0.034), ('kirchoff', 0.034), ('recursions', 0.034), ('resistors', 0.034), ('et', 0.033), ('gram', 0.032), ('argyriou', 0.031), ('html', 0.031), ('rim', 0.031), ('minimum', 0.029), ('tag', 0.029), ('spoke', 0.029), ('ascending', 0.029), ('ii', 0.029), ('tr', 0.027), ('quantity', 0.027), ('content', 0.026), ('nystr', 0.026), ('complexity', 0.026), ('time', 0.025), ('ms', 0.025), ('unweighted', 0.025), ('laws', 0.025), ('method', 0.024), ('costs', 0.024), ('jj', 0.024), ('pontil', 0.024), ('link', 0.024), ('summing', 0.024), ('accuracy', 0.023), ('fast', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 84 nips-2008-Fast Prediction on a Tree
Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano
Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1
2 0.369068 171 nips-2008-Online Prediction on Large Diameter Graphs
Author: Mark Herbster, Guy Lever, Massimiliano Pontil
Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1
3 0.1554099 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
Author: Benjamin Yackley, Eduardo Corona, Terran Lane
Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.
5 0.12666923 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
6 0.10796262 107 nips-2008-Influence of graph construction on graph-based clustering measures
7 0.10552513 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
8 0.09869545 69 nips-2008-Efficient Exact Inference in Planar Ising Models
9 0.091093428 4 nips-2008-A Scalable Hierarchical Distributed Language Model
10 0.083727196 78 nips-2008-Exact Convex Confidence-Weighted Learning
11 0.080106676 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
12 0.078507639 212 nips-2008-Skill Characterization Based on Betweenness
13 0.078351982 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
14 0.078094356 225 nips-2008-Supervised Bipartite Graph Inference
15 0.077540986 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
16 0.071826443 40 nips-2008-Bounds on marginal probability distributions
17 0.066981703 193 nips-2008-Regularized Co-Clustering with Dual Supervision
18 0.064103864 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
19 0.064010493 151 nips-2008-Non-parametric Regression Between Manifolds
20 0.060298666 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
topicId topicWeight
[(0, -0.192), (1, -0.018), (2, -0.055), (3, 0.048), (4, 0.087), (5, -0.148), (6, 0.136), (7, -0.036), (8, 0.054), (9, -0.333), (10, -0.105), (11, 0.051), (12, -0.148), (13, -0.083), (14, -0.028), (15, -0.169), (16, -0.013), (17, 0.088), (18, -0.161), (19, -0.014), (20, 0.042), (21, -0.091), (22, 0.085), (23, -0.101), (24, -0.171), (25, 0.069), (26, 0.019), (27, 0.006), (28, 0.018), (29, -0.1), (30, 0.136), (31, 0.039), (32, -0.048), (33, 0.035), (34, -0.038), (35, -0.007), (36, -0.062), (37, 0.003), (38, -0.024), (39, 0.087), (40, 0.069), (41, -0.001), (42, 0.123), (43, 0.131), (44, -0.063), (45, 0.067), (46, 0.017), (47, -0.039), (48, -0.061), (49, -0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.96605241 84 nips-2008-Fast Prediction on a Tree
Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano
Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1
2 0.92322123 171 nips-2008-Online Prediction on Large Diameter Graphs
Author: Mark Herbster, Guy Lever, Massimiliano Pontil
Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1
3 0.63995194 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while also allowing for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial in the size of the graph and the treewidth bound. At the heart of our method is a triangulated graph that we dynamically update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life datasets. Importantly, we also show that by using global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. 1
4 0.62250364 69 nips-2008-Efficient Exact Inference in Planar Ising Models
Author: Nicol N. Schraudolph, Dmitry Kamenetsky
Abstract: We give polynomial-time algorithms for the exact computation of lowest-energy states, worst margin violators, partition functions, and marginals in certain binary undirected graphical models. Our approach provides an interesting alternative to the well-known graph cut paradigm in that it does not impose any submodularity constraints; instead we require planarity to establish a correspondence with perfect matchings in an expanded dual graph. Maximum-margin parameter estimation for a boundary detection task shows our approach to be efficient and effective. A C++ implementation is available from http://nic.schraudolph.org/isinf/. 1
5 0.60367942 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
Author: Benjamin Yackley, Eduardo Corona, Terran Lane
Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1
6 0.59494799 107 nips-2008-Influence of graph construction on graph-based clustering measures
7 0.51541036 212 nips-2008-Skill Characterization Based on Betweenness
8 0.50507581 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
9 0.49237624 225 nips-2008-Supervised Bipartite Graph Inference
10 0.48107624 104 nips-2008-Improved Moves for Truncated Convex Models
11 0.4696168 194 nips-2008-Regularized Learning with Networks of Features
12 0.43083149 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
13 0.34714419 4 nips-2008-A Scalable Hierarchical Distributed Language Model
14 0.31108814 70 nips-2008-Efficient Inference in Phylogenetic InDel Trees
15 0.29643038 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
16 0.29443952 193 nips-2008-Regularized Co-Clustering with Dual Supervision
17 0.2911911 236 nips-2008-The Mondrian Process
18 0.29107225 29 nips-2008-Automatic online tuning for fast Gaussian summation
19 0.28192991 224 nips-2008-Structured ranking learning using cumulative distribution networks
20 0.2683399 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
topicId topicWeight
[(3, 0.01), (6, 0.048), (7, 0.055), (12, 0.045), (28, 0.191), (57, 0.048), (59, 0.097), (63, 0.049), (71, 0.024), (75, 0.247), (77, 0.032), (78, 0.014), (83, 0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.79241192 84 nips-2008-Fast Prediction on a Tree
Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano
Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1
2 0.78672415 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
Author: Maxim Raginsky, Svetlana Lazebnik, Rebecca Willett, Jorge Silva
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities using orthogonal expansions. For d covariates, there are 2d basis coefficients to estimate, which renders conventional approaches computationally prohibitive when d is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity.
3 0.75689542 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
Author: Indraneel Mukherjee, David M. Blei
Abstract: Hierarchical probabilistic modeling of discrete data has emerged as a powerful tool for text analysis. Posterior inference in such models is intractable, and practitioners rely on approximate posterior inference methods such as variational inference or Gibbs sampling. There has been much research in designing better approximations, but there is yet little theoretical understanding of which of the available techniques are appropriate, and in which data analysis settings. In this paper we provide the beginnings of such understanding. We analyze the improvement that the recently proposed collapsed variational inference (CVB) provides over mean field variational inference (VB) in latent Dirichlet allocation. We prove that the difference in the tightness of the bound on the likelihood of a document decreases as O(k − 1) + log m/m, where k is the number of topics in the model and m is the number of words in a document. As a consequence, the advantage of CVB over VB is lost for long documents but increases with the number of topics. We demonstrate empirically that the theory holds, using simulated text data and two text corpora. We provide practical guidelines for choosing an approximation. 1
4 0.69291455 171 nips-2008-Online Prediction on Large Diameter Graphs
Author: Mark Herbster, Guy Lever, Massimiliano Pontil
Abstract: We continue our study of online prediction of the labelling of a graph. We show a fundamental limitation of Laplacian-based algorithms: if the graph has a large diameter then the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this drawback by means of an efficient algorithm which achieves a logarithmic mistake bound. It is based on the notion of a spine, a path graph which provides a linear embedding of the original graph. In practice, graphs may exhibit cluster structure; thus in the last part, we present a modified algorithm which achieves the “best of both worlds”: it performs well locally in the presence of cluster structure, and globally on large diameter graphs. 1
5 0.68555689 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
Author: Dotan D. Castro, Dmitry Volkinshtein, Ron Meir
Abstract: Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming widely separated time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms. 1
6 0.68069887 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering
7 0.67159575 198 nips-2008-Resolution Limits of Sparse Coding in High Dimensions
8 0.66144818 179 nips-2008-Phase transitions for high-dimensional joint support recovery
9 0.65696526 23 nips-2008-An ideal observer model of infant object perception
10 0.65602684 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
11 0.65279824 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
12 0.65212911 196 nips-2008-Relative Margin Machines
13 0.6494239 48 nips-2008-Clustering via LP-based Stabilities
14 0.64906871 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
15 0.64841187 231 nips-2008-Temporal Dynamics of Cognitive Control
16 0.6481784 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
17 0.64595693 10 nips-2008-A rational model of preference learning and choice prediction by children
18 0.64546186 104 nips-2008-Improved Moves for Truncated Convex Models
19 0.64521074 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
20 0.64460218 195 nips-2008-Regularized Policy Iteration