jmlr jmlr2010 jmlr2010-44 knowledge-graph by maker-knowledge-mining

44 jmlr-2010-Graph Kernels


Source: pdf

Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt

Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 38, 72076 T¨ bingen, Germany u Editor: John Lafferty Abstract We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al. [sent-16, score-0.583]

2 Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). [sent-21, score-0.449]

3 , 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. [sent-27, score-0.664]

4 Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. [sent-28, score-0.684]

5 A very successful recent method is to model the protein as a graph (see Figure 1), and assign similar functions to similar graphs (Borgwardt et al. [sent-54, score-0.454]

6 2 we compute graph kernels to measure the similarity between proteins and enzymes represented in this fashion. [sent-57, score-0.645]

7 Comparing nodes in a graph involves constructing a kernel between nodes, while comparing graphs involves constructing a kernel between graphs. [sent-71, score-0.775]

8 Another algebraic approach to graph kernels has e appeared recently (Kondor and Borgwardt, 2008). [sent-84, score-0.491]

9 A seemingly independent line of research investigates the so-called rational kernels, which are kernels between finite state automata based on the algebra of abstract semirings (Cortes et al. [sent-85, score-0.563]

10 The aim of this paper is twofold: on the one hand we present theoretical results showing that all the above graph kernels are in fact closely related, on the other hand we present new algorithms for efficiently computing such kernels. [sent-87, score-0.491]

11 (2007) to present a unifying framework for graph kernels encompassing many known kernels as special cases, and to discuss connections to yet others. [sent-91, score-0.767]

12 kernels, and discuss the random walk and marginalized graph kernels as special cases. [sent-95, score-0.823]

13 In Section 4 we present four efficient ways to compute random walk graph kernels, namely: 1. [sent-97, score-0.494]

14 , and show that specializing them to graphs yields random walk graph kernels. [sent-111, score-0.695]

15 In Section 7 we discuss the relation between R-convolution kernels (Haussler, 1999) and various graph kernels, all of which can in fact be shown to be instances of R-convolution kernels. [sent-112, score-0.491]

16 The Kronecker product and vec operator are linked by the well-known property (e. [sent-152, score-0.501]

17 The t th power of A thus describes t-length walks, that is, (At )i j is the probability of a transition from vertex v j to vertex vi via a walk of length t. [sent-195, score-0.462]

18 A random walk need not continue indefinitely; to model this, we associate every node vik in the graph with a stopping probability qik . [sent-198, score-0.569]

19 Our generalized random walk graph kernels then use the overall probability of stopping after t steps, given by q⊤pt . [sent-199, score-0.77]

20 Every edge-labeled graph G is associated with a label matrix X ∈ X n×n in which Xi j is the label of the edge (v j , vi ) and Xi j = ζ if (v j , vi ) ∈ E. [sent-204, score-0.46]

21 Random Walk Graph Kernels Our generalized random walk graph kernels are based on a simple idea: given a pair of graphs, perform random walks on both, and count the number of matching walks. [sent-214, score-0.887]

22 We show that this simple concept underlies both random walk and marginalized graph kernels. [sent-215, score-0.547]

23 1 Direct Product Graphs Given two graphs G(V, E) and G′ (V ′, E ′ ), their direct product G× is a graph with vertex set V× = {(vi , v′ ) : vi ∈ V, v′ ∈ V ′ }, r r (4) E× = {((vi , v′ ), (v j , v′ )) : (vi , v j ) ∈ E ∧ (v′ , v′ ) ∈ E ′ }. [sent-218, score-0.595]

24 Each node of the direct product graph is labeled with a pair of nodes (4); an edge exists in the direct product if and only if the corresponding nodes are adjacent in both original graphs (5). [sent-220, score-0.839]

25 Performing a random walk on the direct product graph is equivalent to performing a simultaz neous random walk on G and G′ (Imrich and Klavˇ ar, 2000). [sent-225, score-0.875]

26 Likewise, if q and q′ are stopping probabilities (that is, the probability that a random walk ends at a given vertex), then the stopping probability on the direct product graph is q× := q ⊗ q′ . [sent-227, score-0.596]

27 Thus the weight matrix (6) has a non-zero entry iff an edge exists in the direct product graph and the corresponding edges in G and G′ have the same label. [sent-239, score-0.518]

28 Some simple algebra (omitted for the sake of brevity) shows that the weight matrix of the direct product graph can then be written as d W× = ∑ lA ⊗ lA′ . [sent-241, score-0.432]

29 (7) l=1 In Section 4 we will develop efficient methods to compute kernels defined using the weight matrix of the direct product graph. [sent-242, score-0.448]

30 The applicability and time complexity of a particular method will depend on whether the graphs are unlabeled though possibly edge-weighted (W× = A× ), have discrete edge labels (7), or—in the most general case—employ an arbitrary edge kernel (6); see Table 1 for a summary. [sent-243, score-0.6]

31 2 Kernel Definition As stated above, performing a random walk on the direct product graph G× is equivalent to performing a simultaneous random walk on the graphs G and G′ (Imrich and Klavˇ ar, 2000). [sent-245, score-1.045]

32 1207 V ISHWANATHAN , S CHRAUDOLPH , KONDOR AND B ORGWARDT this knowledge can be incorporated into the kernel; and finally, appropriate kernels or similarity measures between edges can be incorporated via the weight matrix W× . [sent-254, score-0.441]

33 (2004) define a kernel between labeled graphs via walks and their label sequences. [sent-284, score-0.488]

34 × (16) k=0 ˆ To recover the marginalized graph kernels let λ = 1, and define Φ(Xi j ) = Pi j Φ(Xi j ), in which case W× = T× , thus recovering (15). [sent-310, score-0.544]

35 (2003) also extend their kernels to a graphs with labels from a finite set by replacing A× in (17) with a sum W× of label-filtered (but 1209 V ISHWANATHAN , S CHRAUDOLPH , KONDOR AND B ORGWARDT sparsity edge labels Method (Section) W× = Sylvester Equation (4. [sent-321, score-0.608]

36 (2003) differ from our definition (8) in that they do not explicitly a model starting or stopping probabilities, and employ unnormalized adjacency matrices instead of our more general weight matrix (6) which allows for normalization and arbitrary kernels on edges. [sent-340, score-0.476]

37 Efficient Computation Computing a geometric random walk graph kernel with µ(k) = λk amounts to inverting (I −λW× ), an n2 ×n2 matrix if G and G′ have n vertices each. [sent-342, score-0.753]

38 The cost of the iterative algorithms increases by another factor of d for graphs with edge labels from a finite set of d symbols or an edge kernel with d-dimensional feature map; for an arbitrary edge kernel (whose feature map may be infinite-dimensional) this factor becomes n. [sent-357, score-0.787]

39 A nearest Kronecker product approximation can be used, however, to approximate the direct product of labeled graphs with a weight matrix that can be handled by any of our methods for unlabeled graphs. [sent-359, score-0.531]

40 We now show that for graphs with discrete edge labels, whose weight matrix W× can be written as (7), the problem of computing the graph kernel (16) can be reduced to solving the following generalized Sylvester equation: d M = ∑ λ iA′ M iA⊤ + M0 , (22) i=1 where vec(M0 ) = p× . [sent-377, score-0.703]

41 Since solving that only takes O(n3 ) time, computing the random walk graph kernel in this fashion is much faster than the O(n6 ) time required by the direct approach. [sent-384, score-0.698]

42 A solver for the simple Sylvester equation (20) can still be used to efficiently compute kernels between labeled graphs though by employing the nearest Kronecker product approximation (Section 4. [sent-387, score-0.604]

43 We now turn to a method based on spectral decompositions that can compute the general random walk kernel (8) for any convergent choice of µ(k), but is only efficient for unlabeled graphs. [sent-418, score-0.631]

44 The random walk graph kernel (8) can then be written as k(G, G′ ) := ∞ −1 ∑ µ(k) q⊤ (P× D× P× )k p× = × k=0 q⊤ P× × ∞ k ∑ µ(k)D× −1 P× p× . [sent-421, score-0.662]

45 Thus by diagonalizing the nonlinearity central to a random walk graph kernel, spectral decomposition can greatly expedite its computation. [sent-430, score-0.609]

46 The entire m × m kernel matrix can thus be obtained in O(m2 n2 p) time, plus the O(mn3 ) time it takes to precompute spectral decompositions of the m individual adjacency matrices. [sent-452, score-0.432]

47 Experiments Numerous studies have applied random walk graph kernels to problems such as protein function prediction (Borgwardt et al. [sent-491, score-0.839]

48 1215 V ISHWANATHAN , S CHRAUDOLPH , KONDOR AND B ORGWARDT Figure 3: Time to compute a 10×10 kernel matrix on random graphs with n nodes and 3n edges as a function of the graph size n. [sent-499, score-0.699]

49 For the real-world data sets, the value of λ was chosen to ensure that the random walk graph kernel converges. [sent-521, score-0.662]

50 1 T HE DATA S ETS We now briefly describe each data set, and discuss how graph kernels are applicable. [sent-573, score-0.491]

51 We employed graph kernels to measure similarity between molecules from the MUTAG and PTC data sets (Toivonen et al. [sent-576, score-0.568]

52 Comparing these graphs via a modified random walk graph kernel and classifying them with a Support Vector Machine (SVM) led to function prediction accuracies competitive with state-of-theart approaches (Borgwardt et al. [sent-589, score-0.832]

53 Table 2: Time to compute kernel matrix for unlabeled graphs from various data sets. [sent-624, score-0.44]

54 On the two protein data sets we employed a linear kernel to measure similarity between edge weights representing distances (in ˚ Angstr¨ ms) between secondary structure elements; since d = 1 we can use all our methods for o unlabeled graphs here. [sent-633, score-0.592]

55 On the two chemical data sets we used a delta kernel to compare edge labels reflecting types of bonds in molecules; for the Sylvester equation and spectral decomposition approaches we then employed the nearest Kronecker product approximation. [sent-634, score-0.54]

56 3 Protein-Protein Interaction Networks In our third experiment, we used random walk graph kernels to tackle a large-scale problem in bioinformatics involving the comparison of fairly large protein-protein interaction (PPI) networks. [sent-647, score-0.77]

57 For this purpose, a random walk graph kernel is the 1220 G RAPH K ERNELS data set kernel Sparse Conj. [sent-665, score-0.83]

58 natural choice, as a random walk in this graph represents a group of proteins in which consecutive proteins along the walk are co-expressed and interact. [sent-668, score-0.915]

59 Since existing graph kernels cannot take this into account, we propose to modify them appropriately. [sent-675, score-0.491]

60 is a graph over the same vertices but with complementary edges E In other words, the complement graph consists of exactly those edges not present in the original graph. [sent-678, score-0.582]

61 4 R ESULTS In Table 4 we contrast the CPU runtimes of our conjugate gradient and fixed-point approaches to graph kernel computation on the cancer patients modeled as labeled graphs with that of the direct 1221 V ISHWANATHAN , S CHRAUDOLPH , KONDOR AND B ORGWARDT sparse method. [sent-701, score-0.658]

62 Using either the “vanilla” graph kernel (16) or our composite graph kernel (38), we predict the survivors by means of a support vector machine (SVM) in 10-fold cross-validation. [sent-703, score-0.766]

63 The vanilla random walk graph kernel offers slightly higher prediction accuracy than the baseline classifier on one task (Leukemia: 59. [sent-704, score-0.692]

64 As discussed in Section 3, random walk graph kernels have a very different basis: They compute the similarity between two random graphs by matching random walks. [sent-721, score-0.984]

65 Despite these differences we find rational kernels and random walk graph kernels to be closely related. [sent-724, score-1.138]

66 To understand the connection recall that every random walk on a labeled graph produces a sequence of edge labels encountered during the walk. [sent-725, score-0.648]

67 Viewing the set of all label sequences generated by random walks on a graph as a language, one can design a weighted transducer which accepts this language, with the weight assigned to each label sequence being the probability of a random walk generating this sequence. [sent-726, score-0.766]

68 (This transducer can be represented by a graph whose adjacency matrix is the normalized weight matrix of the original graph. [sent-727, score-0.512]

69 ) In this section we formalize this observation and thus establish connections between rational kernels on transducers (Cortes et al. [sent-728, score-0.454]

70 In particular, we show that composition of transducers is analogous to computing product graphs, and that rational kernels on weighted transducers may be viewed as generalizations of random walk graph kernels to weighted automata. [sent-730, score-1.486]

71 Not all semirings have such morphisms: For instance, the logarithmic semiring has a morphism—namely, the exponential function—but the tropical semiring does not have one. [sent-751, score-0.441]

72 (45) Every random walk on a labeled graph results in a sequence of edge labels encountered during the walk. [sent-797, score-0.648]

73 Let the starting and stopping probabilities p and q on the graph equal those of the weighted automaton, and complete the construction by identifying for each l ∈ Σ the label-filtered adjacency matrix lA of the graph with Hl , the transition tensor of the weighted automaton for that symbol. [sent-803, score-0.798]

74 Under the above mapping (44) has a natural interpretation: The weight assigned by the automaton to a string of symbols is the probability of encountering the corresponding labels while performing a random walk on the corresponding labeled graph. [sent-804, score-0.459]

75 An unlabeled graph corresponds to a weighted automaton whose input-output alphabet contains exactly one symbol, and which therefore only accepts strings of the form ak = aa . [sent-806, score-0.43]

76 rational kernels is to replace T in (46) by T ◦T −1 , and let ψ be a semiring morphism. [sent-828, score-0.53]

77 6 Recovering Random Walk Graph Kernels In order to recover random walk graph kernels we use the standard (R, +, ·, 0, 1) ring as our semiring, and hence set ψ to be the identity function. [sent-875, score-0.77]

78 Letting p = p ⊗ p′ and q = q ⊗ q′ , (55) reduces to the direct product of G and G × × k k(G, G′ ) = ∑ µ(k) q⊤ H× p× , × (56) k which recovers the random walk graph kernel (8) with W× = H× . [sent-908, score-0.764]

79 There is one important difference between graph kernels and rational kernels. [sent-925, score-0.583]

80 Graph kernels can handle arbitrary edge kernels, including continuous edge labels via the weight matrix W× . [sent-926, score-0.547]

81 R-convolution Kernels Haussler’s (1999) R-convolution kernels provide a generic way to construct kernels for discrete compound objects. [sent-931, score-0.552]

82 1 Graph Kernels as R-Convolutions To apply R-convolution kernels to graphs, one decomposes the graph into smaller substructures, and builds the kernel based on similarities between those components. [sent-955, score-0.659]

83 Most graph kernels are— knowingly or not—based on R-convolutions; they mainly differ in the way they decompose the graph for comparison and the similarity measure they use to compare the components. [sent-956, score-0.75]

84 Random walks provide a straightforward graph decomposition that—as we have seen in Section 4—leads to kernels that can be computed efficiently. [sent-967, score-0.644]

85 Finally, note that by definition of our edge kernel κ, only pairs of sequences that are both actual walks on their respective graphs will make a non-zero contribution to (60). [sent-977, score-0.535]

86 Random walk graph kernels as proposed by G¨ rtner et al. [sent-978, score-0.859]

87 Their kernel is plagued by computational issues; in fact they show that computing the cyclic pattern kernel of a general graph is NP-hard. [sent-989, score-0.551]

88 They represent a ¯ graph G = (V, E) by a complete graph S = (V, E) over the same vertices, wherein the weight of ¯ each edge in E equals the length of the shortest path between the corresponding nodes in G. [sent-1000, score-0.593]

89 (63) i=1 The optimal assignment graph kernel of Fr¨ hlich et al. [sent-1011, score-0.455]

90 Discussion and Outlook As evidenced by the large number of recent papers, random walk and marginalized graph kernels have received considerable research attention. [sent-1052, score-0.823]

91 Our aim in presenting a unified framework for random walk and marginalized graph kernels that combines the best features of previous formulations is to highlight the similarities as well as the differences between these approaches. [sent-1055, score-0.823]

92 Now that the computation of random walk graph kernels is viable for practical problem sizes, it will open the doors for their application in hitherto unexplored domains. [sent-1063, score-0.77]

93 A major deficiency of geometric random walk graph kernels is that the admissible range of values of the decay parameter λ in (16) depends on the spectrum of the weight matrix W× . [sent-1064, score-0.84]

94 In fact in many applications a naive kernel which simply computes the average kernel between all pairs of edges in the two graphs has performance comparable to the random walk graph kernel. [sent-1066, score-1.051]

95 A natural question to ask is the following: Since diffusion can be viewed as a continuous time limit of random walks, can the ideas behind the random walk kernel be extended to diffusion? [sent-1081, score-0.447]

96 Although rational kernels have always been viewed as distinct from graph kernels, we have shown that in fact these two research areas are closely related. [sent-1083, score-0.583]

97 It is fair to say that R-convolution kernels are the mother of all kernels on structured data. [sent-1086, score-0.552]

98 It is enlightening to view various graph kernels as instances of R-convolution kernels since this brings into focus the relevant decomposition used to define a given kernel, and the similarities and differences between various kernels. [sent-1087, score-0.803]

99 The algorithmic challenge of the future is to integrate higher-order structures such as spanning trees in graph comparisons, and to compute such kernels efficiently. [sent-1096, score-0.491]

100 Cyclic pattern kernels for predictive graph a a a mining. [sent-1316, score-0.491]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vec', 0.435), ('walk', 0.279), ('kernels', 0.276), ('graph', 0.215), ('sylvester', 0.191), ('kondor', 0.184), ('graphs', 0.17), ('kernel', 0.168), ('orgwardt', 0.162), ('semiring', 0.162), ('kronecker', 0.155), ('chraudolph', 0.138), ('ishwanathan', 0.125), ('walks', 0.117), ('borgwardt', 0.104), ('ernels', 0.102), ('raph', 0.101), ('adjacency', 0.1), ('rkhs', 0.096), ('morphism', 0.092), ('rational', 0.092), ('ha', 0.089), ('rtner', 0.089), ('semirings', 0.086), ('transducer', 0.086), ('transducers', 0.086), ('ia', 0.083), ('edge', 0.08), ('spectral', 0.079), ('automaton', 0.077), ('proteins', 0.071), ('kashima', 0.071), ('protein', 0.069), ('product', 0.066), ('bkl', 0.066), ('automata', 0.064), ('vi', 0.062), ('unlabeled', 0.061), ('karsten', 0.059), ('enzyme', 0.059), ('nodes', 0.054), ('cg', 0.054), ('cortes', 0.054), ('marginalized', 0.053), ('ppi', 0.053), ('edges', 0.051), ('vertices', 0.05), ('ib', 0.047), ('vertex', 0.046), ('bks', 0.046), ('mutag', 0.046), ('risi', 0.046), ('rual', 0.046), ('vik', 0.046), ('algebra', 0.045), ('similarity', 0.044), ('decompositions', 0.044), ('labels', 0.041), ('tensor', 0.041), ('chemical', 0.041), ('matrix', 0.041), ('weighted', 0.04), ('enzymes', 0.039), ('hlich', 0.039), ('mah', 0.039), ('ptc', 0.039), ('lemma', 0.038), ('strings', 0.037), ('ai', 0.037), ('conjugate', 0.036), ('decomposition', 0.036), ('direct', 0.036), ('dsl', 0.036), ('assignment', 0.033), ('vishwanathan', 0.033), ('hal', 0.033), ('molecules', 0.033), ('air', 0.033), ('labeled', 0.033), ('pi', 0.031), ('dlyap', 0.031), ('hak', 0.031), ('koji', 0.031), ('mik', 0.031), ('specializing', 0.031), ('tottering', 0.031), ('tropical', 0.031), ('cr', 0.031), ('solver', 0.03), ('matrices', 0.03), ('iteration', 0.03), ('vanilla', 0.03), ('mehryar', 0.03), ('composition', 0.03), ('node', 0.029), ('nearest', 0.029), ('transition', 0.029), ('dkl', 0.029), ('weight', 0.029), ('gene', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 44 jmlr-2010-Graph Kernels

Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt

Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks

2 0.21422091 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks

Author: Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani

Abstract: How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”. First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks. We then present K RON F IT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, K RON F IT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that K RON F IT finds accurate parameters that very well mimic the properties of target networks. In fact, using just c 2010 Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos and Zoubin Ghahramani. L ESKOVEC , C HAKRABARTI , K LEINBERG , FALOUTSOS AND G HAHRAMANI four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synt

3 0.1751657 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi

Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k

4 0.14217377 15 jmlr-2010-Approximate Tree Kernels

Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller

Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels

5 0.12044659 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und

6 0.10447041 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

7 0.0723157 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

8 0.070746101 82 jmlr-2010-On Learning with Integral Operators

9 0.062590882 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

10 0.061964456 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

11 0.058461756 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

12 0.057687324 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

13 0.054091331 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

14 0.050461952 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

15 0.050305616 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

16 0.048168618 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

17 0.048122589 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

18 0.047467966 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

19 0.046626203 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

20 0.044435773 69 jmlr-2010-Lp-Nested Symmetric Distributions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.245), (1, -0.024), (2, -0.052), (3, 0.062), (4, -0.027), (5, 0.032), (6, -0.409), (7, -0.204), (8, -0.105), (9, 0.065), (10, -0.097), (11, -0.025), (12, 0.193), (13, -0.138), (14, 0.063), (15, 0.04), (16, 0.034), (17, 0.197), (18, 0.08), (19, 0.094), (20, -0.076), (21, -0.233), (22, -0.004), (23, -0.016), (24, -0.166), (25, -0.088), (26, 0.128), (27, 0.014), (28, 0.028), (29, 0.128), (30, -0.009), (31, 0.051), (32, 0.003), (33, 0.046), (34, -0.036), (35, 0.069), (36, 0.049), (37, -0.033), (38, 0.008), (39, -0.044), (40, -0.084), (41, 0.04), (42, -0.014), (43, 0.029), (44, 0.025), (45, 0.056), (46, 0.02), (47, 0.042), (48, -0.023), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96982628 44 jmlr-2010-Graph Kernels

Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt

Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks

2 0.77957636 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks

Author: Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani

Abstract: How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”. First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks. We then present K RON F IT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, K RON F IT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that K RON F IT finds accurate parameters that very well mimic the properties of target networks. In fact, using just c 2010 Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos and Zoubin Ghahramani. L ESKOVEC , C HAKRABARTI , K LEINBERG , FALOUTSOS AND G HAHRAMANI four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synt

3 0.54628038 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi

Abstract: Appropriate selection of the kernel function, which implicitly defines the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classification. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a finite mixture of cosine functions. The kernel learning problem is then formulated as a semi-infinite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classifier has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artificial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k

4 0.48557219 15 jmlr-2010-Approximate Tree Kernels

Author: Konrad Rieck, Tammo Krueger, Ulf Brefeld, Klaus-Robert Müller

Abstract: Convolution kernels for trees provide simple means for learning with tree-structured data. The computation time of tree kernels is quadratic in the size of the trees, since all pairs of nodes need to be compared. Thus, large parse trees, obtained from HTML documents or structured network data, render convolution kernels inapplicable. In this article, we propose an effective approximation technique for parse tree kernels. The approximate tree kernels (ATKs) limit kernel computation to a sparse subset of relevant subtrees and discard redundant structures, such that training and testing of kernel-based learning methods are significantly accelerated. We devise linear programming approaches for identifying such subsets for supervised and unsupervised learning tasks, respectively. Empirically, the approximate tree kernels attain run-time improvements up to three orders of magnitude while preserving the predictive accuracy of regular tree kernels. For unsupervised tasks, the approximate tree kernels even lead to more accurate predictions by identifying relevant dimensions in feature space. Keywords: tree kernels, approximation, kernel methods, convolution kernels

5 0.40412298 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und

6 0.38388228 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

7 0.37195405 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

8 0.28502142 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

9 0.27562574 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

10 0.27432668 82 jmlr-2010-On Learning with Integral Operators

11 0.25080904 64 jmlr-2010-Learning Non-Stationary Dynamic Bayesian Networks

12 0.22784001 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

13 0.22752142 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

14 0.2258199 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

15 0.20351879 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

16 0.20114215 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

17 0.19844797 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

18 0.19355091 84 jmlr-2010-On Spectral Learning

19 0.19138624 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

20 0.18115045 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.021), (4, 0.019), (8, 0.02), (21, 0.019), (24, 0.418), (32, 0.042), (36, 0.043), (37, 0.049), (75, 0.175), (81, 0.012), (85, 0.056), (96, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86012107 8 jmlr-2010-A Surrogate Modeling and Adaptive Sampling Toolbox for Computer Based Design

Author: Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene, Karel Crombecq

Abstract: An exceedingly large number of scientific and engineering fields are confronted with the need for computer simulations to study complex, real world phenomena or solve challenging design problems. However, due to the computational cost of these high fidelity simulations, the use of neural networks, kernel methods, and other surrogate modeling techniques have become indispensable. Surrogate models are compact and cheap to evaluate, and have proven very useful for tasks such as optimization, design space exploration, prototyping, and sensitivity analysis. Consequently, in many fields there is great interest in tools and techniques that facilitate the construction of such regression models, while minimizing the computational cost and maximizing model accuracy. This paper presents a mature, flexible, and adaptive machine learning toolkit for regression modeling and active learning to tackle these issues. The toolkit brings together algorithms for data fitting, model selection, sample selection (active learning), hyperparameter optimization, and distributed computing in order to empower a domain expert to efficiently generate an accurate model for the problem or data at hand. Keywords: surrogate modeling, metamodeling, function approximation, model selection, adaptive sampling, active learning, distributed computing 1. Background and Motivation In many science and engineering problems researchers make heavy use of computer simulation codes in order to replace expensive physical experiments and improve the quality and performance of engineered products and devices. Such simulation activities are collectively referred to as computational science/engineering. Unfortunately, while allowing scientists more flexibility to study phenomena under controlled conditions, computer simulations require a substantial investment of c 2010 Dirk Gorissen, Ivo Couckuyt, Piet Demeester, Tom Dhaene and Karel Crombecq. G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ computation time. One simulation may take many minutes, hours, days or even weeks, quickly rendering parameter studies impractical (Forrester et al., 2008; Simpson et al., 2008). Of the different ways to deal with this problem, this paper is concerned with the construction of simpler approximation models to predict the system performance and develop a relationship between the system inputs and outputs. When properly constructed, these approximation models mimic the behavior of the simulation accurately while being computationally cheap(er) to evaluate. Different approximation methods exist, each with their relative merits. This work concentrates on the use of data-driven, global approximations using compact surrogate models (also known as metamodels, replacement models, or response surface models). Examples include: rational functions, Kriging models, Artificial Neural Networks (ANN), splines, and Support Vector Machines (SVM). Once such a global approximation is available it is of great use for gaining insight into the behavior of the underlying system. The surrogate may be easily queried, optimized, visualized, and seamlessly integrated into CAD/CAE software packages. The challenge is thus how to generate an approximation model that is as accurate as possible over the complete domain of interest while minimizing the simulation cost. Solving this challenge involves multiple sub-problems that must be addressed: how to interface with the simulation code, how to run simulations (locally, or on a cluster or cloud), which model type to approximate the data with and how to set the model complexity (e.g., topology of a neural network), how to estimate the model quality and ensure the domain expert trusts the model, how to decide which simulations to run (data collection), etc. The data collection aspect is worth emphasizing. Since data is computationally expensive to obtain and the optimal data distribution is not known up front, data points should be selected iteratively, there where the information gain will be the greatest. A sampling function is needed that minimizes the number of sample points selected in each iteration, yet maximizes the information gain of each iteration step. This process is called adaptive sampling but is also known as active learning, or sequential design. There is a complex dependency web between these different options and dealing with these dependencies is non-trivial, particularly for a domain expert for whom the surrogate model is just an intermediate step towards solving a larger, more important problem. Few domain experts will be experts in the intricacies of efficient sampling and modeling strategies. Their primary concern is obtaining an accurate replacement metamodel for their problem as fast as possible and with minimal overhead (Gorissen et al., 2009d). As a result these choices are often made in a pragmatic, sometimes even ad-hoc, manner. This paper discusses an advanced, and integrated software framework that provides a flexible and rigorous means to tackle such problems. This work lies at the intersection of Machine Learning/AI, Modeling and Simulation, and Distributed Computing. The methods developed are applicable to any domain where a cheap, accurate, approximation is needed to replace some expensive reference model. Our experience has been that the availability of such a framework can facilitate the transfer of knowledge from surrogate modeling researchers and lower the barrier of entry for domain experts. 2. SUMO Toolbox The platform in question is the Matlab SUrrogate MOdeling (SUMO) Toolbox, illustrated in Figure 1. Given a simulation engine (Fluent, Cadence, Abaqus, HFSS, etc.) or other data source (data 2052 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN Figure 1: The SUMO Toolbox is a flexible framework for accurate global surrogate modeling and adaptive sampling (active learning). It features a rich set of plugins, is applicable to a wide range of domains, and can be applied in an autonomous, black-box fashion, or under full manual control. Written in Matlab and Java it is fully cross platform and comes with a large (60+) number of example problems. set, Matlab script, Java class, etc.), the toolbox drives the data source to produce a surrogate model within the time and accuracy constraints set by the user. The SUMO Toolbox adopts a microkernel design philosophy with many different plugins available for each of the different sub-problems:1 model types (rational functions, Kriging, splines, SVM, ANN, etc.), hyperparameter optimization algorithms (Particle Swarm Optimization, Efficient Global Optimization, simulated annealing, Genetic Algorithm, etc.), model selection algorithms (cross validation, AIC, Leave-out set, etc.), sample selection (random, error based, density based, hybrid, etc.), Design of Experiments (Latin hypercube, Box-Bhenken, etc.), and sample evaluation methods (local, on a cluster or grid). The behavior of each software component is configurable through a central XML file and components can easily be added, removed or replaced by custom implementations. In addition the toolbox provides ‘meta’ plugins. For example to automatically select the best model type for a given problem (Gorissen et al., 2009d) or to use multiple model selection or sample selection criteria in concert (Gorissen et al., 2010). Furthermore, there is built-in support for high performance computing. On the modeling side, the model generation process can take full advantage of multi-core CPUs and even of a complete cluster or grid. This can result in significant speedups for model types where the fitting process can be expensive (e.g., neural networks). Likewise, sample evaluation (simulation) can occur locally (with the option to take advantage of multi-core architectures) or on a separate compute cluster or grid (possibly accessed through a remote head-node). All interfacing with the grid middleware 1. The full list of plugins and features can be found at http://www.sumowiki.intec.ugent.be. 2053 G ORISSEN , C OUCKUYT, D EMEESTER , D HAENE AND C ROMBECQ (submission, job monitoring, rescheduling of failed/lost simulation points, etc.) is handled transparently and automatically (see Gorissen et al., 2009c for more details). Also, the sample evaluation component runs in parallel with the other components (non-blocking) and not sequentially. This allows for an optimal use of computational resources. In addition the SUMO Toolbox contains extensive logging and profiling capabilities so that the modeling process can easily be tracked and the modeling decisions understood. Once a final model has been generated, a GUI tool is available to visually explore the model (including derivatives and prediction uncertainty), assess its quality, and export it for use in other software tools. 3. Applications The SUMO Toolbox has already been applied successfully to a very wide range of applications, including RF circuit block modeling (Gorissen et al., 2009b), hydrological modeling (Couckuyt et al., 2009), Electronic Packaging (Zhu and Franzon, 2009), aerodynamic modeling (Gorissen et al., 2009a), process engineering (Stephens et al., 2009), and automotive data modeling (Gorissen et al., 2010). Besides global modeling capabilities, the SUMO Toolbox also includes a powerful optimization framework based on the Efficient Global Optimization framework developed by Jones et al. (1998). As of version 6.1, the toolbox also contains an example of how the framework can also be applied to solve classification problems. In sum, the goal of the toolbox is to fill the void in machine learning software when it comes to the challenging, costly, real-valued, problems faced in computational engineering. The toolbox is in use successfully at various institutions and we are continuously refining and extending the set of available plugins as the number of applications increase. Usage instructions, design documentation, and stable releases for all major platforms can be found at http://www.sumo.intec.ugent.be. References I. Couckuyt, D. Gorissen, H. Rouhani, E. Laermans, and T. Dhaene. Evolutionary regression modeling with active learning: An application to rainfall runoff modeling. In International Conference on Adaptive and Natural Computing Algorithms, volume LNCS 5495, pages 548–558, Sep. 2009. A. Forrester, A. Sobester, and A. Keane. Engineering Design Via Surrogate Modelling: A Practical Guide. Wiley, 2008. D. Gorissen, K. Crombecq, I. Couckuyt, and T. Dhaene. Foundations of Computational Intelligence, Volume 1: Learning and Approximation: Theoretical Foundations and Applications, volume 201, chapter Automatic approximation of expensive functions with active learning, pages 35–62. Springer Verlag, Series Studies in Computational Intelligence, 2009a. D. Gorissen, L. De Tommasi, K. Crombecq, and T. Dhaene. Sequential modeling of a low noise amplifier with neural networks and active learning. Neural Computing and Applications, 18(5): 485–494, Jun. 2009b. D. Gorissen, T. Dhaene, P. Demeester, and J. Broeckhove. Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications, chapter Grid enabled surrogate modeling, pages 249–258. IGI Global, May 2009c. 2054 A S URROGATE M ODELING AND A DAPTIVE S AMPLING TOOLBOX FOR C OMPUTER BASED D ESIGN D. Gorissen, T. Dhaene, and F. DeTurck. Evolutionary model type selection for global surrogate modeling. Journal of Machine Learning Research, 10:2039–2078, 2009d. D. Gorissen, I. Couckuyt, E. Laermans, and T. Dhaene. Multiobjective global surrogate modeling,dealing with the 5-percent problem. Engineering with Computers, 26(1):81–89, Jan. 2010. D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13(4):455–492, Nov. 1998. ISSN 0925-5001. T. W. Simpson, V. Toropov, V. Balabanov, and F. A. C. Viana. Design and analysis of computer experiments in multidisciplinary design optimization: a review of how far we have come or not. In Proceedings of the 12th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, 2008 MAO, Victoria, Canada, 2008. D.W. Stephens, D. Gorissen, and T. Dhaene. Surrogate based sensitivity analysis of process equipment. In Proc. of 7th International Conference on CFD in the Minerals and Process Industries, CSIRO, Melbourne, Australia, Dec. 2009. T. Zhu and P. D. Franzon. Application of surrogate modeling to generate compact and PVT-sensitive IBIS models. In Proceedings of the 18th Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), Oct. 2009. 2055

same-paper 2 0.75754857 44 jmlr-2010-Graph Kernels

Author: S.V.N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, Karsten M. Borgwardt

Abstract: We present a unified framework to study graph kernels, special cases of which include the random a walk (G¨ rtner et al., 2003; Borgwardt et al., 2005) and marginalized (Kashima et al., 2003, 2004; Mah´ et al., 2004) graph kernels. Through reduction to a Sylvester equation we improve the time e complexity of kernel computation between unlabeled graphs with n vertices from O(n6 ) to O(n3 ). We find a spectral decomposition approach even more efficient when computing entire kernel matrices. For labeled graphs we develop conjugate gradient and fixed-point methods that take O(dn3 ) time per iteration, where d is the size of the label set. By extending the necessary linear algebra to Reproducing Kernel Hilbert Spaces (RKHS) we obtain the same result for d-dimensional edge kernels, and O(n4 ) in the infinite-dimensional case; on sparse graphs these algorithms only take O(n2 ) time per iteration in all cases. Experiments on graphs from bioinformatics and other application domains show that these techniques can speed up computation of the kernel by an order of magnitude or more. We also show that certain rational kernels (Cortes et al., 2002, 2003, 2004) when specialized to graphs reduce to our random walk graph kernel. Finally, we relate our framework to R-convolution kernels (Haussler, 1999) and provide a kernel that is close to the optimal assignment o kernel of Fr¨ hlich et al. (2006) yet provably positive semi-definite. Keywords: linear algebra in RKHS, Sylvester equations, spectral decomposition, bioinformatics, rational kernels, transducers, semirings, random walks

3 0.45908165 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks

Author: Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani

Abstract: How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in- and out-degree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a non-standard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”. First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks. We then present K RON F IT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take super-exponential time. In contrast, K RON F IT takes linear time, by exploiting the structure of Kronecker matrix multiplication and by using statistical simulation techniques. Experiments on a wide range of large real and synthetic networks show that K RON F IT finds accurate parameters that very well mimic the properties of target networks. In fact, using just c 2010 Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos and Zoubin Ghahramani. L ESKOVEC , C HAKRABARTI , K LEINBERG , FALOUTSOS AND G HAHRAMANI four parameters we can accurately model several aspects of global network structure. Once fitted, the model parameters can be used to gain insights about the network structure, and the resulting synt

4 0.44842502 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

5 0.4457112 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

Author: Giovanni Cavallanti, Nicolò Cesa-Bianchi, Claudio Gentile

Abstract: We introduce new Perceptron-based algorithms for the online multitask binary classification problem. Under suitable regularity conditions, our algorithms are shown to improve on their baselines by a factor proportional to the number of tasks. We achieve these improvements using various types of regularization that bias our algorithms towards specific notions of task relatedness. More specifically, similarity among tasks is either measured in terms of the geometric closeness of the task reference vectors or as a function of the dimension of their spanned subspace. In addition to adapting to the online setting a mix of known techniques, such as the multitask kernels of Evgeniou et al., our analysis also introduces a matrix-based multitask extension of the p-norm Perceptron, which is used to implement spectral co-regularization. Experiments on real-world data sets complement and support our theoretical findings. Keywords: mistake bounds, perceptron algorithm, multitask learning, spectral regularization

6 0.44543344 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

7 0.44476897 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression

8 0.44047302 82 jmlr-2010-On Learning with Integral Operators

9 0.43523127 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

10 0.43189469 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

11 0.43038079 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

12 0.42809942 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

13 0.42656571 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure

14 0.42552143 48 jmlr-2010-How to Explain Individual Classification Decisions

15 0.42518523 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

16 0.42383194 63 jmlr-2010-Learning Instance-Specific Predictive Models

17 0.42360696 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

18 0.42327556 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

19 0.42320877 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

20 0.42255679 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes