jmlr jmlr2011 jmlr2011-103 knowledge-graph by maker-knowledge-mining

103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Source: pdf

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

Abstract: In this article, we propose a family of efﬁcient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be deﬁned based on this Weisfeiler-Lehman sequence of graphs, including a highly efﬁcient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classiﬁcation benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classiﬁcation, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. [sent-23, score-0.611]

2 In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classiﬁcation benchmark data sets in terms of accuracy and runtime. [sent-24, score-0.718]

3 Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. [sent-25, score-0.567]

4 There exist many graph similarity measures based on graph isomorphism or related concepts such as subgraph isomorphism or the largest common subgraph. [sent-37, score-0.571]

5 Subgraph isomorphism checking is the analogue of graph isomorphism checking in a setting in which the two graphs have different sizes. [sent-42, score-0.537]

6 Besides being computationally expensive or even intractable, similarity measures based on graph isomorphism and its variants are too restrictive in the sense that graphs have to be exactly identical or contain large identical subgraphs in order to be deemed similar by these measures. [sent-48, score-0.604]

7 However, this all subgraphs kernel is at least as hard to compute as deciding if graphs are isomorphic (G¨ rtner et al. [sent-67, score-0.735]

8 The height of a subtree is the maximum distance between the root and any other node in the subtree. [sent-80, score-0.532]

9 Just as the notion of walk extends the notion of path by allowing nodes to be equal, the notion of subtrees can be extended to subtree patterns (also called tree-walks, Bach, 2008), which can have nodes that are equal (see Figure 1). [sent-81, score-0.8]

10 Note that all subtree kernels compare subtree patterns in two graphs, not (strict) subtrees. [sent-83, score-0.877]

11 Several different graph kernels have been deﬁned in machine learning which can be categorized into three classes: graph kernels based on walks (Kashima et al. [sent-84, score-0.77]

12 , 2009), and graph kernels based on subtree patterns (Ramon and G¨ rtner, a 2003; Mah´ and Vert, 2009). [sent-88, score-0.709]

13 The standard formulation of the random walk kernel, based on the direct product graph of two graphs, is computable in O(n6 ) for a pair of graphs (G¨ rtner a et al. [sent-91, score-0.709]

14 2541 S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT 2 1 1 3 3 2 4 6 1 3 1 2 6 4 5 1 5 5 Figure 1: A subtree pattern of height 2 rooted at the node 1. [sent-97, score-0.569]

15 The shortest path kernel by Borgwardt and Kriegel (2005) counts pairs of shortest paths having the same source and sink labels and the same length in two graphs. [sent-106, score-0.998]

16 The second class, graph kernels based on limited-size subgraphs, includes kernels based on socalled graphlets, which represent graphs as counts of all types of subgraphs of size k ∈ {3, 4, 5}. [sent-108, score-0.878]

17 Computing this kernel for a general graph is unfortunately NP-hard, however there exist special cases where the kernel can be efﬁciently computed. [sent-113, score-0.563]

18 The ﬁrst kernel from the third class, subtree kernels, was deﬁned by Ramon and G¨ rtner (2003). [sent-115, score-0.763]

19 a ′ , this kernel iteratively compares all matchings between Intuitively, to compare graphs G and G neighbours of two nodes v from G and v′ from G′ . [sent-116, score-0.594]

20 In other words, for all pairs of nodes v from G and v′ from G′ , it counts all pairs of matching substructures in subtree patterns rooted at v and v′ . [sent-117, score-0.572]

21 The runtime complexity of the subtree kernel for a data set of N graphs is O(N 2 n2 h 4d ). [sent-118, score-0.931]

22 The subtree kernels by Mah´ and Vert (2009) and Bach (2008) reﬁne the Ramon-G¨ rtner kernel e a for applications in chemoinformatics and hand-written digit recognition. [sent-122, score-1.001]

23 It is a general limitation of all the aforementioned graph kernels that they scale poorly to large, labeled graphs with more than 100 nodes: In the worst case, none of them scale better than O(n3 ). [sent-125, score-0.594]

24 We present a general deﬁnition of graph kernels that encompasses many previously known graph kernels, and instances of which are efﬁcient to compute for both unlabeled and discretely labeled graphs with thousands of nodes next. [sent-127, score-0.852]

25 In Section 3, we describe what we call the Weisfeiler-Lehman graphs and our proposed general graph kernels based on them, followed by some examples. [sent-131, score-0.551]

26 We report results on kernel computation runtime and classiﬁcation accuracy on graph benchmark data sets. [sent-133, score-0.571]

27 2 Link with Subtree Patterns Note that the compressed labels li (v) correspond to subtree patterns of height i rooted at v (see Figure 1 for an illustration of subtree patterns). [sent-186, score-0.977]

28 We then present three instances of this kernel, the Weisfeiler-Lehman subtree kernel (Section 3. [sent-191, score-0.525]

29 Deﬁnition 1 Deﬁne the Weisfeiler-Lehman graph at height i of the graph G = (V, E, ℓ) = (V, E, l0 ) as the graph Gi = (V, E, li ). [sent-200, score-0.59]

30 This deﬁnition provides a framework for applying all graph kernels that take into account discrete node labels to different levels of the node-labeling of graphs, from the original labeling to more and more ﬁne-grained labelings for growing h. [sent-227, score-0.55]

31 One example of such a base kernel is the shortest path kernel: As shortest paths in a graph G are the same as shortest paths in corresponding Weisfeiler-Lehman graphs Gi , we can precompute them. [sent-231, score-1.545]

32 2 The Weisfeiler-Lehman Subtree Kernel In this section we present the Weisfeiler-Lehman subtree kernel (Shervashidze and Borgwardt, 2009), which is a natural instance of Deﬁnition 2. [sent-249, score-0.525]

33 2546 W EISFEILER -L EHMAN G RAPH K ERNELS The Weisfeiler-Lehman subtree kernel on two graphs G and G′ with h iterations is deﬁned as: (h) (h) (h) kW Lsubtree (G, G′ ) = φW Lsubtree (G), φW Lsubtree (G′ ) , (2) where (h) φW Lsubtree (G) = (c0 (G, σ01 ), . [sent-259, score-0.717]

34 That is, the Weisfeiler-Lehman subtree kernel counts common original and compressed labels in two graphs. [sent-278, score-0.707]

35 Theorem 5 The Weisfeiler-Lehman subtree kernel on a pair of graphs G and G′ can be computed in time O(hm). [sent-280, score-0.717]

36 Proof This follows directly from the deﬁnition of the Weisfeiler-Lehman subtree kernel and the runtime complexity of the Weisfeiler-Lehman test, as described in Section 2. [sent-281, score-0.739]

37 1 C OMPUTING THE W EISFEILER -L EHMAN S UBTREE K ERNEL ON M ANY G RAPHS To compute the Weisfeiler-Lehman subtree kernel on N graphs, we propose Algorithm 2, which improves over the naive, N 2 -fold application of the kernel from Deﬁnition 4. [sent-295, score-0.731]

38 Note that compressed labels denote subtree patterns: For instance, if a node has label 8, this means that there is a subtree pattern of height 1 rooted at this node, where the root has label 2 and its neighbours have labels 3 and 5. [sent-305, score-1.134]

39 2548 W EISFEILER -L EHMAN G RAPH K ERNELS Algorithm 2 One iteration of the Weisfeiler-Lehman subtree kernel computation on N graphs 1: Multiset-label determination • Assign a multiset-label Mi (v) to each node v in G which consists of the multiset {li−1 (u)|u ∈ N (v)}. [sent-309, score-0.894]

40 Theorem 7 For N graphs, the Weisfeiler-Lehman subtree kernel with h iterations on all pairs of these graphs can be computed in O(Nhm + N 2 hn). [sent-314, score-0.717]

41 Proof Naive application of the kernel from Deﬁnition 4 for computing an N × N kernel matrix would require a runtime of O(N 2 hm). [sent-315, score-0.626]

42 To get all pairwise kernel values, we have to multiply all feature vectors, which requires a (h) runtime of O(N 2 hn), as each graph G has at most hn non-zero entries in φW Lsubtree (G). [sent-320, score-0.613]

43 While our Weisfeiler-Lehman subtree kernel matches neighbourhoods of nodes in a graph exactly, one could also think of other strategies of comparing node neighbourhoods, and still retain the favourable runtime of our graph kernel. [sent-323, score-1.35]

44 In research that was published in parallel to ours, Hido and Kashima (2009) present such an alternative kernel based on node neighbourhoods which uses hash functions and logical operations on bit-representations of node labels and which also scales linearly in the number of edges. [sent-324, score-0.599]

45 The ﬁrst subtree kernel on graphs was deﬁned by Ramon and G¨ rtner (2003). [sent-330, score-0.955]

46 Each element R of M (v, v′ ) is a set of pairs of nodes from the neighbourhoods of v ∈ V and v′ ∈ V ′ such that nodes in each pair have identical labels and no node is contained in more than one pair. [sent-333, score-0.527]

47 The runtime complexity of the subtree kernel for a pair of graphs is O(n2 h4d ), including a comparison of all pairs of nodes (n2 ), and a pairwise comparison of all matchings in their neighbourhoods in O(4d ), which is repeated in h iterations. [sent-337, score-1.232]

48 3 L INK TO THE W EISFEILER -L EHMAN S UBTREE K ERNEL The Weisfeiler-Lehman subtree kernel can be deﬁned in a recursive fashion which elucidates its relation to the Ramon-G¨ rtner kernel. [sent-342, score-0.763]

49 Theorem 8 highlights the following differences between the Weisfeiler-Lehman and the RamonG¨ rtner subtree kernels: In Equation (4), Weisfeiler-Lehman considers all subtrees up to height h, a whereas the Ramon-G¨ rtner kernel looks at subtrees of exactly height h. [sent-352, score-1.209]

50 In Equations (5) and (6), a the Weisfeiler-Lehman subtree kernel checks whether the neighbourhoods of v and v′ match exactly, while the Ramon-G¨ rtner kernel considers all pairs of matching subsets of the neighbourhoods of a v and v′ in Equation (3). [sent-353, score-1.155]

51 In the case of graphs with unweighted edges, we consider the base kernel that counts matching pairs of edges with identically labeled endpoints (incident nodes) in two graphs. [sent-358, score-0.573]

52 If the edges are weighted by a function w that assigns weights, the base kernel kE can be deﬁned as ∑e∈E ∑e′ ∈E ′ δ(a, a′ )δ(b, b′ )kw (w(e), w(e′ )), where kw is a kernel comparing edge weights. [sent-361, score-0.698]

53 4 The Weisfeiler-Lehman Shortest Path Kernel Another example of the general Weisfeiler-Lehman kernels that we consider is the WeisfeilerLehman shortest path kernel. [sent-380, score-0.566]

54 Here we use a node-labeled shortest path kernel (Borgwardt and Kriegel, 2005) as the base kernel. [sent-381, score-0.594]

55 G′ ), where a, b ∈ Σ are ordered endpoint labels of a shortest path and p ∈ N0 is the shortest path length. [sent-384, score-0.798]

56 1 N OTE ON C OMPUTATIONAL C OMPLEXITY Computing shortest paths between all pairs of nodes in a graph can be done in O(n3 ) using the Floyd-Warshall algorithm. [sent-391, score-0.562]

57 Denote the number of distinct shortest path lengths occurring in the data set of graphs as P. [sent-395, score-0.578]

58 2552 W EISFEILER -L EHMAN G RAPH K ERNELS Let us ﬁrst consider the Dirac (δ) kernel on the shortest path lengths, which means that the similarity of two paths in two graphs equals 1 if they have exactly the same length and identically labeled endpoints and 0 otherwise. [sent-396, score-0.888]

59 Then, in iteration i of the Weisfeiler-Lehman relabeling, we can bound the number of features, triplets (a, b, p) where a, b ∈ |Σi | are ordered start and end node labels and p ∈ N0 the shortest path length, by |Σi |(|Σi |+1) P. [sent-397, score-0.549]

60 , h − 1}, if we 2 compute the shortest path kernel by ﬁrst explicitly computing φSP (G) for each G in the data set, the computation will get increasingly expensive in each iteration, as in the case of edge kernels (Section 3. [sent-401, score-0.835]

61 In this case, we can compute the kernel by comparing shortest path lengths pairwise in two graphs. [sent-404, score-0.634]

62 It will scale as O(n4 ) for each pair of graphs as we have to compare all pairs of the O(n2 ) shortest path lengths, and O(N 2 n4 ) for the whole data set. [sent-406, score-0.55]

63 5 Other Weisfeiler-Lehman Kernels In a similar fashion, we can plug other base graph kernels into our Weisfeiler-Lehman graph kernel framework. [sent-408, score-0.746]

64 As node labels are the only aspect that differentiate Weisfeiler-Lehman graphs at different resolutions (determined by the number of iterations), a clear requirement that the base kernel has to satisfy for the Weisfeiler-Lehman kernel to make sense is to exploit the labels on nodes. [sent-409, score-0.907]

65 A non-exhaustive list of possible base kernels not mentioned in previous sections includes the labeled version of the graphlet kernel (Shervashidze et al. [sent-410, score-0.636]

66 , 2010), and the subtree kernel by Ramon and G¨ rtner (2003). [sent-413, score-0.763]

67 Experiments In this section, we ﬁrst empirically study the runtime behaviour of the Weisfeiler-Lehman subtree kernel on synthetic graphs (Section 4. [sent-415, score-0.931]

68 Next, we compare the Weisfeiler-Lehman subtree kernel, the Weisfeiler-Lehman edge kernel, and the Weisfeiler-Lehman shortest path kernel to state-ofthe-art graph kernels in terms of kernel computation runtime and classiﬁcation accuracy on graph benchmark data sets (Section 4. [sent-417, score-1.876]

69 1 Runtime Behaviour of Weisfeiler-Lehman Subtree Kernel Here we experimentally examine the runtime performance of the Weisfeiler-Lehman subtree kernel. [sent-420, score-0.533]

70 1 M ETHODS We empirically compared the runtime behaviour of our two variants of the Weisfeiler-Lehman subtree (WL) kernel. [sent-423, score-0.533]

71 2 E XPERIMENTAL S ETUP We assessed the behaviour on randomly generated graphs with respect to four parameters: data set size N, graph size n, subtree height h and graph density c. [sent-440, score-0.917]

72 When varying the number of nodes n per graph, we observe that the runtime of both WL kernels scales quadratically with n, and the global WL is much faster than the pairwise WL for large graphs. [sent-465, score-0.571]

73 Varying the graph density c, both methods show again a linearly increasing runtime, although the runtime of the global WL kernel is much lower than the runtime of the pairwise WL. [sent-470, score-0.827]

74 Hence the global WL kernel is the variant of our Weisfeiler-Lehman subtree kernel that we use on the following graph classiﬁcation tasks. [sent-475, score-0.882]

75 2 Graph Classiﬁcation We compared the performance of the WL subtree kernel, the WL edge kernel and the WL shortest path kernel to several other state-of-the-art graph kernels in terms of runtime and classiﬁcation accuracy on graph benchmark data sets. [sent-477, score-1.876]

76 2 E XPERIMENTAL S ETUP On these data sets, we compared our Weisfeiler-Lehman subtree, Weisfeiler-Lehman edge, and Weisfeiler-Lehman shortest path kernels to the Ramon-G¨ rtner kernel (λ = 1), as well as to several a state-of-the-art graph kernels for large graphs. [sent-504, score-1.369]

77 Due to the large number of graph kernels in the literature, we could not compare to every single graph kernel, but to representative instances of the major families of graph kernels. [sent-505, score-0.661]

78 From the family of kernels based on walks, we compared our new kernels to the fast geometric random walk kernel by Vishwanathan et al. [sent-540, score-0.75]

79 (2010) that counts common labeled walks, and to the p-random walk kernel that compares random walks up to length p in two graphs (a special case of random walk kernels Kashima et al. [sent-541, score-1.005]

80 a From the family of kernels based on limited-size subgraphs, we chose an extension of the graphlet kernel by Shervashidze et al. [sent-544, score-0.563]

81 From the family of kernels based on paths, we compared to the shortest path kernel by Borgwardt and Kriegel (2005) that counts pairs of labeled nodes with identical shortest path length. [sent-546, score-1.357]

82 Moreover, we used the Dirac kernel on shortest path distances. [sent-553, score-0.564]

83 This allowed us to explicitly compute the feature map corresponding to the shortest path kernel for each graph in all data sets. [sent-554, score-0.715]

84 Proceeding in the same fashion as in the case of the Weisfeiler-Lehman subtree kernel, we computed the Ramon-G¨ rtner subtree and Weisfeiler-Lehman shortest path kernels for h ∈ {0, 1, 2} a and the p-random walk kernel for p ∈ {1, . [sent-572, score-1.776]

85 3 R ESULTS In terms of runtime, the Weisfeiler-Lehman subtree kernel could easily scale up even to graphs with thousands of nodes. [sent-583, score-0.717]

86 ; The fact that the random walk kernel was competitive on the smallest of our data sets, MUTAG, is not surprising, as on this data set one could also afford using kernels with exponential runtime, such as the all paths kernel (G¨ rtner et al. [sent-588, score-1.072]

87 The graphlet kernel was faster than our WL subtree kernel on MUTAG and a 2557 S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT Method/Data Set WL subtree WL edge WL shortest path Ramon & G¨ rtner a p-random walk Random walk Graphlet count Shortest path MUTAG 82. [sent-590, score-2.222]

88 Data Set Maximum # nodes Average # nodes # labels Number of graphs WL subtree WL edge WL shortest path Ramon & G¨ rtner a p-random walk Random walk Graphlet count Shortest path MUTAG 28 17. [sent-672, score-1.83]

89 32 82 1178 11’0” 3 days 484 days 103 days 4 days 48 days 30’21” 23h 17’2” Table 2: CPU runtime for kernel computation on graph classiﬁcation benchmark data sets On NCI1, NCI109, ENZYMES and D&D;, the kernels from the Weisfeiler-Lehman framework reached the highest accuracy. [sent-677, score-1.104]

90 While on NCI1, NCI109, and D&D; the results of all three WL kernels were competitive with each other, on ENZYMES the WL shortest path kernel dramatically improved over the other two WL kernels. [sent-678, score-0.804]

91 On D&D; the shortest path and graphlet kernels yielded similarly good results, while on NCI1 and NCI109 the Weisfeiler-Lehman subtree kernel improved by more than 8% the best accuracy attained by other methods. [sent-679, score-1.24]

92 ; The random walk and the p-random walk kernels, as well as the Ramon-G¨ rtner kernel, were less competitive to a kernels that performed the best on data sets other than MUTAG. [sent-682, score-0.734]

93 2558 W EISFEILER -L EHMAN G RAPH K ERNELS To summarize, the WL subtree kernel turned out to be competitive in terms of runtime on all smaller data sets, fastest on the large protein data set, and its accuracy levels were competitive on all data sets. [sent-684, score-0.844]

94 The WL edge kernel performed slightly better than the WL subtree kernel on three out of ﬁve data sets in terms of accuracy. [sent-685, score-0.794]

95 The WL shortest path kernel achieved the highest accuracy level on two out of ﬁve data sets, and was competitive on the remaining data sets. [sent-686, score-0.596]

96 Conclusions We have deﬁned a general framework for constructing graph kernels on graphs with unlabeled or discretely labeled nodes. [sent-688, score-0.594]

97 Instances of our framework include a fast subtree kernel that combines scalability with the ability to deal with node labels. [sent-689, score-0.634]

98 Moreover, in terms of runtime on large graphs, instances of our kernel outperform other kernels, even the efﬁcient computation schemes for random walk kernels (Vishwanathan et al. [sent-691, score-0.756]

99 Our new kernels open the door to applications of graph kernels on large graphs in bioinformatics, for instance, protein function prediction via detailed graph models of protein structure on the amino acid level, or on gene networks for phenotype prediction. [sent-694, score-0.992]

100 An exciting algorithmic question for further studies will be to consider kernels on graphs with continuous or high-dimensional node labels and their efﬁcient computation. [sent-695, score-0.591]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('subtree', 0.319), ('wl', 0.279), ('shortest', 0.25), ('rtner', 0.238), ('runtime', 0.214), ('kernels', 0.208), ('kernel', 0.206), ('graphs', 0.192), ('lsubtree', 0.178), ('graph', 0.151), ('graphlet', 0.149), ('kw', 0.139), ('ehman', 0.139), ('eisfeiler', 0.139), ('walk', 0.128), ('mutag', 0.119), ('shervashidze', 0.119), ('enzymes', 0.11), ('node', 0.109), ('chweitzer', 0.109), ('eeuwen', 0.109), ('ehlhorn', 0.109), ('hervashidze', 0.109), ('orgwardt', 0.109), ('path', 0.108), ('nodes', 0.107), ('height', 0.104), ('isomorphism', 0.097), ('neighbourhoods', 0.093), ('ernels', 0.084), ('labels', 0.082), ('borgwardt', 0.077), ('multisets', 0.076), ('subgraphs', 0.071), ('raph', 0.071), ('mah', 0.069), ('multiset', 0.068), ('ramon', 0.067), ('days', 0.065), ('edge', 0.063), ('gh', 0.059), ('krec', 0.059), ('ksp', 0.059), ('matchings', 0.059), ('si', 0.055), ('paths', 0.054), ('edges', 0.054), ('vishwanathan', 0.053), ('compressed', 0.052), ('walks', 0.052), ('strings', 0.052), ('sort', 0.051), ('relabeling', 0.051), ('mehlhorn', 0.05), ('counts', 0.048), ('labeled', 0.043), ('kondor', 0.042), ('pairwise', 0.042), ('alphabet', 0.042), ('string', 0.042), ('vert', 0.042), ('protein', 0.041), ('subgraph', 0.04), ('nhm', 0.04), ('weisfeilerlehman', 0.04), ('wlsubtree', 0.04), ('ke', 0.039), ('chemical', 0.038), ('kashima', 0.038), ('gi', 0.037), ('van', 0.037), ('sorting', 0.037), ('rooted', 0.037), ('ch', 0.037), ('similarity', 0.035), ('kriegel', 0.034), ('li', 0.033), ('hm', 0.032), ('competitive', 0.032), ('patterns', 0.031), ('sp', 0.03), ('neighbours', 0.03), ('mpg', 0.03), ('enzyme', 0.03), ('base', 0.03), ('bunke', 0.03), ('chemoinformatics', 0.03), ('graphlets', 0.03), ('horv', 0.03), ('karsten', 0.03), ('leeuwen', 0.03), ('nino', 0.03), ('ote', 0.03), ('substructures', 0.03), ('ubtree', 0.03), ('identical', 0.029), ('ascending', 0.028), ('garey', 0.028), ('isomorphic', 0.028), ('lengths', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

2 0.090524085 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the deﬁnition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classiﬁcation, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, uniﬁed similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to ﬁlter similarity measurements, resulting in a simpliﬁed and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

3 0.084254041 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efﬁcient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artiﬁcial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

4 0.084159605 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet

Abstract: Over the last few years, two different notions of positive deﬁnite (pd) kernels—universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classiﬁcation/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd , all these notions are shown to be equivalent. Keywords: kernel methods, characteristic kernels, Hilbert space embeddings, universal kernels, strictly positive deﬁnite kernels, integrally strictly positive deﬁnite kernels, conditionally strictly positive deﬁnite kernels, translation invariant kernels, radial kernels, binary classiﬁcation, homogeneity testing

5 0.077298492 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

6 0.071035884 35 jmlr-2011-Forest Density Estimation

7 0.070719987 66 jmlr-2011-Multiple Kernel Learning Algorithms

8 0.05902731 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

9 0.055166766 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

10 0.052883878 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

11 0.048634503 101 jmlr-2011-Variable Sparsity Kernel Learning

12 0.048565146 48 jmlr-2011-Kernel Analysis of Deep Networks

13 0.045624994 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

14 0.044815172 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

15 0.043984327 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

16 0.04345401 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

17 0.041106988 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

18 0.03879365 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

19 0.038355343 104 jmlr-2011-X-Armed Bandits

20 0.035639498 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.192), (1, -0.077), (2, 0.089), (3, -0.198), (4, 0.06), (5, 0.03), (6, -0.052), (7, 0.006), (8, 0.02), (9, 0.057), (10, -0.124), (11, 0.154), (12, -0.048), (13, -0.079), (14, 0.061), (15, -0.229), (16, 0.04), (17, 0.077), (18, 0.017), (19, -0.074), (20, -0.062), (21, 0.013), (22, 0.019), (23, 0.18), (24, -0.039), (25, -0.011), (26, 0.003), (27, -0.084), (28, -0.118), (29, 0.024), (30, -0.05), (31, -0.133), (32, -0.124), (33, -0.152), (34, -0.018), (35, 0.032), (36, -0.083), (37, 0.006), (38, 0.013), (39, 0.009), (40, 0.026), (41, -0.025), (42, 0.046), (43, -0.063), (44, 0.009), (45, -0.088), (46, -0.001), (47, 0.002), (48, -0.071), (49, 0.217)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96853864 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

2 0.54906762 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet

3 0.52367973 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

4 0.4156315 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

Author: Jianxin Wu, Wei-Chian Tan, James M. Rehg

Abstract: Common visual codebook generation methods used in a bag of visual words model, for example, k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that with histogram features, the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks. In this paper, we demonstrate that HIK can be used in an unsupervised manner to signiﬁcantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as the standard k-means. The HIK codebooks have consistently higher recognition accuracy over k-means codebooks by 2–4% in several benchmark object and scene recognition data sets. The algorithm is also generalized to arbitrary additive kernels. Its speed is thousands of times faster than a naive implementation of the kernel k-means algorithm. In addition, we propose a one-class SVM formulation to create more effective visual code words. Finally, we show that the standard kmedian clustering method can be used for visual codebook generation and can act as a compromise between the HIK / additive kernel and the k-means approaches. Keywords: visual codebook, additive kernel, histogram intersection kernel

5 0.4123947 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efﬁciently compute marginals and update MAP conﬁgurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP conﬁgurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU

6 0.40909493 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

7 0.40797862 66 jmlr-2011-Multiple Kernel Learning Algorithms

8 0.38474098 35 jmlr-2011-Forest Density Estimation

9 0.37474886 105 jmlr-2011-lp-Norm Multiple Kernel Learning

10 0.36761996 54 jmlr-2011-Learning Latent Tree Graphical Models

11 0.36691144 48 jmlr-2011-Kernel Analysis of Deep Networks

12 0.36062187 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

13 0.3476128 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

14 0.33614478 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

15 0.29825965 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

16 0.28400141 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

17 0.26481956 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

18 0.25327235 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

19 0.25124809 104 jmlr-2011-X-Armed Bandits

20 0.23369634 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.033), (9, 0.017), (10, 0.028), (24, 0.033), (31, 0.06), (32, 0.016), (60, 0.014), (71, 0.011), (73, 0.038), (78, 0.045), (90, 0.599)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89377177 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt

2 0.87616545 65 jmlr-2011-Models of Cooperative Teaching and Learning

Author: Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich

Abstract: While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects “helpful” training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, “helpful” examples can speed up the learning process. The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using “coding tricks”, has been widely addressed. Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. For this purpose, we introduce a new notion of “coding trick”/“collusion”. We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. For instance, monomials can be taught with only two examples independent of the number of variables. The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. We discuss why this nonmonotonicity might be inherent in many interesting cooperative teaching and learning scenarios. Keywords: teaching dimension, learning Boolean functions, interactive learning, collusion c 2011 Sandra Zilles, Steffen Lange, Robert Holte and Martin Zinkevich. Z ILLES , L ANGE , H OLTE AND Z INKEVICH

3 0.85930413 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

Author: Yizhao Ni, Craig Saunders, Sandor Szedmak, Mahesan Niranjan

Abstract: We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to learn the grammatical rules and context dependent changes using a phrase reordering classiﬁcation framework. We consider a variety of machine learning techniques, including state-of-the-art structured prediction methods. Techniques are compared and evaluated on a Chinese-English corpus, a language pair known for the high reordering characteristics which cannot be adequately captured with current models. In the reordering classiﬁcation task, the method signiﬁcantly outperforms the baseline against which it was tested, and further, when integrated as a component of the state-of-the-art machine translation system, MOSES, it achieves improvement in translation results. Keywords: statistical machine translation (SMT), phrase reordering, lexicalized reordering (LR), maximum entropy (ME), support vector machine (SVM), maximum margin regression (MMR) , max-margin structure learning (MMS)

4 0.85166955 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros

Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy

5 0.31692332 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa

Abstract: We propose a uniﬁed neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-speciﬁc engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks

6 0.31353781 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

7 0.29512593 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning

8 0.28632563 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

9 0.26627842 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

10 0.25881284 16 jmlr-2011-Clustering Algorithms for Chains

11 0.25874108 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

12 0.25394195 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

13 0.25320053 61 jmlr-2011-Logistic Stick-Breaking Process

14 0.25216335 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

15 0.25161946 48 jmlr-2011-Kernel Analysis of Deep Networks

16 0.25068796 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

17 0.24952948 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

18 0.24769884 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

19 0.24344617 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

20 0.24277773 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs