jmlr jmlr2011 jmlr2011-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet
Abstract: Over the last few years, two different notions of positive definite (pd) kernels—universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd , all these notions are shown to be equivalent. Keywords: kernel methods, characteristic kernels, Hilbert space embeddings, universal kernels, strictly positive definite kernels, integrally strictly positive definite kernels, conditionally strictly positive definite kernels, translation invariant kernels, radial kernels, binary classification, homogeneity testing
Reference: text
sentIndex sentText sentNum sentScore
1 For radial kernels on Rd , all these notions are shown to be equivalent. [sent-14, score-0.558]
2 Formally, given the set of all Borel probability measures defined on the topological space X, a measurable and bounded kernel, k is said to be characteristic if (2) P → k(·, x) dP(x), X is injective, that is, P is embedded to a unique element, X k(·, x) dP(x) in H. [sent-46, score-0.453]
3 (2007) related characteristic and universal kernels by showing that if k is c-universal—see Section 2 for the definition—then it is characteristic. [sent-49, score-0.653]
4 This is done by first reviewing all the existing characterizations for universal and characteristic kernels, which is then used to clarify not only the relation between them but also their relation to other notions of pd kernels (see Section 3). [sent-54, score-1.534]
5 Since the existing characterizations do not explain the complete relationship between all these various notions of pd kernels, we raise open questions in Section 3 about the relationships to be clarified, which are then addressed in Section 4 by deriving new results. [sent-55, score-0.793]
6 A summary of the relation between all these notions of pd kernels is shown in Figure 1, which shows the equivalence between these notions for radial kernels on Rd . [sent-57, score-1.595]
7 Positive definite (pd), strictly pd, conditionally strictly pd and integrally strictly pd: A symmetric function k : X × X → R is called positive definite (pd) (resp. [sent-81, score-1.36]
8 2391 S RIPERUMBUDUR , F UKUMIZU AND L ANCKRIET Furthermore, k is said to be strictly pd (resp. [sent-97, score-0.711]
9 A measurable, symmetric and bounded kernel, k is said to be integrally strictly pd if X k(x, y) dµ(x) dµ(y) > 0, ∀ µ ∈ Mb (X)\{0}. [sent-102, score-1.0]
10 This definition is a generalization of integrally strictly positive definite functions on Rd (Stewart, 1976, Section 6): Rd k(x, y) f (x) f (y) dx dy > 0 for all f ∈ L2 (Rd ), which is the strictly positive definiteness of the integral operator given by the kernel. [sent-103, score-0.598]
11 c-, cc-, c0 - and L p -universal kernels: A continuous pd kernel k on a compact Hausdorff space X is called c-universal if the RKHS, H induced by k is dense in C(X) w. [sent-104, score-0.75]
12 A pd kernel, k is said to be a c0 -kernel if it is bounded with k(·, x) ∈ C0 (X), ∀ x ∈ X, where X is a locally compact Hausdorff (LCH) space. [sent-111, score-0.635]
13 X Translation invariant and Radial kernels on Rd : A pd kernel, k : Rd × Rd → R is said to be translation invariant if k(x, y) = ψ(x − y), where ψ is a pd function. [sent-144, score-1.586]
14 2 A continuous pd kernel is said to be translation invariant on Td := [0, 2π)d if k(x, y) = ψ((x − y)mod 2π ), where ψ ∈ C(Td ) is such that ψ(x) = ∑ √ −1xT n Aψ (n)e n∈Zd , x ∈ Td , (7) with Aψ : Zd → R+ , Aψ (−n) = Aψ (n) and ∑n∈Zd Aψ (n) < ∞. [sent-150, score-0.838]
15 2, we discuss and summarize the relation between characteristic and universal kernels based on their existing characterizations. [sent-156, score-0.714]
16 The relation of universal and characteristic kernels to strictly pd, conditionally strictly pd and integrally strictly pd kernels are summarized in Section 3. [sent-157, score-2.853]
17 Since the existing characterizations do not explain the complete relationship between all these various notions of pd kernels, we raise questions at the end of each subsection that need to be addressed to obtain a complete understanding of the relationships between all these notions. [sent-159, score-0.773]
18 A summary of the relationships between various notions of pd kernels based on the existing characterizations is shown in Figure 1. [sent-160, score-1.033]
19 Before proceeding further, we would like to highlight a possible confusion that can raise while comparing these various notions of pd kernels. [sent-161, score-0.687]
20 When X = T and k is continuous and translation invariant on T—see (7)—then k being characteristic implies it is strictly pd, which is shown as ♣. [sent-170, score-0.64]
21 The implications shown hold for bounded continuous translation invariant kernels on Rd —see (5). [sent-171, score-0.521]
22 If ψ ∈ Cb (Rd ) ∩ L1 (Rd ), then the implication shown as (♠) holds, that is, strictly pd kernels are cc-universal. [sent-172, score-0.947]
23 In extending this reasoning for the non-trivial comparison of any two notions of pd kernels, it is important to assume that k satisfies the strongest possible condition. [sent-178, score-0.67]
24 Therefore, in order to present a concise summary of the relationships between these various notions, in Figure 1, we assume k to be a c0 -kernel—this is the strongest condition to be satisfied in order to compare all these notions of pd kernels. [sent-179, score-0.702]
25 In the following, we review the existing characterizations for all these notions of universal kernels and summarize the relation between them. [sent-183, score-0.684]
26 4) showed that a bounded continuous pd kernel, k is cc-universal if and only if the following embedding is injective for all µ ∈ Mbc (X) and some p ∈ [1, ∞): f→ X k(·, x) f (x) dµ(x), f ∈ L p (X, µ). [sent-202, score-0.803]
27 (2006, Propositions 14, Theorem 17) showed that a translation invariant kernel on Rd is cc-universal if supp(Λ) is a uniqueness subset5 of Cd , while a radial kernel on Rd is cc-universal if and only if supp(ν) = {0}—see (5) and (6) for the definitions of Λ and ν. [sent-220, score-0.472]
28 However, this notion of universality does not enjoy a nice characterization as c0 -universality—see (12) and (13) for the characterization of c0 -universality—and therefore, we did not include it in our study of relationships between various notions of pd kernels. [sent-236, score-0.898]
29 (C) While cc-universality is characterized for radial kernels on Rd , the characterization of c0 universality for radial kernels is not known. [sent-267, score-0.954]
30 3, we provide a characterization of c0 -universality for radial kernels on Rd and then establish the relation between c0 -universality and cc-universality for such kernels. [sent-269, score-0.496]
31 2 Relation Between Characteristic and Universal Kernels In this section, we comprehensively clarify the relation between various notions of universality and characteristic kernels, based on already existing characterizations for characteristic kernels and the results summarized in Section 3. [sent-271, score-1.219]
32 (2007) related universal and characteristic kernels by showing that if k is c-universal, then it is characteristic. [sent-275, score-0.653]
33 4), we showed that the converse is not true: as an example, a translation invariant kernel, k on Td × Td is characteristic if and only if Aψ (0) ≥ 0, Aψ (n) > 0, ∀ n ∈ Zd while it is universal if and only if Aψ (n) > 0, ∀ n ∈ Zd . [sent-278, score-0.659]
34 Characteristic kernels: cc-universal kernels on a non-compact Hausdorff space need not be characteristic: for example, a bounded continuous translation invariant 2397 S RIPERUMBUDUR , F UKUMIZU AND L ANCKRIET / kernel on Rd is cc-universal if (supp(Λ))◦ = 0 (see the summary of Section 3. [sent-280, score-0.647]
35 The following example shows that continuous kernels that are characteristic on non-compact Hausdorff space, X also need not be cc-universal. [sent-284, score-0.561]
36 In Section 4, we provide an alternate proof for this relation between c0 -universal and characteristic kernels by answering (A). [sent-300, score-0.605]
37 However, for bounded continuous translation invariant kernels on Rd , the converse is true, that is, a translation invariant c0 -kernel that is characteristic6 is also c0 -universal. [sent-302, score-0.783]
38 This is because of the fact that a translation invariant kernel on Rd is characteristic if and only if supp(Λ) = Rd (Sriperumbudur et al. [sent-303, score-0.512]
39 Most of the well-known characteristic kernels satisfy the condition of ψ ∈ L1 (Rd ) and therefore are c0 -kernels. [sent-318, score-0.526]
40 This means, for all practical purposes, we can assume bounded continuous translation invariant kernels to be c0 -kernels. [sent-319, score-0.521]
41 However, for translation invariant kernels on Rd , c0 -universal ⇔ characteristic. [sent-322, score-0.459]
42 • For translation invariant kernels on Rd , characteristic ⇒ cc-universal but not vice-versa. [sent-324, score-0.711]
43 However, on general non-compact Hausdorff spaces, continuous kernels that are characteristic need not be cc-universal. [sent-325, score-0.561]
44 3 Relation of Universal and Characteristic Kernels to Strictly PD, Integrally Strictly PD and Conditionally Strictly PD Kernels In this section, we relate characteristic kernels and various notions of universal kernels to strictly pd, integrally strictly pd and conditionally strictly pd kernels. [sent-331, score-2.957]
45 Before that, we summarize the relation between strictly pd, integrally strictly pd and conditionally strictly pd kernels. [sent-332, score-1.926]
46 4), we showed that integrally strictly pd kernels are strictly pd. [sent-335, score-1.395]
47 However, if X is a finite set, then k being strictly pd also implies it is integrally strictly pd. [sent-339, score-1.103]
48 From the definitions of strictly pd and conditionally strictly pd kernels, it is clear that a strictly pd kernel is conditionally strictly pd but not vice-versa. [sent-340, score-2.945]
49 3) showed that cc-universal kernels are strictly pd, which means c0 -universal kernels are also strictly pd (as c0 universal ⇒ cc-universal from Section 3. [sent-344, score-1.534]
50 This means, when X is compact Hausdorff, c-universal kernels are strictly pd, which matches with the result in Steinwart and Christmann (2008, Definition 4. [sent-346, score-0.507]
51 Conversely, a strictly pd c0 -kernel on an LCH space need not be c0 -universal. [sent-350, score-0.692]
52 62 in Steinwart and Christmann (2008) which shows that there exists a bounded strictly pd kernel, k on X := N∪{0} with k(·, x) ∈ C0 (X), ∀ x ∈ X such that k is not L p -universal (which from the summary of Section 3. [sent-352, score-0.732]
53 Similarly, when X is compact, the converse is not true, that is, continuous strictly pd kernels need not be c-universal which follows from the results due to Dahmen and Micchelli (1987) and Pinkus (2004) for Taylor kernels (Steinwart and Christmann, 2008, Lemma 4. [sent-354, score-1.333]
54 7 Therefore, it is evident that a continuous strictly pd kernel is in general not cc-universal on an Hausdorff space. [sent-359, score-0.783]
55 However, for translation invariant kernels that are continuous, bounded and integrable on Rd , that is, k(x, y) = ψ(x − y), x, y ∈ Rd , where ψ ∈ 7. [sent-360, score-0.486]
56 Another example of continuous strictly pd kernels that are not c-universal is as follows. [sent-361, score-0.982]
57 Therefore, by Theorem 8 (see Appendix B), a strictly pd kernel on T need not be c-universal. [sent-364, score-0.748]
58 2399 S RIPERUMBUDUR , F UKUMIZU AND L ANCKRIET Cb (Rd ) ∩ L1 (Rd ), strictly pd implies cc-universality. [sent-365, score-0.673]
59 Similarly, when the kernel is radial on Rd , then strictly pd kernels are cc-universal. [sent-370, score-1.141]
60 14 of Wendland (2005), which shows that a radial kernel on Rd is strictly pd if and only if supp(ν) = {0}, and therefore ccuniversal (from the summary of Section 3. [sent-372, score-0.899]
61 On the other hand, when X is finite, all these notions of universal and strictly pd kernels are equivalent, which follows from the result due to Carmeli et al. [sent-374, score-1.239]
62 3) that cc-universal and strictly pd kernels are the same when X is finite. [sent-376, score-0.947]
63 Strictly pd kernels: Since characteristic kernels that are c0 - and translation invariant on Rd are equivalent to c0 -universal kernels (see the summary of Section 3. [sent-378, score-1.522]
64 However, the converse is not true: for example, the sinc-squared 2 (σ(x−y)) kernel, k(x, y) = sin (x−y)2 on R, which has supp(Λ) = [−σ, σ] R is strictly pd (Wendland, 2005, Theorem 6. [sent-380, score-0.75]
65 Based on Example 1, it can be shown that in general, characteristic kernels on a non-compact space (not necessarily Rd ) need not be strictly pd: in Example 1, k is characteristic but is not strictly pd because for (a1 , . [sent-382, score-1.638]
66 Therefore, when X is compact Hausdorff, a characteristic kernel need not be strictly pd. [sent-396, score-0.56]
67 However, for translation invariant kernels on T, a characteristic kernel is also strictly pd, while the converse is not true: Fukumizu et al. [sent-397, score-1.031]
68 (2010b, Theorem 14) have shown that k on T × T is characteristic if and only if Aψ (0) ≥ 0, Aψ (n) > 0, ∀ n ∈ Z\{0}, which by Theorem 8 (see Appendix B) is strictly pd, while the converse is clearly not true. [sent-399, score-0.497]
69 (2010b, Theorem 7), we have shown that integrally strictly pd kernels are characteristic, while the converse in general is not true. [sent-403, score-1.286]
70 Summary: The following statements summarize the relation of universal and characteristic kernels to strictly pd, integrally strictly pd and conditionally strictly pd kernels, which are depicted in Figure 1. [sent-406, score-2.579]
71 • c-, cc- and c0 -universal kernels are strictly pd and are therefore conditionally strictly pd, while the converse in general is not true. [sent-407, score-1.281]
72 When X is finite, then c-, cc- and c0 -universal kernels are equivalent to strictly pd kernels. [sent-408, score-0.947]
73 • Bounded, continuous, integrable, strictly pd translation invariant kernels on Rd are cc-universal. [sent-409, score-1.132]
74 Radial kernels on Rd are strictly pd if and only if they are cc-universal. [sent-410, score-0.947]
75 • For a general non-compact Hausdorff space, characteristic kernels need not be strictly pd and vice-versa. [sent-411, score-1.199]
76 However, bounded continuous translation invariant kernels on Rd or T that are characteristic are strictly pd but the converse is not true. [sent-412, score-1.523]
77 Therefore k is not integrally strictly pd but is characteristic. [sent-415, score-0.935]
78 2400 U NIVERSALITY, C HARACTERISTIC K ERNELS AND RKHS E MBEDDING OF M EASURES • Integrally strictly pd kernels are characteristic. [sent-416, score-0.947]
79 (E) While the relation of universal kernels to strictly pd and conditionally strictly pd kernels is clear from the above summary, the relation between universal and integrally strictly pd kernels is not known, which we establish in Section 4. [sent-419, score-3.568]
80 (F) When X is a finite set, it is easy to see that characteristic and conditionally strictly pd kernels are equivalent (see Section 4. [sent-421, score-1.288]
81 (G) As summarized above, radial kernels on Rd are strictly pd if and only if they are cc-universal. [sent-425, score-1.066]
82 However, the relation between all the other notions of pd kernels—c0 -universal, characteristic, strictly pd and integrally strictly pd—is not known, which is addressed in Section 4. [sent-426, score-1.834]
83 (c) Comparing (14) and (2), it is clear that c0 -universal kernels are characteristic while the converse is not true, which matches with the result in Section 3. [sent-456, score-0.603]
84 2 Relation Between Universal Kernels and Integrally Strictly PD Kernels In this section, we address the open question (E) through the following result which shows that c0 -kernels are integrally strictly pd if and only if they are c0 -universal. [sent-459, score-0.955]
85 Proposition 4 (c0 -universal and integrally strictly pd kernels) Suppose the assumptions in Proposition 2 hold. [sent-460, score-0.935]
86 3 Radial Kernels on Rd In this section, we address the open questions (B), (C), (D) and (G) by showing that all the notions of universality and characteristic kernels are equivalent to strictly pd kernels. [sent-470, score-1.539]
87 Proposition 5 (All notions are equivalent for radial kernels on Rd ) Suppose k is radial on Rd . [sent-471, score-0.677]
88 4 Relation Between Characteristic and Conditionally Strictly PD Kernels In this section we address the open question (F) which is about the relation of characteristic kernels to conditionally strictly pd kernels. [sent-495, score-1.369]
89 However, the following result establishes the relation between characteristic and conditionally strictly pd kernels. [sent-498, score-1.075]
90 3 that strictly pd kernels are conditionally strictly pd but need not be characteristic and so conditionally strictly pd kernels need not have to be characteristic. [sent-510, score-2.997]
91 62 in Steinwart and Christmann (2008), which shows that c0 -kernels that are strictly pd need not be c0 -universal. [sent-512, score-0.673]
92 When X is finite, then the converse to Proposition 6 holds, that is, conditionally strictly pd kernels are characteristic, which is shown as follows. [sent-547, score-1.113]
93 ), the notions of characteristic and universal kernels are equivalent. [sent-567, score-0.818]
94 In addition, we also explored their relation to various other notions of positive definite (pd) kernels: strictly pd, integrally strictly pd and conditionally strictly pd. [sent-568, score-1.586]
95 As an example, we showed all these notions to be equivalent (except for conditionally strictly pd) in the case of radial kernels on Rd . [sent-569, score-0.833]
96 This unified study shows that certain families of kernels, for example, bounded continuous translation invariant kernels on Rd and radial kernels on Rd , are interesting for practical use, since the disparate notions of universal and characteristic kernels seem to coincide for these families. [sent-571, score-1.732]
97 The following result in Theorem 8 characterizes strictly pd kernels on T, which we quote from Menegatto (1995). [sent-609, score-0.947]
98 n Theorem 8 (Menegatto 1995) Let ψ be a pd function on T of the form in (7). [sent-621, score-0.505]
99 Then ψ is strictly pd if N has a subset of the form ∪∞ (bl + cl N0 ), l=0 l in which {bl } ∪ {cl } ⊂ N and {cl } is a prime sequence. [sent-623, score-0.71]
100 On the relation between universality, characteristic kernels and RKHS embedding of measures. [sent-859, score-0.689]
wordName wordTfidf (topN-words)
[('pd', 0.505), ('kernels', 0.274), ('integrally', 0.262), ('characteristic', 0.252), ('rd', 0.203), ('sriperumbudur', 0.178), ('strictly', 0.168), ('notions', 0.165), ('lch', 0.148), ('hausdorff', 0.141), ('rkhs', 0.137), ('carmeli', 0.131), ('universal', 0.127), ('universality', 0.126), ('mb', 0.123), ('radial', 0.119), ('injective', 0.116), ('cb', 0.106), ('translation', 0.106), ('steinwart', 0.104), ('embedding', 0.102), ('fukumizu', 0.098), ('riperumbudur', 0.096), ('ukumizu', 0.096), ('borel', 0.091), ('conditionally', 0.089), ('easures', 0.087), ('haracteristic', 0.087), ('niversality', 0.087), ('supp', 0.086), ('anckriet', 0.082), ('invariant', 0.079), ('radon', 0.079), ('converse', 0.077), ('kernel', 0.075), ('christmann', 0.073), ('mbc', 0.07), ('mbedding', 0.067), ('compact', 0.065), ('gretton', 0.064), ('ernels', 0.061), ('mrba', 0.061), ('relation', 0.061), ('signed', 0.059), ('characterizations', 0.057), ('topological', 0.056), ('micchelli', 0.055), ('dense', 0.051), ('wendland', 0.045), ('folland', 0.044), ('polish', 0.044), ('characterization', 0.042), ('proposition', 0.04), ('theorem', 0.039), ('said', 0.038), ('prime', 0.037), ('measurable', 0.035), ('continuous', 0.035), ('clarify', 0.032), ('summary', 0.032), ('topology', 0.031), ('riesz', 0.03), ('caponnetto', 0.03), ('questions', 0.029), ('sch', 0.029), ('zd', 0.029), ('bounded', 0.027), ('xl', 0.027), ('td', 0.027), ('suquet', 0.026), ('measures', 0.026), ('vanish', 0.025), ('wherein', 0.025), ('embeddings', 0.025), ('bharath', 0.022), ('target', 0.022), ('clari', 0.021), ('nn', 0.021), ('hilbert', 0.02), ('al', 0.02), ('verbatim', 0.02), ('bl', 0.02), ('lkopf', 0.02), ('open', 0.02), ('endowed', 0.019), ('space', 0.019), ('answering', 0.018), ('gert', 0.018), ('notion', 0.018), ('showed', 0.018), ('dunford', 0.017), ('footnote', 0.017), ('homogeneity', 0.017), ('injectivity', 0.017), ('ism', 0.017), ('menegatto', 0.017), ('multiquadrics', 0.017), ('rocky', 0.017), ('raise', 0.017), ('fourier', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet
Abstract: Over the last few years, two different notions of positive definite (pd) kernels—universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd , all these notions are shown to be equivalent. Keywords: kernel methods, characteristic kernels, Hilbert space embeddings, universal kernels, strictly positive definite kernels, integrally strictly positive definite kernels, conditionally strictly positive definite kernels, translation invariant kernels, radial kernels, binary classification, homogeneity testing
2 0.11869048 55 jmlr-2011-Learning Multi-modal Similarity
Author: Brian McFee, Gert Lanckriet
Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity
3 0.088931836 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng
Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier
4 0.08576782 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
5 0.084159605 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt
Abstract: In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classification, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT
6 0.074273847 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
7 0.057020463 105 jmlr-2011-lp-Norm Multiple Kernel Learning
8 0.056734324 101 jmlr-2011-Variable Sparsity Kernel Learning
9 0.051233109 66 jmlr-2011-Multiple Kernel Learning Algorithms
10 0.046782993 35 jmlr-2011-Forest Density Estimation
11 0.042920489 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
12 0.040891789 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
13 0.037385218 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
14 0.036915813 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
15 0.034572534 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
16 0.033945348 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
17 0.033481888 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
18 0.033477601 48 jmlr-2011-Kernel Analysis of Deep Networks
19 0.033016782 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
20 0.032059569 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
topicId topicWeight
[(0, 0.155), (1, -0.026), (2, 0.076), (3, -0.168), (4, 0.092), (5, 0.018), (6, -0.071), (7, -0.02), (8, -0.206), (9, -0.023), (10, -0.042), (11, 0.011), (12, -0.131), (13, 0.063), (14, 0.096), (15, -0.135), (16, 0.162), (17, -0.141), (18, 0.127), (19, -0.061), (20, 0.061), (21, -0.075), (22, -0.211), (23, 0.196), (24, 0.032), (25, -0.109), (26, 0.076), (27, -0.107), (28, -0.022), (29, -0.169), (30, -0.133), (31, -0.184), (32, -0.11), (33, -0.143), (34, -0.199), (35, -0.062), (36, -0.176), (37, 0.053), (38, 0.071), (39, -0.004), (40, -0.032), (41, -0.001), (42, -0.001), (43, -0.058), (44, 0.012), (45, 0.003), (46, 0.038), (47, -0.112), (48, -0.093), (49, -0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.9811852 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet
Abstract: Over the last few years, two different notions of positive definite (pd) kernels—universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd , all these notions are shown to be equivalent. Keywords: kernel methods, characteristic kernels, Hilbert space embeddings, universal kernels, strictly positive definite kernels, integrally strictly positive definite kernels, conditionally strictly positive definite kernels, translation invariant kernels, radial kernels, binary classification, homogeneity testing
2 0.51337743 55 jmlr-2011-Learning Multi-modal Similarity
Author: Brian McFee, Gert Lanckriet
Abstract: In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to filter similarity measurements, resulting in a simplified and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity
3 0.50391895 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt
Abstract: In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classification, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT
4 0.4678548 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
Author: Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, Jufu Feng
Abstract: Much attention has been paid to the theoretical explanation of the empirical success of AdaBoost. The most influential work is the margin theory, which is essentially an upper bound for the generalization error of any voting classifier in terms of the margin distribution over the training data. However, important questions were raised about the margin explanation. Breiman (1999) proved a bound in terms of the minimum margin, which is sharper than the margin distribution bound. He argued that the minimum margin would be better in predicting the generalization error. Grove and Schuurmans (1998) developed an algorithm called LP-AdaBoost which maximizes the minimum margin while keeping all other factors the same as AdaBoost. In experiments however, LP-AdaBoost usually performs worse than AdaBoost, putting the margin explanation into serious doubt. In this paper, we make a refined analysis of the margin theory. We prove a bound in terms of a new margin measure called the Equilibrium margin (Emargin). The Emargin bound is uniformly ©2011 Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou and Jufu Feng. WANG , S UGIYAMA , J ING , YANG , Z HOU AND F ENG sharper than Breiman’s minimum margin bound. Thus our result suggests that the minimum margin may be not crucial for the generalization error. We also show that a large Emargin and a small empirical error at Emargin imply a smaller bound of the generalization error. Experimental results on benchmark data sets demonstrate that AdaBoost usually has a larger Emargin and a smaller test error than LP-AdaBoost, which agrees well with our theory. Keywords: boosting, margin bounds, voting classifier
5 0.33845475 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
6 0.29718894 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
7 0.28209153 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
8 0.25286004 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
9 0.25069618 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
10 0.22876395 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
11 0.21773228 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
12 0.21200381 66 jmlr-2011-Multiple Kernel Learning Algorithms
13 0.20932661 105 jmlr-2011-lp-Norm Multiple Kernel Learning
14 0.20651674 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
15 0.20324855 48 jmlr-2011-Kernel Analysis of Deep Networks
16 0.19976968 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
17 0.19624203 35 jmlr-2011-Forest Density Estimation
18 0.18942176 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
19 0.18634386 101 jmlr-2011-Variable Sparsity Kernel Learning
20 0.18380949 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
topicId topicWeight
[(1, 0.497), (4, 0.028), (9, 0.023), (10, 0.014), (24, 0.023), (31, 0.088), (32, 0.027), (41, 0.024), (60, 0.013), (65, 0.02), (73, 0.055), (78, 0.053), (90, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.8287648 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures
Author: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R.G. Lanckriet
Abstract: Over the last few years, two different notions of positive definite (pd) kernels—universal and characteristic—have been developing in parallel in machine learning: universal kernels are proposed in the context of achieving the Bayes risk by kernel-based classification/regression algorithms while characteristic kernels are introduced in the context of distinguishing probability measures by embedding them into a reproducing kernel Hilbert space (RKHS). However, the relation between these two notions is not well understood. The main contribution of this paper is to clarify the relation between universal and characteristic kernels by presenting a unifying study relating them to RKHS embedding of measures, in addition to clarifying their relation to other common notions of strictly pd, conditionally strictly pd and integrally strictly pd kernels. For radial kernels on Rd , all these notions are shown to be equivalent. Keywords: kernel methods, characteristic kernels, Hilbert space embeddings, universal kernels, strictly positive definite kernels, integrally strictly positive definite kernels, conditionally strictly positive definite kernels, translation invariant kernels, radial kernels, binary classification, homogeneity testing
2 0.71564811 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
Author: Vladimir V. V'yugin
Abstract: In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled fluctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled fluctuations of one-step losses of experts of the pool tend to zero. Keywords: prediction with expert advice, follow the perturbed leader, unbounded losses, adaptive learning rate, expected bounds, Hannan consistency, online sequential prediction
3 0.23408891 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi
Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Definite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efficient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efficiently. In particular, we propose two efficient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efficiently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is significantly more efficient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-definite programming, semi-supervised learning, side information, pairwise constraints
4 0.23389754 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
5 0.23270939 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis
Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We first propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efficient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line filling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture
6 0.22952288 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
7 0.22837298 91 jmlr-2011-The Sample Complexity of Dictionary Learning
8 0.22811921 12 jmlr-2011-Bayesian Co-Training
9 0.22672699 48 jmlr-2011-Kernel Analysis of Deep Networks
10 0.22653188 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
11 0.22438613 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
12 0.2243367 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
13 0.2232309 101 jmlr-2011-Variable Sparsity Kernel Learning
14 0.22293143 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
15 0.22275931 66 jmlr-2011-Multiple Kernel Learning Algorithms
16 0.22251038 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
17 0.22250979 95 jmlr-2011-Training SVMs Without Offset
18 0.2217696 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
19 0.22170968 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
20 0.22114885 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling