jmlr jmlr2005 jmlr2005-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
Reference: text
sentIndex sentText sentNum sentScore
1 FR Ecole des Mines de Paris 35 rue Saint Honor´ e 77305 Fontainebleau, France Editor: John Lafferty Abstract We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. [sent-7, score-0.403]
2 These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. [sent-8, score-0.582]
3 Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. [sent-9, score-0.368]
4 Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. [sent-11, score-0.314]
5 Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space 1. [sent-12, score-0.793]
6 ) kernel defined between pairs of objects of interest, to be used alongside with kernel methods such as support vector machines (Boser et al. [sent-16, score-0.324]
7 One possible approach to kernel design for such complex objects consists in representing them by sets of basic components easier to manipulate, and designing kernels on such sets. [sent-21, score-0.335]
8 When the set of basic components is finite, this representation amounts to encode a complex object as a finite-dimensional vector of counters, and any kernel for vectors can be then translated to a kernel for complex object through this feature representation (Joachims, 2002, Leslie et al. [sent-29, score-0.346]
9 For more general situations, several authors have proposed to handle such weighted lists of points by first fitting a probability distribution to each list, and defining a kernel between the resulting distributions (Lafferty and Lebanon, 2002, Jebara et al. [sent-31, score-0.266]
10 Alternatively, Cuturi and Vert (2005) use a parametric family of densities and a Bayesian framework to define a kernel for strings based on the mutual information between their sets of variable-length blocks, using the concept of mutual information kernels (Seeger, 2002). [sent-33, score-0.269]
11 We explore in this contribution a different direction to kernel design for weighted lists of basic components. [sent-36, score-0.266]
12 More precisely, we explore the set of kernels between measures that can be expressed as a function of their sum, that is: k(µ, µ ) = ϕ(µ + µ ). [sent-38, score-0.274]
13 The set of measures endowed with the addition is an Abelian semigroup, and the kernel (1) is exactly what Berg et al. [sent-42, score-0.379]
14 ) semigroup kernels for molecular measures or 1170 S EMIGROUP K ERNELS ON M EASURES densities. [sent-46, score-0.987]
15 As expected, we prove that several functions ϕ that quantify the dispersion of measures through their entropy or through their variance matrix result in valid p. [sent-47, score-0.282]
16 We introduce entropy in this paper slightly differently, noting that it is a semigroup negative definite function defined on measures. [sent-51, score-0.568]
17 On the other hand, the use of generalized variance to derive a positive definite kernel between measures as proposed here is new to our knowledge. [sent-52, score-0.33]
18 We further show how such kernels can be applied to molecular measures through regularization operations. [sent-53, score-0.472]
19 In the case of the kernel based on the spectrum of the variance matrix, we show how it can be applied implicitly for molecular measures mapped to a reproducing kernel Hilbert space when a p. [sent-54, score-0.692]
20 Using the general theory of semigroup kernels we state an integral representation of such kernels and study the semicharacters involved in this representation. [sent-60, score-0.994]
21 We first introduce elements of measure representations of weighted lists and define the semigroup formalism and the notion of semigroup p. [sent-65, score-1.2]
22 kernels, which are however usually not defined for molecular measures: the entropy kernel and the inverse generalized variance (IGV) kernel. [sent-70, score-0.421]
23 Through regularization procedures, practical applications of such kernels on molecular measures are proposed in Section 4, and the approach is further extended by kernelizing the IGV through an a priori kernel defined itself on the space of components in Section 5. [sent-71, score-0.627]
24 Section 6 contains the general integral representation of semigroup kernels and Section 7 makes the link between p. [sent-72, score-0.716]
25 Notations and Framework: Semigroup Kernels on Measures In this section we set up the framework and notations of this paper, in particular the idea of semigroup kernel on the semigroup of measures. [sent-77, score-1.185]
26 We introduce the subset of M+ (X ) of molecular (or atomic) measures Mol+ (X ), namely measures such that def supp(µ) = {x ∈ X | µ(U) > 0, for all open subset U s. [sent-85, score-0.687]
27 For a molecular measure µ, an admissible base of µ is a finite list γ of weighted points of X , namely γ = (xi , ai )d , where xi ∈ X and ai > 0 for 1 ≤ i ≤ d, such that µ = ∑d ai δxi . [sent-88, score-0.571]
28 An Abelian semigroup (S , +) is a nonempty set S endowed with an associative and commutative composition + and a neutral element 0. [sent-95, score-0.605]
29 ) function on the semigroup (S, +) if (s,t ) → ϕ(s + t) is a p. [sent-101, score-0.515]
30 i, j=1 Hence semigroup kernels are real-valued functions ϕ defined on the set of interest S , the similarity between two elements s,t of S being just the value taken by that function on their composition, namely ϕ(s + t). [sent-115, score-0.67]
31 Such a semigroup might be used as a feature representation for complex objects by mapping an object to the set of its components, forgetting about the weights. [sent-117, score-0.612]
32 A semigroup kernel k on P (X ) measuring the similarity of two sets of points A, B ∈ P (X ) would use the value taken by a given p. [sent-119, score-0.644]
33 This representation is richer than the one suggested in the previous paragraph in the semigroup (P (X ), ∪) to consider the merger of two lists. [sent-134, score-0.546]
34 First it performs the union of the supports; second the sum of such molecular measures also adds the weights of the points common to both measures, with a possible renormalization on those weights. [sent-135, score-0.402]
35 functions on the semigroup (M+ (X ), +), in particular on molecular measures, in order to define kernels on weighted lists of simple components. [sent-140, score-0.964]
36 Five measures of interest are represented: the image measures δz and δz of those weighted finite lists, the smoothed density estimates θ(δz ) and θ(δz ) of the two lists of points, and the smoothed density estimate θ(δz + δz ) of the union of both lists. [sent-145, score-0.559]
37 functions, it should however be pointed out that several interesting semigroup p. [sent-148, score-0.515]
38 kernels on measures are not directly applicable to molecular measures. [sent-150, score-0.472]
39 function on (M+ (X ), +) in at least three ways: ϕ(δz + δz ), k(z, z ) = ϕ (θ(δz ) + θ(δz )) , ϕ (θ(δz + δz )) , using ϕ directly on molecular measures, using ϕ on smoothed versions of the molecular measures, evaluating ϕ on a smoothed version of the sum. [sent-160, score-0.498]
40 We then show how those quantities can be computed in the case of molecular measures in Section 4. [sent-166, score-0.358]
41 semigroup kernels for measures, motivated by a common intuition: the kernel between two measures should increase when the sum of the measures gets more “concentrated”. [sent-170, score-1.078]
42 They are therefore limited to a subset of measures, namely the subset of measures with finite entropy and the subset of sub-probability measures with non-degenerated variance, but are extended to a broader class of measures, including molecular measures, in Section 4. [sent-172, score-0.612]
43 1 Entropy Kernel b We consider the subset of M+ (X ) of absolutely continuous measures with respect to the dominant measure ν, and identify in this section a measure with its corresponding density with respect to ν. [sent-174, score-0.286]
44 2 2 2 2 The expression of Equation (2) fits our framework of devising semigroup kernels, unlike the direct use of the KL divergence (Moreno et al. [sent-179, score-0.541]
45 The latter work shares with this paper another similarity, which lies in the “kernelization” of such quantities defined on measures through a prior kernel on the space of components, as will be reviewed in Section 5. [sent-188, score-0.289]
46 However, of all the Hilbertian metrics introduced in Hein and Bousquet (2005), the Jensen-Divergence is the only one that can be related to the semigroup framework used throughout this paper. [sent-189, score-0.544]
47 h Proposition 1 h is a negative definite function on the semigroup M+ (X ). [sent-192, score-0.515]
48 As a consequence e−h def h is a positive definite function on M+ (X ) and its normalized counterpart, kh = e−J is an infinitely h h divisible positive definite kernel on M+ (X ) × M+ (X ). [sent-193, score-0.331]
49 on R+ as a semigroup endowed with addition (Berg et al. [sent-196, score-0.605]
50 Note that only e−h is a semigroup kernel strictly speaking, since e−J involves a normalized sum (through the division by 2) which is not associative. [sent-218, score-0.644]
51 The subset of absolutely h continuous probability measures on (X , ν) with finite entropies, namely f ∈ M+ (X ), s. [sent-222, score-0.261]
52 | f | = 1 is not a semigroup since it is not closed by addition, but we can nonetheless define the restriction of J and hence kh on it to obtain a p. [sent-224, score-0.556]
53 Following the results obtained in the previous section, we propose under these restrictions a second semigroup p. [sent-229, score-0.515]
54 Besides being easy to compute in the case of molecular measures, this quantity is also linked to entropy if we consider that for normal laws N (m, Σ) the following relation holds: 1 √ ∝ e−h(N (m,Σ)) . [sent-234, score-0.251]
55 b Let us define the variance operator on measures µ with finite first and second moment of M+ (X ) as def Σ(µ) = µ[xx ] − µ[x]µ[x] . [sent-237, score-0.329]
56 We call det Σ(µ) the generalized variance of a measure µ, and say a measure µ is non-degenerated if b det Σ(µ) is non-zero, meaning that Σ(µ) is of full rank. [sent-239, score-0.299]
57 However we will work in this case on the mean of two measures in the same way we used their standard addition in the semigroup b framework of M+ (X ). [sent-243, score-0.675]
58 v Proposition 3 The real-valued kernel kv defined on elements µ, µ of M+ (X ) as kv (µ, µ ) = 1 det Σ( µ+µ ) 2 is positive definite. [sent-244, score-0.487]
59 Since we are most likely to use them on molecular measures or smooth measures (as discussed in Section 2. [sent-258, score-0.518]
60 In the case of the entropy kernel, molecular measures are generally not absolutely continuous with respect to ν (except on 1177 C UTURI , F UKUMIZU AND V ERT finite spaces), and they have therefore no entropy; we solve this problem by mapping them into h M+ (X ) through a smoothing kernel. [sent-264, score-0.471]
61 This regularization is particularly important to pave the way to the kernelized version of the IGV kernel presented in the next section, when X is not Euclidian but simply endowed with a prior kernel κ. [sent-266, score-0.389]
62 The application of both the entropy kernel and the IGV kernel to molecular measures requires a previous renormalization to set the total mass of the measures to 1. [sent-267, score-0.873]
63 All molecular measures in this section, and equivalently all admissible bases, will hence be supposed to be normalized such that their total weight is 1, and Mol1 (X ) denotes the subset of Mol+ (X ) of such measures. [sent-269, score-0.454]
64 1 Entropy Kernel on Smoothed Estimates We first define the Parzen smoothing procedure which allows to map molecular measures onto measures with finite entropy: Definition 4 Let κ be a probability kernel on X with finite entropy, i. [sent-271, score-0.647]
65 Once this i=1 k=1 mapping is defined, we use the entropy kernel to propose the following kernel on two molecular measures µ and µ , κ kh (µ, µ ) = e−J(θκ (µ), θκ (µ )) . [sent-276, score-0.71]
66 As an example, let X be an Euclidian space of dimension n endowed with Lebesgue’s measure, and κ the isotropic Gaussian RBF kernel on that space, namely κ(x, y) = 1 − n e (2πσ) 2 x−y 2 2σ2 . [sent-277, score-0.26]
67 We thus introduce the regularized kernel kv defined on measures b (µ, µ ) ∈ M+ (X ) with finite second moment as η kv (µ, µ ) = 1 def det 1 ηΣ µ+µ 2 . [sent-294, score-0.775]
68 + In η It is straightforward to prove that the regularized function kv is a positive definite kernel on the b (X ) with finite second-order moments using the same proof used in Proposition 3. [sent-295, score-0.26]
69 Thus, regardless of the difference between n and d, we have det 1 ˜ K∆ + Id η = det 1 1˜ ˜ 1 ∆ 2 X X∆ 2 + Id η = det 1˜ ˜ X∆X + In η = det 1 Σ + In , η where the addition of identity matrices only introduces an offset of 1 for all eigenvalues. [sent-304, score-0.384]
70 The kernel kv defined on two measures µ, µ of Mol1 (X ) + as 1 η kv (µ, µ ) = , 1 ˜ det η Kγ ∆γ + Il(γ) where γ is any admissible base of µ+µ 2 , is p. [sent-307, score-0.791]
71 Given two objects z, z and their respective molecular measures δz and δz , the computation of the δ +δ IGV for two such objects requires in practice an admissible base of z 2 z as seen in Theorem 6. [sent-310, score-0.634]
72 The kernel η kκ (µ, µ ) = 1 det 1 ˜ η Kγ ∆γ + Il(γ) , (3) defined on two elements µ, µ in Mol1 (X ) is positive definite, where γ is any admissible base of + µ+µ 2 . [sent-330, score-0.369]
73 19) the reproducing kernel Hilbert space Ξ with reproducing kernel κ indexed on X . [sent-336, score-0.328]
74 We define def Y = supp N ∑ µi i=1 ⊂ X, the finite set which numbers all elements in the support of the N considered measures, and def ϒ = span φ(Y ) ⊂ Ξ, 1181 C UTURI , F UKUMIZU AND V ERT the linear span of the elements in the image of Y through φ. [sent-338, score-0.344]
75 Given a molecular measure µ ∈ Mol1 (Y ), let φ(µ) denote the + image measure of µ in ϒ, namely φ(µ) = ∑x∈Y µ(x)δφ(x) . [sent-341, score-0.305]
76 One can easily check that any admissible def base γ = (xi , ai )d of µ can be used to provide an admissible base φ(γ) = (φ(xi ), ai )d of φ(µ). [sent-342, score-0.488]
77 As a η η η result, we have that kκ (µi , µ j ) = kv (φ(µi ), φ(µ j )) where kv is defined on Mol1 (ϒ), ensuring the + non-negativity N N i=1 i=1 η η ∑ ci c j kκ (µi , µ j ) = ∑ ci c j kv (φ(µi ), φ(µ j )) ≥ 0 η and hence positive-definiteness of kκ . [sent-344, score-0.473]
78 Before observing these practical improvements, we provide a general study of the family of semigroup b kernels on M+ (X ) by casting the theory of integral representations of positive definite functions on a semigroup (Berg et al. [sent-346, score-1.2]
79 functions on the whole semigroup b (M+ (X ), +), including thus measures which are not normalized. [sent-351, score-0.675]
80 This characterization is based on a general integral representation theorem valid for any semigroup kernel, and is similar in spirit to the representation of p. [sent-352, score-0.633]
81 Definition 8 A real-valued function ρ on an Abelian semigroup (S, +) is called a semicharacter if it satisfies the following properties: (i) ρ(0) = 1 (ii) ∀s,t ∈ S, ρ(s + t) = ρ(s)ρ(t). [sent-358, score-0.581]
82 These theorems being valid on any semigroup, they hold in b particular on the particular semigroup (M+ (X ), +). [sent-377, score-0.515]
83 b b Proposition 11 A semicharacter ρ : M+ (X ) → R is continuous on (M+ (X ), +) endowed with the weak topology if and only if there exists f ∈ Cb (X ) such that ρ = ρ f . [sent-395, score-0.264]
84 It is further equal to ρ on M+ (X ) b through the denseness of molecular measures in M+ (X ), both in the weak and the pointwise topology (Berg et al. [sent-407, score-0.468]
85 function on the semigroup (Mω , +): ϕ(µ) = Z C(X ) ρ f (µ) dω( f ). [sent-419, score-0.515]
86 (4) If supp ω ⊂ Cb (X ) then ϕ is continuous on Mω endowed with the topology of weak convergence. [sent-420, score-0.286]
87 kernel for measures is obtained; an example involving mixtures over exponential families is provided in Section 7. [sent-430, score-0.289]
88 function can necessarily be represented as an integral of semicharacters by Theorem 10, the semicharacters involved in the representation are not necessarily continuous as in (4). [sent-439, score-0.443]
89 , fs as ψ(θ) = log ν[e∑i=1 θi fi ], s def such that for each θ ∈ Θ, s def pθ = exp ∑ θi fi − ψ(θ) ν, i=1 is a probability density, which defines an exponential family of densities on X as θ varies in Θ. [sent-461, score-0.368]
90 This 4π identity can be used to propose a renormalized kernel for two measures as s def k(µ, µ ) = ˜ ϕ(µ + µ ) e−(|µ+µ |)h(pµ+µ ) = −|µ|h(p )−|µ |h(p ) µ µ ˜ ˜ ϕ(2µ)ϕ(2µ ) e 2 |µ||µ | |µ| + |µ | s 2 . [sent-491, score-0.417]
91 We believe however that in more favourable cases, notably when the considered measures are multinomials, the entropy kernel and its structural variants (Hein and Bousquet, 2005) may provide good results. [sent-507, score-0.342]
92 C UTURI , F UKUMIZU AND V ERT and the value of the kernel is computed through kv (µ, µ ) = Σ1,1 Σ2,2 Σ1,1 Σ2,2 Σ1,1 Σ2,2 . [sent-521, score-0.26]
93 When both weighted sets of points are united, the variance of the mean of both measures has an intermediary behaviour in that respect, and this suffices to discriminate numerically both images. [sent-535, score-0.248]
94 To illustrate this approach we show in Figure 4 the first four eigenfunctions of three measures µ1 , µ0 and µ1 +µ0 built from the 2 image of a handwritten “1” and “0” with their corresponding eigenvalues, as well as for images of “2” and “0” in Figure 5. [sent-545, score-0.277]
95 Since the RBF kernel is grounded on the exact overlapping between two images we expect it to perform poorly with few pixels and significantly better when d grows, while we expect the k-IGV to capture more quickly the structure of the images with fewer pixels through the kernel κ. [sent-640, score-0.558]
96 The results presented in Table 1 of the k-IGV kernel show a consistent improvement over all other kernels for this benchmark of 1000 images, under all sampling schemes. [sent-647, score-0.243]
97 All semigroup kernels presented in this paper are grounded on statistical estimation, which makes their values stable under variable sizes of samples through renormalization, a property shared with the work of Kondor and Jebara (2003). [sent-652, score-0.673]
98 Once such a function is properly defined, the kernel computation goes through the evaluation of the function on the two measures to be compared and on their mixture. [sent-710, score-0.289]
99 The choice of alternative priors on semicharacters to propose other meaningful kernels, with convenient properties on molecular measures for instance, is also a subject of future research. [sent-718, score-0.522]
100 As for practical applications, these kernels can be naturally applied on complex objects seen as molecular measures. [sent-719, score-0.378]
wordName wordTfidf (topN-words)
[('semigroup', 0.515), ('igv', 0.351), ('molecular', 0.198), ('ert', 0.164), ('semicharacters', 0.164), ('ukumizu', 0.164), ('uturi', 0.164), ('measures', 0.16), ('easures', 0.153), ('emigroup', 0.153), ('kv', 0.131), ('berg', 0.13), ('kernel', 0.129), ('def', 0.128), ('kernels', 0.114), ('cuturi', 0.11), ('euclidian', 0.11), ('kondor', 0.106), ('jebara', 0.098), ('det', 0.096), ('ernels', 0.096), ('admissible', 0.096), ('lists', 0.09), ('endowed', 0.09), ('supp', 0.088), ('radon', 0.077), ('vert', 0.077), ('hein', 0.073), ('images', 0.072), ('objects', 0.066), ('abelian', 0.066), ('semicharacter', 0.066), ('pixels', 0.056), ('id', 0.056), ('integral', 0.056), ('topology', 0.054), ('entropy', 0.053), ('smoothed', 0.051), ('niteness', 0.048), ('gram', 0.048), ('base', 0.048), ('weighted', 0.047), ('cb', 0.046), ('wolf', 0.046), ('fukumizu', 0.046), ('mol', 0.046), ('marco', 0.046), ('handwritten', 0.045), ('proposition', 0.044), ('fuglede', 0.044), ('grounded', 0.044), ('renormalization', 0.044), ('tops', 0.044), ('continuity', 0.042), ('variance', 0.041), ('kernelized', 0.041), ('namely', 0.041), ('mnist', 0.041), ('kh', 0.041), ('borel', 0.041), ('ci', 0.04), ('hilbert', 0.04), ('rbf', 0.039), ('digits', 0.038), ('fi', 0.038), ('semigroups', 0.037), ('hausdorff', 0.037), ('ai', 0.036), ('fs', 0.036), ('displayed', 0.036), ('reproducing', 0.035), ('leslie', 0.034), ('measure', 0.033), ('ln', 0.033), ('divisible', 0.033), ('endres', 0.033), ('kenji', 0.033), ('nagaoka', 0.033), ('shashua', 0.033), ('weakly', 0.032), ('absolutely', 0.032), ('bach', 0.032), ('representation', 0.031), ('bousquet', 0.031), ('pointwise', 0.03), ('constructive', 0.03), ('xx', 0.03), ('metrics', 0.029), ('amari', 0.029), ('continuous', 0.028), ('quantify', 0.028), ('risi', 0.028), ('hilbertian', 0.028), ('francis', 0.028), ('notations', 0.026), ('determinant', 0.026), ('weak', 0.026), ('components', 0.026), ('divergence', 0.026), ('strings', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
2 0.068516739 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
Author: John Lafferty, Guy Lebanon
Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. Keywords: kernels, heat equation, diffusion, information geometry, text classification
3 0.067853935 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
4 0.057949904 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
5 0.057408981 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
6 0.057325218 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
7 0.053656034 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
8 0.053143844 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
9 0.048594717 41 jmlr-2005-Kernel Methods for Measuring Independence
10 0.046510667 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
11 0.041864626 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
12 0.040100213 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
13 0.03905924 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
14 0.038870241 36 jmlr-2005-Gaussian Processes for Ordinal Regression
15 0.037329223 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
16 0.037061878 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
17 0.035775606 47 jmlr-2005-Learning from Examples as an Inverse Problem
18 0.03576421 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
19 0.034249853 3 jmlr-2005-A Classification Framework for Anomaly Detection
20 0.033886436 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
topicId topicWeight
[(0, 0.214), (1, 0.109), (2, 0.025), (3, 0.148), (4, 0.1), (5, 0.01), (6, -0.044), (7, 0.043), (8, -0.02), (9, 0.049), (10, 0.006), (11, -0.178), (12, 0.13), (13, 0.09), (14, -0.165), (15, -0.068), (16, -0.078), (17, 0.149), (18, -0.139), (19, 0.081), (20, -0.06), (21, -0.126), (22, -0.085), (23, -0.059), (24, 0.024), (25, 0.138), (26, -0.01), (27, 0.059), (28, 0.1), (29, -0.032), (30, 0.099), (31, 0.12), (32, 0.039), (33, -0.019), (34, 0.12), (35, -0.007), (36, 0.051), (37, -0.372), (38, 0.202), (39, -0.138), (40, 0.269), (41, -0.229), (42, 0.299), (43, 0.312), (44, 0.219), (45, 0.032), (46, 0.136), (47, 0.044), (48, -0.091), (49, -0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.93868291 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
2 0.22618997 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
3 0.21902847 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
Author: John Lafferty, Guy Lebanon
Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. Keywords: kernels, heat equation, diffusion, information geometry, text classification
4 0.21888337 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
5 0.21384771 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
Author: Alain Rakotomamonjy, Stéphane Canu
Abstract: This work deals with a method for building a reproducing kernel Hilbert space (RKHS) from a Hilbert space with frame elements having special properties. Conditions on existence and a method of construction are given. Then, these RKHS are used within the framework of regularization theory for function approximation. Implications on semiparametric estimation are discussed and a multiscale scheme of regularization is also proposed. Results on toy and real-world approximation problems illustrate the effectiveness of such methods. Keywords: regularization, kernel, frames, wavelets
6 0.19837841 41 jmlr-2005-Kernel Methods for Measuring Independence
7 0.1872351 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
8 0.16828357 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
9 0.16495281 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
10 0.16266719 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
11 0.15835519 3 jmlr-2005-A Classification Framework for Anomaly Detection
12 0.15486096 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
13 0.15214872 48 jmlr-2005-Learning the Kernel Function via Regularization
14 0.15189263 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
15 0.13913988 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
16 0.13322641 36 jmlr-2005-Gaussian Processes for Ordinal Regression
17 0.1303819 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
18 0.12981752 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
19 0.12917545 47 jmlr-2005-Learning from Examples as an Inverse Problem
20 0.1244137 20 jmlr-2005-Clustering with Bregman Divergences
topicId topicWeight
[(11, 0.022), (13, 0.054), (17, 0.024), (19, 0.033), (36, 0.051), (37, 0.027), (42, 0.014), (43, 0.048), (47, 0.015), (52, 0.083), (59, 0.028), (70, 0.029), (88, 0.076), (90, 0.021), (93, 0.037), (94, 0.34)]
simIndex simValue paperId paperTitle
1 0.89521086 29 jmlr-2005-Efficient Margin Maximizing with Boosting
Author: Gunnar Rätsch, Manfred K. Warmuth
Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefficients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.
2 0.87027156 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers
3 0.86667728 5 jmlr-2005-A Generalization Error for Q-Learning
Author: Susan A. Murphy
Abstract: Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function. Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data
same-paper 4 0.76317024 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
5 0.52908546 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
Author: Günther Eibl, Karl-Peter Pfeiffer
Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classifiers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classifier should minimize the confidence-rated error, whereas for AdaBoost.M2 the base classifier should minimize the pseudo-loss. This makes them more easily applicable to already existing base classifiers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classification, decision stumps
6 0.48179865 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
7 0.47850358 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
8 0.47426337 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
9 0.45474446 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
10 0.45152643 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets
11 0.44993466 11 jmlr-2005-Algorithmic Stability and Meta-Learning
12 0.44846386 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
13 0.44428426 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
14 0.44003803 54 jmlr-2005-Managing Diversity in Regression Ensembles
15 0.43054548 48 jmlr-2005-Learning the Kernel Function via Regularization
16 0.43001378 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
17 0.42879698 67 jmlr-2005-Stability of Randomized Learning Algorithms
18 0.42666978 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
19 0.42301384 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
20 0.42237937 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching