jmlr jmlr2005 jmlr2005-64 knowledge-graph by maker-knowledge-mining

64 jmlr-2005-Semigroup Kernels on Measures

Source: pdf

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive deﬁnite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities deﬁned on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efﬁcient and ﬂexible framework using the “kernel trick”. Finally, a constructive approach to such positive deﬁnite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classiﬁcation to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 FR Ecole des Mines de Paris 35 rue Saint Honor´ e 77305 Fontainebleau, France Editor: John Lafferty Abstract We present a family of positive deﬁnite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. [sent-7, score-0.403]

2 These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. [sent-8, score-0.582]

3 Several kernels studied in this work make use of common quantities deﬁned on measures such as entropy or generalized variance to detect similarities. [sent-9, score-0.368]

4 Finally, a constructive approach to such positive deﬁnite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classiﬁcation to illustrate the validity of the approach. [sent-11, score-0.314]

5 Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space 1. [sent-12, score-0.793]

6 ) kernel deﬁned between pairs of objects of interest, to be used alongside with kernel methods such as support vector machines (Boser et al. [sent-16, score-0.324]

7 One possible approach to kernel design for such complex objects consists in representing them by sets of basic components easier to manipulate, and designing kernels on such sets. [sent-21, score-0.335]

8 When the set of basic components is ﬁnite, this representation amounts to encode a complex object as a ﬁnite-dimensional vector of counters, and any kernel for vectors can be then translated to a kernel for complex object through this feature representation (Joachims, 2002, Leslie et al. [sent-29, score-0.346]

9 For more general situations, several authors have proposed to handle such weighted lists of points by ﬁrst ﬁtting a probability distribution to each list, and deﬁning a kernel between the resulting distributions (Lafferty and Lebanon, 2002, Jebara et al. [sent-31, score-0.266]

10 Alternatively, Cuturi and Vert (2005) use a parametric family of densities and a Bayesian framework to deﬁne a kernel for strings based on the mutual information between their sets of variable-length blocks, using the concept of mutual information kernels (Seeger, 2002). [sent-33, score-0.269]

11 We explore in this contribution a different direction to kernel design for weighted lists of basic components. [sent-36, score-0.266]

12 More precisely, we explore the set of kernels between measures that can be expressed as a function of their sum, that is: k(µ, µ ) = ϕ(µ + µ ). [sent-38, score-0.274]

13 The set of measures endowed with the addition is an Abelian semigroup, and the kernel (1) is exactly what Berg et al. [sent-42, score-0.379]

14 ) semigroup kernels for molecular measures or 1170 S EMIGROUP K ERNELS ON M EASURES densities. [sent-46, score-0.987]

15 As expected, we prove that several functions ϕ that quantify the dispersion of measures through their entropy or through their variance matrix result in valid p. [sent-47, score-0.282]

16 We introduce entropy in this paper slightly differently, noting that it is a semigroup negative deﬁnite function deﬁned on measures. [sent-51, score-0.568]

17 On the other hand, the use of generalized variance to derive a positive deﬁnite kernel between measures as proposed here is new to our knowledge. [sent-52, score-0.33]

18 We further show how such kernels can be applied to molecular measures through regularization operations. [sent-53, score-0.472]

19 In the case of the kernel based on the spectrum of the variance matrix, we show how it can be applied implicitly for molecular measures mapped to a reproducing kernel Hilbert space when a p. [sent-54, score-0.692]

20 Using the general theory of semigroup kernels we state an integral representation of such kernels and study the semicharacters involved in this representation. [sent-60, score-0.994]

21 We ﬁrst introduce elements of measure representations of weighted lists and deﬁne the semigroup formalism and the notion of semigroup p. [sent-65, score-1.2]

22 kernels, which are however usually not deﬁned for molecular measures: the entropy kernel and the inverse generalized variance (IGV) kernel. [sent-70, score-0.421]

23 Through regularization procedures, practical applications of such kernels on molecular measures are proposed in Section 4, and the approach is further extended by kernelizing the IGV through an a priori kernel deﬁned itself on the space of components in Section 5. [sent-71, score-0.627]

24 Section 6 contains the general integral representation of semigroup kernels and Section 7 makes the link between p. [sent-72, score-0.716]

25 Notations and Framework: Semigroup Kernels on Measures In this section we set up the framework and notations of this paper, in particular the idea of semigroup kernel on the semigroup of measures. [sent-77, score-1.185]

26 We introduce the subset of M+ (X ) of molecular (or atomic) measures Mol+ (X ), namely measures such that def supp(µ) = {x ∈ X | µ(U) > 0, for all open subset U s. [sent-85, score-0.687]

27 For a molecular measure µ, an admissible base of µ is a ﬁnite list γ of weighted points of X , namely γ = (xi , ai )d , where xi ∈ X and ai > 0 for 1 ≤ i ≤ d, such that µ = ∑d ai δxi . [sent-88, score-0.571]

28 An Abelian semigroup (S , +) is a nonempty set S endowed with an associative and commutative composition + and a neutral element 0. [sent-95, score-0.605]

29 ) function on the semigroup (S, +) if (s,t ) → ϕ(s + t) is a p. [sent-101, score-0.515]

30 i, j=1 Hence semigroup kernels are real-valued functions ϕ deﬁned on the set of interest S , the similarity between two elements s,t of S being just the value taken by that function on their composition, namely ϕ(s + t). [sent-115, score-0.67]

31 Such a semigroup might be used as a feature representation for complex objects by mapping an object to the set of its components, forgetting about the weights. [sent-117, score-0.612]

32 A semigroup kernel k on P (X ) measuring the similarity of two sets of points A, B ∈ P (X ) would use the value taken by a given p. [sent-119, score-0.644]

33 This representation is richer than the one suggested in the previous paragraph in the semigroup (P (X ), ∪) to consider the merger of two lists. [sent-134, score-0.546]

34 First it performs the union of the supports; second the sum of such molecular measures also adds the weights of the points common to both measures, with a possible renormalization on those weights. [sent-135, score-0.402]

35 functions on the semigroup (M+ (X ), +), in particular on molecular measures, in order to deﬁne kernels on weighted lists of simple components. [sent-140, score-0.964]

36 Five measures of interest are represented: the image measures δz and δz of those weighted ﬁnite lists, the smoothed density estimates θ(δz ) and θ(δz ) of the two lists of points, and the smoothed density estimate θ(δz + δz ) of the union of both lists. [sent-145, score-0.559]

37 functions, it should however be pointed out that several interesting semigroup p. [sent-148, score-0.515]

38 kernels on measures are not directly applicable to molecular measures. [sent-150, score-0.472]

39 function on (M+ (X ), +) in at least three ways:  ϕ(δz + δz ),  k(z, z ) = ϕ (θ(δz ) + θ(δz )) ,   ϕ (θ(δz + δz )) , using ϕ directly on molecular measures, using ϕ on smoothed versions of the molecular measures, evaluating ϕ on a smoothed version of the sum. [sent-160, score-0.498]

40 We then show how those quantities can be computed in the case of molecular measures in Section 4. [sent-166, score-0.358]

41 semigroup kernels for measures, motivated by a common intuition: the kernel between two measures should increase when the sum of the measures gets more “concentrated”. [sent-170, score-1.078]

42 They are therefore limited to a subset of measures, namely the subset of measures with ﬁnite entropy and the subset of sub-probability measures with non-degenerated variance, but are extended to a broader class of measures, including molecular measures, in Section 4. [sent-172, score-0.612]

43 1 Entropy Kernel b We consider the subset of M+ (X ) of absolutely continuous measures with respect to the dominant measure ν, and identify in this section a measure with its corresponding density with respect to ν. [sent-174, score-0.286]

44 2 2 2 2 The expression of Equation (2) ﬁts our framework of devising semigroup kernels, unlike the direct use of the KL divergence (Moreno et al. [sent-179, score-0.541]

45 The latter work shares with this paper another similarity, which lies in the “kernelization” of such quantities deﬁned on measures through a prior kernel on the space of components, as will be reviewed in Section 5. [sent-188, score-0.289]

46 However, of all the Hilbertian metrics introduced in Hein and Bousquet (2005), the Jensen-Divergence is the only one that can be related to the semigroup framework used throughout this paper. [sent-189, score-0.544]

47 h Proposition 1 h is a negative deﬁnite function on the semigroup M+ (X ). [sent-192, score-0.515]

48 As a consequence e−h def h is a positive deﬁnite function on M+ (X ) and its normalized counterpart, kh = e−J is an inﬁnitely h h divisible positive deﬁnite kernel on M+ (X ) × M+ (X ). [sent-193, score-0.331]

49 on R+ as a semigroup endowed with addition (Berg et al. [sent-196, score-0.605]

50 Note that only e−h is a semigroup kernel strictly speaking, since e−J involves a normalized sum (through the division by 2) which is not associative. [sent-218, score-0.644]

51 The subset of absolutely h continuous probability measures on (X , ν) with ﬁnite entropies, namely f ∈ M+ (X ), s. [sent-222, score-0.261]

52 | f | = 1 is not a semigroup since it is not closed by addition, but we can nonetheless deﬁne the restriction of J and hence kh on it to obtain a p. [sent-224, score-0.556]

53 Following the results obtained in the previous section, we propose under these restrictions a second semigroup p. [sent-229, score-0.515]

54 Besides being easy to compute in the case of molecular measures, this quantity is also linked to entropy if we consider that for normal laws N (m, Σ) the following relation holds: 1 √ ∝ e−h(N (m,Σ)) . [sent-234, score-0.251]

55 b Let us deﬁne the variance operator on measures µ with ﬁnite ﬁrst and second moment of M+ (X ) as def Σ(µ) = µ[xx ] − µ[x]µ[x] . [sent-237, score-0.329]

56 We call det Σ(µ) the generalized variance of a measure µ, and say a measure µ is non-degenerated if b det Σ(µ) is non-zero, meaning that Σ(µ) is of full rank. [sent-239, score-0.299]

57 However we will work in this case on the mean of two measures in the same way we used their standard addition in the semigroup b framework of M+ (X ). [sent-243, score-0.675]

58 v Proposition 3 The real-valued kernel kv deﬁned on elements µ, µ of M+ (X ) as kv (µ, µ ) = 1 det Σ( µ+µ ) 2 is positive deﬁnite. [sent-244, score-0.487]

59 Since we are most likely to use them on molecular measures or smooth measures (as discussed in Section 2. [sent-258, score-0.518]

60 In the case of the entropy kernel, molecular measures are generally not absolutely continuous with respect to ν (except on 1177 C UTURI , F UKUMIZU AND V ERT ﬁnite spaces), and they have therefore no entropy; we solve this problem by mapping them into h M+ (X ) through a smoothing kernel. [sent-264, score-0.471]

61 This regularization is particularly important to pave the way to the kernelized version of the IGV kernel presented in the next section, when X is not Euclidian but simply endowed with a prior kernel κ. [sent-266, score-0.389]

62 The application of both the entropy kernel and the IGV kernel to molecular measures requires a previous renormalization to set the total mass of the measures to 1. [sent-267, score-0.873]

63 All molecular measures in this section, and equivalently all admissible bases, will hence be supposed to be normalized such that their total weight is 1, and Mol1 (X ) denotes the subset of Mol+ (X ) of such measures. [sent-269, score-0.454]

64 1 Entropy Kernel on Smoothed Estimates We ﬁrst deﬁne the Parzen smoothing procedure which allows to map molecular measures onto measures with ﬁnite entropy: Deﬁnition 4 Let κ be a probability kernel on X with ﬁnite entropy, i. [sent-271, score-0.647]

65 Once this i=1 k=1 mapping is deﬁned, we use the entropy kernel to propose the following kernel on two molecular measures µ and µ , κ kh (µ, µ ) = e−J(θκ (µ), θκ (µ )) . [sent-276, score-0.71]

66 As an example, let X be an Euclidian space of dimension n endowed with Lebesgue’s measure, and κ the isotropic Gaussian RBF kernel on that space, namely κ(x, y) = 1 − n e (2πσ) 2 x−y 2 2σ2 . [sent-277, score-0.26]

67 We thus introduce the regularized kernel kv deﬁned on measures b (µ, µ ) ∈ M+ (X ) with ﬁnite second moment as η kv (µ, µ ) = 1 def det 1 ηΣ µ+µ 2 . [sent-294, score-0.775]

68 + In η It is straightforward to prove that the regularized function kv is a positive deﬁnite kernel on the b (X ) with ﬁnite second-order moments using the same proof used in Proposition 3. [sent-295, score-0.26]

69 Thus, regardless of the difference between n and d, we have det 1 ˜ K∆ + Id η = det 1 1˜ ˜ 1 ∆ 2 X X∆ 2 + Id η = det 1˜ ˜ X∆X + In η = det 1 Σ + In , η where the addition of identity matrices only introduces an offset of 1 for all eigenvalues. [sent-304, score-0.384]

70 The kernel kv deﬁned on two measures µ, µ of Mol1 (X ) + as 1 η kv (µ, µ ) = , 1 ˜ det η Kγ ∆γ + Il(γ) where γ is any admissible base of µ+µ 2 , is p. [sent-307, score-0.791]

71 Given two objects z, z and their respective molecular measures δz and δz , the computation of the δ +δ IGV for two such objects requires in practice an admissible base of z 2 z as seen in Theorem 6. [sent-310, score-0.634]

72 The kernel η kκ (µ, µ ) = 1 det 1 ˜ η Kγ ∆γ + Il(γ) , (3) deﬁned on two elements µ, µ in Mol1 (X ) is positive deﬁnite, where γ is any admissible base of + µ+µ 2 . [sent-330, score-0.369]

73 19) the reproducing kernel Hilbert space Ξ with reproducing kernel κ indexed on X . [sent-336, score-0.328]

74 We deﬁne def Y = supp N ∑ µi i=1 ⊂ X, the ﬁnite set which numbers all elements in the support of the N considered measures, and def ϒ = span φ(Y ) ⊂ Ξ, 1181 C UTURI , F UKUMIZU AND V ERT the linear span of the elements in the image of Y through φ. [sent-338, score-0.344]

75 Given a molecular measure µ ∈ Mol1 (Y ), let φ(µ) denote the + image measure of µ in ϒ, namely φ(µ) = ∑x∈Y µ(x)δφ(x) . [sent-341, score-0.305]

76 One can easily check that any admissible def base γ = (xi , ai )d of µ can be used to provide an admissible base φ(γ) = (φ(xi ), ai )d of φ(µ). [sent-342, score-0.488]

77 As a η η η result, we have that kκ (µi , µ j ) = kv (φ(µi ), φ(µ j )) where kv is deﬁned on Mol1 (ϒ), ensuring the + non-negativity N N i=1 i=1 η η ∑ ci c j kκ (µi , µ j ) = ∑ ci c j kv (φ(µi ), φ(µ j )) ≥ 0 η and hence positive-deﬁniteness of kκ . [sent-344, score-0.473]

78 Before observing these practical improvements, we provide a general study of the family of semigroup b kernels on M+ (X ) by casting the theory of integral representations of positive deﬁnite functions on a semigroup (Berg et al. [sent-346, score-1.2]

79 functions on the whole semigroup b (M+ (X ), +), including thus measures which are not normalized. [sent-351, score-0.675]

80 This characterization is based on a general integral representation theorem valid for any semigroup kernel, and is similar in spirit to the representation of p. [sent-352, score-0.633]

81 Deﬁnition 8 A real-valued function ρ on an Abelian semigroup (S, +) is called a semicharacter if it satisﬁes the following properties: (i) ρ(0) = 1 (ii) ∀s,t ∈ S, ρ(s + t) = ρ(s)ρ(t). [sent-358, score-0.581]

82 These theorems being valid on any semigroup, they hold in b particular on the particular semigroup (M+ (X ), +). [sent-377, score-0.515]

83 b b Proposition 11 A semicharacter ρ : M+ (X ) → R is continuous on (M+ (X ), +) endowed with the weak topology if and only if there exists f ∈ Cb (X ) such that ρ = ρ f . [sent-395, score-0.264]

84 It is further equal to ρ on M+ (X ) b through the denseness of molecular measures in M+ (X ), both in the weak and the pointwise topology (Berg et al. [sent-407, score-0.468]

85 function on the semigroup (Mω , +): ϕ(µ) = Z C(X ) ρ f (µ) dω( f ). [sent-419, score-0.515]

86 (4) If supp ω ⊂ Cb (X ) then ϕ is continuous on Mω endowed with the topology of weak convergence. [sent-420, score-0.286]

87 kernel for measures is obtained; an example involving mixtures over exponential families is provided in Section 7. [sent-430, score-0.289]

88 function can necessarily be represented as an integral of semicharacters by Theorem 10, the semicharacters involved in the representation are not necessarily continuous as in (4). [sent-439, score-0.443]

89 , fs as ψ(θ) = log ν[e∑i=1 θi fi ], s def such that for each θ ∈ Θ, s def pθ = exp ∑ θi fi − ψ(θ) ν, i=1 is a probability density, which deﬁnes an exponential family of densities on X as θ varies in Θ. [sent-461, score-0.368]

90 This 4π identity can be used to propose a renormalized kernel for two measures as s def k(µ, µ ) = ˜ ϕ(µ + µ ) e−(|µ+µ |)h(pµ+µ ) = −|µ|h(p )−|µ |h(p ) µ µ ˜ ˜ ϕ(2µ)ϕ(2µ ) e 2 |µ||µ | |µ| + |µ | s 2 . [sent-491, score-0.417]

91 We believe however that in more favourable cases, notably when the considered measures are multinomials, the entropy kernel and its structural variants (Hein and Bousquet, 2005) may provide good results. [sent-507, score-0.342]

92 C UTURI , F UKUMIZU AND V ERT and the value of the kernel is computed through kv (µ, µ ) = Σ1,1 Σ2,2 Σ1,1 Σ2,2 Σ1,1 Σ2,2 . [sent-521, score-0.26]

93 When both weighted sets of points are united, the variance of the mean of both measures has an intermediary behaviour in that respect, and this sufﬁces to discriminate numerically both images. [sent-535, score-0.248]

94 To illustrate this approach we show in Figure 4 the ﬁrst four eigenfunctions of three measures µ1 , µ0 and µ1 +µ0 built from the 2 image of a handwritten “1” and “0” with their corresponding eigenvalues, as well as for images of “2” and “0” in Figure 5. [sent-545, score-0.277]

95 Since the RBF kernel is grounded on the exact overlapping between two images we expect it to perform poorly with few pixels and signiﬁcantly better when d grows, while we expect the k-IGV to capture more quickly the structure of the images with fewer pixels through the kernel κ. [sent-640, score-0.558]

96 The results presented in Table 1 of the k-IGV kernel show a consistent improvement over all other kernels for this benchmark of 1000 images, under all sampling schemes. [sent-647, score-0.243]

97 All semigroup kernels presented in this paper are grounded on statistical estimation, which makes their values stable under variable sizes of samples through renormalization, a property shared with the work of Kondor and Jebara (2003). [sent-652, score-0.673]

98 Once such a function is properly deﬁned, the kernel computation goes through the evaluation of the function on the two measures to be compared and on their mixture. [sent-710, score-0.289]

99 The choice of alternative priors on semicharacters to propose other meaningful kernels, with convenient properties on molecular measures for instance, is also a subject of future research. [sent-718, score-0.522]

100 As for practical applications, these kernels can be naturally applied on complex objects seen as molecular measures. [sent-719, score-0.378]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('semigroup', 0.515), ('igv', 0.351), ('molecular', 0.198), ('ert', 0.164), ('semicharacters', 0.164), ('ukumizu', 0.164), ('uturi', 0.164), ('measures', 0.16), ('easures', 0.153), ('emigroup', 0.153), ('kv', 0.131), ('berg', 0.13), ('kernel', 0.129), ('def', 0.128), ('kernels', 0.114), ('cuturi', 0.11), ('euclidian', 0.11), ('kondor', 0.106), ('jebara', 0.098), ('det', 0.096), ('ernels', 0.096), ('admissible', 0.096), ('lists', 0.09), ('endowed', 0.09), ('supp', 0.088), ('radon', 0.077), ('vert', 0.077), ('hein', 0.073), ('images', 0.072), ('objects', 0.066), ('abelian', 0.066), ('semicharacter', 0.066), ('pixels', 0.056), ('id', 0.056), ('integral', 0.056), ('topology', 0.054), ('entropy', 0.053), ('smoothed', 0.051), ('niteness', 0.048), ('gram', 0.048), ('base', 0.048), ('weighted', 0.047), ('cb', 0.046), ('wolf', 0.046), ('fukumizu', 0.046), ('mol', 0.046), ('marco', 0.046), ('handwritten', 0.045), ('proposition', 0.044), ('fuglede', 0.044), ('grounded', 0.044), ('renormalization', 0.044), ('tops', 0.044), ('continuity', 0.042), ('variance', 0.041), ('kernelized', 0.041), ('namely', 0.041), ('mnist', 0.041), ('kh', 0.041), ('borel', 0.041), ('ci', 0.04), ('hilbert', 0.04), ('rbf', 0.039), ('digits', 0.038), ('fi', 0.038), ('semigroups', 0.037), ('hausdorff', 0.037), ('ai', 0.036), ('fs', 0.036), ('displayed', 0.036), ('reproducing', 0.035), ('leslie', 0.034), ('measure', 0.033), ('ln', 0.033), ('divisible', 0.033), ('endres', 0.033), ('kenji', 0.033), ('nagaoka', 0.033), ('shashua', 0.033), ('weakly', 0.032), ('absolutely', 0.032), ('bach', 0.032), ('representation', 0.031), ('bousquet', 0.031), ('pointwise', 0.03), ('constructive', 0.03), ('xx', 0.03), ('metrics', 0.029), ('amari', 0.029), ('continuous', 0.028), ('quantify', 0.028), ('risi', 0.028), ('hilbertian', 0.028), ('francis', 0.028), ('notations', 0.026), ('determinant', 0.026), ('weak', 0.026), ('components', 0.026), ('divergence', 0.026), ('strings', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

2 0.068516739 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

Author: John Lafferty, Guy Lebanon

Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold deﬁned by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classiﬁcation, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classiﬁcation. Keywords: kernels, heat equation, diffusion, information geometry, text classiﬁcation

3 0.067853935 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by deﬁning a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semideﬁnite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classiﬁcation, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semideﬁnite programming

4 0.057949904 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

Author: Lior Wolf, Amnon Shashua

Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classiﬁcation tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a deﬁnition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features deﬁned by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.

5 0.057408981 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil

Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we deﬁne is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Speciﬁc kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can signiﬁcantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms

6 0.057325218 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

7 0.053656034 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

8 0.053143844 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

9 0.048594717 41 jmlr-2005-Kernel Methods for Measuring Independence

10 0.046510667 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

11 0.041864626 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

12 0.040100213 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets

13 0.03905924 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

14 0.038870241 36 jmlr-2005-Gaussian Processes for Ordinal Regression

15 0.037329223 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

16 0.037061878 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

17 0.035775606 47 jmlr-2005-Learning from Examples as an Inverse Problem

18 0.03576421 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

19 0.034249853 3 jmlr-2005-A Classification Framework for Anomaly Detection

20 0.033886436 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.214), (1, 0.109), (2, 0.025), (3, 0.148), (4, 0.1), (5, 0.01), (6, -0.044), (7, 0.043), (8, -0.02), (9, 0.049), (10, 0.006), (11, -0.178), (12, 0.13), (13, 0.09), (14, -0.165), (15, -0.068), (16, -0.078), (17, 0.149), (18, -0.139), (19, 0.081), (20, -0.06), (21, -0.126), (22, -0.085), (23, -0.059), (24, 0.024), (25, 0.138), (26, -0.01), (27, 0.059), (28, 0.1), (29, -0.032), (30, 0.099), (31, 0.12), (32, 0.039), (33, -0.019), (34, 0.12), (35, -0.007), (36, 0.051), (37, -0.372), (38, 0.202), (39, -0.138), (40, 0.269), (41, -0.229), (42, 0.299), (43, 0.312), (44, 0.219), (45, 0.032), (46, 0.136), (47, 0.044), (48, -0.091), (49, -0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93868291 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

2 0.22618997 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

3 0.21902847 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

Author: John Lafferty, Guy Lebanon

4 0.21888337 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil

5 0.21384771 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

Author: Alain Rakotomamonjy, Stéphane Canu

Abstract: This work deals with a method for building a reproducing kernel Hilbert space (RKHS) from a Hilbert space with frame elements having special properties. Conditions on existence and a method of construction are given. Then, these RKHS are used within the framework of regularization theory for function approximation. Implications on semiparametric estimation are discussed and a multiscale scheme of regularization is also proposed. Results on toy and real-world approximation problems illustrate the effectiveness of such methods. Keywords: regularization, kernel, frames, wavelets

6 0.19837841 41 jmlr-2005-Kernel Methods for Measuring Independence

7 0.1872351 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

8 0.16828357 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

9 0.16495281 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

10 0.16266719 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

11 0.15835519 3 jmlr-2005-A Classification Framework for Anomaly Detection

12 0.15486096 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

13 0.15214872 48 jmlr-2005-Learning the Kernel Function via Regularization

14 0.15189263 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets

15 0.13913988 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

16 0.13322641 36 jmlr-2005-Gaussian Processes for Ordinal Regression

17 0.1303819 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables

18 0.12981752 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

19 0.12917545 47 jmlr-2005-Learning from Examples as an Inverse Problem

20 0.1244137 20 jmlr-2005-Clustering with Bregman Divergences

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.022), (13, 0.054), (17, 0.024), (19, 0.033), (36, 0.051), (37, 0.027), (42, 0.014), (43, 0.048), (47, 0.015), (52, 0.083), (59, 0.028), (70, 0.029), (88, 0.076), (90, 0.021), (93, 0.037), (94, 0.34)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89521086 29 jmlr-2005-Efficient Margin Maximizing with Boosting

Author: Gunnar Rätsch, Manfred K. Warmuth

Abstract: AdaBoost produces a linear combination of base hypotheses and predicts with the sign of this linear combination. The linear combination may be viewed as a hyperplane in feature space where the base hypotheses form the features. It has been observed that the generalization error of the algorithm continues to improve even after all examples are on the correct side of the current hyperplane. The improvement is attributed to the experimental observation that the distances (margins) of the examples to the separating hyperplane are increasing even after all examples are on the correct side. We introduce a new version of AdaBoost, called AdaBoost∗ , that explicitly maximizes the ν minimum margin of the examples up to a given precision. The algorithm incorporates a current estimate of the achievable margin into its calculation of the linear coefﬁcients of the base hypotheses. The bound on the number of iterations needed by the new algorithms is the same as the number needed by a known version of AdaBoost that must have an explicit estimate of the achievable margin as a parameter. We also illustrate experimentally that our algorithm requires considerably fewer iterations than other algorithms that aim to maximize the margin.

2 0.87027156 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

Author: Fabio Aiolli, Alessandro Sperduti

Abstract: Winner-take-all multiclass classiﬁers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classiﬁed with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to ﬁnd locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efﬁcient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a signiﬁcant reduction (of one or two orders) in response time. Keywords: multiclass classiﬁcation, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classiﬁers

3 0.86667728 5 jmlr-2005-A Generalization Error for Q-Learning

Author: Susan A. Murphy

Abstract: Planning problems that involve learning a policy from a single training set of ﬁnite horizon trajectories arise in both social science and medical ﬁelds. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function. Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data

same-paper 4 0.76317024 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

5 0.52908546 57 jmlr-2005-Multiclass Boosting for Weak Classifiers

Author: Günther Eibl, Karl-Peter Pfeiffer

Abstract: AdaBoost.M2 is a boosting algorithm designed for multiclass problems with weak base classiﬁers. The algorithm is designed to minimize a very loose bound on the training error. We propose two alternative boosting algorithms which also minimize bounds on performance measures. These performance measures are not as strongly connected to the expected error as the training error, but the derived bounds are tighter than the bound on the training error of AdaBoost.M2. In experiments the methods have roughly the same performance in minimizing the training and test error rates. The new algorithms have the advantage that the base classiﬁer should minimize the conﬁdence-rated error, whereas for AdaBoost.M2 the base classiﬁer should minimize the pseudo-loss. This makes them more easily applicable to already existing base classiﬁers. The new algorithms also tend to converge faster than AdaBoost.M2. Keywords: boosting, multiclass, ensemble, classiﬁcation, decision stumps

6 0.48179865 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

7 0.47850358 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

8 0.47426337 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

9 0.45474446 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

10 0.45152643 24 jmlr-2005-Core Vector Machines: Fast SVM Training on Very Large Data Sets

11 0.44993466 11 jmlr-2005-Algorithmic Stability and Meta-Learning

12 0.44846386 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

13 0.44428426 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

14 0.44003803 54 jmlr-2005-Managing Diversity in Regression Ensembles

15 0.43054548 48 jmlr-2005-Learning the Kernel Function via Regularization

16 0.43001378 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture

17 0.42879698 67 jmlr-2005-Stability of Randomized Learning Algorithms

18 0.42666978 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

19 0.42301384 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

20 0.42237937 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching