jmlr jmlr2005 jmlr2005-26 knowledge-graph by maker-knowledge-mining

26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

Source: pdf

Author: John Lafferty, Guy Lebanon

Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold deﬁned by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classiﬁcation, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classiﬁcation. Keywords: kernels, heat equation, diffusion, information geometry, text classiﬁcation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The kernels are based on the heat equation on the Riemannian manifold deﬁned by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. [sent-6, score-0.953]

2 As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. [sent-7, score-0.469]

3 Experimental results are presented for document classiﬁcation, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classiﬁcation. [sent-9, score-0.451]

4 Keywords: kernels, heat equation, diffusion, information geometry, text classiﬁcation 1. [sent-10, score-0.479]

5 Motivated by this need for kernel methods that can be applied to discrete, categorical data, Kondor and Lafferty (2002) propose the use of discrete diffusion kernels and tools from spectral graph theory for data represented by graphs. [sent-21, score-0.559]

6 In this paper, we propose a related construction of kernels based on the heat equation. [sent-22, score-0.531]

7 We then exploit the geometry of the statistical family; speciﬁcally, we consider the heat equation with respect to the Riemannian structure given by the Fisher metric, leading to a Mercer kernel deﬁned on the appropriate function spaces. [sent-24, score-0.697]

8 Since the kernels are intimately based on the geometry of the Fisher information metric and the heat or diffusion equation on the associated Riemannian manifold, we refer to them here as information diffusion kernels. [sent-26, score-1.586]

9 One apparent limitation of the discrete diffusion kernels of Kondor and Lafferty (2002) is the difﬁculty of analyzing the associated learning algorithms in the discrete setting. [sent-27, score-0.485]

10 The statistical manifold of the n-dimensional multinomial family comes from an embedding of the multinomial simplex into the n-dimensional sphere which is isometric under the the Fisher information metric. [sent-36, score-0.754]

11 Thus, the multinomial family can be viewed as a manifold of constant positive curvature. [sent-37, score-0.452]

12 While the heat kernel for this manifold does not have a closed form, we can approximate the kernel in a closed form using the leading term in the parametrix expansion, a small time asymptotic expansion for the heat kernel that is of great use in differential geometry. [sent-39, score-1.585]

13 Our experimental results indicate that the multinomial information diffusion kernel performs very well empirically. [sent-42, score-0.617]

14 In Section 2 we review the relevant concepts that are required from Riemannian geometry, including the heat kernel for a general Riemannian manifold and its parametrix expansion. [sent-46, score-0.919]

15 Section 4 derives bounds on covering numbers and Rademacher averages for various learning algorithms that use the new kernels, borrowing results from differential geometry on bounds for the geometric Laplacian. [sent-48, score-0.433]

16 Section 5 describes the results of applying the multinomial diffusion kernels to text classiﬁcation, and we conclude with a discussion of our results in Section 6. [sent-49, score-0.687]

17 The Heat Kernel In this section we review the basic properties of the heat kernel on a Riemannian manifold, together with its asymptotic expansion, the parametrix. [sent-51, score-0.507]

18 The heat kernel and its parametrix expansion contains a wealth of geometric information, and indeed much of modern differential geometry, notably index theory, is based upon the use of the heat kernel and its generalizations. [sent-52, score-1.299]

19 The fundamental nature of the heat kernel makes it a natural candidate to consider for statistical learning applications. [sent-53, score-0.507]

20 1 The Heat Kernel The Laplacian is used to model how heat will diffuse throughout a geometric manifold; the ﬂow is governed by the following second order differential equation with initial conditions ∂f −∆f = 0 ∂t f (x, 0) = f0 (x) . [sent-57, score-0.532]

21 The value f (x,t) describes the heat at location x at time t, beginning from an initial distribution of heat given by f0 (x) at time zero. [sent-58, score-0.866]

22 The heat or diffusion kernel Kt (x, y) is the solution to the heat equation f (x,t) with initial condition given by Dirac’s delta function δy . [sent-59, score-1.327]

23 As a consequence of the linearity of the heat equation, the heat kernel can be used to generate the solution to the heat equation with arbitrary initial conditions, according to f (x,t) = Z M Kt (x, y) f0 (y) dy . [sent-60, score-1.409]

24 131 L AFFERTY AND L EBANON As a simple special case, consider heat ﬂow on the circle, or one-dimensional sphere M = S1 , with the metric inherited from the Euclidean metric on R2 . [sent-61, score-0.639]

25 As the j=0 time parameter t gets large, the solution converges to f (θ,t) −→ a0 , which is the average value of f ; thus, the heat diffuses until the manifold is at a uniform temperature. [sent-63, score-0.69]

26 2 j=0 This simple example shows several properties of the general solution of the heat equation on a (compact) Riemannian manifold; in particular, note that the eigenvalues of the kernel scale as λ j ∼ 2/d e− j where the dimension in this case is d = 1. [sent-65, score-0.538]

27 When M = R, the heat kernel is the familiar Gaussian kernel, so that the solution to the heat equation is expressed as Z (x−y)2 1 e− 4t f0 (y) dy, f (x,t) = √ 4πt R and it is seen that as t −→ ∞, the heat diffuses out “to inﬁnity” so that f (x,t) −→ 0. [sent-66, score-1.397]

28 The following theorem summarizes the basic properties for the kernel of the heat equation on M; we refer to Schoen and Yau (1994) for a proof. [sent-70, score-0.507]

29 Then there exists a function K ∈ C∞ (R+ × M × M), called the heat kernel, which satisﬁes the following properties for all x, y ∈ M, with Kt (·, ·) = K(t, ·, ·) 1. [sent-72, score-0.433]

30 i=0 Properties 2 and 3 imply that Kt (x, y) solves R heat equation in x, starting from a point heat the t∆ f (x) = f (x,t) = source at y. [sent-78, score-0.866]

31 It follows that e 0 M Kt (x, y) f 0 (y) dy solves the heat equation with initial conditions f (x, 0) = f0 (x), since ∂ f (x,t) ∂t = = ∂Kt (x, y) f0 (y) dy ∂t ZM Z ∆Kt (x, y) f0 (y) dy M = ∆ Z M Kt (x, y) f0 (y) dy = ∆ f (x,t), and limt→0 f (x,t) = M limt→0 Kt (x, y) dy = f0 (x). [sent-79, score-0.613]

32 Property 4 implies that et∆ es∆ = e(t+s)∆ , which has the physically intuitive interpretation that heat diffusion for time t is the composition of heat diffusion up to time s with heat diffusion for an additional time t − s. [sent-80, score-2.46]

33 The heat kernel Kt (x, y) is a natural candidate for measuring the similarity between points between x, y ∈ M, while respecting the geometry encoded in the metric g. [sent-85, score-0.788]

34 1 T HE PARAMETRIX E XPANSION For most geometries, there is no closed form solution for the heat kernel. [sent-90, score-0.433]

35 In fact, the existence of the heat kernel, as asserted in the above theorem, is most directly proven by ﬁrst showing the existence of the parametrix expansion. [sent-92, score-0.588]

36 Recall that the heat kernel on ﬂat n-dimensional Euclidean space is given by n KtEuclid (x, y) = (4πt)− 2 exp − 133 x−y 4t 2 L AFFERTY AND L EBANON where x − y 2 = ∑n |xi − yi |2 is the squared Euclidean distance between x and y. [sent-94, score-0.53]

37 The parametrix i=1 expansion approximates the heat kernel locally as a correction to this Euclidean heat kernel. [sent-95, score-1.126]

38 The idea is to obtain ψk recursively by solving the heat equation approximately to order t m , for small diffusion time t. [sent-97, score-0.82]

39 As suggested by equation (2), Kt is an approximate solution to the heat equation, and satisﬁes (m) Kt (x, y) = Kt (x, y)+O(t m ) for x and y sufﬁciently close; in particular, the parametrix is not unique. [sent-103, score-0.588]

40 This fact will be used when we employ the parametrix approximation to the heat kernel for statistical learning. [sent-109, score-0.662]

41 Diffusion Kernels on Statistical Manifolds We now proceed to the main contribution of the paper, which is the application of the heat kernel constructions reviewed in the previous section to the geometry of statistical families, in order to obtain kernels for statistical learning. [sent-111, score-0.795]

42 1 The following two basic examples illustrate the geometry of the Fisher information metric and the associated diffusion kernel it induces on a statistical manifold. [sent-117, score-0.742]

43 The spherical normal family corresponds to a manifold of constant negative curvature, and the multinomial corresponds to a manifold of constant positive curvature. [sent-118, score-0.749]

44 By a statistical manifold we mean simply a manifold of densities together with the metric induced by the Fisher information matrix, rather than the more general notion of a Riemannian manifold together with a (possibly nonmetric) connection, as deﬁned by Lauritzen (1987). [sent-131, score-0.862]

45 9 1 Figure 1: Example decision boundaries for a kernel-based classiﬁer using information diffusion kernels for spherical normal geometry with d = 2 (right), which has constant negative curvature, compared with the standard Gaussian kernel for ﬂat Euclidean space (left). [sent-168, score-0.789]

46 The curved decision boundary for the diffusion kernel can be interpreted statistically by noting that as the variance decreases the mean is known with increasing certainty. [sent-170, score-0.554]

47 The heat kernel on the hyperbolic space Hn has the following explicit form (Grigor’yan and Noguchi, 1998). [sent-171, score-0.559]

48 1 the diffusion kernel for hyperbolic space H2 is compared with the Euclidean space Gaussian kernel. [sent-181, score-0.513]

49 The curved decision boundary for the diffusion kernel makes intuitive sense, since as the variance decreases the mean is known with increasing certainty. [sent-182, score-0.554]

50 In this case, the densities on the boundary become singular, as point masses at the mean; the boundary is simply given by ∂M ∼ Rn−1 , which is a = manifold without boundary, as required. [sent-184, score-0.443]

51 (7) D IFFUSION K ERNELS ON S TATISTICAL M ANIFOLDS Figure 3: Example decision boundaries using support vector machines with information diffusion kernels for trinomial geometry on the 2-simplex (top right) compared with the standard Gaussian kernel (left). [sent-226, score-0.749]

52 While the spherical geometry has been derived as the information geometry for a ﬁnite multinomial, the same geometry can be used non-parametrically for an arbitrary subset of probability measures, leading to spherical geometry in a Hilbert space (Dawid, 1977). [sent-232, score-0.84]

53 1 T HE M ULTINOMIAL D IFFUSION K ERNEL Unlike the explicit expression for the Gaussian geometry discussed above, there is not an explicit form for the heat kernel on the sphere, nor on the positive orthant of the sphere. [sent-235, score-0.697]

54 We will therefore resort to the parametrix expansion to derive an approximate heat kernel for the multinomial. [sent-236, score-0.693]

55 Thus, the leading order parametrix for the multinomial diffusion kernel is (0) Pt (θ, θ n −2 ) = (4πt) 1 exp − d 2 (θ, θ ) 4t sin d(θ, θ ) d(θ, θ ) − (n−1) 2 . [sent-244, score-0.772]

56 2 ROUNDING THE S IMPLEX The case of multinomial geometry poses some technical complications for the analysis of diffusion kernels, due to the fact that the open simplex is not complete, and moreover, its closure is not a differentiable manifold with boundary. [sent-251, score-1.122]

57 However, it does not form a compact manifold with boundary since the boundary has edges and corners. [sent-255, score-0.475]

58 The set ∆δ forms a compact manifold with boundary, and its image under the isometry n + F : (Pn , g) → (Sn , h) is a compact submanifold with boundary of the n-sphere. [sent-267, score-0.466]

59 Spectral Bounds on Covering Numbers and Rademacher Averages We now turn to establishing bounds on the generalization performance of kernel machines that use information diffusion kernels. [sent-272, score-0.49]

60 The primary conclusion that is drawn from these analyses is that from the point of view of generalization error bounds, diffusion kernels behave essentially the same as the standard Gaussian kernel. [sent-277, score-0.485]

61 Using these results we can establish the following bounds on covering numbers for information diffusion kernels. [sent-296, score-0.459]

62 Then the covering numbers for the Dirichlet heat kernel Kt on M satisfy V log N (ε, FR (x)) = O 1 ε d+2 2 . [sent-300, score-0.589]

63 (8) t Proof By the lower bound in Theorem 3, the Dirichlet eigenvalues of the heat kernel Kt (x, y), which are given by λ j = e−tµ j , satisfy log λ j ≤ −tc1 (d) λ1 · · · λ j 1 − log j n2 2 d i tc1 j ≥ ∑ V j i=1 j V tc2 . [sent-301, score-0.616]

64 Thus, for ﬁxed t the covering numbers scale as log N (ε, F ) = O log 2 1 , ε d and for ﬁxed ε they scale as log N (ε, F ) = O t − 2 143 in the diffusion time t. [sent-309, score-0.547]

65 Then the Rademacher term ψ for the Dirichlet heat kernel Kt on M satisﬁes r ψ(r) ≤ C t 1 , r d log 2 d 2 for some constant C depending on the geometry of M. [sent-337, score-0.736]

66 Proof We have that ∞ ∑ min{r, λ j } ψ2 (r) = j=1 ∗ jr r + = ∞ ∑ e−tµ j ∗ j= jr ∞ 2 1 d ∑ e−tc j ≤ ∗ jr r + ≤ ∗ jr r +Ce−tc1 jr d ∗ j= jr ∗ 2 for some constant C, where the ﬁrst inequality follows from the lower bound in Theorem 3. [sent-338, score-0.564]

67 While the bounds for diffusion kernels were derived for the case of positive curvature, which apply to the special case of the multinomial, similar bounds for general manifolds with curvature bounded below by a negative constant should also be attainable. [sent-344, score-0.636]

68 Multinomial Diffusion Kernels and Text Classiﬁcation In this section we present the application of multinomial diffusion kernels to the problem of text classiﬁcation. [sent-346, score-0.687]

69 However for text, the use of multinomial geometry is natural and well motivated; our experimental results offer some insight into how useful this geometry may be for classiﬁcation. [sent-348, score-0.536]

70 The multinomial diffusion kernel is given by √ √ n 1 KtMult (θ, θ ) = (4πt)− 2 exp − arccos2 ( θ · θ ) , t as derived in Section 3. [sent-377, score-0.617]

71 02 0 40 80 120 200 400 0 40 600 80 120 200 400 600 Figure 5: Experimental results on the WebKB corpus, using SVMs for linear (dotted) and Gaussian (dash-dotted) kernels, compared with the diffusion kernel for the multinomial (solid). [sent-404, score-0.617]

72 01 0 40 80 120 200 400 0 40 600 80 120 200 400 600 Figure 6: Results on the WebKB corpus, using SVMs for linear (dotted) and Gaussian (dash-dotted) kernels, compared with the diffusion kernel (solid). [sent-427, score-0.461]

73 02 0 80 120 200 400 0 80 600 Figure 7: Experimental results on the Reuters corpus, using SVMs for linear (dotted) and Gaussian (dash-dotted) kernels, compared with the diffusion kernel (solid). [sent-452, score-0.461]

74 0802 Table 1: Experimental results on the WebKB corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-639, score-0.543]

75 As mentioned above, in the case of the diffusion kernel we 151 L AFFERTY AND L EBANON Task course vs. [sent-655, score-0.461]

76 0358 Table 2: Experimental results on the WebKB corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-766, score-0.543]

77 For both the Gaussian and √ diffusion kernels, we test scale parameters ( 2σ for the Gaussian kernel and 2t 1/2 for the diffusion kernel) in the set {0. [sent-775, score-0.848]

78 all Table 3: Experimental results on the Reuters corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-939, score-0.543]

79 results of Zhang and Oles (2001), with a + indicating the diffusion kernel F1 measure is greater than the result published in Zhang and Oles (2001) for this task. [sent-945, score-0.461]

80 However the multinomial diffusion kernel signiﬁcantly outperforms the linear and Gaussian kernels for the tf representation, achieving signiﬁcantly lower error rate than the other kernels. [sent-948, score-0.921]

81 For the tﬁdf representation, the diffusion kernel consistently outperforms the other kernels for the WebKb data and usually outperforms the linear and Gaussian kernels for the Reuters data. [sent-949, score-0.657]

82 earn Table 4: Experimental results on the Reuters corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-1074, score-0.622]

83 It is notable, however, that the multinomial information diffusion kernel achieves at least as high an accuracy without the use of any heuristic term weighting scheme. [sent-1081, score-0.617]

84 Discussion and Conclusion This paper has introduced a family of kernels that is intimately based on the geometry of the Riemannian manifold associated with a statistical family through the Fisher information metric. [sent-1084, score-0.623]

85 The last column compares the presented results with the published results of Zhang and Oles (2001), with a + indicating the diffusion kernel F1 measure is greater than the result published in Zhang and Oles (2001) for this task. [sent-1149, score-0.461]

86 The main application of these ideas has been to develop the multinomial diffusion kernel. [sent-1152, score-0.543]

87 The primary degree of freedom in the use of information diffusion kernels lies in the speciﬁcation of the mapping of data to model parameters. [sent-1157, score-0.485]

88 In contrast, information diffusion kernels are based on the full geometry of the statistical family, and yet are also invariant under reparameterization of the family. [sent-1166, score-0.675]

89 While information diffusion kernels are very general, they will be difﬁcult to compute in many cases—explicit formulas such as equations (5–6) for hyperbolic space are rare. [sent-1169, score-0.537]

90 To approximate an information diffusion kernel it may be attractive to use the parametrices and geodesic distance between points, as we have done for the multinomial. [sent-1170, score-0.559]

91 (2004) for multimedia applications, which have the form K(θ, θ ) ∝ exp(−αD(θ, θ )) ≈ exp(−2αd 2 (θ, θ )), and so can be viewed in terms of the leading order approximation to the heat kernel. [sent-1173, score-0.433]

92 (2004) are suggestive that diffusion kernels may be attractive not only for multinomial geometry, but also for much more complex statistical families. [sent-1175, score-0.641]

93 The Geometric Laplacian In this appendix we brieﬂy review some of the elementary concepts from Riemannian geometry that are used in the construction of information diffusion kernels, since these concepts are not widely 156 D IFFUSION K ERNELS ON S TATISTICAL M ANIFOLDS used in machine learning. [sent-1180, score-0.577]

94 Since IntM is open it is an n-dimensional manifold without boundary, and ∂M is an (n − 1)-dimensional manifold without boundary. [sent-1196, score-0.539]

95 If f : M → N is a diffeomorphism of the manifold M onto the manifold N, then f induces a push-foward mapping f∗ of the associated tangent spaces. [sent-1197, score-0.604]

96 A Riemannian manifold (M, g) is a differentiable manifold M with a family of smoothly varying positive-deﬁnite inner products g = g p on Tp M for each p ∈ M. [sent-1207, score-0.58]

97 For example, every manifold can be embedded in Rm for some m ≥ n (the Whitney embedding theorem), and the Euclidean metric induces a metric on the manifold under the embedding. [sent-1211, score-0.738]

98 This geodesic distance d turns the Riemannian manifold into a metric space, satisfying the usual properties of positivity, symmetry and the triangle inequality. [sent-1219, score-0.446]

99 An orientation of a manifold is a smooth choice of orientation for the tangent spaces, meaning that for local charts ϕi and ϕ j , the differential D(ϕ j ◦ ϕi )(x) : Rn −→ Rn is orientation preserving, so the sign of the determinant is constant. [sent-1234, score-0.493]

100 As we comment in the text, this relationship may be of use in approximating information diffusion kernels for complex models. [sent-1268, score-0.485]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('heat', 0.433), ('diffusion', 0.387), ('manifold', 0.257), ('tf', 0.206), ('geometry', 0.19), ('riemannian', 0.186), ('kt', 0.179), ('multinomial', 0.156), ('iffusion', 0.155), ('parametrix', 0.155), ('afferty', 0.146), ('anifolds', 0.146), ('ebanon', 0.146), ('fisher', 0.129), ('tatistical', 0.123), ('df', 0.113), ('webkb', 0.112), ('kernels', 0.098), ('jr', 0.094), ('boundary', 0.093), ('metric', 0.091), ('ernels', 0.091), ('reuters', 0.081), ('jn', 0.081), ('simplex', 0.08), ('earn', 0.079), ('pn', 0.076), ('geodesic', 0.075), ('kernel', 0.074), ('rn', 0.074), ('tp', 0.072), ('laplacian', 0.065), ('euclidean', 0.065), ('manifolds', 0.064), ('tangent', 0.061), ('document', 0.059), ('rademacher', 0.055), ('differential', 0.054), ('hyperbolic', 0.052), ('submanifold', 0.052), ('acq', 0.05), ('grain', 0.05), ('gi', 0.05), ('text', 0.046), ('geometric', 0.045), ('averages', 0.043), ('guo', 0.043), ('charts', 0.043), ('moneyfx', 0.043), ('rounding', 0.043), ('covering', 0.043), ('ui', 0.043), ('student', 0.042), ('embedding', 0.042), ('documents', 0.041), ('spherical', 0.04), ('gaussian', 0.04), ('family', 0.039), ('log', 0.039), ('mercer', 0.038), ('corpus', 0.038), ('lafferty', 0.038), ('dy', 0.036), ('det', 0.036), ('grad', 0.034), ('schoen', 0.034), ('yau', 0.034), ('ei', 0.033), ('coordinates', 0.033), ('compact', 0.032), ('kass', 0.032), ('expansion', 0.031), ('eigenvalues', 0.031), ('representation', 0.03), ('crude', 0.029), ('curvature', 0.029), ('faculty', 0.029), ('dh', 0.029), ('bounds', 0.029), ('frequency', 0.029), ('oles', 0.029), ('diffeomorphism', 0.029), ('curves', 0.029), ('dirichlet', 0.028), ('differentiable', 0.027), ('dx', 0.026), ('vi', 0.026), ('geodesically', 0.026), ('milnor', 0.026), ('orientation', 0.026), ('vl', 0.026), ('mendelson', 0.026), ('families', 0.025), ('open', 0.025), ('familiar', 0.024), ('sphere', 0.024), ('embeddings', 0.023), ('velocity', 0.023), ('pt', 0.023), ('distance', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

Author: John Lafferty, Guy Lebanon

2 0.20932806 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

Author: Simone Fiori

Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims speciﬁcally at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications. Keywords: differential geometry, diffusion-type gradient, Lie groups, non-negative independent component analysis, Riemannian gradient

3 0.068516739 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

Abstract: We present a family of positive deﬁnite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities deﬁned on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efﬁcient and ﬂexible framework using the “kernel trick”. Finally, a constructive approach to such positive deﬁnite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classiﬁcation to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space

4 0.050897971 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

Author: Hyunsoo Kim, Peg Howland, Haesun Park

Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classiﬁcation methods for many applications including text classiﬁcation. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efﬁciently handle a large number of terms in practical applications of text classiﬁcation. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classiﬁcation algorithm and support vector classiﬁers to handle the classiﬁcation problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efﬁciency for both training and testing can be achieved without sacriﬁcing prediction accuracy of text classiﬁcation even when the dimension of the input space is signiﬁcantly reduced. Keywords: dimension reduction, support vector machines, text classiﬁcation, linear discriminant analysis, centroids

5 0.048018098 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

Author: Lior Wolf, Amnon Shashua

Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classiﬁcation tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a deﬁnition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features deﬁned by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.

6 0.047140889 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

7 0.044638496 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

8 0.044634659 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

9 0.043860517 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

10 0.043055609 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

11 0.0420416 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

12 0.041970395 36 jmlr-2005-Gaussian Processes for Ordinal Regression

13 0.04018968 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

14 0.038316336 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

15 0.035520844 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

16 0.034014989 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

17 0.033344757 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

18 0.033206273 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

19 0.031509366 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

20 0.031025507 41 jmlr-2005-Kernel Methods for Measuring Independence

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.212), (1, 0.142), (2, -0.026), (3, 0.129), (4, 0.088), (5, 0.044), (6, -0.092), (7, 0.187), (8, 0.098), (9, 0.014), (10, -0.019), (11, -0.553), (12, 0.209), (13, -0.032), (14, 0.025), (15, 0.141), (16, 0.016), (17, 0.144), (18, 0.224), (19, -0.065), (20, -0.117), (21, 0.105), (22, 0.011), (23, -0.045), (24, -0.021), (25, 0.017), (26, -0.066), (27, 0.003), (28, 0.012), (29, -0.027), (30, 0.001), (31, 0.109), (32, 0.001), (33, 0.008), (34, -0.02), (35, -0.011), (36, -0.003), (37, 0.111), (38, -0.018), (39, 0.016), (40, -0.034), (41, -0.0), (42, -0.148), (43, -0.093), (44, 0.014), (45, 0.077), (46, -0.041), (47, 0.013), (48, -0.013), (49, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9482556 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

Author: John Lafferty, Guy Lebanon

2 0.78146356 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

Author: Simone Fiori

3 0.19711927 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

4 0.19164854 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson

Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by deﬁning a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semideﬁnite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classiﬁcation, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semideﬁnite programming

5 0.18979008 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines

Author: Hyunsoo Kim, Peg Howland, Haesun Park

6 0.16072942 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

7 0.15833889 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

8 0.15698659 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

9 0.1541201 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

10 0.15043917 41 jmlr-2005-Kernel Methods for Measuring Independence

11 0.14843841 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data

12 0.14554563 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

13 0.14313582 36 jmlr-2005-Gaussian Processes for Ordinal Regression

14 0.13954403 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions

15 0.13878824 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

16 0.13624036 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods

17 0.12865725 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection

18 0.12742622 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

19 0.11988391 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

20 0.11910103 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.02), (13, 0.497), (17, 0.024), (19, 0.022), (36, 0.039), (37, 0.027), (42, 0.024), (43, 0.032), (47, 0.017), (52, 0.066), (59, 0.024), (70, 0.026), (88, 0.077), (90, 0.015), (94, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87573308 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems

Author: Onno Zoeter, Tom Heskes

Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies

same-paper 2 0.84990519 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

Author: John Lafferty, Guy Lebanon

3 0.46441725 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton

Author: Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins

Abstract: This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classiﬁcation problems. Active learning is applied in conjunction with support vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images. Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multiclass support vector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires signiﬁcantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining. Keywords: active learning, support vector machine, plankton recognition, probabilistic output, multi-class support vector machine

4 0.36800948 64 jmlr-2005-Semigroup Kernels on Measures

Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert

5 0.33212286 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky

Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simpliﬁed message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization

6 0.32998976 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

7 0.3293314 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks

8 0.32881984 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning

9 0.31700906 48 jmlr-2005-Learning the Kernel Function via Regularization

10 0.31198692 71 jmlr-2005-Variational Message Passing

11 0.30298036 20 jmlr-2005-Clustering with Bregman Divergences

12 0.30230287 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)

13 0.30071512 39 jmlr-2005-Information Bottleneck for Gaussian Variables

14 0.29899457 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

15 0.29634994 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems

16 0.29458955 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels

17 0.29279512 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

18 0.2918092 32 jmlr-2005-Expectation Consistent Approximate Inference

19 0.29100624 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning

20 0.28802919 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures