jmlr jmlr2005 jmlr2005-26 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Lafferty, Guy Lebanon
Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. Keywords: kernels, heat equation, diffusion, information geometry, text classification
Reference: text
sentIndex sentText sentNum sentScore
1 The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. [sent-6, score-0.953]
2 As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. [sent-7, score-0.469]
3 Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. [sent-9, score-0.451]
4 Keywords: kernels, heat equation, diffusion, information geometry, text classification 1. [sent-10, score-0.479]
5 Motivated by this need for kernel methods that can be applied to discrete, categorical data, Kondor and Lafferty (2002) propose the use of discrete diffusion kernels and tools from spectral graph theory for data represented by graphs. [sent-21, score-0.559]
6 In this paper, we propose a related construction of kernels based on the heat equation. [sent-22, score-0.531]
7 We then exploit the geometry of the statistical family; specifically, we consider the heat equation with respect to the Riemannian structure given by the Fisher metric, leading to a Mercer kernel defined on the appropriate function spaces. [sent-24, score-0.697]
8 Since the kernels are intimately based on the geometry of the Fisher information metric and the heat or diffusion equation on the associated Riemannian manifold, we refer to them here as information diffusion kernels. [sent-26, score-1.586]
9 One apparent limitation of the discrete diffusion kernels of Kondor and Lafferty (2002) is the difficulty of analyzing the associated learning algorithms in the discrete setting. [sent-27, score-0.485]
10 The statistical manifold of the n-dimensional multinomial family comes from an embedding of the multinomial simplex into the n-dimensional sphere which is isometric under the the Fisher information metric. [sent-36, score-0.754]
11 Thus, the multinomial family can be viewed as a manifold of constant positive curvature. [sent-37, score-0.452]
12 While the heat kernel for this manifold does not have a closed form, we can approximate the kernel in a closed form using the leading term in the parametrix expansion, a small time asymptotic expansion for the heat kernel that is of great use in differential geometry. [sent-39, score-1.585]
13 Our experimental results indicate that the multinomial information diffusion kernel performs very well empirically. [sent-42, score-0.617]
14 In Section 2 we review the relevant concepts that are required from Riemannian geometry, including the heat kernel for a general Riemannian manifold and its parametrix expansion. [sent-46, score-0.919]
15 Section 4 derives bounds on covering numbers and Rademacher averages for various learning algorithms that use the new kernels, borrowing results from differential geometry on bounds for the geometric Laplacian. [sent-48, score-0.433]
16 Section 5 describes the results of applying the multinomial diffusion kernels to text classification, and we conclude with a discussion of our results in Section 6. [sent-49, score-0.687]
17 The Heat Kernel In this section we review the basic properties of the heat kernel on a Riemannian manifold, together with its asymptotic expansion, the parametrix. [sent-51, score-0.507]
18 The heat kernel and its parametrix expansion contains a wealth of geometric information, and indeed much of modern differential geometry, notably index theory, is based upon the use of the heat kernel and its generalizations. [sent-52, score-1.299]
19 The fundamental nature of the heat kernel makes it a natural candidate to consider for statistical learning applications. [sent-53, score-0.507]
20 1 The Heat Kernel The Laplacian is used to model how heat will diffuse throughout a geometric manifold; the flow is governed by the following second order differential equation with initial conditions ∂f −∆f = 0 ∂t f (x, 0) = f0 (x) . [sent-57, score-0.532]
21 The value f (x,t) describes the heat at location x at time t, beginning from an initial distribution of heat given by f0 (x) at time zero. [sent-58, score-0.866]
22 The heat or diffusion kernel Kt (x, y) is the solution to the heat equation f (x,t) with initial condition given by Dirac’s delta function δy . [sent-59, score-1.327]
23 As a consequence of the linearity of the heat equation, the heat kernel can be used to generate the solution to the heat equation with arbitrary initial conditions, according to f (x,t) = Z M Kt (x, y) f0 (y) dy . [sent-60, score-1.409]
24 131 L AFFERTY AND L EBANON As a simple special case, consider heat flow on the circle, or one-dimensional sphere M = S1 , with the metric inherited from the Euclidean metric on R2 . [sent-61, score-0.639]
25 As the j=0 time parameter t gets large, the solution converges to f (θ,t) −→ a0 , which is the average value of f ; thus, the heat diffuses until the manifold is at a uniform temperature. [sent-63, score-0.69]
26 2 j=0 This simple example shows several properties of the general solution of the heat equation on a (compact) Riemannian manifold; in particular, note that the eigenvalues of the kernel scale as λ j ∼ 2/d e− j where the dimension in this case is d = 1. [sent-65, score-0.538]
27 When M = R, the heat kernel is the familiar Gaussian kernel, so that the solution to the heat equation is expressed as Z (x−y)2 1 e− 4t f0 (y) dy, f (x,t) = √ 4πt R and it is seen that as t −→ ∞, the heat diffuses out “to infinity” so that f (x,t) −→ 0. [sent-66, score-1.397]
28 The following theorem summarizes the basic properties for the kernel of the heat equation on M; we refer to Schoen and Yau (1994) for a proof. [sent-70, score-0.507]
29 Then there exists a function K ∈ C∞ (R+ × M × M), called the heat kernel, which satisfies the following properties for all x, y ∈ M, with Kt (·, ·) = K(t, ·, ·) 1. [sent-72, score-0.433]
30 i=0 Properties 2 and 3 imply that Kt (x, y) solves R heat equation in x, starting from a point heat the t∆ f (x) = f (x,t) = source at y. [sent-78, score-0.866]
31 It follows that e 0 M Kt (x, y) f 0 (y) dy solves the heat equation with initial conditions f (x, 0) = f0 (x), since ∂ f (x,t) ∂t = = ∂Kt (x, y) f0 (y) dy ∂t ZM Z ∆Kt (x, y) f0 (y) dy M = ∆ Z M Kt (x, y) f0 (y) dy = ∆ f (x,t), and limt→0 f (x,t) = M limt→0 Kt (x, y) dy = f0 (x). [sent-79, score-0.613]
32 Property 4 implies that et∆ es∆ = e(t+s)∆ , which has the physically intuitive interpretation that heat diffusion for time t is the composition of heat diffusion up to time s with heat diffusion for an additional time t − s. [sent-80, score-2.46]
33 The heat kernel Kt (x, y) is a natural candidate for measuring the similarity between points between x, y ∈ M, while respecting the geometry encoded in the metric g. [sent-85, score-0.788]
34 1 T HE PARAMETRIX E XPANSION For most geometries, there is no closed form solution for the heat kernel. [sent-90, score-0.433]
35 In fact, the existence of the heat kernel, as asserted in the above theorem, is most directly proven by first showing the existence of the parametrix expansion. [sent-92, score-0.588]
36 Recall that the heat kernel on flat n-dimensional Euclidean space is given by n KtEuclid (x, y) = (4πt)− 2 exp − 133 x−y 4t 2 L AFFERTY AND L EBANON where x − y 2 = ∑n |xi − yi |2 is the squared Euclidean distance between x and y. [sent-94, score-0.53]
37 The parametrix i=1 expansion approximates the heat kernel locally as a correction to this Euclidean heat kernel. [sent-95, score-1.126]
38 The idea is to obtain ψk recursively by solving the heat equation approximately to order t m , for small diffusion time t. [sent-97, score-0.82]
39 As suggested by equation (2), Kt is an approximate solution to the heat equation, and satisfies (m) Kt (x, y) = Kt (x, y)+O(t m ) for x and y sufficiently close; in particular, the parametrix is not unique. [sent-103, score-0.588]
40 This fact will be used when we employ the parametrix approximation to the heat kernel for statistical learning. [sent-109, score-0.662]
41 Diffusion Kernels on Statistical Manifolds We now proceed to the main contribution of the paper, which is the application of the heat kernel constructions reviewed in the previous section to the geometry of statistical families, in order to obtain kernels for statistical learning. [sent-111, score-0.795]
42 1 The following two basic examples illustrate the geometry of the Fisher information metric and the associated diffusion kernel it induces on a statistical manifold. [sent-117, score-0.742]
43 The spherical normal family corresponds to a manifold of constant negative curvature, and the multinomial corresponds to a manifold of constant positive curvature. [sent-118, score-0.749]
44 By a statistical manifold we mean simply a manifold of densities together with the metric induced by the Fisher information matrix, rather than the more general notion of a Riemannian manifold together with a (possibly nonmetric) connection, as defined by Lauritzen (1987). [sent-131, score-0.862]
45 9 1 Figure 1: Example decision boundaries for a kernel-based classifier using information diffusion kernels for spherical normal geometry with d = 2 (right), which has constant negative curvature, compared with the standard Gaussian kernel for flat Euclidean space (left). [sent-168, score-0.789]
46 The curved decision boundary for the diffusion kernel can be interpreted statistically by noting that as the variance decreases the mean is known with increasing certainty. [sent-170, score-0.554]
47 The heat kernel on the hyperbolic space Hn has the following explicit form (Grigor’yan and Noguchi, 1998). [sent-171, score-0.559]
48 1 the diffusion kernel for hyperbolic space H2 is compared with the Euclidean space Gaussian kernel. [sent-181, score-0.513]
49 The curved decision boundary for the diffusion kernel makes intuitive sense, since as the variance decreases the mean is known with increasing certainty. [sent-182, score-0.554]
50 In this case, the densities on the boundary become singular, as point masses at the mean; the boundary is simply given by ∂M ∼ Rn−1 , which is a = manifold without boundary, as required. [sent-184, score-0.443]
51 (7) D IFFUSION K ERNELS ON S TATISTICAL M ANIFOLDS Figure 3: Example decision boundaries using support vector machines with information diffusion kernels for trinomial geometry on the 2-simplex (top right) compared with the standard Gaussian kernel (left). [sent-226, score-0.749]
52 While the spherical geometry has been derived as the information geometry for a finite multinomial, the same geometry can be used non-parametrically for an arbitrary subset of probability measures, leading to spherical geometry in a Hilbert space (Dawid, 1977). [sent-232, score-0.84]
53 1 T HE M ULTINOMIAL D IFFUSION K ERNEL Unlike the explicit expression for the Gaussian geometry discussed above, there is not an explicit form for the heat kernel on the sphere, nor on the positive orthant of the sphere. [sent-235, score-0.697]
54 We will therefore resort to the parametrix expansion to derive an approximate heat kernel for the multinomial. [sent-236, score-0.693]
55 Thus, the leading order parametrix for the multinomial diffusion kernel is (0) Pt (θ, θ n −2 ) = (4πt) 1 exp − d 2 (θ, θ ) 4t sin d(θ, θ ) d(θ, θ ) − (n−1) 2 . [sent-244, score-0.772]
56 2 ROUNDING THE S IMPLEX The case of multinomial geometry poses some technical complications for the analysis of diffusion kernels, due to the fact that the open simplex is not complete, and moreover, its closure is not a differentiable manifold with boundary. [sent-251, score-1.122]
57 However, it does not form a compact manifold with boundary since the boundary has edges and corners. [sent-255, score-0.475]
58 The set ∆δ forms a compact manifold with boundary, and its image under the isometry n + F : (Pn , g) → (Sn , h) is a compact submanifold with boundary of the n-sphere. [sent-267, score-0.466]
59 Spectral Bounds on Covering Numbers and Rademacher Averages We now turn to establishing bounds on the generalization performance of kernel machines that use information diffusion kernels. [sent-272, score-0.49]
60 The primary conclusion that is drawn from these analyses is that from the point of view of generalization error bounds, diffusion kernels behave essentially the same as the standard Gaussian kernel. [sent-277, score-0.485]
61 Using these results we can establish the following bounds on covering numbers for information diffusion kernels. [sent-296, score-0.459]
62 Then the covering numbers for the Dirichlet heat kernel Kt on M satisfy V log N (ε, FR (x)) = O 1 ε d+2 2 . [sent-300, score-0.589]
63 (8) t Proof By the lower bound in Theorem 3, the Dirichlet eigenvalues of the heat kernel Kt (x, y), which are given by λ j = e−tµ j , satisfy log λ j ≤ −tc1 (d) λ1 · · · λ j 1 − log j n2 2 d i tc1 j ≥ ∑ V j i=1 j V tc2 . [sent-301, score-0.616]
64 Thus, for fixed t the covering numbers scale as log N (ε, F ) = O log 2 1 , ε d and for fixed ε they scale as log N (ε, F ) = O t − 2 143 in the diffusion time t. [sent-309, score-0.547]
65 Then the Rademacher term ψ for the Dirichlet heat kernel Kt on M satisfies r ψ(r) ≤ C t 1 , r d log 2 d 2 for some constant C depending on the geometry of M. [sent-337, score-0.736]
66 Proof We have that ∞ ∑ min{r, λ j } ψ2 (r) = j=1 ∗ jr r + = ∞ ∑ e−tµ j ∗ j= jr ∞ 2 1 d ∑ e−tc j ≤ ∗ jr r + ≤ ∗ jr r +Ce−tc1 jr d ∗ j= jr ∗ 2 for some constant C, where the first inequality follows from the lower bound in Theorem 3. [sent-338, score-0.564]
67 While the bounds for diffusion kernels were derived for the case of positive curvature, which apply to the special case of the multinomial, similar bounds for general manifolds with curvature bounded below by a negative constant should also be attainable. [sent-344, score-0.636]
68 Multinomial Diffusion Kernels and Text Classification In this section we present the application of multinomial diffusion kernels to the problem of text classification. [sent-346, score-0.687]
69 However for text, the use of multinomial geometry is natural and well motivated; our experimental results offer some insight into how useful this geometry may be for classification. [sent-348, score-0.536]
70 The multinomial diffusion kernel is given by √ √ n 1 KtMult (θ, θ ) = (4πt)− 2 exp − arccos2 ( θ · θ ) , t as derived in Section 3. [sent-377, score-0.617]
71 02 0 40 80 120 200 400 0 40 600 80 120 200 400 600 Figure 5: Experimental results on the WebKB corpus, using SVMs for linear (dotted) and Gaussian (dash-dotted) kernels, compared with the diffusion kernel for the multinomial (solid). [sent-404, score-0.617]
72 01 0 40 80 120 200 400 0 40 600 80 120 200 400 600 Figure 6: Results on the WebKB corpus, using SVMs for linear (dotted) and Gaussian (dash-dotted) kernels, compared with the diffusion kernel (solid). [sent-427, score-0.461]
73 02 0 80 120 200 400 0 80 600 Figure 7: Experimental results on the Reuters corpus, using SVMs for linear (dotted) and Gaussian (dash-dotted) kernels, compared with the diffusion kernel (solid). [sent-452, score-0.461]
74 0802 Table 1: Experimental results on the WebKB corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-639, score-0.543]
75 As mentioned above, in the case of the diffusion kernel we 151 L AFFERTY AND L EBANON Task course vs. [sent-655, score-0.461]
76 0358 Table 2: Experimental results on the WebKB corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-766, score-0.543]
77 For both the Gaussian and √ diffusion kernels, we test scale parameters ( 2σ for the Gaussian kernel and 2t 1/2 for the diffusion kernel) in the set {0. [sent-775, score-0.848]
78 all Table 3: Experimental results on the Reuters corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-939, score-0.543]
79 results of Zhang and Oles (2001), with a + indicating the diffusion kernel F1 measure is greater than the result published in Zhang and Oles (2001) for this task. [sent-945, score-0.461]
80 However the multinomial diffusion kernel significantly outperforms the linear and Gaussian kernels for the tf representation, achieving significantly lower error rate than the other kernels. [sent-948, score-0.921]
81 For the tfidf representation, the diffusion kernel consistently outperforms the other kernels for the WebKb data and usually outperforms the linear and Gaussian kernels for the Reuters data. [sent-949, score-0.657]
82 earn Table 4: Experimental results on the Reuters corpus, using SVMs for linear, Gaussian, and multinomial diffusion kernels. [sent-1074, score-0.622]
83 It is notable, however, that the multinomial information diffusion kernel achieves at least as high an accuracy without the use of any heuristic term weighting scheme. [sent-1081, score-0.617]
84 Discussion and Conclusion This paper has introduced a family of kernels that is intimately based on the geometry of the Riemannian manifold associated with a statistical family through the Fisher information metric. [sent-1084, score-0.623]
85 The last column compares the presented results with the published results of Zhang and Oles (2001), with a + indicating the diffusion kernel F1 measure is greater than the result published in Zhang and Oles (2001) for this task. [sent-1149, score-0.461]
86 The main application of these ideas has been to develop the multinomial diffusion kernel. [sent-1152, score-0.543]
87 The primary degree of freedom in the use of information diffusion kernels lies in the specification of the mapping of data to model parameters. [sent-1157, score-0.485]
88 In contrast, information diffusion kernels are based on the full geometry of the statistical family, and yet are also invariant under reparameterization of the family. [sent-1166, score-0.675]
89 While information diffusion kernels are very general, they will be difficult to compute in many cases—explicit formulas such as equations (5–6) for hyperbolic space are rare. [sent-1169, score-0.537]
90 To approximate an information diffusion kernel it may be attractive to use the parametrices and geodesic distance between points, as we have done for the multinomial. [sent-1170, score-0.559]
91 (2004) for multimedia applications, which have the form K(θ, θ ) ∝ exp(−αD(θ, θ )) ≈ exp(−2αd 2 (θ, θ )), and so can be viewed in terms of the leading order approximation to the heat kernel. [sent-1173, score-0.433]
92 (2004) are suggestive that diffusion kernels may be attractive not only for multinomial geometry, but also for much more complex statistical families. [sent-1175, score-0.641]
93 The Geometric Laplacian In this appendix we briefly review some of the elementary concepts from Riemannian geometry that are used in the construction of information diffusion kernels, since these concepts are not widely 156 D IFFUSION K ERNELS ON S TATISTICAL M ANIFOLDS used in machine learning. [sent-1180, score-0.577]
94 Since IntM is open it is an n-dimensional manifold without boundary, and ∂M is an (n − 1)-dimensional manifold without boundary. [sent-1196, score-0.539]
95 If f : M → N is a diffeomorphism of the manifold M onto the manifold N, then f induces a push-foward mapping f∗ of the associated tangent spaces. [sent-1197, score-0.604]
96 A Riemannian manifold (M, g) is a differentiable manifold M with a family of smoothly varying positive-definite inner products g = g p on Tp M for each p ∈ M. [sent-1207, score-0.58]
97 For example, every manifold can be embedded in Rm for some m ≥ n (the Whitney embedding theorem), and the Euclidean metric induces a metric on the manifold under the embedding. [sent-1211, score-0.738]
98 This geodesic distance d turns the Riemannian manifold into a metric space, satisfying the usual properties of positivity, symmetry and the triangle inequality. [sent-1219, score-0.446]
99 An orientation of a manifold is a smooth choice of orientation for the tangent spaces, meaning that for local charts ϕi and ϕ j , the differential D(ϕ j ◦ ϕi )(x) : Rn −→ Rn is orientation preserving, so the sign of the determinant is constant. [sent-1234, score-0.493]
100 As we comment in the text, this relationship may be of use in approximating information diffusion kernels for complex models. [sent-1268, score-0.485]
wordName wordTfidf (topN-words)
[('heat', 0.433), ('diffusion', 0.387), ('manifold', 0.257), ('tf', 0.206), ('geometry', 0.19), ('riemannian', 0.186), ('kt', 0.179), ('multinomial', 0.156), ('iffusion', 0.155), ('parametrix', 0.155), ('afferty', 0.146), ('anifolds', 0.146), ('ebanon', 0.146), ('fisher', 0.129), ('tatistical', 0.123), ('df', 0.113), ('webkb', 0.112), ('kernels', 0.098), ('jr', 0.094), ('boundary', 0.093), ('metric', 0.091), ('ernels', 0.091), ('reuters', 0.081), ('jn', 0.081), ('simplex', 0.08), ('earn', 0.079), ('pn', 0.076), ('geodesic', 0.075), ('kernel', 0.074), ('rn', 0.074), ('tp', 0.072), ('laplacian', 0.065), ('euclidean', 0.065), ('manifolds', 0.064), ('tangent', 0.061), ('document', 0.059), ('rademacher', 0.055), ('differential', 0.054), ('hyperbolic', 0.052), ('submanifold', 0.052), ('acq', 0.05), ('grain', 0.05), ('gi', 0.05), ('text', 0.046), ('geometric', 0.045), ('averages', 0.043), ('guo', 0.043), ('charts', 0.043), ('moneyfx', 0.043), ('rounding', 0.043), ('covering', 0.043), ('ui', 0.043), ('student', 0.042), ('embedding', 0.042), ('documents', 0.041), ('spherical', 0.04), ('gaussian', 0.04), ('family', 0.039), ('log', 0.039), ('mercer', 0.038), ('corpus', 0.038), ('lafferty', 0.038), ('dy', 0.036), ('det', 0.036), ('grad', 0.034), ('schoen', 0.034), ('yau', 0.034), ('ei', 0.033), ('coordinates', 0.033), ('compact', 0.032), ('kass', 0.032), ('expansion', 0.031), ('eigenvalues', 0.031), ('representation', 0.03), ('crude', 0.029), ('curvature', 0.029), ('faculty', 0.029), ('dh', 0.029), ('bounds', 0.029), ('frequency', 0.029), ('oles', 0.029), ('diffeomorphism', 0.029), ('curves', 0.029), ('dirichlet', 0.028), ('differentiable', 0.027), ('dx', 0.026), ('vi', 0.026), ('geodesically', 0.026), ('milnor', 0.026), ('orientation', 0.026), ('vl', 0.026), ('mendelson', 0.026), ('families', 0.025), ('open', 0.025), ('familiar', 0.024), ('sphere', 0.024), ('embeddings', 0.023), ('velocity', 0.023), ('pt', 0.023), ('distance', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
Author: John Lafferty, Guy Lebanon
Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. Keywords: kernels, heat equation, diffusion, information geometry, text classification
2 0.20932806 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
Author: Simone Fiori
Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications. Keywords: differential geometry, diffusion-type gradient, Lie groups, non-negative independent component analysis, Riemannian gradient
3 0.068516739 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
4 0.050897971 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
5 0.048018098 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
Author: Lior Wolf, Amnon Shashua
Abstract: The problem of selecting a subset of relevant features in a potentially overwhelming quantity of data is classic and found in many branches of science. Examples in computer vision, text processing and more recently bio-informatics are abundant. In text classification tasks, for example, it is not uncommon to have 104 to 107 features of the size of the vocabulary containing word frequency counts, with the expectation that only a small fraction of them are relevant. Typical examples include the automatic sorting of URLs into a web directory and the detection of spam email. In this work we present a definition of “relevancy” based on spectral properties of the Laplacian of the features’ measurement matrix. The feature selection process is then based on a continuous ranking of the features defined by a least-squares optimization process. A remarkable property of the feature relevance function is that sparse solutions for the ranking values naturally emerge as a result of a “biased non-negativity” of a key matrix in the process. As a result, a simple leastsquares optimization process converges onto a sparse solution, i.e., a selection of a subset of features which form a local maximum over the relevance function. The feature selection algorithm can be embedded in both unsupervised and supervised inference problems and empirical evidence show that the feature selections typically achieve high accuracy even when only a small fraction of the features are relevant.
6 0.047140889 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
7 0.044638496 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
8 0.044634659 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
9 0.043860517 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
10 0.043055609 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
12 0.041970395 36 jmlr-2005-Gaussian Processes for Ordinal Regression
13 0.04018968 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
14 0.038316336 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
15 0.035520844 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.034014989 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
17 0.033344757 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
18 0.033206273 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
19 0.031509366 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
20 0.031025507 41 jmlr-2005-Kernel Methods for Measuring Independence
topicId topicWeight
[(0, 0.212), (1, 0.142), (2, -0.026), (3, 0.129), (4, 0.088), (5, 0.044), (6, -0.092), (7, 0.187), (8, 0.098), (9, 0.014), (10, -0.019), (11, -0.553), (12, 0.209), (13, -0.032), (14, 0.025), (15, 0.141), (16, 0.016), (17, 0.144), (18, 0.224), (19, -0.065), (20, -0.117), (21, 0.105), (22, 0.011), (23, -0.045), (24, -0.021), (25, 0.017), (26, -0.066), (27, 0.003), (28, 0.012), (29, -0.027), (30, 0.001), (31, 0.109), (32, 0.001), (33, 0.008), (34, -0.02), (35, -0.011), (36, -0.003), (37, 0.111), (38, -0.018), (39, 0.016), (40, -0.034), (41, -0.0), (42, -0.148), (43, -0.093), (44, 0.014), (45, 0.077), (46, -0.041), (47, 0.013), (48, -0.013), (49, 0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.9482556 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
Author: John Lafferty, Guy Lebanon
Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. Keywords: kernels, heat equation, diffusion, information geometry, text classification
2 0.78146356 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
Author: Simone Fiori
Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications. Keywords: differential geometry, diffusion-type gradient, Lie groups, non-negative independent component analysis, Riemannian gradient
3 0.19711927 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
4 0.19164854 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.18979008 27 jmlr-2005-Dimension Reduction in Text Classification with Support Vector Machines
Author: Hyunsoo Kim, Peg Howland, Haesun Park
Abstract: Support vector machines (SVMs) have been recognized as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification. In this paper, we adopt novel dimension reduction methods to reduce the dimension of the document vectors dramatically. We also introduce decision functions for the centroid-based classification algorithm and support vector classifiers to handle the classification problem where a document may belong to multiple classes. Our substantial experimental results show that with several dimension reduction methods that are designed particularly for clustered data, higher efficiency for both training and testing can be achieved without sacrificing prediction accuracy of text classification even when the dimension of the input space is significantly reduced. Keywords: dimension reduction, support vector machines, text classification, linear discriminant analysis, centroids
7 0.15833889 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
8 0.15698659 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
9 0.1541201 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
10 0.15043917 41 jmlr-2005-Kernel Methods for Measuring Independence
11 0.14843841 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
12 0.14554563 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
13 0.14313582 36 jmlr-2005-Gaussian Processes for Ordinal Regression
14 0.13954403 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
15 0.13878824 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
16 0.13624036 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
17 0.12865725 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
19 0.11988391 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
20 0.11910103 28 jmlr-2005-Efficient Computation of Gapped Substring Kernels on Large Alphabets
topicId topicWeight
[(11, 0.02), (13, 0.497), (17, 0.024), (19, 0.022), (36, 0.039), (37, 0.027), (42, 0.024), (43, 0.032), (47, 0.017), (52, 0.066), (59, 0.024), (70, 0.026), (88, 0.077), (90, 0.015), (94, 0.018)]
simIndex simValue paperId paperTitle
1 0.87573308 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
same-paper 2 0.84990519 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
Author: John Lafferty, Guy Lebanon
Abstract: A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernel-based learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification. Keywords: kernels, heat equation, diffusion, information geometry, text classification
3 0.46441725 9 jmlr-2005-Active Learning to Recognize Multiple Types of Plankton
Author: Tong Luo, Kurt Kramer, Dmitry B. Goldgof, Lawrence O. Hall, Scott Samson, Andrew Remsen, Thomas Hopkins
Abstract: This paper presents an active learning method which reduces the labeling effort of domain experts in multi-class classification problems. Active learning is applied in conjunction with support vector machines to recognize underwater zooplankton from higher-resolution, new generation SIPPER II images. Most previous work on active learning with support vector machines only deals with two class problems. In this paper, we propose an active learning approach “breaking ties” for multiclass support vector machines using the one-vs-one approach with a probability approximation. Experimental results indicate that our approach often requires significantly less labeled images to reach a given accuracy than the approach of labeling the least certain test example and random sampling. It can also be applied in batch mode resulting in an accuracy comparable to labeling one image at a time and retraining. Keywords: active learning, support vector machine, plankton recognition, probabilistic output, multi-class support vector machine
4 0.36800948 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
5 0.33212286 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
Author: Alexander T. Ihler, John W. Fisher III, Alan S. Willsky
Abstract: Belief propagation (BP) is an increasingly popular method of performing approximate inference on arbitrary graphical models. At times, even further approximations are required, whether due to quantization of the messages or model parameters, from other simplified message or model representations, or from stochastic approximation methods. The introduction of such errors into the BP message computations has the potential to affect the solution obtained adversely. We analyze the effect resulting from message approximation under two particular measures of error, and show bounds on the accumulation of errors in the system. This analysis leads to convergence conditions for traditional BP message passing, and both strict bounds and estimates of the resulting error in systems of approximate BP message passing. Keywords: belief propagation, sum-product, convergence, approximate inference, quantization
6 0.32998976 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
7 0.3293314 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
8 0.32881984 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
9 0.31700906 48 jmlr-2005-Learning the Kernel Function via Regularization
10 0.31198692 71 jmlr-2005-Variational Message Passing
11 0.30298036 20 jmlr-2005-Clustering with Bregman Divergences
12 0.30230287 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
13 0.30071512 39 jmlr-2005-Information Bottleneck for Gaussian Variables
14 0.29899457 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
15 0.29634994 18 jmlr-2005-Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems
16 0.29458955 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
17 0.29279512 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
18 0.2918092 32 jmlr-2005-Expectation Consistent Approximate Inference
19 0.29100624 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
20 0.28802919 23 jmlr-2005-Convergence Theorems for Generalized Alternating Minimization Procedures