nips nips2002 nips2002-113 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Guy Lebanon, John D. Lafferty
Abstract: A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. [sent-5, score-0.362]
2 Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. [sent-6, score-1.56]
3 As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. [sent-7, score-0.292]
4 Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification. [sent-8, score-0.58]
5 1 Introduction The use of kernels is of increasing importance in machine learning. [sent-9, score-0.252]
6 Research in this area continues to progress rapidly, with most of the activity focused on the underlying learning algorithms rather than on the kernels themselves. [sent-11, score-0.276]
7 Kernel methods have largely been a tool for data represented as points in Euclidean space, with the collection of kernels employed limited to a few simple families such as polynomial or Gaussian RBF kernels. [sent-12, score-0.326]
8 However, recent work by Kondor and Lafferty [7], motivated by the need for kernel methods that can be applied to discrete data such as graphs, has proposed the use of diffusion kernels based on the tools of spectral graph theory. [sent-13, score-1.101]
9 One limitation of this approach is the difficulty of analyzing the associated learning algorithms in the discrete setting. [sent-14, score-0.083]
10 For example, there is no obvious way to bound covering numbers and generalization error for this class of diffusion kernels, since the natural function spaces are over discrete sets. [sent-15, score-0.833]
11 In this paper, we propose a related construction of kernels based on the heat equation. [sent-16, score-0.532]
12 The key idea in our approach is to begin with a statistical model of the data being analyzed, and to consider the heat equation on the Riemannian manifold defined by the Fisher information metric of the model. [sent-17, score-0.604]
13 The result is a family of kernels that naturally generalizes the familiar Gaussian kernel for Euclidean space, and that includes new kernels for discrete data by beginning with statistical families such as the multinomial. [sent-18, score-0.851]
14 Since the kernels are intimately based on the geometry of the Fisher information metric and the heat or diffusion equation on the associated Riemannian manifold, we refer to them as information diffusion kernels. [sent-19, score-2.061]
15 Unlike the diffusion kernels of [7], the kernels we investigate here are over continuous parameter spaces even in the case where the underlying data is discrete. [sent-20, score-1.073]
16 As a consequence, some of the machinery that has been developed for analyzing the generalization performance of kernel machines can be applied in our setting. [sent-21, score-0.255]
17 In particular, the spectral approach of Guo et al. [sent-22, score-0.042]
18 [3] is applicable to information diffusion kernels, and in applying this approach it is possible to draw on the considerable body of research in differential geometry that studies the eigenvalues of the geometric Laplacian. [sent-23, score-0.869]
19 Section 3 derives bounds on the covering numbers for support vector machines using the new kernels, adopting the approach of [3]. [sent-25, score-0.339]
20 Section 4 describes experiments on text classification, and Section 5 discusses the results of the paper. [sent-26, score-0.04]
21 ©¨¦¥¤¢ assume the mapping and given by "cb a`Y§ XWB V U " 3 Let be a -dimensional statistical model on a set . [sent-29, score-0.073]
22 The Fisher information matrix of at is (1) or equivalently as (2) defines a Riemannian metric on , giving the structure of a In coordinates , -dimensional Riemannian manifold. [sent-32, score-0.164]
23 One of the motivating properties of the Fisher information metric is that, unlike the Euclidean distance, it is invariant under reparameterization. [sent-33, score-0.143]
24 For detailed treatments of information geometry we refer to [1, 6]. [sent-34, score-0.224]
25 4 For many statistical models there is a natural way to associate to each data point a pain the statistical model. [sent-35, score-0.08]
26 For example, in the case of text, under the rameter vector multinomial model a document is naturally associated with the relative frequencies of the word counts. [sent-36, score-0.24]
27 This amounts to the mapping which sends a document to its maximum . [sent-37, score-0.071]
28 Given such a mapping, we propose to apply a kernel on parameter likelihood model space, . [sent-38, score-0.202]
29 Given a kernel on parameter space, we then average over the posteriors to obtain a kernel on data: (3) y 2 2 4 Y¨E 4 8Y¨E Y§ b iqi 4 4 § b § ¥ § ¥ h h ¡ It remains to define the kernel on parameter space. [sent-41, score-0.606]
30 There is a fundamental choice: the kernel associated with heat diffusion on the parameter manifold under the Fisher information metric. [sent-42, score-1.199]
31 f WXB gV e d ¡ WB V with metric W V vB SA XWB )V wA 6 § ) § )g For a manifold coordinates by the Laplacian det det is given in local (4) F tC C B © © © ¡ © § ¨ ¥ ¦ ¤ ¡ C B B ` WB V U s` WB V U ¡ ¢ ¡ where , generalizing the classical operator div . [sent-43, score-0.342]
32 When is with corresponding compact the Laplacian has discrete eigenvalues eigenfunctions satisfying . [sent-44, score-0.139]
33 When the manifold has a boundary, appropriate boundary conditions must be imposed in order that is self-adjoint. [sent-45, score-0.229]
34 Dirichlet boundary conditions set and Neumann boundary conditions require where is the outer normal direction. [sent-46, score-0.254]
35 The following theorem summarizes the basic properties for the kernel of the heat equation on . [sent-47, score-0.574]
36 Then the heat exists and satisfies (1) , (2) , , (4) , and (5) . [sent-54, score-0.28]
37 Properties 2 and 3 imply that solves the heat equation in , starting from . [sent-59, score-0.31]
38 Note that when using such a kernel for classification, the discriminant function can be interpreted as the solution to the heat equation with initial temperature on labeled data point , and on unlabeled points. [sent-63, score-0.512]
39 1 The Multinomial The multinomial is an important example of how information diffusion kernels can be , is an element of applied naturally to discrete data. [sent-66, score-1.021]
40 B ¡ B e 56 B wE©¨¦¥£ 0 § G p 2 d ¡ B (B ¢ Q fR ¢ g h i ¥ 2 2 The representation of the Fisher information metric given in equation (2) suggests the geometry underlying the multinomial. [sent-69, score-0.392]
41 In particular, the information metric is given by so that the Fisher information corresponds to the inner product of tangent vectors to the sphere, and information geometry for the multinomial is the geometry of the positive orthant of the sphere. [sent-70, score-0.746]
42 In general for the heat kernel on a Riemannian manifold, there is an asymptotic expansion in terms of the parametrices; see for example [9]. [sent-72, score-0.502]
43 This expands the kernel as (6) § 4 B q 4§B g e hf 2 1 3 Bf u 4§ 2 2 3 d R Q e # g§ ¦ §u q 4 § ¡ 2 P Using the first order approximation and the explicit distance for the geodesic distance gives 1 1 0. [sent-73, score-0.344]
44 9 1 Figure 1: Example decision boundaries using support vector machines with information diffusion kernels for trinomial geometry on the 2-simplex (top right) and spherical normal geometry, (bottom right), compared with the standard Gaussian kernel (left). [sent-109, score-1.395]
45 2 Now consider the statistical family given by where is the mean and is the scale of the variance. [sent-114, score-0.08]
46 Thus, the Fisher information metric gives the structure of the upper half plane in hyperbolic space. [sent-116, score-0.243]
47 ¡ h where is the geodesic distance between the two points in kernel is identical to the Gaussian kernel on . [sent-126, score-0.542]
48 (8) § F # ¢ the kernel is given by e WB (& ¨ and for ¢ The heat kernel on hyperbolic space by . [sent-127, score-0.805]
49 For (9) the If only the mean is unspecified, then the associated kernel is the standard Gaussian RBF kernel. [sent-128, score-0.224]
50 In Figure 1 the kernel for hyperbolic space is compared with the Euclidean space Gaussian kernel for the case of a 1-dimensional normal model with unknown mean and variance, corresponding to . [sent-129, score-0.584]
51 Note that the curved decision boundary for the diffusion kernel makes intuitive sense, since as the variance decreases the mean is known with increasing certainty. [sent-130, score-0.832]
52 ¡ 2 h 3 Spectral Bounds on Covering Numbers In this section we prove bounds on the entropy and covering numbers for support vector machines that use information diffusion kernels; these bounds in turn yield bounds on the expected risk of the learning algorithms. [sent-131, score-1.119]
53 [3], and make use of bounds on the spectrum of the Laplacian on a Riemannian manifold, rather than on VC dimension techniques. [sent-133, score-0.124]
54 Our calculations give an indication of how the underlying geometry influences the entropy numbers, which are inverse to the covering numbers. [sent-134, score-0.351]
55 Let be a compact subset of -dimensional Euclidean space, and suppose that is a Mercer kernel. [sent-136, score-0.041]
56 V Given points , the SVM hypothesis class for bounded by is defined as the collection of functions ©©© def corresponding eigenfunctions. [sent-146, score-0.049]
57 We assume that with weight vector (10) X ` X H ¢ ` ¢ where is the mapping from to feature space defined by the Mercer kernel, and and denote the corresponding Hilbert space inner product and norm. [sent-147, score-0.085]
58 It is of interest to obtain uniform bounds on the covering numbers , defined as the in the metric induced by the norm size of the smallest -cover of . [sent-148, score-0.421]
59 U U § ¢ f © To apply this result, we will obtain bounds on the indices using spectral theory in Riemannian geometry. [sent-153, score-0.145]
60 The following bounds on the eigenvalues of the Laplacian are due to Li and Yau [8]. [sent-154, score-0.161]
61 Let be a compact Riemannian manifold of dimension with non-negative Ricci curvature, and assume that the boundary of is convex. [sent-156, score-0.247]
62 Let denote the eigenvalues of the Laplacian with Dirichlet boundary conditions. [sent-157, score-0.143]
63 Note that the manifold of the multinomial model satisfies the conditions of this theorem. [sent-159, score-0.299]
64 Using these results we can establish the following bounds on covering numbers for information diffusion kernels. [sent-160, score-0.881]
65 We assume Dirichlet boundary conditions; a similar result can be proven for Neumann boundary conditions. [sent-161, score-0.17]
66 We include the constant vol and diffusion coefficient in order to indicate how the bounds depend on the geometry. [sent-162, score-0.648]
67 Let be a compact Riemannian manifold, with volume , satisfying the conditions of Theorem 3. [sent-164, score-0.094]
68 Then the covering numbers for the Dirichlet heat kernel on satisfy R d ( frPH ¤ R g f q (12) § ¡ § R ¦ ¨ § ¡ % SQPH Proof. [sent-165, score-0.686]
69 By the lower bound in Theorem 3, the Dirichlet eigenvalues of the heat kernel h R SQPH ( e ( (R 2 . [sent-166, score-0.562]
70 Plugging this bound on constant ; thus, Q # since we may assume that ¢ The above inequality will hold in case A or equivalently will hold if . [sent-170, score-0.069]
71 Thus, B where the second inequality comes from upper bound of Theorem 3, the inequality ¤ W 2§ , satisfy § , which are given by ¦ § § We note that Theorem 4 of [3] can be used to show that this bound does not, in fact, depend on and . [sent-171, score-0.096]
72 Thus, for fixed the covering numbers scale as , and for fixed they scale as in the diffusion time . [sent-172, score-0.749]
73 ¡ % frPH § R @¤ 4 § g ¡ § R g % SQPH ¨ @ ¢ B A § ¦ § R ¤ frPH 4 g 4 Experiments We compared the information diffusion kernel to linear and Gaussian kernels in the context of text classification using the WebKB dataset. [sent-173, score-1.068]
74 A “bag of words” representation was used for all three kernels, using only the word frequencies. [sent-175, score-0.035]
75 The information diffusion kernel is based on the multinomial model, which is the correct model under the 0. [sent-177, score-0.931]
76 18 linear rbf diffusion linear rbf diffusion 0. [sent-178, score-1.18]
77 02 50 100 150 Number of training examples 200 50 250 100 150 Number of training examples 200 250 Figure 2: Experimental results on the WebKB corpus, using SVMs for linear (dot-dashed) and Gaussian (dotted) kernels, compared with the information diffusion kernel for the multinomial (solid). [sent-191, score-0.931]
78 Results for two classification tasks are shown, faculty vs. [sent-192, score-0.091]
79 (incorrect) assumption that the word occurrences are independent. [sent-196, score-0.035]
80 The maximum likelihood mapping was used to map a document to a multinomial model, simply normalizing the counts to sum to one. [sent-197, score-0.247]
81 2 q§ 56 2 Figure 2 shows test set error rates obtained using support vector machines for linear, Gaussian, and information diffusion kernels for two binary classification tasks: faculty vs. [sent-198, score-0.971]
82 The curves shown are the mean error rates over 20-fold cross validation and the error bars represent twice the standard deviation. [sent-201, score-0.044]
83 For the Gaussian and ) in information diffusion kernels we tested values of the kernels’ free parameter ( or the set . [sent-202, score-0.826]
84 e ¨ 0 ¤ £ d y ¡ y d y £ ¢ h h Our results are consistent with previous experiments on this dataset [5], which have observed that the linear and Gaussian kernels result in very similar performance. [sent-204, score-0.252]
85 However the information diffusion kernel significantly outperforms both of them, almost always obtaining lower error rate than the average error rate of the other kernels. [sent-205, score-0.82]
86 This result is striking because the kernels use identical representations of the documents, vectors of word counts (in contrast to, for example, string kernels). [sent-208, score-0.308]
87 We attribute this improvement to the fact that the information metric places more emphasis on points near the boundary of the simplex. [sent-209, score-0.31]
88 Yet statistical models offer many advantages, and thus it is attractive to explore methods that combine data models and purely discriminative methods for classification and regression. [sent-211, score-0.072]
89 Our approach brings a new perspective to combining parametric statistical modeling with non-parametric discriminative learning. [sent-212, score-0.072]
90 However, the kernels we investigate here differ significantly from the Fisher kernel proposed in [4]. [sent-214, score-0.454]
91 In particular, the latter is based on the Fisher score at a single point in parameter space, and in the case of an exponential family model it is given by a covariance . [sent-215, score-0.05]
92 In contrast, infor- f ¦¥¨sfrPH D §¥R ¤ D D © © B` ¥ U d B14 B` ¥ U d B 4 B B # A B # A ¥ ¡ § 14 4 § ¨ mation diffusion kernels are based on the full geometry of the statistical family, and yet are also invariant under reparameterization of the family. [sent-216, score-1.042]
93 Bounds on the covering numbers for information diffusion kernels were derived for the case of positive curvature, which apply to the special case of the multinomial. [sent-217, score-1.03]
94 Similar bounds for general manifolds with curvature bounded below by a negative constant should also be attainable. [sent-219, score-0.157]
95 ¢ d ¡ 2 2 While information diffusion kernels are very general, they may be difficult to compute in particular cases; explicit formulas such as equations (8–9) for hyperbolic space are rare. [sent-220, score-0.947]
96 To approximate an information diffusion kernel it may be attractive to use the parametrices between points, as we have done for the multinomial. [sent-221, score-0.822]
97 In and geodesic distance cases where the distance itself is difficult to compute exactly, a compromise may be to approximate the distance between nearby points in terms of the Kullback-Leibler divergence, using the relation . [sent-222, score-0.2]
98 wY§ 2 ¦ E©8¥ ¤ ©¨X¥§ ¢¢ Y§ 2 § § ¡ The primary “degree of freedom” in the use of information diffusion kernels lies in the specification of the mapping of data to model parameters, . [sent-223, score-0.869]
99 For the multinomial, we have used the maximum likelihood mapping , which is simple and well motivated. [sent-224, score-0.043]
100 Diffusion kernels on graphs and other discrete input spaces. [sent-272, score-0.292]
wordName wordTfidf (topN-words)
[('diffusion', 0.545), ('frph', 0.299), ('heat', 0.28), ('kernels', 0.252), ('riemannian', 0.237), ('kernel', 0.202), ('geometry', 0.195), ('multinomial', 0.155), ('sqph', 0.14), ('covering', 0.132), ('fisher', 0.123), ('manifold', 0.121), ('xwb', 0.12), ('metric', 0.114), ('bounds', 0.103), ('hyperbolic', 0.1), ('guo', 0.092), ('faculty', 0.091), ('laplacian', 0.091), ('boundary', 0.085), ('geodesic', 0.08), ('euclidean', 0.076), ('wb', 0.073), ('numbers', 0.072), ('dirichlet', 0.07), ('theorem', 0.062), ('webkb', 0.06), ('eigenvalues', 0.058), ('spherical', 0.056), ('curvature', 0.054), ('family', 0.05), ('cc', 0.048), ('lafferty', 0.048), ('parametrices', 0.046), ('sfrph', 0.046), ('trinomial', 0.046), ('rbf', 0.045), ('mapping', 0.043), ('spectral', 0.042), ('differential', 0.042), ('discriminative', 0.042), ('compact', 0.041), ('discrete', 0.04), ('gaussian', 0.04), ('neumann', 0.04), ('text', 0.04), ('normal', 0.038), ('mercer', 0.037), ('student', 0.036), ('lebanon', 0.036), ('kondor', 0.036), ('word', 0.035), ('classi', 0.035), ('hypertext', 0.034), ('bf', 0.032), ('machines', 0.032), ('distance', 0.031), ('places', 0.03), ('cq', 0.03), ('equation', 0.03), ('statistical', 0.03), ('volume', 0.03), ('det', 0.029), ('vb', 0.029), ('information', 0.029), ('operator', 0.029), ('document', 0.028), ('sa', 0.028), ('points', 0.027), ('inequality', 0.026), ('course', 0.025), ('families', 0.025), ('emphasis', 0.025), ('underlying', 0.024), ('pittsburgh', 0.023), ('conditions', 0.023), ('associated', 0.022), ('let', 0.022), ('collection', 0.022), ('error', 0.022), ('bound', 0.022), ('cation', 0.022), ('equivalently', 0.021), ('jaakkola', 0.021), ('counts', 0.021), ('icml', 0.021), ('analyzing', 0.021), ('spectrum', 0.021), ('space', 0.021), ('associate', 0.02), ('tools', 0.02), ('asymptotic', 0.02), ('xsa', 0.02), ('guy', 0.02), ('mation', 0.02), ('gef', 0.02), ('tc', 0.02), ('div', 0.02), ('joachims', 0.02), ('intimately', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 113 nips-2002-Information Diffusion Kernels
Author: Guy Lebanon, John D. Lafferty
Abstract: A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
2 0.31200531 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
3 0.18234482 156 nips-2002-On the Complexity of Learning the Kernel Matrix
Author: Olivier Bousquet, Daniel Herrmann
Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.
4 0.16444935 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
Author: Jean-philippe Vert, Minoru Kanehisa
Abstract: We present an algorithm to extract features from high-dimensional gene expression profiles, based on the knowledge of a graph which links together genes known to participate to successive reactions in metabolic pathways. Motivated by the intuition that biologically relevant features are likely to exhibit smoothness with respect to the graph topology, the algorithm involves encoding the graph and the set of expression profiles into kernel functions, and performing a generalized form of canonical correlation analysis in the corresponding reproducible kernel Hilbert spaces. Function prediction experiments for the genes of the yeast S. Cerevisiae validate this approach by showing a consistent increase in performance when a state-of-the-art classifier uses the vector of features instead of the original expression profile to predict the functional class of a gene.
5 0.16099592 120 nips-2002-Kernel Design Using Boosting
Author: Koby Crammer, Joseph Keshet, Yoram Singer
Abstract: The focus of the paper is the problem of learning kernel operators from empirical data. We cast the kernel design problem as the construction of an accurate kernel from simple (and less accurate) base kernels. We use the boosting paradigm to perform the kernel construction process. To do so, we modify the booster so as to accommodate kernel operators. We also devise an efficient weak-learner for simple kernels that is based on generalized eigen vector decomposition. We demonstrate the effectiveness of our approach on synthetic data and on the USPS dataset. On the USPS dataset, the performance of the Perceptron algorithm with learned kernels is systematically better than a fixed RBF kernel. 1 Introduction and problem Setting The last decade brought voluminous amount of work on the design, analysis and experimentation of kernel machines. Algorithm based on kernels can be used for various machine learning tasks such as classification, regression, ranking, and principle component analysis. The most prominent learning algorithm that employs kernels is the Support Vector Machines (SVM) [1, 2] designed for classification and regression. A key component in a kernel machine is a kernel operator which computes for any pair of instances their inner-product in some abstract vector space. Intuitively and informally, a kernel operator is a means for measuring similarity between instances. Almost all of the work that employed kernel operators concentrated on various machine learning problems that involved a predefined kernel. A typical approach when using kernels is to choose a kernel before learning starts. Examples to popular predefined kernels are the Radial Basis Functions and the polynomial kernels (see for instance [1]). Despite the simplicity required in modifying a learning algorithm to a “kernelized” version, the success of such algorithms is not well understood yet. More recently, special efforts have been devoted to crafting kernels for specific tasks such as text categorization [3] and protein classification problems [4]. Our work attempts to give a computational alternative to predefined kernels by learning kernel operators from data. We start with a few definitions. Let X be an instance space. A kernel is an inner-product operator K : X × X → . An explicit way to describe K is via a mapping φ : X → H from X to an inner-products space H such that K(x, x ) = φ(x)·φ(x ). Given a kernel operator and a finite set of instances S = {xi , yi }m , the kernel i=1 matrix (a.k.a the Gram matrix) is the matrix of all possible inner-products of pairs from S, Ki,j = K(xi , xj ). We therefore refer to the general form of K as the kernel operator and to the application of the kernel operator to a set of pairs of instances as the kernel matrix. The specific setting of kernel design we consider assumes that we have access to a base kernel learner and we are given a target kernel K manifested as a kernel matrix on a set of examples. Upon calling the base kernel learner it returns a kernel operator denote Kj . The goal thereafter is to find a weighted combination of kernels ˆ K(x, x ) = j αj Kj (x, x ) that is similar, in a sense that will be defined shortly, to ˆ the target kernel, K ∼ K . Cristianini et al. [5] in their pioneering work on kernel target alignment employed as the notion of similarity the inner-product between the kernel matrices < K, K >F = m K(xi , xj )K (xi , xj ). Given this definition, they defined the i,j=1 kernel-similarity, or alignment, to be the above inner-product normalized by the norm of ˆ ˆ ˆ ˆ ˆ each kernel, A(S, K, K ) = < K, K >F / < K, K >F < K , K >F , where S is, as above, a finite sample of m instances. Put another way, the kernel alignment Cristianini et al. employed is the cosine of the angle between the kernel matrices where each matrix is “flattened” into a vector of dimension m2 . Therefore, this definition implies that the alignment is bounded above by 1 and can attain this value iff the two kernel matrices are identical. Given a (column) vector of m labels y where yi ∈ {−1, +1} is the label of the instance xi , Cristianini et al. used the outer-product of y as the the target kernel, ˆ K = yy T . Therefore, an optimal alignment is achieved if K(xi , xj ) = yi yj . Clearly, if such a kernel is used for classifying instances from X , then the kernel itself suffices to construct an excellent classifier f : X → {−1, +1} by setting, f (x) = sign(y i K(xi , x)) where (xi , yi ) is any instance-label pair. Cristianini et al. then devised a procedure that works with both labelled and unlabelled examples to find a Gram matrix which attains a good alignment with K on the labelled part of the matrix. While this approach can clearly construct powerful kernels, a few problems arise from the notion of kernel alignment they employed. For instance, a kernel operator such that the sign(K(x i , xj )) is equal to yi yj but its magnitude, |K(xi , xj )|, is not necessarily 1, might achieve a poor alignment score while it can constitute a classifier whose empirical loss is zero. Furthermore, the task of finding a good kernel when it is not always possible to find a kernel whose sign on each pair of instances is equal to the products of the labels (termed the soft-margin case in [5, 6]) becomes rather tricky. We thus propose a different approach which attempts to overcome some of the difficulties above. Like Cristianini et al. we assume that we are given a set of labelled instances S = {(xi , yi ) | xi ∈ X , yi ∈ {−1, +1}, i = 1, . . . , m} . We are also given a set of unlabelled m ˜ ˜ examples S = {˜i }i=1 . If such a set is not provided we can simply use the labelled inx ˜ ˜ stances (without the labels themselves) as the set S. The set S is used for constructing the ˆ primitive kernels that are combined to constitute the learned kernel K. The labelled set is used to form the target kernel matrix and its instances are used for evaluating the learned ˆ kernel K. This approach, known as transductive learning, was suggested in [5, 6] for kernel alignment tasks when the distribution of the instances in the test data is different from that of the training data. This setting becomes in particular handy in datasets where the test data was collected in a different scheme than the training data. We next discuss the notion of kernel goodness employed in this paper. This notion builds on the objective function that several variants of boosting algorithms maintain [7, 8]. We therefore first discuss in brief the form of boosting algorithms for kernels. 2 Using Boosting to Combine Kernels Numerous interpretations of AdaBoost and its variants cast the boosting process as a procedure that attempts to minimize, or make small, a continuous bound on the classification error (see for instance [9, 7] and the references therein). A recent work by Collins et al. [8] unifies the boosting process for two popular loss functions, the exponential-loss (denoted henceforth as ExpLoss) and logarithmic-loss (denoted as LogLoss) that bound the empir- ˜ ˜ Input: Labelled and unlabelled sets of examples: S = {(xi , yi )}m ; S = {˜i }m x i=1 i=1 Initialize: K ← 0 (all zeros matrix) For t = 1, 2, . . . , T : • Calculate distribution over pairs 1 ≤ i, j ≤ m: Dt (i, j) = exp(−yi yj K(xi , xj )) 1/(1 + exp(−yi yj K(xi , xj ))) ExpLoss LogLoss ˜ • Call base-kernel-learner with (Dt , S, S) and receive Kt • Calculate: + − St = {(i, j) | yi yj Kt (xi , xj ) > 0} ; St = {(i, j) | yi yj Kt (xi , xj ) < 0} + Wt = (i,j)∈S + Dt (i, j)|Kt (xi , xj )| ; Wt− = (i,j)∈S − Dt (i, j)|Kt (xi , xj )| t t 1 2 + Wt − Wt • Set: αt = ln ; K ← K + α t Kt . Return: kernel operator K : X × X → Figure 1: The skeleton of the boosting algorithm for kernels. ical classification error. Given the prediction of a classifier f on an instance x and a label y ∈ {−1, +1} the ExpLoss and the LogLoss are defined as, ExpLoss(f (x), y) = exp(−yf (x)) LogLoss(f (x), y) = log(1 + exp(−yf (x))) . Collins et al. described a single algorithm for the two losses above that can be used within the boosting framework to construct a strong-hypothesis which is a classifier f (x). This classifier is a weighted combination of (possibly very simple) base classifiers. (In the boosting framework, the base classifiers are referred to as weak-hypotheses.) The strongT hypothesis is of the form f (x) = t=1 αt ht (x). Collins et al. discussed a few ways to select the weak-hypotheses ht and to find a good of weights αt . Our starting point in this paper is the first sequential algorithm from [8] that enables the construction or creation of weak-hypotheses on-the-fly. We would like to note however that it is possible to use other variants of boosting to design kernels. In order to use boosting to design kernels we extend the algorithm to operate over pairs of instances. Building on the notion of alignment from [5, 6], we say that the inner-product of x1 and x2 is aligned with the labels y1 and y2 if sign(K(x1 , x2 )) = y1 y2 . Furthermore, we would like to make the magnitude of K(x, x ) to be as large as possible. We therefore use one of the following two alignment losses for a pair of examples (x 1 , y1 ) and (x2 , y2 ), ExpLoss(K(x1 , x2 ), y1 y2 ) = exp(−y1 y2 K(x1 , x2 )) LogLoss(K(x1 , x2 ), y1 y2 ) = log(1 + exp(−y1 y2 K(x1 , x2 ))) . Put another way, we view a pair of instances as a single example and cast the pairs of instances that attain the same label as positively labelled examples while pairs of opposite labels are cast as negatively labelled examples. Clearly, this approach can be applied to both losses. In the boosting process we therefore maintain a distribution over pairs of instances. The weight of each pair reflects how difficult it is to predict whether the labels of the two instances are the same or different. The core boosting algorithm follows similar lines to boosting algorithms for classification algorithm. The pseudo code of the booster is given in Fig. 1. The pseudo-code is an adaptation the to problem of kernel design of the sequentialupdate algorithm from [8]. As with other boosting algorithm, the base-learner, which in our case is charge of returning a good kernel with respect to the current distribution, is left unspecified. We therefore turn our attention to the algorithmic implementation of the base-learning algorithm for kernels. 3 Learning Base Kernels The base kernel learner is provided with a training set S and a distribution D t over a pairs ˜ of instances from the training set. It is also provided with a set of unlabelled examples S. Without any knowledge of the topology of the space of instances a learning algorithm is likely to fail. Therefore, we assume the existence of an initial inner-product over the input space. We assume for now that this initial inner-product is the standard scalar products over vectors in n . We later discuss a way to relax the assumption on the form of the inner-product. Equipped with an inner-product, we define the family of base kernels to be the possible outer-products Kw = wwT between a vector w ∈ n and itself. Using this definition we get, Kw (xi , xj ) = (xi ·w)(xj ·w) . Input: A distribution Dt . Labelled and unlabelled sets: ˜ ˜ Therefore, the similarity beS = {(xi , yi )}m ; S = {˜i }m . x i=1 i=1 tween two instances xi and Compute : xj is high iff both xi and xj • Calculate: ˜ are similar (w.r.t the standard A ∈ m×m , Ai,r = xi · xr ˜ inner-product) to a third vecm×m B∈ , Bi,j = Dt (i, j)yi yj tor w. Analogously, if both ˜ ˜ K ∈ m×m , Kr,s = xr · xs ˜ ˜ xi and xj seem to be dissim• Find the generalized eigenvector v ∈ m for ilar to the vector w then they the problem AT BAv = λKv which attains are similar to each other. Dethe largest eigenvalue λ spite the restrictive form of • Set: w = ( r vr xr )/ ˜ ˜ r vr xr . the inner-products, this famt ily is still too rich for our setReturn: Kernel operator Kw = ww . ting and we further impose two restrictions on the inner Figure 2: The base kernel learning algorithm. products. First, we assume ˜ that w is restricted to a linear combination of vectors from S. Second, since scaling of the base kernels is performed by the boosted, we constrain the norm of w to be 1. The m ˜ resulting class of kernels is therefore, C = {Kw = wwT | w = r=1 βr xr , w = 1} . ˜ In the boosting process we need to choose a specific base-kernel K w from C. We therefore need to devise a notion of how good a candidate for base kernel is given a labelled set S and a distribution function Dt . In this work we use the simplest version suggested by Collins et al. This version can been viewed as a linear approximation on the loss function. We define the score of a kernel Kw w.r.t to the current distribution Dt to be, Score(Kw ) = Dt (i, j)yi yj Kw (xi , xj ) . (1) i,j The higher the value of the score is, the better Kw fits the training data. Note that if Dt (i, j) = 1/m2 (as is D0 ) then Score(Kw ) is proportional to the alignment since w = 1. Under mild assumptions the score can also provide a lower bound of the loss function. To see that let c be the derivative of the loss function at margin zero, c = Loss (0) . If all the √ training examples xi ∈ S lies in a ball of radius c, we get that Loss(Kw (xi , xj ), yi yj ) ≥ 1 − cKw (xi , xj )yi yj ≥ 0, and therefore, i,j Dt (i, j)Loss(Kw (xi , xj ), yi yj ) ≥ 1 − c Dt (i, j)Kw (xi , xj )yi yj . i,j Using the explicit form of Kw in the Score function (Eq. (1)) we get, Score(Kw ) = i,j D(i, j)yi yj (w·xi )(w·xj ) . Further developing the above equation using the constraint that w = m ˜ r=1 βr xr we get, ˜ Score(Kw ) = βs βr r,s i,j D(i, j)yi yj (xi · xr ) (xj · xs ) . ˜ ˜ To compute efficiently the base kernel score without an explicit enumeration we exploit the fact that if the initial distribution D0 is symmetric (D0 (i, j) = D0 (j, i)) then all the distributions generated along the run of the boosting process, D t , are also symmetric. We ˜ now define a matrix A ∈ m×m where Ai,r = xi · xr and a symmetric matrix B ∈ m×m ˜ with Bi,j = Dt (i, j)yi yj . Simple algebraic manipulations yield that the score function can be written as the following quadratic form, Score(β) = β T (AT BA)β , where β is m dimensional column vector. Note that since B is symmetric so is A T BA. Finding a ˜ good base kernel is equivalent to finding a vector β which maximizes this quadratic form 2 m ˜ under the norm equality constraint w = ˜ 2 = β T Kβ = 1 where Kr,s = r=1 βr xr xr · xs . Finding the maximum of Score(β) subject to the norm constraint is a well known ˜ ˜ maximization problem known as the generalized eigen vector problem (cf. [10]). Applying simple algebraic manipulations it is easy to show that the matrix AT BA is positive semidefinite. Assuming that the matrix K is invertible, the the vector β which maximizes the quadratic form is proportional the eigenvector of K −1 AT BA which is associated with the m ˜ generalized largest eigenvalue. Denoting this vector by v we get that w ∝ ˜ r=1 vr xr . m ˜ m ˜ Adding the norm constraint we get that w = ( r=1 vr xr )/ ˜ vr xr . The skeleton ˜ r=1 of the algorithm for finding a base kernels is given in Fig. 3. To conclude the description of the kernel learning algorithm we describe how to the extend the algorithm to be employed with general kernel functions. Kernelizing the Kernel: As described above, we assumed that the standard scalarproduct constitutes the template for the class of base-kernels C. However, since the proce˜ dure for choosing a base kernel depends on S and S only through the inner-products matrix A, we can replace the scalar-product itself with a general kernel operator κ : X × X → , where κ(xi , xj ) = φ(xi ) · φ(xj ). Using a general kernel function κ we can not compute however the vector w explicitly. We therefore need to show that the norm of w, and evaluation Kw on any two examples can still be performed efficiently. First note that given the vector v we can compute the norm of w as follows, T w 2 = vr xr ˜ vs xr ˜ r s = vr vs κ(˜r , xs ) . x ˜ r,s Next, given two vectors xi and xj the value of their inner-product is, Kw (xi , xj ) = vr vs κ(xi , xr )κ(xj , xs ) . ˜ ˜ r,s Therefore, although we cannot compute the vector w explicitly we can still compute its norm and evaluate any of the kernels from the class C. 4 Experiments Synthetic data: We generated binary-labelled data using as input space the vectors in 100 . The labels, in {−1, +1}, were picked uniformly at random. Let y designate the label of a particular example. Then, the first two components of each instance were drawn from a two-dimensional normal distribution, N (µ, ∆ ∆−1 ) with the following parameters, µ=y 0.03 0.03 1 ∆= √ 2 1 −1 1 1 = 0.1 0 0 0.01 . That is, the label of each examples determined the mean of the distribution from which the first two components were generated. The rest of the components in the vector (98 8 0.2 6 50 50 100 100 150 150 200 200 4 2 0 0 −2 −4 −6 250 250 −0.2 −8 −0.2 0 0.2 −8 −6 −4 −2 0 2 4 6 8 300 20 40 60 80 100 120 140 160 180 200 300 20 40 60 80 100 120 140 160 180 Figure 3: Results on a toy data set prior to learning a kernel (first and third from left) and after learning (second and fourth). For each of the two settings we show the first two components of the training data (left) and the matrix of inner products between the train and the test data (right). altogether) were generated independently using the normal distribution with a zero mean and a standard deviation of 0.05. We generated 100 training and test sets of size 300 and 200 respectively. We used the standard dot-product as the initial kernel operator. On each experiment we first learned a linear classier that separates the classes using the Perceptron [11] algorithm. We ran the algorithm for 10 epochs on the training set. After each epoch we evaluated the performance of the current classifier on the test set. We then used the boosting algorithm for kernels with the LogLoss for 30 rounds to build a kernel for each random training set. After learning the kernel we re-trained a classifier with the Perceptron algorithm and recorded the results. A summary of the online performance is given in Fig. 4. The plot on the left-hand-side of the figure shows the instantaneous error (achieved during the run of the algorithm). Clearly, the Perceptron algorithm with the learned kernel converges much faster than the original kernel. The middle plot shows the test error after each epoch. The plot on the right shows the test error on a noisy test set in which we added a Gaussian noise of zero mean and a standard deviation of 0.03 to the first two features. In all plots, each bar indicates a 95% confidence level. It is clear from the figure that the original kernel is much slower to converge than the learned kernel. Furthermore, though the kernel learning algorithm was not expoed to the test set noise, the learned kernel reflects better the structure of the feature space which makes the learned kernel more robust to noise. Fig. 3 further illustrates the benefits of using a boutique kernel. The first and third plots from the left correspond to results obtained using the original kernel and the second and fourth plots show results using the learned kernel. The left plots show the empirical distribution of the two informative components on the test data. For the learned kernel we took each input vector and projected it onto the two eigenvectors of the learned kernel operator matrix that correspond to the two largest eigenvalues. Note that the distribution after the projection is bimodal and well separated along the first eigen direction (x-axis) and shows rather little deviation along the second eigen direction (y-axis). This indicates that the kernel learning algorithm indeed found the most informative projection for separating the labelled data with large margin. It is worth noting that, in this particular setting, any algorithm which chooses a single feature at a time is prone to failure since both the first and second features are mandatory for correctly classifying the data. The two plots on the right hand side of Fig. 3 use a gray level color-map to designate the value of the inner-product between each pairs instances, one from training set (y-axis) and the other from the test set. The examples were ordered such that the first group consists of the positively labelled instances while the second group consists of the negatively labelled instances. Since most of the features are non-relevant the original inner-products are noisy and do not exhibit any structure. In contrast, the inner-products using the learned kernel yields in a 2 × 2 block matrix indicating that the inner-products between instances sharing the same label obtain large positive values. Similarly, for instances of opposite 200 1 12 Regular Kernel Learned Kernel 0.8 17 0.7 16 0.5 0.4 0.3 Test Error % 8 0.6 Regular Kernel Learned Kernel 18 10 Test Error % Averaged Cumulative Error % 19 Regular Kernel Learned Kernel 0.9 6 4 15 14 13 12 0.2 11 2 0.1 10 0 0 10 1 10 2 10 Round 3 10 4 10 0 2 4 6 Epochs 8 10 9 2 4 6 Epochs 8 10 Figure 4: The online training error (left), test error (middle) on clean synthetic data using a standard kernel and a learned kernel. Right: the online test error for the two kernels on a noisy test set. labels the inner products are large and negative. The form of the inner-products matrix of the learned kernel indicates that the learning problem itself becomes much easier. Indeed, the Perceptron algorithm with the standard kernel required around 94 training examples on the average before converging to a hyperplane which perfectly separates the training data while using the Perceptron algorithm with learned kernel required a single example to reach a perfect separation on all 100 random training sets. USPS dataset: The USPS (US Postal Service) dataset is known as a challenging classification problem in which the training set and the test set were collected in a different manner. The USPS contains 7, 291 training examples and 2, 007 test examples. Each example is represented as a 16 × 16 matrix where each entry in the matrix is a pixel that can take values in {0, . . . , 255}. Each example is associated with a label in {0, . . . , 9} which is the digit content of the image. Since the kernel learning algorithm is designed for binary problems, we broke the 10-class problem into 45 binary problems by comparing all pairs of classes. The interesting question of how to learn kernels for multiclass problems is beyond the scopre of this short paper. We thus constraint on the binary error results for the 45 binary problem described above. For the original kernel we chose a RBF kernel with σ = 1 which is the value employed in the experiments reported in [12]. We used the kernelized version of the kernel design algorithm to learn a different kernel operator for each of the binary problems. We then used a variant of the Perceptron [11] and with the original RBF kernel and with the learned kernels. One of the motivations for using the Perceptron is its simplicity which can underscore differences in the kernels. We ran the kernel learning al˜ gorithm with LogLoss and ExpLoss, using bith the training set and the test test as S. Thus, we obtained four different sets of kernels where each set consists of 45 kernels. By examining the training loss, we set the number of rounds of boosting to be 30 for the LogLoss and 50 for the ExpLoss, when using the trainin set. When using the test set, the number of rounds of boosting was set to 100 for both losses. Since the algorithm exhibits slower rate of convergence with the test data, we choose a a higher value without attempting to optimize the actual value. The left plot of Fig. 5 is a scatter plot comparing the test error of each of the binary classifiers when trained with the original RBF a kernel versus the performance achieved on the same binary problem with a learned kernel. The kernels were built ˜ using boosting with the LogLoss and S was the training data. In almost all of the 45 binary classification problems, the learned kernels yielded lower error rates when combined with the Perceptron algorithm. The right plot of Fig. 5 compares two learned kernels: the first ˜ was build using the training instances as the templates constituing S while the second used the test instances. Although the differenece between the two versions is not as significant as the difference on the left plot, we still achieve an overall improvement in about 25% of the binary problems by using the test instances. 6 4.5 4 5 Learned Kernel (Test) Learned Kernel (Train) 3.5 4 3 2 3 2.5 2 1.5 1 1 0.5 0 0 1 2 3 Base Kernel 4 5 6 0 0 1 2 3 Learned Kernel (Train) 4 5 Figure 5: Left: a scatter plot comparing the error rate of 45 binary classifiers trained using an RBF kernel (x-axis) and a learned kernel with training instances. Right: a similar scatter plot for a learned kernel only constructed from training instances (x-axis) and test instances. 5 Discussion In this paper we showed how to use the boosting framework to design kernels. Our approach is especially appealing in transductive learning tasks where the test data distribution is different than the the distribution of the training data. For example, in speech recognition tasks the training data is often clean and well recorded while the test data often passes through a noisy channel that distorts the signal. An interesting and challanging question that stem from this research is how to extend the framework to accommodate more complex decision tasks such as multiclass and regression problems. Finally, we would like to note alternative approaches to the kernel design problem has been devised in parallel and independently. See [13, 14] for further details. Acknowledgements: Special thanks to Cyril Goutte and to John Show-Taylor for pointing the connection to the generalized eigen vector problem. Thanks also to the anonymous reviewers for constructive comments. References [1] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [2] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [3] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christopher J. C. H. Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. [4] C. Leslie, E. Eskin, and W. Stafford Noble. The spectrum kernel: A string kernel for svm protein classification. In Proceedings of the Pacific Symposium on Biocomputing, 2002. [5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandla. On kernel target alignment. In Advances in Neural Information Processing Systems 14, 2001. [6] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proc. of the 19th Intl. Conf. on Machine Learning, 2002. [7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337–374, April 2000. [8] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman distances. Machine Learning, 47(2/3):253–285, 2002. [9] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. [10] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1985. [11] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. [12] B. Sch¨ lkopf, S. Mika, C.J.C. Burges, P. Knirsch, K. M¨ ller, G. R¨ tsch, and A.J. Smola. Input o u a space vs. feature space in kernel-based methods. IEEE Trans. on NN, 10(5):1000–1017, 1999. [13] O. Bosquet and D.J.L. Herrmann. On the complexity of learning the kernel matrix. NIPS, 2002. [14] C.S. Ong, A.J. Smola, and R.C. Williamson. Superkenels. NIPS, 2002.
6 0.15444878 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
7 0.14339896 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
8 0.13355307 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
9 0.13175656 106 nips-2002-Hyperkernels
10 0.11586753 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
11 0.1082198 119 nips-2002-Kernel Dependency Estimation
12 0.10626049 53 nips-2002-Clustering with the Fisher Score
13 0.10582793 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
14 0.0941993 124 nips-2002-Learning Graphical Models with Mercer Kernels
15 0.092974566 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
16 0.092407599 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
17 0.089562021 167 nips-2002-Rational Kernels
18 0.088985309 49 nips-2002-Charting a Manifold
19 0.082166538 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction
20 0.078234367 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
topicId topicWeight
[(0, -0.218), (1, -0.165), (2, 0.144), (3, -0.152), (4, -0.217), (5, -0.053), (6, 0.14), (7, 0.092), (8, 0.012), (9, 0.058), (10, 0.0), (11, 0.047), (12, 0.057), (13, -0.025), (14, 0.093), (15, 0.059), (16, -0.052), (17, -0.009), (18, -0.052), (19, -0.055), (20, -0.044), (21, -0.011), (22, -0.07), (23, -0.0), (24, -0.015), (25, 0.078), (26, -0.033), (27, 0.098), (28, -0.08), (29, -0.02), (30, -0.022), (31, -0.131), (32, -0.018), (33, -0.064), (34, 0.091), (35, -0.118), (36, 0.065), (37, -0.035), (38, -0.003), (39, -0.003), (40, -0.027), (41, -0.056), (42, -0.087), (43, 0.103), (44, -0.147), (45, -0.078), (46, -0.076), (47, 0.097), (48, -0.108), (49, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.94537354 113 nips-2002-Information Diffusion Kernels
Author: Guy Lebanon, John D. Lafferty
Abstract: A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
2 0.73837435 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
3 0.6385054 156 nips-2002-On the Complexity of Learning the Kernel Matrix
Author: Olivier Bousquet, Daniel Herrmann
Abstract: We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them to an alignment-based approach.
4 0.63250828 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor
Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1
5 0.60688806 106 nips-2002-Hyperkernels
Author: Cheng S. Ong, Robert C. Williamson, Alex J. Smola
Abstract: We consider the problem of choosing a kernel suitable for estimation using a Gaussian Process estimator or a Support Vector Machine. A novel solution is presented which involves defining a Reproducing Kernel Hilbert Space on the space of kernels itself. By utilizing an analog of the classical representer theorem, the problem of choosing a kernel from a parameterized family of kernels (e.g. of varying width) is reduced to a statistical estimation problem akin to the problem of minimizing a regularized risk functional. Various classical settings for model or kernel selection are special cases of our framework.
6 0.58233041 120 nips-2002-Kernel Design Using Boosting
7 0.57739878 167 nips-2002-Rational Kernels
8 0.55560476 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA
9 0.51528281 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
10 0.49439889 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
11 0.46083298 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
12 0.45033166 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
13 0.44469219 119 nips-2002-Kernel Dependency Estimation
14 0.44149932 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
15 0.42833203 124 nips-2002-Learning Graphical Models with Mercer Kernels
16 0.37164152 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
17 0.36817801 53 nips-2002-Clustering with the Fisher Score
18 0.36562765 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction
19 0.36544791 49 nips-2002-Charting a Manifold
20 0.36127681 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
topicId topicWeight
[(11, 0.015), (23, 0.016), (42, 0.078), (54, 0.218), (55, 0.021), (57, 0.015), (68, 0.03), (69, 0.249), (74, 0.106), (87, 0.015), (92, 0.062), (98, 0.077)]
simIndex simValue paperId paperTitle
same-paper 1 0.84140617 113 nips-2002-Information Diffusion Kernels
Author: Guy Lebanon, John D. Lafferty
Abstract: A new family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. Based on the heat equation on the Riemannian manifold defined by the Fisher information metric, information diffusion kernels generalize the Gaussian kernel of Euclidean space, and provide a natural way of combining generative statistical modeling with non-parametric discriminative learning. As a special case, the kernels give a new approach to applying kernel-based learning algorithms to discrete data. Bounds on covering numbers for the new kernels are proved using spectral theory in differential geometry, and experimental results are presented for text classification.
2 0.73297411 46 nips-2002-Boosting Density Estimation
Author: Saharon Rosset, Eran Segal
Abstract: Several authors have suggested viewing boosting as a gradient descent search for a good fit in function space. We apply gradient-based boosting methodology to the unsupervised learning problem of density estimation. We show convergence properties of the algorithm and prove that a strength of weak learnability property applies to this problem as well. We illustrate the potential of this approach through experiments with boosting Bayesian networks to learn density models.
3 0.72032768 190 nips-2002-Stochastic Neighbor Embedding
Author: Geoffrey E. Hinton, Sam T. Roweis
Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.
4 0.71612453 119 nips-2002-Kernel Dependency Estimation
Author: Jason Weston, Olivier Chapelle, Vladimir Vapnik, André Elisseeff, Bernhard Schölkopf
Abstract: We consider the learning problem of finding a dependency between a general class of objects and another, possibly different, general class of objects. The objects can be for example: vectors, images, strings, trees or graphs. Such a task is made possible by employing similarity measures in both input and output spaces using kernel functions, thus embedding the objects into vector spaces. We experimentally validate our approach on several tasks: mapping strings to strings, pattern recognition, and reconstruction from partial images. 1
5 0.71484143 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting
Author: Kenji Fukumizu, Shotaro Akaho, Shun-ichi Amari
Abstract: We show the existence of critical points as lines for the likelihood function of mixture-type models. They are given by embedding of a critical point for models with less components. A sufficient condition that the critical line gives local maxima or saddle points is also derived. Based on this fact, a component-split method is proposed for a mixture of Gaussian components, and its effectiveness is verified through experiments. 1
6 0.71410477 53 nips-2002-Clustering with the Fisher Score
7 0.71355486 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement
8 0.70969248 124 nips-2002-Learning Graphical Models with Mercer Kernels
9 0.7091608 106 nips-2002-Hyperkernels
10 0.70804161 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
11 0.70762533 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
12 0.7075609 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
13 0.70742393 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
14 0.70718056 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
15 0.70424449 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
16 0.70384842 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling
17 0.70378959 27 nips-2002-An Impossibility Theorem for Clustering
18 0.70353973 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
19 0.70339614 156 nips-2002-On the Complexity of Learning the Kernel Matrix
20 0.7021389 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation