nips nips2004 nips2004-168 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marco Cuturi, Jean-philippe Vert
Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Semigroup Kernels on Finite Sets Marco Cuturi Computational Biology Group Ecole des Mines de Paris 35 rue Saint Honor´ e 77300 Fontainebleau marco. [sent-1, score-0.117]
2 fr Jean-Philippe Vert Computational Biology Group Ecole des Mines de Paris 35 rue Saint Honor´ e 77300 Fontainebleau jean-philippe. [sent-3, score-0.117]
3 fr Abstract Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. [sent-5, score-0.269]
4 ) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. [sent-8, score-0.351]
5 We prove a general integral representation of such kernels and present two particular examples. [sent-9, score-0.176]
6 One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. [sent-10, score-0.633]
7 We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. [sent-11, score-0.105]
8 g non-vectorial) objects from a set Z on which we wish to apply existing kernel methods [1] to perform tasks such as classification or regression. [sent-13, score-0.385]
9 Assume furthermore that the latter objects can be meaningfully described by small components contained in a set X . [sent-14, score-0.174]
10 Namely, we suppose that we can define an a priori mapping τ which maps any z ∈ Z into a finite unordered list of elements of X , τ (z) = [x1 , x2 , . [sent-15, score-0.121]
11 , xn ], through a sampling process which may be exhaustive, heuristic or random both in the quantity of sampled components n and in the way those components are extracted. [sent-18, score-0.133]
12 Comparing two such complex objects through the direct comparison of their respective lists of components has attracted much attention recently, namely through the definition of p. [sent-19, score-0.426]
13 Most recent approaches to compare two τ -lists involve the estimation of two distributions pz and pz′ on X within a parametric class of models that fit (e. [sent-22, score-0.167]
14 Such a kernel is then defined between pz and pz′ , as seen for example in [2] with the Information Diffusion kernel, in [3] with the family of Mutual Information Kernels or in [4] with the use of the Battacharyya affinity between pz and pz′ . [sent-25, score-0.71]
15 An alternative and non-parametric approach to τ -lists comparison that studies the subspaces generated by points of τ (z) and τ (z ′ ) in a feature space was also proposed in [5], recalling elements presented in Kernel-Canonical Correlation Analysis [6]. [sent-26, score-0.173]
16 We explore in this contribution a different direction to kernel design for lists by studying the class of kernels whose value computed on two lists is only defined through its value on their concatenation. [sent-27, score-0.81]
17 This approach was already used in [7], where a particular kernel for strings that only compares two strings through their concatenation is presented. [sent-28, score-0.428]
18 in terms of the distribution of the components they describe, then their concatenation will be more “concentrated” than if they are very different, in which case it might look more like a reunion of two disjoint sets of points. [sent-31, score-0.186]
19 As a result, one can expect to get a relevant measure of similarity, and hence a kernel, by studying properties of the concatenation of two lists such as its concentration. [sent-32, score-0.323]
20 After an example of a valid kernel for lists seen as measures on the space of components (Section 2), we provide a complete characterization for this class of kernels (Section 3) by casting them in the context of semigroup kernels. [sent-33, score-1.208]
21 This leads to the definition of a second kernel based on exponential densities on X , which boils down after a numerical approximation to the computation of the entropy of the maximum likelihood density (taken in the considered exponential family) of the points numbered by lists of components. [sent-34, score-0.945]
22 This kernel is extended in Section 4 to points taken in a reproducing kernel Hilbert space defined by a kernel κ on X , and is then tested on a problem of image classification, where images are seen as bags of pixels and a non-linear kernel between pixels is used (Section 5). [sent-35, score-1.809]
23 2 The entropy kernel As a warm-up, let us assume that the set X is measurable, e. [sent-36, score-0.479]
24 the Borel uniform measure), with finite entropy def h(µ) = − X µ ln µ. [sent-40, score-0.341]
25 kernel k between unordered lists τ and τ ′ that only depends on their concatenation τ (z) · τ (z ′ ) is equivalent to a p. [sent-48, score-0.678]
26 kernel between densities µ and µ′ that only depends on µ+µ′ . [sent-50, score-0.431]
27 kernel on the set P of probability densities of finite entropy of the form κ(µ, µ′ ) = ϕ(µ + µ′ ). [sent-53, score-0.58]
28 An example of such a kernel is provided in the following proposition. [sent-54, score-0.33]
29 ) kernel on a set X is a symmetric function g : X 2 → R that satisfies n n n i,j=1 ci cj g(xi , xj ) ≤ 0 for any n ∈ N, (x1 , . [sent-57, score-0.455]
30 The function g : µ, µ′ → h( µ+µ ) is negative definite on P, making 2 def µ+µ′ kh (µ, µ′ ) = e−th( 2 between two measures. [sent-77, score-0.199]
31 We call kh the entropy kernel The entropy kernel is already a satisfactory answer to our initial motivation to look at merger of points. [sent-81, score-1.12]
32 Observe that if µx is a probability density around x, then µτ can often be thought of as an estimate of the distribution of the points in τ , and (µτ + µτ ′ )/2 is an estimate of the distribution of the points enumerated in τ · τ ′ . [sent-82, score-0.21]
33 If the latter estimate has a small entropy we can guess that the points in τ and τ ′ are likely to have similar distributions which is exactly the similarity that is quantified by the entropy kernel. [sent-83, score-0.449]
34 on R+ as a semigroup endowed with addition [8, Example 6. [sent-87, score-0.48]
35 kernel k and any real-valued function g, we have trivially that ′ (y, y ′ ) → k(y, y ′ ) + g(y) + g(y ′ ) remains negative definite, hence h( f +f ) is n. [sent-95, score-0.33]
36 through 2 ′ 1 h( f +f ) = 2 h(f + f ′ ) + ln 2 (|f | + |f ′ |), yielding positive definiteness of kh . [sent-97, score-0.231]
37 kernels on finite Radon measures In order to generalize the example presented in the previous section, let us briefly recall the concept of p. [sent-100, score-0.256]
38 A nonempty set S is called an Abelian (autoinvolutive) semigroup if it is equipped with an Abelian associative composition ◦ admitting a neutral element in S. [sent-103, score-0.351]
39 negative definite) function on the semigroup (S, ◦) if (s, t) → ϕ(s ◦ t) is a p. [sent-105, score-0.351]
40 The entropy kernel defined in Proposition 1 is therefore a p. [sent-111, score-0.479]
41 kernel on the semigroup of measures with finite entropy endowed with usual addition. [sent-113, score-1.095]
42 This can be generalized by assuming that X is a Hausdorff space, which suffices to consider the set of finite Radon b b measures M+ (X ) [8]. [sent-114, score-0.136]
43 The reason to consider this semigroup is that there is a natural semigroup homomorphism between finite lists of n b points and elements of M+ (X ) given by τ = [x1 , . [sent-118, score-1.01]
44 We discussed in section 2 the case where µx has a density, but more general measures are allowed, such as µx = δx , the Dirac measure. [sent-122, score-0.136]
45 Observe that when we talk about lists, it should be understood that some objects might appear with some multiplicity which should be taken into account n (specially when X is finite), making us consider weighted measures µ = i=1 ci µxi in the general case. [sent-123, score-0.316]
46 if and only if it has an integral representation: e−µ[f ] dν(f ), ϕ(µ) = C + (X ) where ν is a uniquely determined positive radon measure on C + (X ), the space of nonnegative-valued continuous functions of RX endowed with the topology of pointwise convergence. [sent-129, score-0.453]
47 The general result of integral representation of bounded p. [sent-134, score-0.1]
48 It can be shown that bounded semicharacters b on M+ (X ) are exactly the functions of the form µ → exp(−µ[f ]) where f ∈ C + (X ) by using the characterization of semicharacters on (R+ , +) [8, Theorem 6. [sent-139, score-0.184]
49 8] and the fact b that atomic measures is a dense subset of M+ (X ) [8, Theorem 2. [sent-141, score-0.234]
50 As a constructive application to this general representation theorem, let us consider the case µx = δx and consider, as a subspace of C + (X ), the linear span of N non-constant, continuous, real-valued and linearly independent functions f1 , . [sent-144, score-0.115]
51 As we will see below, this is equivalent to considering a set of densities defined by an exponential model, N namely of the form pθ (x) = exp( j=1 θj fj (x)−ψ(θ)) where θ = (θ j )j=1. [sent-148, score-0.232]
52 N ∈ Θ ⊂ RN is variable and ψ is a real-valued function defined on Θ to ensure normalization of the densities pθ . [sent-150, score-0.101]
53 Considering a prior ω on the parameter space Θ is equivalent to defining a Radon measure taking positive values on the subset of C + (X ) spanned by functions f1 , . [sent-151, score-0.099]
54 kernel on the semigroup of measures, where d(p||q) = Kullback-Leibler divergence between p and q. [sent-158, score-0.681]
55 supp(q) p ln p is the q Although an exact calculation of the latter equation is feasible in certain cases (see [10, 7]), an approximation can be computed using Laplace’s approximation. [sent-159, score-0.159]
56 If for example the prior on the densities is taken to be Jeffrey’s prior [9, p. [sent-160, score-0.101]
57 This inconvenient scaling behavior leads in practice to bad SVM classification results due to diagonal dominance of the Gram matrices produced by such kernels (see [11] for instance). [sent-165, score-0.211]
58 β should hence be big enough in practical applications to ensure the consistency of Laplace’s approximation and thus positive definiteness, while small enough to avoid diagonal dominance. [sent-168, score-0.145]
59 We will now always suppose that our atomic measures are n normalized, meaning that their total weight i=1 ci always sums up to 1. [sent-169, score-0.359]
60 Let us now review a practical case when X is Rk , and that some kind of gaussianity among points makes sense. [sent-170, score-0.138]
61 The ML parameters of a measure n n µ are in that case : µ = i=1 ci xi and Σµ = i=1 ci (xi − µ)(xi − µ)⊤ . [sent-174, score-0.338]
62 Supposing ¯ ¯ ¯ k that the span of the n vectors xi covers R yields non-degenerated covariance matrices. [sent-175, score-0.224]
63 This ensures the existence of the entropy of the ML estimates through the formula [12]: h(pm,Σ ) = 1 ln ((2πe)n |Σ|). [sent-176, score-0.234]
64 The value of the normalized kernel in (2) is then: 2 ′ kβ (µ, µ ) = |Σµ ||Σµ′ | |Σµ′′ | 2β . [sent-177, score-0.33]
65 An approach designed to bypass this double restriction is presented in the next section, taking advantage of a prior knowledge on the components space through the use of a kernel κ. [sent-179, score-0.48]
66 4 A kernel defined through regularized covariance operators Endowing X (now also considered 2-separable) with a p. [sent-180, score-0.448]
67 kernel κ bounded on the diagonal, we make use in this section of its corresponding reproducing kernel Hilbert space (RKHS, see [13] for a complete survey). [sent-182, score-0.76]
68 Ξ is infinite dimensional in the general case, preventing any systematical use of exponential densities on that feature space. [sent-184, score-0.101]
69 We bypass this issue through a generalization of the previous section by still assuming some “gaussianity” among the elements numbered by atomic measures µ, µ′ and µ′′ which, once mapped in the feature space, are now functions. [sent-185, score-0.451]
70 In this section we use such finite samples to estimate, diagonalize and regularize three covariance operators Sµ , Sµ′ , Sµ′′ associated with each measure on Ξ, and compare them by measuring their respective dispersion in a similar way. [sent-187, score-0.278]
71 Let (ei )i∈N be a complete orthonormal base of Ξ (i. [sent-191, score-0.119]
72 Given a family of positive real i numbers (ti )i∈N , we note St,e the bilinear symmetric operator which maps ξ, ζ → ξ ∗ St,e ζ where St,e = i∈N ti ei e∗ . [sent-194, score-0.227]
73 t to µ) variance of the normalized dot-product def v ∗ ξ hv (ξ) = ||v|| here defined for any v of Ξ. [sent-197, score-0.168]
74 Such functions can be obtained through the following recursive maximizations: 1 vj = argmax varµ (hv (ξ)) = argmax ||vj ||2 v∈Ξ,v⊥{v1 ,. [sent-198, score-0.221]
75 i=1 As in the framework of Kernel PCA [1] (from which this calculus only differs by considering weighted points in the feature space) we have by the representer theorem [1] that all the ˜ solutions of these successive maximizations lie in span({ξi }i=1. [sent-205, score-0.25]
76 Our latter formulation is however ˜ ill-defined, since any αj is determined up to the addition of any element of ker Kµ . [sent-209, score-0.135]
77 We def ˜⊥ thus restrict our parameters α to lie in E = ker Kµ ⊂ Rn to consider functions of positive squared norm, having now: αj = ˜ ˜ α⊤ Kµ ∆c Kµ α ˜ α ⊤ Kµ α ˜ α∈E:∀k < j ≤ dim(E), and with corresponding positive eigenvalues d ˜ in decreasing order λ1 , . [sent-210, score-0.276]
78 Through vj = i=1 αj,i ξi and writing r = dim(E), ˜ this also yields an orthogonal basis (vj )i≤r of span{(ξi )i≤n }, which can be completed to span Ξ through a Gram-Schmidt orthonormalization process using the original basis (ei )i∈N . [sent-214, score-0.25]
79 The orthonormal base corresponding to Sµ is thus (vi )i∈N , where the r first vectors are the original eigenvectors obtained through the previous maximization. [sent-215, score-0.119]
80 This bilinear form is however degenerated on Ξ and facing the same problem encountered in [4, 6] we also propose to solve this issue through a regularization by adding a component η on every vector of the base, i. [sent-223, score-0.099]
81 ) with η > 0, to propose a regularization of Sµ as: r ∗ (λi + η)vi vi + Sλη ,v = i=1 ∗ η v i vi . [sent-231, score-0.225]
82 i>r The entropy of a covariance operator St,e not being defined, we bypass this issue by considering the entropy of its marginal distribution on its first d eigenfunctions, namely introducd ing the quantity |St,e |d = d ln(2πe) + 1 i=1 ln ti . [sent-232, score-0.731]
83 Let us sum up ideas now and consider 2 2 ′ three normalized measures µ, µ′ and µ′′ = µ+µ , which yield three different orthonormal 2 ′ ′′ bases vi , vi and vi of Ξ and three different families of weights λη = (λi≤r +η, η, . [sent-233, score-0.512]
84 Though the latter hint does not establish a valid theoretical proof of the positive definiteness of this kernel, we use this final formula for the following classification experiments. [sent-244, score-0.128]
85 5 Experiments Following the previous work of [4], we have conducted experiments on an extraction of 500 images (28 × 28 pixels) taken in the MNIST database of handwritten digits, with 50 images for each digit. [sent-245, score-0.169]
86 To each image z we randomly associate a set τ (z) of 25 to 30 pixels among black points (intensity superior to 191 on a 0 to 255 scale ) in the image, 1 where X is {1, . [sent-246, score-0.168]
87 To define our RKHS Ξ we used both the linear kernel, κa ((x1 , y1 ), (x2 , y2 )) = (x1 x2 + y1 y2 )/272 and the 2 2 1 Gaussian kernel of width σ, namely κb ((x1 , y1 ), (x2 , y2 )) = e− (x1 −x2 ) 2+(y2 −y2 ) . [sent-252, score-0.47]
88 The 27 ·2σ linear case boils down to the simple application presented in the end of section 3 where we fit Gaussian bivariate-laws on our three measures and define similarity through variance analysis. [sent-253, score-0.183]
89 The resulting diagonal variances (Σ1,1 , Σ2,2 ),(Σ′ , Σ′ ) and (Σ′′ , Σ′′ ) mea1,1 2,2 1,1 2,2 sure the dispersion of our data for each of the three measures, yielding a kernel value of √ Σ1,1 Σ2,2 Σ′ Σ′ 1,1 2,2 equal to 0. [sent-254, score-0.473]
90 The linear kernel manΣ′′ Σ′′ 1,1 2,2 ages a good discrimination between clearly defined digits such as 1 and 0 but fails at doing so when considering numbers whose pixels’ distribution cannot be properly characterized by ellipsoid-like shapes. [sent-256, score-0.43]
91 Using instead the Gaussian kernel brings forward a non-linear perspective to the previous approach since it maps now all pixels into Gaussian bells, providing thus a much richer function class for Ξ. [sent-257, score-0.421]
92 184 1 Figure 1: First Eigenfunction of three empirical measures µ1 , µ0 and µ1 +µ0 using the linear 2 (a) and the Gaussian (b, with η = 0. [sent-267, score-0.136]
93 require explicit tuning: σ (the width of κ) controls the range of the typical eigenvalues found in the spectrum of our regularized operators whereas η acts as a scaling parameter for the latter values as can be seen in equation (3). [sent-271, score-0.229]
94 For each kernel computed on the base of a (σ, η) couple, we used a balanced training fold of our dataset to train 10 binary SVM classifiers, namely one for each digit versus all other 9 digits. [sent-283, score-0.534]
95 The class of the remaining images of the test fold was then predicted to be the one with highest SVM score among the the 10 previously trained binary SVMs. [sent-284, score-0.118]
96 Splitting our data into test and training sets was led through a 3-fold cross validation (roughly 332 training images and 168 for testing), averaging the test error on 5 random fold splits of the original data. [sent-285, score-0.161]
97 Those results were obtained using the spider toolbox1 and graphically displayed in figure (2). [sent-286, score-0.125]
98 To illustrate the sensibility of our method to the number of sampled points in τ we show in the same figure the decrease of this error when the number of sampled points ranges from 10 to 30 with independently chosen random points for each computation. [sent-292, score-0.231]
99 As in [4], we also compared our results to the standard RBF kernel on images seen as vectors of {0, 1}27·27 , using a fixed number of 30 sampled points and the formula ||z−z ′ || k(z, z ′ ) = e− 30·2σ2 . [sent-293, score-0.511]
100 12 fixed) depending now on the size of the sets of randomly selected black pixels for each image, this size varying between 10 and 30. [sent-321, score-0.134]
wordName wordTfidf (topN-words)
[('semigroup', 0.351), ('kernel', 0.33), ('lists', 0.18), ('pz', 0.167), ('entropy', 0.149), ('abelian', 0.14), ('measures', 0.136), ('vj', 0.135), ('endowed', 0.129), ('ci', 0.125), ('radon', 0.122), ('kernels', 0.12), ('span', 0.115), ('nite', 0.115), ('def', 0.107), ('bypass', 0.105), ('densities', 0.101), ('concatenation', 0.098), ('atomic', 0.098), ('kh', 0.092), ('diagonal', 0.091), ('pixels', 0.091), ('vi', 0.091), ('ln', 0.085), ('ml', 0.084), ('vert', 0.084), ('namely', 0.083), ('dim', 0.078), ('points', 0.077), ('latter', 0.074), ('ei', 0.071), ('bags', 0.07), ('borel', 0.07), ('cuturi', 0.07), ('hausdorff', 0.07), ('honor', 0.07), ('maximizations', 0.07), ('merger', 0.07), ('renormalization', 0.07), ('saint', 0.07), ('semicharacters', 0.07), ('semigroups', 0.07), ('spider', 0.07), ('unordered', 0.07), ('niteness', 0.07), ('mnist', 0.067), ('covariance', 0.066), ('respective', 0.063), ('laplace', 0.062), ('proposition', 0.062), ('base', 0.061), ('ker', 0.061), ('numbered', 0.061), ('gaussianity', 0.061), ('fontainebleau', 0.061), ('hv', 0.061), ('rue', 0.061), ('fold', 0.06), ('orthonormal', 0.058), ('images', 0.058), ('width', 0.057), ('reproducing', 0.056), ('rkhs', 0.056), ('enumerated', 0.056), ('bilinear', 0.056), ('eigenfunction', 0.056), ('des', 0.056), ('integral', 0.056), ('objects', 0.055), ('theorem', 0.055), ('displayed', 0.055), ('gram', 0.055), ('positive', 0.054), ('handwritten', 0.053), ('dispersion', 0.052), ('mines', 0.052), ('ecole', 0.052), ('operators', 0.052), ('digits', 0.052), ('elements', 0.051), ('considering', 0.048), ('boils', 0.047), ('pm', 0.047), ('laws', 0.047), ('paris', 0.047), ('pointwise', 0.047), ('ti', 0.046), ('seen', 0.046), ('measure', 0.045), ('components', 0.045), ('bach', 0.045), ('subspaces', 0.045), ('bases', 0.045), ('bounded', 0.044), ('xi', 0.043), ('argmax', 0.043), ('eigenfunctions', 0.043), ('sets', 0.043), ('xn', 0.043), ('regularization', 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000013 168 nips-2004-Semigroup Kernels on Finite Sets
Author: Marco Cuturi, Jean-philippe Vert
Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1
2 0.2046068 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty
Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1
3 0.14584652 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
Author: Anton Schwaighofer, Volker Tresp, Kai Yu
Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1
4 0.13924693 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
Author: Laurent Zwald, Gilles Blanchard, Pascal Massart, Régis Vert
Abstract: This paper investigates the effect of Kernel Principal Component Analysis (KPCA) within the classification framework, essentially the regularization properties of this dimensionality reduction method. KPCA has been previously used as a pre-processing step before applying an SVM but we point out that this method is somewhat redundant from a regularization point of view and we propose a new algorithm called Kernel Projection Machine to avoid this redundancy, based on an analogy with the statistical framework of regression for a Gaussian white noise model. Preliminary experimental results show that this algorithm reaches the same performances as an SVM. 1
5 0.13683605 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
Author: Volker Roth
Abstract: The problem of detecting “atypical objects” or “outliers” is one of the classical topics in (robust) statistics. Recently, it has been proposed to address this problem by means of one-class SVM classifiers. The main conceptual shortcoming of most one-class approaches, however, is that in a strict sense they are unable to detect outliers, since the expected fraction of outliers has to be specified in advance. The method presented in this paper overcomes this problem by relating kernelized one-class classification to Gaussian density estimation in the induced feature space. Having established this relation, it is possible to identify “atypical objects” by quantifying their deviations from the Gaussian model. For RBF kernels it is shown that the Gaussian model is “rich enough” in the sense that it asymptotically provides an unbiased estimator for the true density. In order to overcome the inherent model selection problem, a cross-validated likelihood criterion for selecting all free model parameters is applied. 1
6 0.13566111 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
7 0.13445888 92 nips-2004-Kernel Methods for Implicit Surface Modeling
8 0.13398644 42 nips-2004-Computing regularization paths for learning multiple kernels
9 0.12663789 177 nips-2004-Supervised Graph Inference
10 0.12122815 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
11 0.11869188 30 nips-2004-Binet-Cauchy Kernels
12 0.11316778 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
13 0.11151151 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
14 0.11144838 103 nips-2004-Limits of Spectral Clustering
15 0.1105189 94 nips-2004-Kernels for Multi--task Learning
16 0.1091456 148 nips-2004-Probabilistic Computation in Spiking Populations
17 0.10750879 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
18 0.10661503 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
19 0.10283428 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
20 0.10254646 69 nips-2004-Fast Rates to Bayes for Kernel Machines
topicId topicWeight
[(0, -0.314), (1, 0.117), (2, -0.096), (3, 0.122), (4, -0.109), (5, -0.067), (6, -0.107), (7, -0.187), (8, 0.041), (9, -0.058), (10, -0.109), (11, 0.0), (12, 0.08), (13, 0.005), (14, 0.042), (15, -0.012), (16, 0.082), (17, -0.024), (18, 0.163), (19, -0.039), (20, -0.066), (21, 0.006), (22, -0.021), (23, -0.017), (24, -0.103), (25, -0.089), (26, 0.103), (27, -0.005), (28, -0.083), (29, 0.049), (30, -0.068), (31, -0.065), (32, -0.031), (33, -0.06), (34, -0.023), (35, -0.042), (36, 0.031), (37, 0.112), (38, -0.045), (39, -0.009), (40, 0.046), (41, -0.032), (42, -0.013), (43, 0.079), (44, -0.025), (45, -0.006), (46, -0.002), (47, 0.048), (48, -0.069), (49, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.97301644 168 nips-2004-Semigroup Kernels on Finite Sets
Author: Marco Cuturi, Jean-philippe Vert
Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1
2 0.87964827 94 nips-2004-Kernels for Multi--task Learning
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1
3 0.81229389 30 nips-2004-Binet-Cauchy Kernels
Author: Alex J. Smola, S.v.n. Vishwanathan
Abstract: We propose a family of kernels based on the Binet-Cauchy theorem and its extension to Fredholm operators. This includes as special cases all currently known kernels derived from the behavioral framework, diffusion processes, marginalized kernels, kernels on graphs, and the kernels on sets arising from the subspace angle approach. Many of these kernels can be seen as the extrema of a new continuum of kernel functions, which leads to numerous new special cases. As an application, we apply the new class of kernels to the problem of clustering of video sequences with encouraging results. 1
4 0.79998541 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty
Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1
5 0.70064253 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
Author: Peter Sollich, Christopher Williams
Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.
6 0.66275871 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
7 0.65758651 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
8 0.65191448 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
9 0.64686757 177 nips-2004-Supervised Graph Inference
10 0.59796691 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
11 0.59746611 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
12 0.58186692 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
13 0.58159077 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
14 0.53415489 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
15 0.53373438 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
16 0.51879174 68 nips-2004-Face Detection --- Efficient and Rank Deficient
17 0.51801324 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
18 0.50065875 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
19 0.50005114 92 nips-2004-Kernel Methods for Implicit Surface Modeling
20 0.49817598 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis
topicId topicWeight
[(13, 0.084), (15, 0.174), (26, 0.056), (31, 0.035), (33, 0.145), (35, 0.026), (39, 0.031), (50, 0.07), (52, 0.288), (76, 0.017)]
simIndex simValue paperId paperTitle
1 0.91533452 157 nips-2004-Saliency-Driven Image Acuity Modulation on a Reconfigurable Array of Spiking Silicon Neurons
Author: R. J. Vogelstein, Udayan Mallik, Eugenio Culurciello, Gert Cauwenberghs, Ralph Etienne-Cummings
Abstract: We have constructed a system that uses an array of 9,600 spiking silicon neurons, a fast microcontroller, and digital memory, to implement a reconfigurable network of integrate-and-fire neurons. The system is designed for rapid prototyping of spiking neural networks that require high-throughput communication with external address-event hardware. Arbitrary network topologies can be implemented by selectively routing address-events to specific internal or external targets according to a memory-based projective field mapping. The utility and versatility of the system is demonstrated by configuring it as a three-stage network that accepts input from an address-event imager, detects salient regions of the image, and performs spatial acuity modulation around a high-resolution fovea that is centered on the location of highest salience. 1
2 0.89684606 193 nips-2004-Theories of Access Consciousness
Author: Michael D. Colagrosso, Michael C. Mozer
Abstract: Theories of access consciousness address how it is that some mental states but not others are available for evaluation, choice behavior, and verbal report. Farah, O’Reilly, and Vecera (1994) argue that quality of representation is critical; Dehaene, Sergent, and Changeux (2003) argue that the ability to communicate representations is critical. We present a probabilistic information transmission or PIT model that suggests both of these conditions are essential for access consciousness. Having successfully modeled data from the repetition priming literature in the past, we use the PIT model to account for data from two experiments on subliminal priming, showing that the model produces priming even in the absence of accessibility and reportability of internal states. The model provides a mechanistic basis for understanding the dissociation of priming and awareness. Philosophy has made many attempts to identify distinct aspects of consciousness. Perhaps the most famous effort is Block’s (1995) delineation of phenomenal and access consciousness. Phenomenal consciousness has to do with “what it is like” to experience chocolate or a pin prick. Access consciousness refers to internal states whose content is “(1) inferentially promiscuous, i.e., poised to be used as a premise in reasoning, (2) poised for control of action, and (3) poised for rational control of speech.” (p. 230) The scientific study of consciousness has exploded in the past six years, and an important catalyst for this explosion has been the decision to focus on the problem of access consciousness: how is it that some mental states but not others become available for evaluation, choice behavior, verbal report, and storage in working memory. Another reason for the recent explosion of consciousness research is the availability of functional imaging techniques to explore differences in brain activation between conscious and unconscious states, as well as the development of clever psychological experiments that show that a stimulus that is not consciously perceived can nonetheless influence cognition, which we describe shortly. 1 Subliminal Priming The phenomena we address utilize an experimental paradigm known as repetition priming. Priming refers to an improvement in efficiency in processing a stimulus item as a result of previous exposure to the item. Efficiency is defined in terms of shorter response times, lower error rates, or both. A typical long-term perceptual priming experiment consists of a study phase during which participants are asked to read aloud a list of words, and a test phase during which participants must name or categorize a series of words, presented one at a time. Reaction time is lower and/or accuracy is higher for test words that were also on the study list. Repetition priming occurs without strategic effort on the part of participants, and therefore appears to be a low level mechanism of learning, which likely serves as the mechanism underlying the refinement of cognitive skills with practice. In traditional studies, priming is supraliminal—the prime is consciously perceived. In the studies we model here, primes are subliminal. Subliminal priming addresses fundamental issues concerning conscious access: How is it that a word or image that cannot be identified, detected, or even discriminated in forced choice can nonetheless influence the processing of a subsequent stimulus word? Answering this question in a computational framework would be a significant advance toward understanding the nature of access consciousness. 2 Models of Conscious and Unconscious Processing In contrast to the wealth of experimental data, and the large number of speculative and philosophical papers on consciousness, concrete computational models are rare. The domain of consciousness is particularly ripe for theoretical perspectives, because it is a significant contribution to simply provide an existence proof of a mechanism that can explain specific experimental data. Ordinarily, a theorist faces skepticism when presenting a model; it often seems that hundreds of alternative, equally plausible accounts must exist. However, when addressing data deemed central to issues of consciousness, simply providing a concrete handle on the phenomena serves to demystify consciousness and bring it into the realm of scientific understanding. We are familiar with only three computational models that address specific experimental data in the domain of consciousness. We summarize these models, and then present a novel model and describe its relationship to the previous efforts. Farah, O’Reilly, and Vecera (1994) were the first to model specific phenomena pertaining to consciousness in a computational framework. The phenomena involve prosopagnosia, a deficit of overt face recognition following brain damage. Nonetheless, prosopagnosia patients exhibit residual covert recognition by a variety of tests. For example, when patients are asked to categorize names as famous or nonfamous, their response times are faster to a famous name when the name is primed by a picture of a semantically related face (e.g., the name “Bill Clinton” when preceded by a photograph of Hillary), despite the fact that they could not identify the related face. Farah et al. model face recognition in a neural network, and show that when the network is damaged, it loses the ability to perform tasks requiring high fidelity representations (e.g., identification) but not tasks requiring only coarse information (e.g., semantic priming). They argue that conscious perception is associated with a certain minimal quality of representation. Dehaene and Naccache (2001) outline a framework based on Baars’ (1989) notion of conscious states as residing in a global workspace. They describe the workspace as a “distributed neural system...with long-distance connectivity that can potentially interconnect multiple specialized brain areas in a coordinated, though variable manner.” (p. 13) Dehaene, Sergent, and Changeaux (2003) implement this framework in a complicated architecture of integrate-and-fire neurons and show that the model can qualitatively account for the attentional blink phenomenon. The attentional blink is observed in experiments where participants are shown a rapid series of stimuli, which includes two targets (T1 and T2). If T2 appears shortly after T1, the ability to report T2 drops, as if attention is distracted. Dehane et al. explain this phenomenon as follows. When T1 is presented, its activation propagates to frontal cortical areas (the global workspace). Feedback connections lead to a resonance between frontal and posterior areas, which strengthen T1 but block T2 from entering the workspace. If the T1-T2 lag is sufficiently great, habituation of T1 sufficiently weakens the representation such that T2 can enter the workspace and suppress T1. In this account, conscious access is achieved via resonance between posterior and frontal areas. Although the Farah et al. and Dehaene et al. models might not seem to have much in common, they both make claims concerning what is required to achieve functional connectivity between perceptual and response systems. Farah et al. focus on aspects of the representation; Dehaene et al. focus on a pathway through which representations can be communicated. These two aspects are not incompatible, and in fact, a third model incorporates both. Mathis and Mozer (1996) describe an architecture with processing modules for perceptual and response processes, implemented as attractor neural nets. They argue that in order for a representation in some perceptual module to be assured of influencing a response module, (a) it must have certain characteristics–temporal persistence and well-formedness– which is quite similar to Farah et al.’s notion of quality, and (b) the two modules must be interconnected—which is the purpose of Dehaene et al.’s global workspace. The model has two limitations that restrict its value as a contemporary account of conscious access. First, it addressed classical subliminal priming data, but more reliable data has recently been reported. Second, like the other two models, Mathis and Mozer used a complex neural network architecture with arbitrary assumptions built in, and the sensitivity of the model’s behavior to these assumptions is far from clear. In this paper, we present a model that embodies the same assumptions as Mathis and Mozer, but overcomes its two limitations, and explains subliminal-priming data that has yet to be interpreted via a computational model. 3 The Probabilistic Information Transmission (PIT) Framework Our model is based on the probabilistic information transmission or PIT framework of Mozer, Colagrosso, and Huber (2002, 2003). The framework characterizes the transmission of information from perceptual to response systems, and how the time course of information transmission changes with experience (i.e., priming). Mozer et al. used this framework to account for a variety of facilitation effects from supraliminal repetition priming. The framework describes cognition in terms of a collection of information-processing pathways, and supposes that any act of cognition involves coordination among multiple pathways. For example, to model a letter-naming task where a letter printed in upper or lower case is presented visually and the letter must be named, the framework would assume a perceptual pathway that maps the visual input to an identity representation, and a response pathway that maps a identity representation to a naming response. The framework is formalized as a probabilistic model: the pathway input and output are random variables and microinference in a pathway is carried out by Bayesian belief revision. The framework captures the time course of information processing for a single experimental trial. To elaborate, consider a pathway whose input at time t is a discrete random variable, denoted X(t), which can assume values x1 , x2 , x3 , . . . , xnx corresponding to alternative input states. Similarly, the output of the pathway at time t is a discrete random variable, denoted Y (t), which can assume values y1 , y2 , y3 , . . . , yny . For example, in the letter-naming task, the input to the perceptual pathway would be one of nx = 52 visual patterns corresponding to the upper- and lower-case letters of the alphabet, and the output is one of ny = 26 letter identities. To present a particular input alternative, say xi , to the model for T time steps, we specify X(t) = xi for t = 1 . . . T , and allow the model to compute P(Y (t) | X(1) . . . X(t)). A pathway is modeled as a dynamic Bayes network; the minimal version of the model used in the present simulations is simply a hidden Markov model, where the X(t) are observations and the Y (t) are inferred state (see Figure 1a). In typical usage, an HMM is presented with a sequence of distinct inputs, whereas we maintain the same input for many successive time steps; and an HMM transitions through a sequence of distinct hidden states, whereas we attempt to converge with increasing confidence on a single state. Figure 1b illustrates the time course of inference in a single pathway with 52 input and 26 output alternatives and two-to-one associations. The solid line in the Figure shows, as a function of time t, P(Y (t) = yi | X(1) = x2i . . . X(t) = x2i ), i.e., the probability that input i (say, the visual pattern of an upper case O) will produce its target output (the letter identity). Evidence for the target output accumulates gradually over time, yielding a speed-accuracy curve that relates the number of iterations to the accuracy of identification. Y0 Y1 X1 Y2 X2 P(Output) 1 Y3 X3 (a) (b) 0.8 0.6 O 0.4 0.2 0 Q Time Figure 1: (a) basic pathway architecture—a hidden Markov model; (b) time course of inference in a pathway when the letter O is presented, causing activation of both O and the visually similar Q. The exact shape of the speed-accuracy curve—the pathway dynamics—are determined by three probability distributions, which embody the knowledge and past experience of the model. First, P(Y (0)) is the prior distribution over outputs in the absence of any information about the input. Second, P(Y (t) | Y (t − 1)) characterizes how the pathway output evolves over time. We assume the transition probability matrix serves as a memory with diffusion, i.e., P(Y (t) = yi |Y (t − 1) = yj ) = (1 − β)δij + βP(Y (0) = yi ), where β is the diffusion constant and δij is the Kronecker delta. Third, P(X(t) | Y (t)) characterizes the strength of association between inputs and outputs. The greater the association strength, the more rapidly that information about X will be communicated to Y . We parameterize this distribution as P(X(t) = xi |Y (t) = yj ) ∼ 1 + k γik αkj , where αij indicates the frequency of experience with the association between states xi and yj , and γik specifies the similarity between states xi and xk . (Although the representation of states is localist, the γ terms allow us to design in the similarity structure inherent in a distributed representation.) These association strengths are highly constrained by the task structure and the similarity structure and familiarity of the inputs. Fundamental to the framework is the assumption that with each experience, a pathway becomes more efficient at processing an input. Efficiency is reflected by a shift in the speedaccuracy curve to the left. In Mozer, Colagrosso, and Huber (2002, 2003), we propose two distinct mechanisms to model phenomena of supraliminal priming. First, the association frequencies, αij , are increased following a trial in which xi leads to activation of yj , resulting in more efficient transmission of information, corresponding to an increased slope of the solid line in Figure 1b. The increase is Hebbian, based on the maximum activation achieved by xi and yj : ∆αij = η maxt P(X(t) = xi )P(Y (t) = yj ), where η is a step size. Second, the priors, which serve as a model of the environment, are increased to indicate a greater likelihood of the same output occurring again in the future. In modeling data from supraliminal priming, we found that the increases to association frequencies are long lasting, but the increases to the priors decay over the course of a few minutes or a few trials. As a result, the prior updating does not play into the simulation we report here; we refer the reader to Mozer, Colagrosso, and Huber (2003) for details. 4 Access Consciousness and PIT We have described the operation of a single pathway, but to model any cognitive task, we require a series of pathways in cascade. For a simple choice task, we use a percpetual pathway cascaded to a response pathway. The interconnection between the pathways is achieved by copying the output of the perceptual pathway, Y p (t), to the input of the response pathway, X r (t), at each time t. This multiple-pathway architecture allows us to characterize the notion of access consciousness. Considering the output of the perceptual pathway, access is achieved when: (1) the output representation is sufficient to trigger the correct behavior in the response pathway, and (2) the perceptual and response pathways are functionally interconnected. In more general terms, access for a perceptual pathway output requires that these two condi- tions be met not just for a specific response pathway, but for arbitrary response pathways (e.g., pathways for naming, choice, evaluation, working memory, etc.). In Mozer and Colagrosso (in preparation) we characterize the sufficiency requirements of condition 1; they involve a representation of low entropy that stays active for long enough that the representation can propagate to the next pathway. As we will show, a briefly presented stimulus fails to achieve a representation that supports choice and naming responses. Nonetheless, the stimulus evokes activity in the perceptual pathway. Because perceptual priming depends on the magnitude of the activation in the perceptual pathway, not on the activation being communicated to response pathways, the framework is consistent with the notion of priming occurring in the absence of awareness. 4.1 Simulation of Bar and Biederman (1998) Bar and Biederman (1998) presented a sequence of masked line drawings of objects and asked participants to name the objects, even if they had to guess. If the guess was incorrect, participants were required to choose the object name from a set of four alternatives. Unbeknownst to the participant, some of the drawings in the series were repeated, and Bar and Biederman were interested in whether participants would benefit from the first presentation even if it could not be identified. The repeated objects could be the same or a different exemplar of the object, and it could appear in either the same or a different display position. Participants were able to name 13.5% of drawings on presentation 1, but accuracy jumped to 34.5% on presentation 2. Accuracy did improve, though not as much, if the same shape was presented in a different position, but not if a different drawing of the same object was presented, suggesting a locus of priming early in the visual stream. The improvement in accuracy is not due to practice in general, because accuracy rose only 4.0% for novel control objects over the course of the experiment. The priming is firmly subliminal, because participants were not only unable to name objects on the first presentation, but their fouralternative forced choice (4AFC) performance was not much above chance (28.5%). To model these phenomena, we created a response pathway with fifty states representing names of objects that are used in the experiment, e.g., chair and lamp. We also created a perceptual pathway with states representing visual patterns that correspond to the names in the response pathway. Following the experimental design, every object identity was instantiated in two distinct shapes, and every shape could be in one of nine different visualfield positions, leading to 900 distinct states in the perceptual pathway to model the possible visual stimuli. The following parameters were fit to the data. If two perceptual states, xi and xk are the same shape in different positions, they are assigned a similarity coefficient γik = 0.95; all other similarity coefficients are zero. The association frequency, α, for valid associations in the perceptual pathway was 22, and the response pathway 18. Other parameters were β p = .05, β r = .01, and η = 1.0. The PIT model achieves a good fit to the human experimental data (Figure 2). Specifically, priming is greatest for the same shape in the same position, some priming occurs for the same shape in a different position, and no substantial priming occurs for the different shape. Figure 3a shows the time course of activation of a stimulus representation in the perceptual pathway when the stimulus is presented for 50 iterations, on both the first and third presentations. The third presentation was chosen instead of the second to make the effect of priming clearer. Even though a shape cannot be named on the first presentation, partial information about the shape may nonetheless be available for report. The 4AFC test of Bar and Biederman provides a more sensitive measure of residual stimulus information. In past work, we modeled forced-choice tasks using a response pathway with only the alternatives under consideration. However, in this experiment, forced-choice performance must be estimated conditional on incorrect naming. In PIT framework, we achieve this using naming and 40 40 First Block Second Block 30 25 20 15 10 5 First Block 35 Percent Correct Naming Percent Correct Naming 35 Second Block 30 25 20 15 10 5 0 0 Control Objects Prime SHAPE: Same Objects POSITION: Same Same Different Different Different Second Same Different Control Control Objects Prime SHAPE: Same Objects POSITION: Same Same Different Different Different Second Same Different Control Figure 2: (left panel) Data from Bar and Biederman (1998) (right panel) Simulation of PIT. White bar: accuracy on first presentation of a prime object. Black bars: the accuracy when the object is repeated, either with the same or different shape, and in the same or different position. Grey bars: accuracy for control objects at the beginning and the end of the experiment. forced-choice output pathways having output distributions N (t) and F (t), which are linked via the perceptual state, Y p (t). F (t) must be reestimated with the evidence that N (t) is not the target state. This inference problem is intractable. We therefore used a shortcut in which a single response pathway is used, augmented with a simple three-node belief net (Figure 3b) to capture the dependence between naming and forced choice. The belief net has a response pathway node Y r (t) connected to F (t) and N (t), with conditional distribution P (N (t) = ni |Y r (t) = yj ) = θδij + (1 − θ)/|Y r |, and an analogous distribution for P (F (t) = fi |Y r (t) = yj ). The free parameter θ determines how veridically naming and forced-choice actions reflect response-pathway output. Over a range of θ, θ < 1, the model obtains forced-choice performance near chance on the first presentation when the naming response is incorrect. For example, with θ = 0.72, the model produces a forced-choice accuracy on presentation 1 of 26.1%. (Interestingly, the model also produces below chance performance on presentation 2 if the object is not named correctly—23.5%—which is also found in the human data—20.0%.) Thus, by the stringent criterion of 4AFC, the model shows no access consciousness, and therefore illustrates a dissociation between priming and access consciousness. In our simulation, we followed the procedure of Bar and Biederman by including distractor alternatives with visual and semantic similarity to the target. These distractors are critical: with unrelated distractors, the model’s 4AFC performance is significantly above chance, illustrating that a perceptual representation can be adequate to support some responses but not others, as Farah et al. (1994) also argued. 4.2 Simulation of Abrams and Greenwald (2000) During an initial phase of the experiment, participants categorized 24 clearly visible target words as pleasant (e.g., HUMOR) or unpleasant (e.g., SMUT). They became quite familiar with the task by categorizing each word a total of eight times. In a second phase, participants were asked to classify the same targets and were given a response deadline to induce errors. The targets were preceded by masked primes that could not be identified. Of interest is the effective valence (or EV) of the target for different prime types, defined as the error rate difference between unpleasant and pleasant targets. A positive (negative) EV indicates that responses are biased toward a pleasant (unpleasant) interpretation by the prime. As one would expect, pleasant primes resulted in a positive EV, unpleasant primes in a negative EV. Of critical interest is the finding that a nonword prime formed by recombining two pleasant targets (e.g., HULIP from HUMOR and TULIP) or unpleasant targets (e.g., BIUT from BILE and SMUT ) also served to bias the targets. More surprising, a positive EV resulted from unpleasant prime words formed by recombining two pleasant targets (TUMOR from TULIP and HUMOR ), indicating that subliminal priming arises from word fragments, not words as unitary entities, and providing further evidence for an early locus of subliminal priming. Note that the results depend critically on the first phase of the experiment, which gave participants extensive practice on a relatively small set of words that were then used as and recombined to form primes. Words not studied in the first phase (orphans) provided Probability 0.6 0.5 0.4 0.3 0.2 0.1 0 object, first presentation object, third presentation different object N(t) F(t) Yr(t) 1 50 1000 (a) (b) Time (msec) Figure 3: (a) Activation of the perceptual representation in PIT as a function of processing iterations Effective Valence on the first (thin solid line) and third (thick solid line) presentations of target. (b) Bayes net for performing 4AFC conditional on incorrect naming response. 0.4 Experiment Model 0.3 0.2 0.1 0 targets hulip-type tumor-type orphans Figure 4: Effective valence of primes in the Abrams and Greenwald (2000) experiment for human subjects (black bars) and PIT model (grey bars). HULIP-type primes are almost as strong as target repetitions, and TUMOR-type primes have a positive valence, contrary to the meaning of the word. no significant EV effect when used as primes. In this simulation, we used a three pathway model: a perceptual pathway that maps visual patterns to orthography with 200 input states corresponding both to words, nonwords, and nonword recombinations of words; a semantic pathway that maps to 100 distinct lexical/semantic states; and a judgement pathway that maps to two responses, pleasant and unpleasant. In the perceptual pathway, similarity structure was based on letter overlap, so that HULIP was similar to both TULIP and HUMOR, with γ = 0.837. No similarity was assumed in the semantic state representation; consistent with the previous simulation, β p = .05, β s = .01, β j = .01, and η = .01. At the outset of the simulation, α frequencies for correct associations were 15, 19, and 25 in the perceptual, semantic, and judgement pathways. The initial phase of the experiment was simulated by repeated supraliminal presentation of words, which increased the association frequencies in all three pathways through the ∆αij learning rule. Long-term supraliminal priming is essential in establishing the association strengths, as we’ll explain. Short-term subliminal priming also plays a key role in the experiment. During the second phase of the experiment, residual activity from the prime—primarily in the judgement pathway—biases the response to the target. Residual activation of the prime is present even if the representation of the prime does not reach sufficient strength that it could be named or otherwise reported. The outcome of the simulation is consistent with the human data (Figure 4). When a HULIP -type prime is presented, HUMOR and TULIP become active in the semantic pathway because of their visual similarity to HULIP. Partial activation of these two practiced words pushes the judgement pathway toward a pleasant response, resulting in a positive EV. When a TUMOR-type prime is presented, three different words become active in the semantic pathway: HUMOR, TULIP, and TUMOR itself. Although TUMOR is more active, it was not one of the words studied during the initial phase of the experiment, and as a result, it has a relatively weak association to the unpleasant judgement, in contrast to the other two words which have strong associations to the pleasant judgement. Orphan primes have little effect because they were not studied during the initial phase of the experiment, and consequently their association to pleasant and unpleasant judgements is also weak. In summary, activation of the prime along a critical, well-practiced pathway may not be sufficient to support an overt naming response, yet it may be sufficient to bias the processing of the immediately following target. 5 Discussion An important contribution of this work has been to demonstrate that specific experimental results relating to access consciousness and subliminal priming can be interpreted in a concrete computational framework. By necessity, the PIT framework, which we previously used to model supraliminal priming data, predicts the existence of subliminal priming, because the mechanisms giving rise to priming depend on degree of activation of a representation, whereas the processes giving rise to access consciousness also depend on the temporal persistence of a representation. Another contribution of this work has been to argue that two previous computational models each tell only part of the story. Farah et al. argue that quality of representation is critical; Dehaene et al. argue that pathways to communicate representations is critical. The PIT framework argues that both of these features are necessary for access consciousness. Although the PIT framework is not completely developed, it nonetheless makes a clear prediction: that subliminal priming is can never be stronger than supraliminal priming, because the maximal activation of subliminal primes is never greater than that of supraliminal primes. One might argue that many theoretical frameworks might predict the same, but no other computational model is sufficiently well developed—in terms of addressing both priming and access consciousness—to make this prediction. In its current stage of development, a weakness of the PIT framework is that it is silent as to how perceptual and response pathways become flexibly interconnected based on task demands. However, the PIT framework is not alone in failing to address this critical issue: The Dehaene et al. model suggests that once a representation enters the global workspace, all response modules can access it, but the model does not specify how the appropriate perceptual module wins the competition to enter the global workspace, or how the appropriate response module is activated. Clearly, flexible cognitive control structures that perform these functions are intricately related to mechanisms of consciousness. Acknowledgments This research was supported by NIH/IFOPAL R01 MH61549–01A1. References Abrams, R. L., & Greenwald, A. G. (2000). Parts outweigh the whole (word) in unconscious analysis of meaning. Psychological Science, 11(2), 118–124. Baars, B. (1989). A cognitive theory of consciousness. Cambridge: Cambridge University Press. Bar, M., & Biederman, I. (1998). Subliminal visual priming. Psychological Science, 9(6), 464–468. Block, N. (1995). On a confusion about a function of consciousness. Brain and Behavioral Sciences, 18(2), 227–247. Dehaene, S., & Naccache, L. (2001). Towards a cognitive neuroscience of consciousness: basic evidence and a workspace framework. Cognition, 79, 1–37. Dehaene, S., Sergent, C., & Changeux, J.-P. (2003). A neuronal network model linking subjective reports and objective physiological data during conscious perception. Proceedings of the National Academy of Sciences, 100, 8520–8525. Farah, M. J., O’Reilly, R. C., & Vecera, S. P. (1994). Dissociated overt and covert recognition as an emergent property of a lesioned neural network. Psychological Review, 100, 571–588. Mathis, D. W., & Mozer, M. C. (1996). Conscious and unconscious perception: a computational theory. In G. Cottrell (Ed.), Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society (pp. 324–328). Hillsdale, NJ: Erlbaum & Associates. Mozer, M. C., Colagrosso, M. D., & Huber, D. E. (2002). A rational analysis of cognitive control in a speeded discrimination task. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14. Cambridge, MA: MIT Press. Mozer, M. C., Colagrosso, M. D., & Huber, D. E. (2003). Mechanisms of long-term repetition priming and skill refinement: A probabilistic pathway model. In Proceedings of the TwentyFifth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum Associates.
same-paper 3 0.80436885 168 nips-2004-Semigroup Kernels on Finite Sets
Author: Marco Cuturi, Jean-philippe Vert
Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1
4 0.65009421 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer
Abstract: Prediction suffix trees (PST) provide a popular and effective tool for tasks such as compression, classification, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efficient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the first provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any fixed PST determined in hindsight. 1
5 0.64758688 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
Author: Peter Sollich, Christopher Williams
Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.
6 0.64613795 178 nips-2004-Support Vector Classification with Input Data Uncertainty
7 0.64553982 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
8 0.64500654 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
9 0.64496011 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
10 0.64446908 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection
11 0.64420563 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning
12 0.64159679 131 nips-2004-Non-Local Manifold Tangent Learning
13 0.63834339 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
14 0.63816124 148 nips-2004-Probabilistic Computation in Spiking Populations
15 0.63774961 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
16 0.63740999 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
17 0.63626236 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill
18 0.63477606 69 nips-2004-Fast Rates to Bayes for Kernel Machines
19 0.63142139 68 nips-2004-Face Detection --- Efficient and Rank Deficient
20 0.6313138 187 nips-2004-The Entire Regularization Path for the Support Vector Machine