nips nips2004 nips2004-94 knowledge-graph by maker-knowledge-mining

94 nips-2004-Kernels for Multi--task Learning


Source: pdf

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this setting, the kernel is a matrix–valued function. [sent-3, score-0.124]

2 In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. [sent-5, score-0.684]

3 We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. [sent-6, score-0.184]

4 Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. [sent-7, score-0.288]

5 We focus on linear spaces of such functions that admit a reproducing kernel, see [7]. [sent-9, score-0.232]

6 Our main motivation is the practical problem of multi–task learning where we wish to learn many related regression or classification functions simultaneously, see eg [3, 5, 6]. [sent-11, score-0.104]

7 Moreover, the spaces of vector–valued functions described in this paper may be useful for learning continuous transformations. [sent-16, score-0.09]

8 For example, in face animation X represents pose and expression of a face and Y a space of functions IR2 → IR, although in practice one considers discrete images in which case f (x) is a finite dimensional vector whose components are associated to the image pixels. [sent-18, score-0.085]

9 , fn ) consists in separately representing each component of f by a linear space of smooth functions and then learn these components independently, for example by minimizing some regularized error functional. [sent-23, score-0.098]

10 This approach does not capture relations between components of f (which are associated to tasks or pixels in the examples above) and should not be the method of choice when these relations occur. [sent-24, score-0.071]

11 In this paper we investigate how kernels can be used for representing vector–valued functions. [sent-25, score-0.097]

12 We proposed to do this by using a matrix–valued kernel K : X × X → IRn×n that reflects the interaction amongst the components of f . [sent-26, score-0.124]

13 For example, in the case of support vector machines (SVM’s) [10], appropriate choices of the matrix–valued kernel implement a trade–off between large margin of each per–task SVM and large margin of combinations of these SVM’s, eg their average. [sent-28, score-0.209]

14 In section 2 we formalize the above observations and show that reproducing Hilbert spaces (RKHS) of vector–valued functions admit a kernel with values which are bounded linear operators on the output space Y and characterize the form some of these operators in section 3. [sent-30, score-0.472]

15 Finally, in section 4 we provide a novel proof for the representer theorem which is based on the notion of minimal norm interpolation and present linear multi–task learning algorithms. [sent-31, score-0.244]

16 2 RKHS of vector–valued functions Let Y be a real Hilbert space with inner product (·, ·), X a set, and H a linear space of functions on X with values in Y. [sent-32, score-0.146]

17 1 Matrix–valued kernels based on Aronszajn The first approach extends the scalar case, Y = IR, in [2]. [sent-36, score-0.176]

18 Definition 1 We say that H is a reproducing kernel Hilbert space (RKHS) of functions f : X → Y, when for any y ∈ Y and x ∈ X the linear functional which maps f ∈ H to (y, f (x)) is continuous on H. [sent-37, score-0.371]

19 , [1]) that, for every x ∈ X and y ∈ Y, there is a linear operator Kx : Y → H such that (y, f (x)) = Kx y, f . [sent-40, score-0.098]

20 1) For every x, t ∈ X we also introduce the linear operator K(x, t) : Y → Y defined, for every y ∈ Y, by K(x, t)y := (Kt y)(x). [sent-42, score-0.136]

21 2) In the proposition below we state the main properties of the function K. [sent-44, score-0.056]

22 To this end, we let L(Y) be the set of all bounded linear operators from Y into itself and, for every A ∈ L(Y), we denote by A∗ its adjoint. [sent-45, score-0.121]

23 We also use L+ (Y) to denote the cone of positive semidefinite bounded linear operators, i. [sent-46, score-0.064]

24 A ∈ L+ (Y) provided that, for every y ∈ Y, (y, Ay) ≥ 0. [sent-48, score-0.064]

25 Finally, we say that H is normal provided there does not exist (x, y) ∈ X × (Y\{0}) such that the linear functional (y, f (x)) = 0 for all f ∈ H. [sent-51, score-0.152]

26 Proposition 1 If K(x, t) is defined, for every x, t ∈ X , by equation (2. [sent-52, score-0.088]

27 1) then the kernel K satisfies, for every x, t ∈ X , the following properties: (a) For every y, z ∈ Y, we have that (y, K(x, t)z) = Kt z, Kx y . [sent-54, score-0.2]

28 3) j, ∈INm We prove (a) by merely choosing f = Kt z in equation (2. [sent-59, score-0.101]

29 4) Consequently, from this equation, we conclude that K(x, t) admits an algebraic adjoint K(t, x) defined everywhere on Y and, so, the uniform boundness principle, see, eg, [1, p. [sent-63, score-0.116]

30 As for (c), we again use property (a) to obtain that (yj , K(xj , x )y ) = j, ∈INm Kxj yj , K x y = Kx j y j j, ∈INm 2 ≥ 0. [sent-69, score-0.061]

31 For simplicity, we say that K : X × X → L(Y) is a matrix–valued kernel (or simply a kernel if no confusion will arise) if it satisfies properties (b) and (c). [sent-71, score-0.272]

32 In the spirit of the Moore-Aronszajn’s theorem for RKHS of scalar functions [2], it can be shown that if K : X × X → L(Y) is a kernel then there exists a unique (up to an isometry) RKHS of functions from X to Y which admits K as the reproducing kernel. [sent-73, score-0.456]

33 Thus, H consists of functions which are linear in their second variable. [sent-78, score-0.081]

34 It then follows that H1 is a RKHS with reproducing scalar–valued kernel defined, for all (x, y), (t, z) ∈ X × Y, by the formula K 1 ((x, y), (t, z)) := (y, K(x, t)z). [sent-80, score-0.202]

35 6) Feature map The second approach uses the notion of feature map, see e. [sent-83, score-0.092]

36 A feature map is a function Φ : X × Y → W where W is a Hilbert space. [sent-86, score-0.059]

37 A feature map representation of a kernel K has the property that, for every x, t ∈ X and y, z ∈ Y there holds the equation (Φ(x, y), Φ(t, z)) = (y, K(x, t)z). [sent-87, score-0.291]

38 4) we conclude that every kernel admits a feature map representation (a Mercer type theorem) with W = H. [sent-89, score-0.357]

39 7) Much more importantly, we begin with a feature map Φ(x, λ) = ((Φ (x), λ) : ∈ IN) where λ ∈ W, this being the space of squared summable sequence on IN. [sent-92, score-0.059]

40 We choose f = w and conclude that r∈IN wr Φr for each the space of all such functions is a Hilbert space of function from X to Y with kernel (2. [sent-94, score-0.219]

41 These remarks connect feature maps to kernels and vice versa. [sent-96, score-0.147]

42 Note a kernel may have many maps which represent it and a feature map representation for a kernel may not be the appropriate way to write it for numerical computations. [sent-97, score-0.327]

43 3 Kernel construction In this section we characterize a wide variety of kernels which are potentially useful for applications. [sent-98, score-0.14]

44 1 Linear kernels A first natural question concerning RKHS of vector–valued functions is: if X is IRd what is the form of linear kernels? [sent-100, score-0.178]

45 In the scalar case a linear kernel is a quadratic form, namely K(x, t) = (x, Qt), where Q is a d × d positive semidefinite matrix. [sent-101, score-0.267]

46 We claim that for Y = IRn any linear matrix–valued kernel K = (K q : , q ∈ INn ) has the form K q (x, t) = (B x, Bq t), x, t ∈ IRd (3. [sent-102, score-0.162]

47 To see that such K is a kernel simply note that K is in the Mercer form (2. [sent-104, score-0.124]

48 On the other hand, since any linear kernel has a Mercer representation with linear features, we conclude that all linear kernels have the form (3. [sent-106, score-0.407]

49 A special case is provided by choosing p = d and B to be diagonal matrices. [sent-108, score-0.074]

50 This situation is important in multi–task learning, see eg [5]. [sent-110, score-0.061]

51 We are interested in vector–valued functions f : X → IRn whose coordinates are given by f (x) = g (P x), where x = (x : x ∈ X , ∈ INn ) and P : X → X is a projection operator defined, for every x ∈ X by P (x) = x , ∈ INn . [sent-112, score-0.103]

52 For , q ∈ INn , we suppose kernel functions C q : X × Xq → IR are given such that the matrix valued kernel whose elements are defined as K q (x, t) := C q (P x, Pq t), , q ∈ INn satisfies properties (b) and (c) of Proposition 1. [sent-113, score-0.857]

53 An example of this construction is provided again by linear functions. [sent-114, score-0.064]

54 Specifically, we choose X = IRd , where d ∈ IN and C q (x , tq ) = (Q x , Qq tq ), x ∈ X , tq ∈ Xq , where Q are p × d matrices. [sent-115, score-0.147]

55 In this case, the matrix–valued kernel K = (K q : , q ∈ INn ) is given by K q (x, t) = (Q P x, Qq Pq t) (3. [sent-116, score-0.124]

56 2 Combinations of kernels The results in this section are based on a lemma by Schur which state that the elementwise product of two positive semidefinite matrices is also positive semidefinite, see [2, p. [sent-120, score-0.294]

57 This result implies that, when Y is finite dimensional, the elementwise product of two matrix–valued kernels is also a matrix–valued kernel. [sent-122, score-0.167]

58 Lemma 1 If Y = IRn and K1 and K2 are matrix–valued kernels then their elementwise product is a matrix–valued kernel. [sent-125, score-0.167]

59 This result allows us, for example, to enhance the linear kernel (3. [sent-126, score-0.186]

60 In particular, if r is a positive integer, we define, for every , q ∈ INn , K q (x, t) := (B x , Bq tq )r and conclude that K = (K q : , q ∈ INn ) is a kernel. [sent-128, score-0.165]

61 Lemma 2 If G : IRd × IRd → IR is a kernel and z : X → IRd a vector–valued function, for ∈ INn then the matrix–valued function K : X × X → IRn×n whose elements are defined, for every x, t ∈ X , by K q (x, t) = G(z (x), zq (t)) is a matrix–valued kernel. [sent-129, score-0.162]

62 This lemma confirms, as a special case, that if z (x) = B x with B a p × d matrix, ∈ INn , and G : IRd × IRd → IR is a scalar–valued kernel, then the function (3. [sent-130, score-0.074]

63 In the scalar case it is well–known that a nonnegative combination of kernels is a kernel. [sent-133, score-0.176]

64 Proposition 2 If Kj , j ∈ INs , s ∈ IN are scalar–valued kernels and Aj ∈ L+ (Y) then the function K= A j Kj (3. [sent-135, score-0.097]

65 10) can be used to generate a wide variety of matrix–valued kernels which have the flexibility needed for learning. [sent-141, score-0.114]

66 For example, we obtain polynomial matrix–valued kernels by setting X = IRd and Kj (x, t) = (x, t)j , where x, t ∈ IRd . [sent-142, score-0.097]

67 We remark that, generally, the kernel in equation (3. [sent-143, score-0.174]

68 An interesting case of Proposition 2 is provided by low rank kernels which may be useful in situations where the components of f are linearly related, that is, for every f ∈ H and x ∈ X f (x) lies in a linear subspace M ⊆ Y. [sent-145, score-0.199]

69 In this case, it is desirable to use a kernel which has the same property that f (x) ∈ M, x ∈ X for all f ∈ H. [sent-146, score-0.124]

70 Specifically, Aj e−σj K(x, t) = x−t j∈INs is a kernel on X × X for any {Aj : j ∈ INs } ⊆ L+ (IRn ). [sent-150, score-0.124]

71 2 4 Regularization and minimal norm interpolation Let V : Y m × IR+ → IR be a prescribed function and consider the problem of minimizing the functional E(f ) := V (f (xj ) : j ∈ INm ), f 2 (4. [sent-151, score-0.185]

72 A special case is covered by the functional of the form E(f ) := Q(yj , f (xj )) + γ f 2 (4. [sent-153, score-0.083]

73 12) j∈INm where γ is a positive parameter and Q : Y × Y → IR+ is some prescribed loss function, eg the square loss. [sent-154, score-0.114]

74 Within this general setting we provide a “representer theorem” for any function which minimizes the functional in equation (4. [sent-155, score-0.114]

75 Our proof technique uses the idea of minimal norm interpolation, a central notion in function estimation and interpolation. [sent-158, score-0.106]

76 Lemma 3 If y ∈ {(f (xj ) : j ∈ INm ) : f ∈ H} ⊂ IRm the minimum of problem min 2 f ˆ is unique and admits the form f = : f (xj ) = yj , j ∈ INm j∈INm (4. [sent-159, score-0.16]

77 Theorem 1 If for every y ∈ Y m the function h : IR+ → IR+ defined for t ∈ IR+ by h(t) := V (y, t) is strictly increasing and f0 ∈ H is a local minimum of E then f0 = j∈INm Kxj cj for some {cj : j ∈ INm } ⊆ Y. [sent-166, score-0.073]

78 1 Linear regularization We comment on regularization for linear multi–task learning and therefore consider minimizing the functional R0 (w) := Q(yj , (w, B xj )) + γ w 2 (4. [sent-171, score-0.221]

79 We set u = B ∗ w, u = (u : ∈ INn ), and observe that the above functional is related to the functional Q(yj , (u , xj )) + γJ(u) R1 (u) := (4. [sent-173, score-0.186]

80 15) j∈INm ∈INn 1 A function f0 ∈ H is a local minimum for E provided that there is a positive number such that whenever f ∈ H satisfies f0 − f ≤ then E(f0 ) ≤ E(f ). [sent-174, score-0.069]

81 where we have defined the minimum norm functional 2 J(u) := min{ w : w ∈ IRp , B ∗ w = u , ∈ INn }. [sent-175, score-0.107]

82 16) is given by w = ˆ ˆ d {c : ∈ INn } ⊆ IR satisfy the linear equations B ∗ Bk ck = u , ∈INn B c , where the vectors ∈ INn k∈INn and ˜ (u , B −1 uq ) q J(u) = ,q∈INn ˜ provided the d × d block matrix B = (B ∗ Bq : , q ∈ INn ) is nonsingular. [sent-179, score-0.217]

83 15) by xj, ∈ IRd and matrix B by Q P , see section 3. [sent-182, score-0.065]

84 As a special example we choose B to be the (n + 1)d × d matrix whose d × d blocks are all zero expect for the 1−st and ( + 1)−th block which are equal to c−1 Id and Id respectively, where c > 0 and Id is the d−dimensional identity matrix. [sent-184, score-0.108]

85 16) is given by K q (x, t) = ( J(u) = c2 n + c2 u ∈INn 2 + n n + c2 u − ∈INn 1 n uq (4. [sent-189, score-0.064]

86 If c is small the tasks parameters are related (closed to their average) whereas a large value of c means the task are learned independently. [sent-198, score-0.07]

87 This can be done by either introducing the matrices B as in the previous example, or by modifying the functional on the right hand side of equation (4. [sent-204, score-0.134]

88 For example, we choose an n × n symmetric matrix A all of whose entries are in the unit interval, and consider the regularizer J(u) := 1 2 u − uq ,q∈INn 2 A q = (u , uq )L q (4. [sent-206, score-0.224]

89 The matrix A could be the weight matrix of a graph with n vertices and L the graph Laplacian, see eg [4]. [sent-208, score-0.191]

90 The equation A q = 0 means that tasks and q are not related, whereas A q = 1 means strong relation. [sent-209, score-0.077]

91 In order ˜ to derive the matrix–valued kernel we note that (4. [sent-210, score-0.124]

92 19) can be written as (u, L, u) where ˜ L is the n × n block matrix whose , q block is the d × d matrix Id L q . [sent-211, score-0.178]

93 Thus, we define ˜ 1 ˜1 w = L 2 u so that we have u = P L− 2 w (here L−1 is the pseudoinverse), where P is a projection matrix from IRdn to IRd . [sent-212, score-0.065]

94 2 one can form polynomials or non-linear functions of the above linear kernels. [sent-216, score-0.081]

95 12) is still a linear combination of the kernel at the given data examples. [sent-218, score-0.162]

96 5 Conclusions and future directions We have described reproducing kernel Hilbert spaces of vector–valued functions and discussed their use in multi–task learning. [sent-219, score-0.292]

97 We have provided a wide class of matrix–valued kernels which should proved useful in applications. [sent-220, score-0.14]

98 This problem seems more challenging that its scalar counterpart due to the possibly large dimension of the output space. [sent-222, score-0.079]

99 Finally, it would be interesting to link the choice of matrix–valued kernels to the notion of relatedness between tasks discussed in [5]. [sent-225, score-0.194]

100 Theory of linear operators in Hilbert spaces, volume I. [sent-235, score-0.083]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('inn', 0.551), ('valued', 0.501), ('inm', 0.331), ('ird', 0.177), ('multi', 0.175), ('ir', 0.162), ('ins', 0.129), ('kernel', 0.124), ('kx', 0.112), ('irn', 0.109), ('hilbert', 0.103), ('kernels', 0.097), ('rkhs', 0.088), ('scalar', 0.079), ('reproducing', 0.078), ('irp', 0.073), ('matrix', 0.065), ('functional', 0.064), ('admits', 0.064), ('bq', 0.064), ('uq', 0.064), ('kj', 0.061), ('eg', 0.061), ('yj', 0.061), ('aj', 0.059), ('xj', 0.058), ('proposition', 0.056), ('lemma', 0.055), ('kt', 0.054), ('conclude', 0.052), ('equation', 0.05), ('tq', 0.049), ('id', 0.049), ('representer', 0.049), ('micchelli', 0.048), ('elementwise', 0.048), ('spaces', 0.047), ('svm', 0.045), ('operators', 0.045), ('task', 0.043), ('functions', 0.043), ('linear', 0.038), ('semide', 0.038), ('every', 0.038), ('albany', 0.037), ('kxj', 0.037), ('relatedness', 0.037), ('notion', 0.033), ('map', 0.032), ('schur', 0.032), ('mercer', 0.031), ('regularizer', 0.031), ('choosing', 0.029), ('qq', 0.029), ('colt', 0.028), ('tg', 0.027), ('trade', 0.027), ('prescribed', 0.027), ('tasks', 0.027), ('feature', 0.027), ('characterize', 0.026), ('norm', 0.026), ('positive', 0.026), ('provided', 0.026), ('interpolation', 0.026), ('xq', 0.026), ('admit', 0.026), ('theorem', 0.025), ('minimal', 0.025), ('block', 0.024), ('say', 0.024), ('enhance', 0.024), ('pq', 0.024), ('vector', 0.024), ('satis', 0.023), ('bj', 0.023), ('remarks', 0.023), ('closeness', 0.022), ('regularization', 0.022), ('operator', 0.022), ('product', 0.022), ('relations', 0.022), ('proof', 0.022), ('merely', 0.022), ('valuable', 0.022), ('minimizer', 0.022), ('annual', 0.021), ('th', 0.02), ('matrices', 0.02), ('nite', 0.02), ('representation', 0.02), ('special', 0.019), ('min', 0.018), ('dimensional', 0.018), ('cj', 0.018), ('laplacian', 0.018), ('consequently', 0.018), ('minimizing', 0.017), ('wide', 0.017), ('minimum', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 94 nips-2004-Kernels for Multi--task Learning

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1

2 0.1105189 168 nips-2004-Semigroup Kernels on Finite Sets

Author: Marco Cuturi, Jean-philippe Vert

Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1

3 0.099857256 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty

Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1

4 0.097061165 42 nips-2004-Computing regularization paths for learning multiple kernels

Author: Francis R. Bach, Romain Thibaux, Michael I. Jordan

Abstract: The problem of learning a sparse conic combination of kernel functions or kernel matrices for classification or regression can be achieved via the regularization by a block 1-norm [1]. In this paper, we present an algorithm that computes the entire regularization path for these problems. The path is obtained by using numerical continuation techniques, and involves a running time complexity that is a constant times the complexity of solving the problem for one value of the regularization parameter. Working in the setting of kernel linear regression and kernel logistic regression, we show empirically that the effect of the block 1-norm regularization differs notably from the (non-block) 1-norm regularization commonly used for variable selection, and that the regularization path is of particular value in the block case. 1

5 0.084728651 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie

Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1

6 0.075497329 92 nips-2004-Kernel Methods for Implicit Surface Modeling

7 0.068332016 30 nips-2004-Binet-Cauchy Kernels

8 0.065785192 69 nips-2004-Fast Rates to Bayes for Kernel Machines

9 0.065379262 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

10 0.06417188 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

11 0.060441278 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

12 0.056471065 54 nips-2004-Distributed Information Regularization on Graphs

13 0.055307552 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

14 0.054452013 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

15 0.051850684 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

16 0.051350053 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

17 0.05130722 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

18 0.049000435 115 nips-2004-Maximum Margin Clustering

19 0.048145503 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

20 0.047863614 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.151), (1, 0.074), (2, -0.027), (3, 0.096), (4, -0.066), (5, -0.069), (6, -0.054), (7, -0.15), (8, 0.044), (9, -0.042), (10, -0.051), (11, -0.035), (12, 0.023), (13, 0.058), (14, 0.016), (15, -0.01), (16, 0.065), (17, -0.017), (18, 0.044), (19, 0.003), (20, 0.01), (21, -0.071), (22, 0.018), (23, 0.012), (24, -0.056), (25, -0.038), (26, 0.051), (27, -0.016), (28, -0.08), (29, 0.094), (30, -0.034), (31, -0.038), (32, -0.027), (33, 0.002), (34, 0.015), (35, -0.043), (36, 0.027), (37, 0.088), (38, -0.053), (39, -0.074), (40, 0.016), (41, 0.036), (42, 0.089), (43, -0.046), (44, -0.007), (45, -0.012), (46, -0.016), (47, 0.034), (48, -0.093), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93456918 94 nips-2004-Kernels for Multi--task Learning

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1

2 0.78033721 30 nips-2004-Binet-Cauchy Kernels

Author: Alex J. Smola, S.v.n. Vishwanathan

Abstract: We propose a family of kernels based on the Binet-Cauchy theorem and its extension to Fredholm operators. This includes as special cases all currently known kernels derived from the behavioral framework, diffusion processes, marginalized kernels, kernels on graphs, and the kernels on sets arising from the subspace angle approach. Many of these kernels can be seen as the extrema of a new continuum of kernel functions, which leads to numerous new special cases. As an application, we apply the new class of kernels to the problem of clustering of video sequences with encouraging results. 1

3 0.77838594 168 nips-2004-Semigroup Kernels on Finite Sets

Author: Marco Cuturi, Jean-philippe Vert

Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1

4 0.656304 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

Author: Amnon Shashua, Tamir Hazan

Abstract: This paper presents a general family of algebraic positive definite similarity functions over spaces of matrices with varying column rank. The columns can represent local regions in an image (whereby images have varying number of local parts), images of an image sequence, motion trajectories in a multibody motion, and so forth. The family of set kernels we derive is based on a group invariant tensor product lifting with parameters that can be naturally tuned to provide a cook-book of sorts covering the possible ”wish lists” from similarity measures over sets of varying cardinality. We highlight the strengths of our approach by demonstrating the set kernels for visual recognition of pedestrians using local parts representations. 1

5 0.6548239 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

Author: Xiaojin Zhu, Jaz Kandola, Zoubin Ghahramani, John D. Lafferty

Abstract: We present an algorithm based on convex optimization for constructing kernels for semi-supervised learning. The kernel matrices are derived from the spectral decomposition of graph Laplacians, and combine labeled and unlabeled data in a systematic fashion. Unlike previous work using diffusion kernels and Gaussian random field kernels, a nonparametric kernel approach is presented that incorporates order constraints during optimization. This results in flexible kernels and avoids the need to choose among different parametric forms. Our approach relies on a quadratically constrained quadratic program (QCQP), and is computationally feasible for large datasets. We evaluate the kernels on real datasets using support vector machines, with encouraging results. 1

6 0.60127616 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

7 0.58785611 42 nips-2004-Computing regularization paths for learning multiple kernels

8 0.58685374 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

9 0.55608487 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

10 0.49007267 59 nips-2004-Efficient Kernel Discriminant Analysis via QR Decomposition

11 0.4777123 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

12 0.45472154 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters

13 0.45374 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

14 0.45328572 177 nips-2004-Supervised Graph Inference

15 0.43208337 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

16 0.43090859 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

17 0.42302608 92 nips-2004-Kernel Methods for Implicit Surface Modeling

18 0.41742942 68 nips-2004-Face Detection --- Efficient and Rank Deficient

19 0.41606084 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

20 0.38934007 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.072), (15, 0.163), (26, 0.063), (31, 0.011), (32, 0.015), (33, 0.144), (35, 0.016), (39, 0.033), (50, 0.037), (87, 0.012), (92, 0.318), (94, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79463601 94 nips-2004-Kernels for Multi--task Learning

Author: Charles A. Micchelli, Massimiliano Pontil

Abstract: This paper provides a foundation for multi–task learning using reproducing kernel Hilbert spaces of vector–valued functions. In this setting, the kernel is a matrix–valued function. Some explicit examples will be described which go beyond our earlier results in [7]. In particular, we characterize classes of matrix– valued kernels which are linear and are of the dot product or the translation invariant type. We discuss how these kernels can be used to model relations between the tasks and present linear multi–task learning algorithms. Finally, we present a novel proof of the representer theorem for a minimizer of a regularization functional which is based on the notion of minimal norm interpolation. 1

2 0.5928756 178 nips-2004-Support Vector Classification with Input Data Uncertainty

Author: Jinbo Bi, Tong Zhang

Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1

3 0.59165639 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods

Author: Chakra Chennubhotla, Allan D. Jepson

Abstract: We show how to build hierarchical, reduced-rank representation for large stochastic matrices and use this representation to design an efficient algorithm for computing the largest eigenvalues, and the corresponding eigenvectors. In particular, the eigen problem is first solved at the coarsest level of the representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy. A small number of power iterations are employed at each stage to correct the eigen solution. The typical speedups obtained by a Matlab implementation of our fast eigensolver over a standard sparse matrix eigensolver [13] are at least a factor of ten for large image sizes. The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. 1 Spectral Methods Graph-theoretic spectral methods have gained popularity in a variety of application domains: segmenting images [22]; embedding in low-dimensional spaces [4, 5, 8]; and clustering parallel scientific computation tasks [19]. Spectral methods enable the study of properties global to a dataset, using only local (pairwise) similarity or affinity measurements between the data points. The global properties that emerge are best understood in terms of a random walk formulation on the graph. For example, the graph can be partitioned into clusters by analyzing the perturbations to the stationary distribution of a Markovian relaxation process defined in terms of the affinity weights [17, 18, 24, 7]. The Markovian relaxation process need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. In this paper we consider the practical application of spectral methods to large datasets. In particular, the eigen decomposition can be very expensive, on the order of O(n 3 ), where n is the number of nodes in the graph. While it is possible to compute analytically the first eigenvector (see §3 below), the remaining subspace of vectors (necessary for say clustering) has to be explicitly computed. A typical approach to dealing with this difficulty is to first sparsify the links in the graph [22] and then apply an efficient eigensolver [13, 23, 3]. In comparison, we propose in this paper a specialized eigensolver suitable for large stochastic matrices with known stationary distributions. In particular, we exploit the spectral properties of the Markov transition matrix to generate hierarchical, successively lower-ranked approximations to the full transition matrix. The eigen problem is solved directly at the coarsest level of representation. The approximate eigen solution is then interpolated over successive levels of the hierarchy, using a small number of power iterations to correct the solution at each stage. 2 Previous Work One approach to speeding up the eigen decomposition is to use the fact that the columns of the affinity matrix are typically correlated. The idea then is to pick a small number of representative columns to perform eigen decomposition via SVD. For example, in the Nystrom approximation procedure, originally proposed for integral eigenvalue problems, the idea is to randomly pick a small set of m columns; generate the corresponding affinity matrix; solve the eigenproblem and finally extend the solution to the complete graph [9, 10]. The Nystrom method has also been recently applied in the kernel learning methods for fast Gaussian process classification and regression [25]. Other sampling-based approaches include the work reported in [1, 2, 11]. Our starting point is the transition matrix generated from affinity weights and we show how building a representational hierarchy follows naturally from considering the stochastic matrix. A closely related work is the paper by Lin on reduced rank approximations of transition matrices [14]. We differ in how we approximate the transition matrices, in particular our objective function is computationally less expensive to solve. In particular, one of our goals in reducing transition matrices is to develop a fast, specialized eigen solver for spectral clustering. Fast eigensolving is also the goal in ACE [12], where successive levels in the hierarchy can potentially have negative affinities. A graph coarsening process for clustering was also pursued in [21, 3]. 3 Markov Chain Terminology We first provide a brief overview of the Markov chain terminology here (for more details see [17, 15, 6]). We consider an undirected graph G = (V, E) with vertices vi , for i = {1, . . . , n}, and edges ei,j with non-negative weights ai,j . Here the weight ai,j represents the affinity between vertices vi and vj . The affinities are represented by a non-negative, symmetric n × n matrix A having weights ai,j as elements. The degree of a node j is n n defined to be: dj = i=1 ai,j = j=1 aj,i , where we define D = diag(d1 , . . . , dn ). A Markov chain is defined using these affinities by setting a transition probability matrix M = AD −1 , where the columns of M each sum to 1. The transition probability matrix defines the random walk of a particle on the graph G. The random walk need never be explicitly carried out; instead, it can be analytically expressed using the leading order eigenvectors, and eigenvalues, of the Markov transition matrix. Because the stochastic matrices need not be symmetric in general, a direct eigen decomposition step is not preferred for reasons of instability. This problem is easily circumvented by considering a normalized affinity matrix: L = D −1/2 AD−1/2 , which is related to the stochastic matrix by a similarity transformation: L = D −1/2 M D1/2 . Because L is symmetric, it can be diagonalized: L = U ΛU T , where U = [u1 , u2 , · · · , un ] is an orthogonal set of eigenvectors and Λ is a diagonal matrix of eigenvalues [λ1 , λ2 , · · · , λn ] sorted in decreasing order. The eigenvectors have unit length uk = 1 and from the form of A and D it can be shown that the eigenvalues λi ∈ (−1, 1], with at least one eigenvalue equal to one. Without loss of generality, we take λ1 = 1. Because L and M are similar we can perform an eigen decomposition of the Markov transition matrix as: M = D1/2 LD−1/2 = D1/2 U Λ U T D−1/2 . Thus an eigenvector u of L corresponds to an eigenvector D 1/2 u of M with the same eigenvalue λ. The Markovian relaxation process after β iterations, namely M β , can be represented as: M β = D1/2 U Λβ U T D−1/2 . Therefore, a particle undertaking a random walk with an initial distribution p 0 acquires after β steps a distribution p β given by: p β = M β p 0 . Assuming the graph is connected, as β → ∞, the Markov chain approaches a unique n stationary distribution given by π = diag(D)/ i=1 di , and thus, M ∞ = π1T , where 1 is a n-dim column vector of all ones. Observe that π is an eigenvector of M as it is easy to show that M π = π and the corresponding eigenvalue is 1. Next, we show how to generate hierarchical, successively low-ranked approximations for the transition matrix M . 4 Building a Hierarchy of Transition Matrices The goal is to generate a very fast approximation, while simultaneously achieving sufficient accuracy. For notational ease, we think of M as a fine-scale representation and M as some coarse-scale approximation to be derived here. By coarsening M further, we can generate successive levels of the representation hierarchy. We use the stationary distribution π to construct a corresponding coarse-scale stationary distribution δ. As we just discussed a critical property of the fine scale Markov matrix M is that it is similar to the symmetric matrix L and we wish to preserve this property at every level of the representation hierarchy. 4.1 Deriving Coarse-Scale Stationary Distribution We begin by expressing the stationary distribution π as a probabilistic mixture of latent distributions. In matrix notation, we have (1) π = K δ, where δ is an unknown mixture coefficient vector of length m, K is an n × m non-negative n kernel matrix whose columns are latent distributions that each sum to 1: i=1 Ki,j = 1 and m n. It is easy to derive a maximum likelihood approximation of δ using an EM type algorithm [16]. The main step is to find a stationary point δ, λ for the Lagrangian: m n i=1 m Ki,j δj + λ πi ln E≡− j=1 δj − 1 . (2) j=1 An implicit step in this EM procedure is to compute the the ownership probability r i,j of the j th kernel (or node) at the coarse scale for the ith node on the fine scale and is given by ri,j = δj Ki,j . m k=1 δk Ki,k (3) The EM procedure allows for an update of both δ and the latent distributions in the kernel matrix K (see §8.3.1 in [6]). For initialization, δ is taken to be uniform over the coarse-scale states. But in choosing kernels K, we provide a good initialization for the EM procedure. Specifically, the Markov matrix M is diffused using a small number of iterations to get M β . The diffusion causes random walks from neighboring nodes to be less distinguishable. This in turn helps us select a small number of columns of M β in a fast and greedy way to be the kernel matrix K. We defer the exact details on kernel selection to a later section (§4.3). 4.2 Deriving the Coarse-Scale Transition Matrix In order to define M , the coarse-scale transition matrix, we break it down into three steps. First, the Markov chain propagation at the coarse scale can be defined as: q k+1 = M q k , (4) where q is the coarse scale probability distribution after k steps of the random walk. Second, we expand q k into the fine scale using the kernels K resulting in a fine scale probability distribution p k : p k = Kq k . (5) k Finally, we lift p k back into the coarse scale by using the ownership probability of the j th kernel for the ith node on the fine grid: n qjk+1 = ri,j pik i=1 (6) Substituting for Eqs.(3) and (5) in Eq. 6 gives n m qjk+1 = i=1 n Ki,t qtk = ri,j t=1 i=1 δj Ki,j m k=1 δk Ki,k m Ki,t qtk . (7) t=1 We can write the preceding equation in a matrix form: q k+1 = diag( δ ) K T diag K δ −1 Kq k . (8) Comparing this with Eq. 4, we can derive the transition matrix M as: M = diag( δ ) K T diag K δ −1 (9) K. It is easy to see that δ = M δ, so δ is the stationary distribution for M . Following the definition of M , and its stationary distribution δ, we can generate a symmetric coarse scale affinity matrix A given by A = M diag(δ) = diag( δ ) K T diag K δ −1 Kdiag(δ) , (10) where we substitute for the expression M from Eq. 9. The coarse-scale affinity matrix A is then normalized to get: L = D−1/2 AD−1/2 ; D = diag(d1 , d2 , · · · , dm ), (11) where dj is the degree of node j in the coarse-scale graph represented by the matrix A (see §3 for degree definition). Thus, the coarse scale Markov matrix M is precisely similar to a symmetric matrix L. 4.3 Selecting Kernels For demonstration purpose, we present the kernel selection details on the image of an eye shown below. To begin with, a random walk is defined where each pixel in the test image is associated with a vertex of the graph G. The edges in G are defined by the standard 8-neighbourhood of each pixel. For the demonstrations in this paper, the edge weight ai,j between neighbouring pixels xi and xj is given by a function of the difference in the 2 corresponding intensities I(xi ) and I(xj ): ai,j = exp(−(I(xi ) − I(xj ))2 /2σa ), where σa is set according to the median absolute difference |I(xi ) − I(xj )| between neighbours measured over the entire image. The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The kernel selection process we use is fast and greedy. First, the fine scale Markov matrix M is diffused to M β using β = 4. The Markov matrix M is sparse as we make the affinity matrix A sparse. Every column in the diffused matrix M β is a potential kernel. To facilitate the selection process, the second step is to rank order the columns of M β based on a probability value in the stationary distribution π. Third, the kernels (i.e. columns of M β ) are picked in such a way that for a kernel Ki all of the neighbours of pixel i which are within the half-height of the the maximum value in the kernel Ki are suppressed from the selection process. Finally, the kernel selection is continued until every pixel in the image is within a half-height of the peak value of at least one kernel. If M is a full matrix, to avoid the expense of computing M β explicitly, random kernel centers can be selected, and only the corresponding columns of M β need be computed. We show results from a three-scale hierarchy on the eye image (below). The image has 25 × 20 pixels but is shown here enlarged for clarity. At the first coarse scale 83 kernels are picked. The kernels each correspond to a different column in the fine scale transition matrix and the pixels giving rise to these kernels are shown numbered on the image. Using these kernels as an initialization, the EM procedure derives a coarse-scale stationary distribution δ 21 14 26 4 (Eq. 2), while simultaneously updating the kernel ma12 27 2 19 trix. Using the newly updated kernel matrix K and the 5 8 13 23 30 18 6 9 derived stationary distribution δ a transition matrix M 28 20 15 32 10 22 is generated (Eq. 9). The coarse scale Markov matrix 24 17 7 is then diffused to M β , again using β = 4. The kernel Coarse Scale 1 Coarse Scale 2 selection algorithm is reapplied, this time picking 32 kernels for the second coarse scale. Larger values of β cause the coarser level to have fewer elements. But the exact number of elements depends on the form of the kernels themselves. For the random experiments that we describe later in §6 we found β = 2 in the first iteration and 4 thereafter causes the number of kernels to be reduced by a factor of roughly 1/3 to 1/4 at each level. 72 28 35 44 51 64 82 4 12 31 56 19 77 36 45 52 65 13 57 23 37 5 40 53 63 73 14 29 6 66 38 74 47 24 7 30 41 54 71 78 58 15 8 20 39 48 59 67 25 68 79 21 16 2 11 26 42 49 55 60 75 32 83 43 9 76 50 17 27 61 33 69 80 3 46 18 70 81 34 10 62 22 1 25 11 1 3 16 31 29 At coarser levels of the hierarchy, we expect the kernels to get less sparse and so will the affinity and the transition matrices. In order to promote sparsity at successive levels of the hierarchy we sparsify A by zeroing out elements associated with “small” transition probabilities in M . However, in the experiments described later in §6, we observe this sparsification step to be not critical. To summarize, we use the stationary distribution π at the fine-scale to derive a transition matrix M , and its stationary distribution δ, at the coarse-scale. The coarse scale transition in turn helps to derive an affinity matrix A and its normalized version L. It is obvious that this procedure can be repeated recursively. We describe next how to use this representation hierarchy for building a fast eigensolver. 5 Fast EigenSolver Our goal in generating a hierarchical representation of a transition matrix is to develop a fast, specialized eigen solver for spectral clustering. To this end, we perform a full eigen decomposition of the normalized affinity matrix only at the coarsest level. As discussed in the previous section, the affinity matrix at the coarsest level is not likely to be sparse, hence it will need a full (as opposed to a sparse) version of an eigen solver. However it is typically the case that e ≤ m n (even in the case of the three-scale hierarchy that we just considered) and hence we expect this step to be the least expensive computationally. The resulting eigenvectors are interpolated to the next lower level of the hierarchy by a process which will be described next. Because the eigen interpolation process between every adjacent pair of scales in the hierarchy is similar, we will assume we have access to the leading eigenvectors U (size: m × e) for the normalized affinity matrix L (size: m × m) and describe how to generate the leading eigenvectors U (size: n × e), and the leading eigenvalues S (size: e × 1), for the fine-scale normalized affinity matrix L (size: n × n). There are several steps to the eigen interpolation process and in the discussion that follows we refer to the lines in the pseudo-code presented below. First, the coarse-scale eigenvectors U can be interpolated using the kernel matrix K to generate U = K U , an approximation for the fine-scale eigenvectors (line 9). Second, interpolation alone is unlikely to set the directions of U exactly aligned with U L , the vectors one would obtain by a direct eigen decomposition of the fine-scale normalized affinity matrix L. We therefore update the directions in U by applying a small number of power iterations with L, as given in lines 13-15. e e function (U, S) = CoarseToFine(L, K, U , S) 1: INPUT 2: L, K ⇐ {L is n × n and K is n × m where m n} e e e e 3: U /S ⇐ {leading coarse-scale eigenvectors/eigenvalues of L. U is of size m × e, e ≤ m} 4: OUTPUT 5: U, S ⇐ {leading fine-scale eigenvectors/eigenvalues of L. U is n × e and S is e × 1.} x 10 0.4 3 0.96 0.94 0.92 0.9 0.35 2.5 Relative Error Absolute Relative Error 0.98 Eigen Value |δλ|λ−1 −3 Eigen Spectrum 1 2 1.5 1 5 10 15 20 Eigen Index (a) 25 30 0.2 0.15 0.1 0.5 0.88 0.3 0.25 0.05 5 10 15 20 Eigen Index (b) 25 30 5 10 15 20 Eigen Index 25 30 (c) Figure 1: Hierarchical eigensolver results. (a) comparing ground truth eigenvalues S L (red circles) with multi-scale eigensolver spectrum S (blue line) (b) Relative absolute error between eigenvalues: |S−SL | (c) Eigenvector mismatch: 1 − diag |U T UL | , between SL eigenvectors U derived by the multi-scale eigensolver and the ground truth U L . Observe the slight mismatch in the last few eigenvectors, but excellent agreement in the leading eigenvectors (see text). 6: CONSTANTS: TOL = 1e-4; POWER ITERS = 50 7: “ ” e 8: TPI = min POWER ITERS, log(e × eps/TOL)/ log(min(S)) {eps: machine accuracy} e 9: U = K U {interpolation from coarse to fine} 10: while not converged do 11: Uold = U {n × e matrix, e n} 12: for i = 1 to TPI do 13: U ⇐ LU 14: end for 15: U ⇐ Gram-Schmidt(U ) {orthogonalize U } 16: Le = U T LU {L may be sparse, but Le need not be.} 17: Ue Se UeT = svd(Le ) {eigenanalysis of Le , which is of size e × e.} 18: U ⇐ U Ue {update the leading eigenvectors of L} 19: S = diag(Se ) {grab the leading eigenvalues of L} T 20: innerProd = 1 − diag( Uold U ) {1 is a e × 1 vector of all ones} 21: converged = max[abs(innerProd)] < TOL 22: end while The number of power iterations TPI can be bounded as discussed next. Suppose v = U c where U is a matrix of true eigenvectors and c is a coefficient vector for an arbitrary vector v. After TPI power iterations v becomes v = U diag(S TPI )c, where S has the exact eigenvalues. In order for the component of a vector v in the direction Ue (the eth column of U ) not to be swamped by other components, we can limit it’s decay after TPI iterations as TPI follows: (S(e)/S(1)) >= e×eps/TOL, where S(e) is the exact eth eigenvalue, S(1) = 1, eps is the machine precision, TOL is requested accuracy. Because we do not have access to the exact value S(e) at the beginning of the interpolation procedure, we estimate it from the coarse eigenvalues S. This leads to a bound on the power iterations TPI, as derived on the line 9 above. Third, the interpolation process and the power iterations need not preserve orthogonality in the eigenvectors in U . We fix this by Gram-Schmidt orthogonalization procedure (line 16). Finally, there is a still a problem with power iterations that needs to be resolved, in that it is very hard to separate nearby eigenvalues. In particular, for the convergence of the power iterations the ratio that matters is between the (e + 1) st and eth eigenvalues. So the idea we pursue is to use the power iterations only to separate the reduced space of eigenvectors (of dimension e) from the orthogonal subspace (of dimension n − e). We then use a full SVD on the reduced space to update the leading eigenvectors U , and eigenvalues S, for the fine-scale (lines 17-20). This idea is similar to computing the Ritz values and Ritz vectors in a Rayleigh-Ritz method. 6 Interpolation Results Our multi-scale decomposition code is in Matlab. For the direct eigen decomposition, we have used the Matlab program svds.m which invokes the compiled ARPACKC routine [13], with a default convergence tolerance of 1e-10. In Fig. 1a we compare the spectrum S obtained from a three-scale decomposition on the eye image (blue line) with the ground truth, which is the spectrum SL resulting from direct eigen decomposition of the fine-scale normalized affinity matrices L (red circles). There is an excellent agreement in the leading eigenvalues. To illustrate this, we show absolute relative error between the spectra: |S−SL | in Fig. 1b. The spectra agree mostly, except for SL the last few eigenvalues. For a quantitative comparison between the eigenvectors, we plot in Fig. 1c the following measure: 1 − diag(|U T UL |), where U is the matrix of eigenvectors obtained by the multi-scale approximation, UL is the ground-truth resulting from a direct eigen decomposition of the fine-scale affinity matrix L and 1 is a vector of all ones. The relative error plot demonstrates a close match, within the tolerance threshold of 1e-4 that we chose for the multi-scale method, in the leading eigenvector directions between the two methods. The relative error is high with the last few eigen vectors, which suggests that the power iterations have not clearly separated them from other directions. So, the strategy we suggest is to pad the required number of leading eigen basis by about 20% before invoking the multi-scale procedure. Obviously, the number of hierarchical stages for the multi-scale procedure must be chosen such that the transition matrix at the coarsest scale can accommodate the slight increase in the subspace dimensions. For lack of space we are omitting extra results (see Ch.8 in [6]). Next we measure the time the hierarchical eigensolver takes to compute the leading eigenbasis for various input sizes, in comparison with the svds.m procedure [13]. We form images of different input sizes by Gaussian smoothing of i.i.d noise. The Gaussian function has a standard deviation of 3 pixels. The edges in graph G are defined by the standard 8-neighbourhood of each pixel. The edge weights between neighbouring pixels are simply given by a function of the difference in the corresponding intensities (see §4.3). The affinity matrix A with the edge weights is then used to generate a Markov transition matrix M . The fast eigensolver is run on ten different instances of the input image of a given size and the average of these times is reported here. For a fair comparison between the two procedures, we set the convergence tolerance value for the svds.m procedure to be 1e-4, the same as the one used for the fast eigensolver. We found the hierarchical representation derived from this tolerance threshold to be sufficiently accurate for a novel min-cut based segmentation results that we reported in [8]. Also, the subspace dimensionality is fixed to be 51 where we expect (and indeed observe) the leading 40 eigenpairs derived from the multi-scale procedure to be accurate. Hence, while invoking svds.m we compute only the leading 41 eigenpairs. In the table shown below, the first column corresponds to the number of nodes in the graph, while the second and third columns report the time taken in seconds by the svds.m procedure and the Matlab implementation of the multi-scale eigensolver respectively. The fourth column reports the speedups of the multi-scale eigensolver over svds.m procedure on a standard desktop (Intel P4, 2.5GHz, 1GB RAM). Lowering the tolerance threshold for svds.m made it faster by about 20 − 30%. Despite this, the multi-scale algorithm clearly outperforms the svds.m procedure. The most expensive step in the multi-scale algorithm is the power iteration required in the last stage, that is interpolating eigenvectors from the first coarse scale to the required fine scale. The complexity is of the order of n × e where e is the subspace dimensionality and n is the size of the graph. Indeed, from the table we can see that the multi-scale procedure is taking time roughly proportional to n. Deviations from the linear trend are observed at specific values of n, which we believe are due to the n 322 632 642 652 1002 1272 1282 1292 1602 2552 2562 2572 5112 5122 5132 6002 7002 8002 svds.m 1.6 10.8 20.5 12.6 44.2 91.1 230.9 96.9 179.3 819.2 2170.8 871.7 7977.2 20269 7887.2 10841.4 15048.8 Multi-Scale 1.5 4.9 5.5 5.1 13.1 20.4 35.2 20.9 34.4 90.3 188.7 93.3 458.8 739.3 461.9 644.2 1162.4 1936.6 Speedup 1.1 2.2 3.7 2.5 3.4 4.5 6.6 4.6 5.2 9.1 11.5 9.3 17.4 27.4 17.1 16.8 12.9 variations in the difficulty of the specific eigenvalue problem (eg. nearly multiple eigenvalues). The hierarchical representation has proven to be effective in a min-cut based segmentation algorithm that we proposed recently [8]. Here we explored the use of random walks and associated spectral embedding techniques for the automatic generation of suitable proposal (source and sink) regions for a min-cut based algorithm. The multiscale algorithm was used to generate the 40 leading eigenvectors of large transition matrices (eg. size 20K × 20K). In terms of future work, it will be useful to compare our work with other approximate methods for SVD such as [23]. Ack: We thank S. Roweis, F. Estrada and M. Sakr for valuable comments. References [1] D. Achlioptas and F. McSherry. Fast Computation of Low-Rank Approximations. STOC, 2001. [2] D. Achlioptas et al Sampling Techniques for Kernel Methods. NIPS, 2001. [3] S. Barnard and H. Simon Fast Multilevel Implementation of Recursive Spectral Bisection for Partitioning Unstructured Problems. PPSC, 627-632. [4] M. Belkin et al Laplacian Eigenmaps and Spectral Techniques for Embedding. NIPS, 2001. [5] M. Brand et al A unifying theorem for spectral embedding and clustering. AI & STATS, 2002. [6] C. Chennubhotla. Spectral Methods for Multi-scale Feature Extraction and Spectral Clustering. http://www.cs.toronto.edu/˜chakra/thesis.pdf Ph.D Thesis, Department of Computer Science, University of Toronto, Canada, 2004. [7] C. Chennubhotla and A. Jepson. Half-Lives of EigenFlows for Spectral Clustering. NIPS, 2002. [8] F. Estrada, A. Jepson and C. Chennubhotla. Spectral Embedding and Min-Cut for Image Segmentation. Manuscript Under Review, 2004. [9] C. Fowlkes et al Efficient spatiotemporal grouping using the Nystrom method. CVPR, 2001. [10] S. Belongie et al Spectral Partitioning with Indefinite Kernels using Nystrom app. ECCV, 2002. [11] A. Frieze et al Fast Monte-Carlo Algorithms for finding low-rank approximations. FOCS, 1998. [12] Y. Koren et al ACE: A Fast Multiscale Eigenvectors Computation for Drawing Huge Graphs IEEE Symp. on InfoVis 2002, pp. 137-144 [13] R. B. Lehoucq, D. C. Sorensen and C. Yang. ARPACK User Guide: Solution of Large Scale Eigenvalue Problems by Implicitly Restarted Arnoldi Methods. SIAM 1998. [14] J. J. Lin. Reduced Rank Approximations of Transition Matrices. AI & STATS, 2002. [15] L. Lova’sz. Random Walks on Graphs: A Survey Combinatorics, 1996, 353–398. [16] G. J. McLachlan et al Mixture Models: Inference and Applications to Clustering. 1988 [17] M. Meila and J. Shi. A random walks view of spectral segmentation. AI & STATS, 2001. [18] A. Ng, M. Jordan and Y. Weiss. On Spectral Clustering: analysis and an algorithm NIPS, 2001. [19] A. Pothen Graph partitioning algorithms with applications to scientific computing. Parallel Numerical Algorithms, D. E. Keyes et al (eds.), Kluwer Academic Press, 1996. [20] G. L. Scott et al Feature grouping by relocalization of eigenvectors of the proximity matrix. BMVC, pg. 103-108, 1990. [21] E. Sharon et al Fast Multiscale Image Segmentation CVPR, I:70-77, 2000. [22] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI, August, 2000. [23] H. Simon et al Low-Rank Matrix Approximation Using the Lanczos Bidiagonalization Process with Applications SIAM J. of Sci. Comp. 21(6):2257-2274, 2000. [24] N. Tishby et al Data clustering by Markovian Relaxation NIPS, 2001. [25] C. Williams et al Using the Nystrom method to speed up the kernel machines. NIPS, 2001.

4 0.59133786 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie

Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.

5 0.59048152 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

Author: Peter Sollich, Christopher Williams

Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.

6 0.59041905 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

7 0.58987898 19 nips-2004-An Application of Boosting to Graph Classification

8 0.58946818 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

9 0.5888589 68 nips-2004-Face Detection --- Efficient and Rank Deficient

10 0.58733374 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

11 0.58666426 168 nips-2004-Semigroup Kernels on Finite Sets

12 0.58626604 148 nips-2004-Probabilistic Computation in Spiking Populations

13 0.5861963 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

14 0.58510649 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

15 0.58474004 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

16 0.58417237 131 nips-2004-Non-Local Manifold Tangent Learning

17 0.5833444 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

18 0.58163637 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

19 0.58019435 197 nips-2004-Two-Dimensional Linear Discriminant Analysis

20 0.57965958 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach