nips nips2004 nips2004-60 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis
Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The computation and memory required for kernel machines with N training samples is at least O(N 2 ). [sent-3, score-0.195]
2 Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. [sent-4, score-0.088]
3 We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). [sent-5, score-0.43]
4 We also give an error bound for the approximation, and provide experimental results on the UCI datasets. [sent-6, score-0.101]
5 1 Introduction Kernel based methods, including support vector machines [16], regularization networks [5] and Gaussian processes [18], have attracted much attention in machine learning. [sent-7, score-0.084]
6 The solid theoretical foundations and good practical performance of kernel methods make them very popular. [sent-8, score-0.141]
7 However one major drawback of the kernel methods is their scalability. [sent-9, score-0.141]
8 Kernel methods require O(N 2 ) storage and O(N 3 ) operations for direct methods, or O(N 2 ) operations per iteration for iterative methods, which is impractical for large datasets. [sent-10, score-0.254]
9 To deal with this scalability problem, many approaches have been proposed, including the Nystr¨ m method [19], sparse greedy approximation [13, 12], low rank kernel approximao tion [3] and reduced support vector machines [9]. [sent-11, score-0.379]
10 All these try to find a reduced subset of the original dataset using either random selection or greedy approximation. [sent-12, score-0.081]
11 In these methods there is no guarantee on the approximation of the kernel matrix in a deterministic sense. [sent-13, score-0.21]
12 An assumption made in these methods is that most eigenvalues of the kernel matrix are zero. [sent-14, score-0.17]
13 We explore a deterministic method to speed up kernel machines using the improved fast Gauss transform (IFGT) [20, 21]. [sent-16, score-0.585]
14 The kernel machine is solved iteratively using the conjugate gradient method, where the dominant computation is the matrix-vector product which we accelerate using the IFGT. [sent-17, score-0.189]
15 Rather than approximating the kernel matrix by a low-rank representation, we approximate the matrix-vector product by the improved fast Gauss transform to any desired precision. [sent-18, score-0.56]
16 The total computational and storage costs are of linear order in the size of the dataset. [sent-19, score-0.102]
17 We present the application of the IFGT to kernel methods in the context of the Regularized Least-Squares Classification (RLSC) [11, 10], though the approach is general and can be extended to other kernel methods. [sent-20, score-0.282]
18 Based on the Representer Theorem [17], the solution has a representation as N fλ (x) = ci K(x, xi ). [sent-22, score-0.124]
19 There are many choices for the kernel function K. [sent-35, score-0.141]
20 The Gaussian is a good kernel for classification and is used in many applications. [sent-36, score-0.141]
21 If a Gaussian kernel is applied, as shown in [10], the classification problem can be solved by the solution of a linear system, i. [sent-37, score-0.141]
22 A direct solution of the linear system will require O(N 3 ) computation and O(N 2 ) storage, which is impractical even for problems of moderate size. [sent-40, score-0.142]
23 Find the solution as f (x) = i=1 ci K(x, xi ), where c satisfies the linear system (3). [sent-45, score-0.124]
24 Since the matrix K is symmetric, we consider the well-known conjugate gradient method. [sent-49, score-0.077]
25 So the computational complexity is O(N 2 ) for low-rank K and the storage requirement is O(N 2 ). [sent-54, score-0.142]
26 While this represents an improvement for most problems, the rank of the matrix may not be small, and moreover the quadratic storage and computational complexity are still too high for large datasets. [sent-55, score-0.204]
27 In the following sections, we present an algorithm to reduce the computational and storage complexity to linear order. [sent-56, score-0.142]
28 Direct evaluation of the Gauss j=1 transform at M target points due to N sources requires O(M N ) operations. [sent-59, score-0.337]
29 The Fast Gauss Transform (FGT) was invented by Greengard and Strain [8] for efficient evaluation of the Gauss transform in O(M + N ) operations. [sent-60, score-0.186]
30 The expansion of the univariate Gaussian is p−1 e− yj −xi 2 /σ 2 = 1 n! [sent-63, score-0.315]
31 n=0 xi − x ∗ σ n hn yj − x ∗ σ + (p), (5) n 2 d where hn (x) are the Hermite functions defined by hn (x) = (−1)n dxn e−x , and x∗ is the expansion center. [sent-64, score-0.485]
32 The sum (4) is then equal to the Hermite expansion about center x∗ : G(yj ) = C α hα α≥0 yj − x ∗ h , Cα = 1 α! [sent-81, score-0.321]
33 If we truncate each of the Hermite series (6) after p terms (or equivalently order p − 1), then each of the coefficients Cα is a d-dimensional matrix with pd terms. [sent-84, score-0.163]
34 The total computational complexity for a single Hermite expansion is O((M + N )pd ). [sent-85, score-0.122]
35 In practice a single expansion about one center is not always valid or accurate over the entire domain. [sent-88, score-0.122]
36 The original FGT subdivides space into uniform boxes, which is simple, but highly inefficient in higher dimensions. [sent-90, score-0.075]
37 The number of boxes grows exponentially with dimensionality, which makes it inefficient for storage and for searching nonempty neighbor boxes. [sent-91, score-0.176]
38 We introduced the improved FGT [20, 21] to address these deficiencies, and it is summarized below. [sent-98, score-0.094]
39 1 Multivariate Taylor Expansions Instead of expanding the Gaussian into Hermite functions, we factorize it as e− yj −xi 2 /σ 2 = e− 2 ∆yj /σ 2 e− ∆xi 2 /σ 2 2 e2∆yj ·∆xi /σ , (7) where x∗ is the center of the sources, ∆yj = yj − x∗ , ∆xi = xi − x∗ . [sent-101, score-0.534]
40 The first two exponential terms can be evaluated individually at the source points or target points. [sent-102, score-0.131]
41 Here we break the entanglement by expanding it into a multivariate Taylor series ∞ 2 e2∆yj ·∆xi /σ = ∆xi ∆yj · σ σ 2n n=0 n = |α|≥0 2|α| α! [sent-104, score-0.12]
42 (8) If we truncate the series after total order p − 1, then the number of terms is rp−1,d = p+d−1 which is much less than pd in higher dimensions. [sent-106, score-0.134]
43 For d = 12 and p = 10, the d original FGT needs 1012 terms, while the multivariate Taylor expansion needs only 293930. [sent-107, score-0.142]
44 (7) and (8), the weighted sum of Gaussians (4) can be expressed as a multivariate Taylor expansions about center x∗ : C α e− G(yj ) = yj −x∗ 2 /σ 2 |α|≥0 yj − x ∗ σ α xi − x ∗ σ α , (9) . [sent-110, score-0.648]
45 N c i e− xi −x∗ 2 /σ 2 i=1 The coefficients Cα can be efficiently evaluate with rnd storage and rnd − 1 multiplications using the multivariate Horner’s rule [20]. [sent-112, score-0.324]
46 Based on these consideration, we model the space subdivision task as a k-center problem [1]: given a set of N points and a predefined number of clusters k, find a partition of the points into clusters S1 , . [sent-117, score-0.22]
47 , ck , that minimizes the maximum radius of any cluster: max max v − ci . [sent-123, score-0.137]
48 Initially, pick an arbitrary point v 0 as the center of the first cluster and add it to the center set C. [sent-126, score-0.115]
49 The direct implementation of farthest-point clustering has running time O(N k). [sent-140, score-0.139]
50 In practice, we used circular lists to index the points and achieve the complexity O(N log k) empirically. [sent-142, score-0.093]
51 3 The Algorithm and Error Bound The improved fast Gauss transform consists of the following steps: Algorithm 2 Improved Fast Gauss Transform 1. [sent-145, score-0.39]
52 Assign N sources into k clusters using the farthest-point clustering algorithm such that the radius is less than σρx . [sent-146, score-0.178]
53 Choose p sufficiently large such that the error estimate (11) is less than the desired precision . [sent-148, score-0.082]
54 For each cluster Sk with center ck , compute the coefficients given by (10). [sent-150, score-0.105]
55 Repeat for each target yj , find its neighbor clusters whose centers lie within the range σρy . [sent-152, score-0.397]
56 The amount of work required in step 3 is of O(N rpd ). [sent-155, score-0.073]
57 The work required in step 4 is O(M n rpd ), where n ≤ k is the maximum number of neighbor clusters for each target. [sent-156, score-0.149]
58 So, the improved fast Gauss transform achieves linear running time. [sent-157, score-0.43]
59 The algorithm needs to store the k coefficients of size rpd , so the storage complexity is reduced to O(Krpd ). [sent-158, score-0.245]
60 To verify the linear order of our algorithm, we generate N source points and N target points in 4, 6, 8, 10 dimensional unit hypercubes using a uniform distribution. [sent-159, score-0.253]
61 The weights on the source points are generated from a uniform distribution in the interval [0, 1] and σ = 1. [sent-160, score-0.118]
62 The results of the IFGT and the direct evaluation are displayed in Figure 1(a), (b), and confirm the linear order of the IFGT. [sent-161, score-0.1]
63 The error of the improved fast Gauss transform (2) is bounded by N |E(G(yj ))| ≤ |ci | i=1 2 2p p p ρ ρ + e−(ρy −ρx ) p! [sent-162, score-0.444]
64 The comparison between the maximum absolute errors in the simulation and the estimated error bound (11) is displayed in Figure 1(c) and (d). [sent-165, score-0.128]
65 It shows that the error bound is very conservative compared with the real errors. [sent-166, score-0.101]
66 The general Fast Multipole Methods (FMM) seek to analytically approximate the possibly full-rank matrix as a sum of low rank approximations with a tight error bound [14] (The FGT is a variant of the FMM with Gaussian kernel). [sent-170, score-0.163]
67 The problems to which kernel methods are usually applied are in higher dimensions, though the intrinsic dimensionality of the data is expected to be much smaller. [sent-172, score-0.185]
68 The improved FGT uses new data structures and a modified expansion to reduce this to polynomial order. [sent-175, score-0.176]
69 Despite this improvement, at first glance, even with the use of the IFGT, it is not clear if the reduction in complexity will be competitive with the other approaches proposed. [sent-176, score-0.087]
70 8 (d) Figure 1: (a) Running time and (b) maximum absolute error w. [sent-183, score-0.081]
71 The comparison between the real maximum absolute errors and the estimated error bound (11) w. [sent-187, score-0.128]
72 (c) the order of the Taylor series p, and (d) the radius of the farthest-point clustering algorithm r x = σρx . [sent-190, score-0.112]
73 The uniformly distributed sources and target points are in 4-dimension. [sent-191, score-0.151]
74 for hope is provided by the fact that in high dimensions we expect that the IFGT with very low order expansions will converge rapidly (because of the sharply vanishing exponential terms multiplying the expansion in factorization (7). [sent-192, score-0.166]
75 Thus we expect that combined with a dimensionality reduction technique, we can achieve very competitive solutions. [sent-193, score-0.091]
76 While dimensionality reduction would be desirable, here we do not perform such a reduction for fair comparison. [sent-195, score-0.165]
77 We use small order expansions (p = 1 and p = 2) in the IFGT and run the iterative solver. [sent-196, score-0.084]
78 We generate N source points and N target points in 100 dimensional unit hypercubes using a uniform distribution. [sent-200, score-0.253]
79 The weights on the source points are generated using a uniform distribution in the interval [0, 1]. [sent-201, score-0.118]
80 Then we estimate it using the improved fast Gauss transform and Nystrom method. [sent-204, score-0.39]
81 To compare the results, we use the maximum relative error to measure the precision of the approximations. [sent-205, score-0.082]
82 5%, we use the error bound (11) to find the para¨ meters of the IFGT, and use a trial and error method to find the parameter of the Nystr om method. [sent-207, score-0.197]
83 We also fix the number of points to N = 1000 and vary the size of centers (or random subset) k from 10 to 1000 and plot the results in Figure 2 (b). [sent-210, score-0.135]
84 The results show that the errors of the IFGT are not sensitive to the number of the centers, which means we can use very a ¨ small number of centers to achieve a good approximation. [sent-211, score-0.082]
85 The accuracy of the Nystr om method catches up at large k, where the direct evaluation may be even faster. [sent-212, score-0.179]
86 The intuition is that the use of expansions improves the accuracy of the approximation and relaxes the requirement of the centers. [sent-213, score-0.161]
87 Table 1: Ten-fold training and testing accuracy in percentage and training time in seconds using the 2 four classifiers on the five UCI datasets. [sent-223, score-0.069]
88 A rectangular kernel matrix with random subset size of 20% of N was used in PSVM on Galaxy Dim and Mushroom datasets. [sent-226, score-0.17]
89 1126 In the second experiment, five datasets from the UCI repository are used to compare the performance of four different methods for classification: RLSC with the IFGT, RLSC with ¨ full kernel evaluation, RLSC with the Nystrom method and the Proximal Support Vector Machines (PSVM) [4]. [sent-284, score-0.141]
90 The Gaussian kernel is used for all these methods. [sent-285, score-0.141]
91 The ten-fold cross validation accuracy on training and testing and the training time are listed in Table 1. [sent-288, score-0.069]
92 The RLSC with the IFGT is fastest among the four classifiers on all five datasets, while the training and testing accuracy is close to the accuracy of the RLSC with full kernel evaluation. [sent-289, score-0.247]
93 The RLSC with the Nystr¨ m approximation is nearly as fast, but the accuracy is lower than the other o methods. [sent-290, score-0.077]
94 5 Conclusions and Discussion We presented an improved fast Gauss transform to speed up kernel machines with Gaussian kernel to linear order. [sent-293, score-0.726]
95 The simulations and the classification experiments show that the algorithm is in general faster and more accurate than other matrix approximation methods. [sent-294, score-0.069]
96 At present, we do not consider the reduction from the support vector set or dimensionality reduction. [sent-295, score-0.121]
97 The combination of the improved fast Gauss transform with these techniques should bring even more reduction in computation. [sent-296, score-0.437]
98 A possible solution could be running a series of testing problems and tuning the parameters accordingly. [sent-298, score-0.102]
99 Using the Nystrom method to speed up kernel machines. [sent-422, score-0.141]
100 Improved fast Gauss transform and efficient kernel density estimation. [sent-430, score-0.437]
wordName wordTfidf (topN-words)
[('ifgt', 0.435), ('gauss', 0.384), ('fgt', 0.339), ('rlsc', 0.295), ('yj', 0.199), ('hermite', 0.189), ('transform', 0.152), ('fast', 0.144), ('kernel', 0.141), ('nystrom', 0.107), ('storage', 0.102), ('psvm', 0.097), ('improved', 0.094), ('nystr', 0.092), ('expansions', 0.084), ('centers', 0.082), ('expansion', 0.082), ('duraiswami', 0.073), ('multipole', 0.073), ('mushroom', 0.073), ('rpd', 0.073), ('pd', 0.068), ('direct', 0.066), ('xi', 0.066), ('feder', 0.063), ('greengard', 0.063), ('multivariate', 0.06), ('ci', 0.058), ('sources', 0.058), ('taylor', 0.057), ('machines', 0.054), ('abs', 0.054), ('error', 0.054), ('points', 0.053), ('greedy', 0.051), ('regularized', 0.049), ('radius', 0.049), ('fmm', 0.048), ('gonzalez', 0.048), ('greene', 0.048), ('gumerov', 0.048), ('maryland', 0.048), ('rnd', 0.048), ('subdivides', 0.048), ('moderate', 0.048), ('classi', 0.048), ('conjugate', 0.048), ('bound', 0.047), ('reduction', 0.047), ('siam', 0.046), ('hn', 0.046), ('cients', 0.045), ('dimensionality', 0.044), ('gaussian', 0.044), ('accelerated', 0.042), ('hypercubes', 0.042), ('om', 0.042), ('complexity', 0.04), ('yang', 0.04), ('inef', 0.04), ('approximation', 0.04), ('center', 0.04), ('coef', 0.04), ('target', 0.04), ('running', 0.04), ('clusters', 0.038), ('subdivision', 0.038), ('galaxy', 0.038), ('proximal', 0.038), ('source', 0.038), ('neighbor', 0.038), ('gaussians', 0.037), ('accuracy', 0.037), ('truncate', 0.036), ('boxes', 0.036), ('dim', 0.036), ('cluster', 0.035), ('evaluation', 0.034), ('univariate', 0.034), ('uci', 0.033), ('rank', 0.033), ('clustering', 0.033), ('testing', 0.032), ('chicago', 0.031), ('series', 0.03), ('reduced', 0.03), ('support', 0.03), ('ck', 0.03), ('expanding', 0.03), ('sphere', 0.03), ('operations', 0.029), ('matrix', 0.029), ('park', 0.028), ('impractical', 0.028), ('precision', 0.028), ('bandwidth', 0.027), ('uniform', 0.027), ('absolute', 0.027), ('vi', 0.027), ('fair', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis
Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1
2 0.14287841 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
3 0.11357851 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
4 0.10283428 168 nips-2004-Semigroup Kernels on Finite Sets
Author: Marco Cuturi, Jean-philippe Vert
Abstract: Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p.d.) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representation of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach. 1
5 0.094571859 178 nips-2004-Support Vector Classification with Input Data Uncertainty
Author: Jinbo Bi, Tong Zhang
Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1
6 0.087733418 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
7 0.084026232 70 nips-2004-Following Curved Regularized Optimization Solution Paths
8 0.081486866 92 nips-2004-Kernel Methods for Implicit Surface Modeling
9 0.07471589 82 nips-2004-Incremental Algorithms for Hierarchical Classification
10 0.074431457 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
11 0.07341186 68 nips-2004-Face Detection --- Efficient and Rank Deficient
12 0.07173536 42 nips-2004-Computing regularization paths for learning multiple kernels
13 0.071217299 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
14 0.069310248 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
15 0.069124401 115 nips-2004-Maximum Margin Clustering
16 0.06757132 34 nips-2004-Breaking SVM Complexity with Cross-Training
17 0.067439735 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
18 0.06309735 54 nips-2004-Distributed Information Regularization on Graphs
19 0.062002849 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
20 0.061317913 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
topicId topicWeight
[(0, -0.214), (1, 0.076), (2, -0.055), (3, 0.077), (4, -0.083), (5, -0.018), (6, 0.008), (7, -0.049), (8, -0.014), (9, -0.019), (10, 0.045), (11, 0.009), (12, -0.029), (13, 0.015), (14, -0.026), (15, -0.006), (16, -0.025), (17, 0.023), (18, 0.044), (19, -0.019), (20, -0.057), (21, -0.094), (22, -0.014), (23, 0.023), (24, -0.009), (25, 0.011), (26, -0.016), (27, -0.028), (28, 0.117), (29, 0.048), (30, -0.024), (31, -0.068), (32, -0.091), (33, 0.013), (34, 0.053), (35, -0.085), (36, -0.156), (37, 0.154), (38, 0.025), (39, -0.128), (40, -0.112), (41, 0.045), (42, 0.037), (43, -0.05), (44, -0.119), (45, -0.093), (46, -0.048), (47, 0.01), (48, 0.056), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.90152776 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis
Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1
2 0.66982907 198 nips-2004-Unsupervised Variational Bayesian Learning of Nonlinear Models
Author: Antti Honkela, Harri Valpola
Abstract: In this paper we present a framework for using multi-layer perceptron (MLP) networks in nonlinear generative models trained by variational Bayesian learning. The nonlinearity is handled by linearizing it using a Gauss–Hermite quadrature at the hidden neurons. This yields an accurate approximation for cases of large posterior variance. The method can be used to derive nonlinear counterparts for linear algorithms such as factor analysis, independent component/factor analysis and state-space models. This is demonstrated with a nonlinear factor analysis experiment in which even 20 sources can be estimated from a real world speech data set. 1
3 0.61706561 82 nips-2004-Incremental Algorithms for Hierarchical Classification
Author: Nicolò Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni
Abstract: We study the problem of hierarchical classification when labels corresponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, implementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classifier for the H-loss. We then empirically compare two incremental approximations of the Bayes-optimal classifier with a flat SVM classifier and with classifiers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremental approximation of the Bayes-optimal classifier performs, after just one training epoch, nearly as well as the hierarchical SVM classifier (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classifier based on the true model parameters. 1 Introduction and basic definitions We study the problem of classifying data in a given taxonomy of labels, where the taxonomy is specified as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that multilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf). Learning algorithms for hierarchical classification have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss function, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classification tasks, and we derive the corresponding Bayes-optimal classifier under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based ∗ This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reflects the authors’ views. algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an approximation to the Bayes-optimal classifier; 3) a simplified variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classifier based on the true model parameters. Our incremental algorithms are based on training linear-threshold classifiers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classifications as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classification system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a flat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x ∈ Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1 , . . . , yN ) ∈ {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are defined over the set of labels. A multilabel y ∈ {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some fixed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate φ over a set Ω, we will use {φ} to denote both the subset of Ω where φ is true and the indicator function of this subset. 2 The H-loss Though several hierarchical losses have been proposed in the literature (e.g., in [11, 20]), no one has emerged as a standard yet. Since hierarchical losses are defined over multilabels, we start by considering two very simple functions measuring the discrepancy between multilabels y = (y1 , ..., yN ) and y = (y1 , ..., yN ): the 0/1-loss 0/1 (y, y) = {∃i : yi = yi } and the symmetric difference loss ∆ (y, y) = {y1 = y1 } + . . . + {yN = yN }. There are several ways of making these losses depend on a given taxonomy G. In this work, we follow the intuition “if a mistake is made at node i, then further mistakes made in the subtree rooted at i are unimportant”. That is, we do not require the algorithm be able to make fine-grained distinctions on tasks when it is unable to make coarse-grained ones. For example, if an algorithm failed to label a document with the class SPORTS, then the algorithm should not be charged more loss because it also failed to label the same document with the subclass SOCCER and the sub-subclass CHAMPIONS LEAGUE. A function implementing this intuition is defined by N H (y, y) = i=1 ci {yi = yi ∧ yj = yj , j ∈ anc(i)}, where c1 , . . . , cN > 0 are fixed cost coefficients. This loss, which we call H-loss, can also be described as follows: all paths in G from a root down to a leaf are examined and, whenever we encounter a node i such that yi = yi , we add ci to the loss, whereas all the loss contributions in the subtree rooted at i are discarded. Note that if c1 = . . . = cN = 1 then 0/1 ≤ H ≤ ∆ . Choices of ci depending on the structure of G are proposed in Section 4. Given a multilabel y ∈ {0, 1}N define its G-truncation as the multilabel y = (y1 , ..., yN ) ∈ {0, 1}N where, for each i = 1, . . . , N , yi = 1 iff yi = 1 and yj = 1 for all j ∈ anc(i). Note that the G-truncation of any multilabel always respects G. A graphical (a) (b) (c) (d) Figure 1: A one-tree forest (repeated four times). Each node corresponds to a class in the taxonomy G, hence in this case N = 12. Gray nodes are included in the multilabel under consideration, white nodes are not. (a) A generic multilabel which does not respect G; (b) its G-truncation. (c) A second multilabel that respects G. (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). representation of the notions introduced so far is given in Figure 1. In the next lemma we show that whenever y respects G, then H (y, y) cannot be smaller than H (y , y). In other words, when the multilabel y to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. Lemma 1 Let G be a taxonomy, y, y ∈ {0, 1}N be two multilabels such that y respects G, and y be the G-truncation of y. Then H (y , y) ≤ H (y, y) . Proof. For each i = 1, . . . , N we show that yi = yi and yj = yj for all j ∈ anc(i) implies yi = yi and yj = yj for all j ∈ anc(i). Pick some i and suppose yi = yi and yj = yj for all j ∈ anc(i). Now suppose yj = 0 (and thus yj = 0) for some j ∈ anc(i). Then yi = 0 since y respects G. But this implies yi = 1, contradicting the fact that the G-truncation y respects G. Therefore, it must be the case that yj = yj = 1 for all j ∈ anc(i). Hence the G-truncation of y left each node j ∈ anc(i) unchanged, implying yj = yj for all j ∈ anc(i). But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that yj = 1, this also implies yi = yi . Therefore yi = yi and the proof is concluded. 3 A probabilistic data model Our learning algorithms are based on the following statistical model for the data, originally introduced in [5]. The model defines a probability distribution fG over the set of multilabels respecting a given taxonomy G by associating with each node i of G a Bernoulli random variable Yi and defining fG (y | x) = N i=1 P Yi = yi | Ypar(i) = ypar(i) , X = x . To guarantee that fG (y | x) = 0 whenever y ∈ {0, 1}N does not respect G, we set P Yi = 1 | Ypar(i) = 0, X = x = 0. Notice that this definition of fG makes the (rather simplistic) assumption that all Yk with the same parent node i (i.e., the children of i) are independent when conditioned on Yi and x. Through fG we specify an i.i.d. process {(X 1 , Y 1 ), (X 2 , Y 2 ), . . .}, where, for t = 1, 2, . . ., the multilabel Y t is distributed according to fG (· | X t ) and X t is distributed according to a fixed and unknown distribution D. Each example (xt , y t ) is thus a realization of the corresponding pair (X t , Y t ) of random variables. Our parametric model for fG is described as follows. First, we assume that the support of D is the surface of the d-dimensional unit sphere (i.e., instances x ∈ R d are such that ||x|| = 1). With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . Then, we define the conditional probabilities for a nonroot node i with parent j by P (Yi = 1 | Yj = 1, X = x) = (1 + ui x)/2. If i is a root node, the previous equation simplifies to P (Yi = 1 | X = x) = (1 + ui x)/2. 3.1 The Bayes-optimal classifier for the H-loss We now describe a classifier, called H - BAYES, that is the Bayes-optimal classifier for the H-loss. In other words, H - BAYES classifies any instance x with the multilabel y = argminy∈{0,1} E[ H (¯ , Y ) | x ]. Define pi (x) = P Yi = 1 | Ypar(i) = 1, X = x . y ¯ When no ambiguity arises, we write pi instead of pi (x). Now, fix any unit-length instance x and let y be a multilabel that respects G. For each node i in G, recursively define H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ) + k∈child(i) H k,x (y) . The classifier H - BAYES operates as follows. It starts by putting all nodes of G in a set S; nodes are then removed from S one by one. A node i can be removed only if i is a leaf or if all nodes j in the subtree rooted at i have been already removed. When i is removed, its value yi is set to 1 if and only if pi 2 − k∈child(i) H k,x (y)/ci ≥ 1 . (1) (Note that if i is a leaf then (1) is equivalent to yi = {pi ≥ 1/2}.) If yi is set to zero, then all nodes in the subtree rooted at i are set to zero. Theorem 2 For any taxonomy G and all unit-length x ∈ Rd , the multilabel generated by H - BAYES is the Bayes-optimal classification of x for the H-loss. Proof sketch. Let y be the multilabel assigned by H - BAYES and y ∗ be any multilabel minimizing the expected H-loss. Introducing the short-hand Ex [·] = E[· | x], we can write Ex H (y, Y )= N i=1 ci (pi (1 − yi ) + (1 − pi )yi ) j∈anc(i) pj {yj = 1} . Note that we can recursively decompose the expected H-loss as Ex H (y, Y )= i∈root(G) where Ex Hi (y, Y ) = ci (pi (1 − yi ) + (1 − pi )yi ) Ex Hi (y, Y ), pj {yj = 1} + j∈anc(i) Ex Hk (y, Y ) . (2) k∈child(i) ∗ Pick a node i. If i is a leaf, then the sum in the RHS of (2) disappears and yi = {pi ≥ 1/2}, ∗ which is also the minimizer of H i,x (y) = ci (pi (1 − yi ) + (1 − pi )yi ), implying yi = yi . ∗ Now let i be an internal node and inductively assume yj = yj for all j ∈ sub(i). Notice ∗ that the factors j∈anc(i) pj {yj = 1} occur in both terms in the RHS of (2). Hence yi does not depend on these factors and we can equivalently minimize ci (pi (1 − yi ) + (1 − pi )yi ) + pi {yi = 1} k∈child(i) H k,x (y), (3) where we noted that, for each k ∈ child(i), Ex Hk (y, Y ) = j∈anc(i) pj {yj = 1} pi {yi = 1}H k,x (y) . ∗ Now observe that yi minimizing (3) is equivalent to the assignment produced by H - BAYES. ∗ ∗ To conclude the proof, note that whenever yi = 0, Lemma 1 requires that yj = 0 for all nodes j ∈ sub(i), which is exactly what H - BAYES does. 4 The algorithms We consider three incremental algorithms. Each one of these algorithms learns a hierarchical classifier by training a decision function gi : Rd → {0, 1} at each node i = 1, . . . , N . For a given set g1 , . . . , gN of decision functions, the hierarchical classifier generated by these algorithms classifies an instance x through a multilabel y = (y1 , ..., yN ) defined as follows: yi = gi (x) 0 if i ∈ root(G) or yj = 1 for all j ∈ anc(i) otherwise. (4) Note that y computed this way respects G. The classifiers (4) are trained incrementally. Let gi,t be the decision function at node i after training on the first t − 1 examples. When the next training example (xt , y t ) is available, the algorithms compute the multilabel y t using classifier (4) based on g1,t (xt ), . . . , gN,t (xt ). Then, the algorithms consider for an update only those decision functions sitting at nodes i satisfying either i ∈ root(G) or ypar(i),t = 1. We call such nodes eligible at time t. The decision functions of all other nodes are left unchanged. The first algorithm we consider is a simple hierarchical version of the Perceptron algorithm [16], which we call H - PERC. The decision functions at time t are defined by gi,t (xt ) = {wi,t xt ≥ 0}. In the update phase, the Perceptron rule wi,t+1 = wi,t + yi,t xt is applied to every node i eligible at time t and such that yi,t = yi,t . The second algorithm, called APPROX - H - BAYES, approximates the H - BAYES classifier of Section 3.1 by replacing the unknown quantities pi (xt ) with estimates (1+w i,t xt )/2. The weights w i,t are regularized least-squares estimates defined by (i) wi,t = (I + Si,t−1 Si,t−1 + xt xt )−1 Si,t−1 y t−1 . (5) The columns of the matrix Si,t−1 are all past instances xs that have been stored at node i; (i) the s-th component of vector y t−1 is the i-th component yi,s of the multilabel y s associated with instance xs . In the update phase, an instance xt is stored at node i, causing an update of wi,t , whenever i is eligible at time t and |w i,t xt | ≤ (5 ln t)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. The corresponding decision functions gi,t are of the form gi,t (xt ) = {w i,t xt ≥ τi,t }, where the threshold τi,t ≥ 0 at node i depends on the margin values w j,t xt achieved by nodes j ∈ sub(i) — recall (1). Note that gi,t is not a linear-threshold function, as xt appears in the definition of w i,t . The margin threshold (5 ln t)/Ni,t , controlling the update of node i at time t, reduces the space requirements of the classifier by keeping matrices Si,t suitably small. This threshold is motivated by the work [4] on selective sampling. The third algorithm, which we call H - RLS (Hierarchical Regularized Least Squares), is a simplified variant of APPROX - H - BAYES in which the thresholds τi,t are set to zero. That is, we have gi,t (xt ) = {w i,t xt ≥ 0} where the weights w i,t are defined as in (5) and updated as in the APPROX - H - BAYES algorithm. Details on how to run APPROX - H - BAYES 2 and H - RLS in dual variables and perform an update at node i in time O(Ni,t ) are found in [3] (where a mistake-driven version of H - RLS is analyzed). 5 Experimental results The empirical evaluation of the algorithms was carried out on two well-known datasets of free-text documents. The first dataset consists of the first (in chronological order) 100,000 newswire stories from the Reuters Corpus Volume 1, RCV1 [2]. The associated taxonomy of labels, which are the topics of the documents, has 101 nodes organized in a forest of 4 trees. The forest is shallow: the longest path has length 3 and the the distribution of nodes, sorted by increasing path length, is {0.04, 0.53, 0.42, 0.01}. For this dataset, we used the bag-of-words vectorization performed by Xerox Research Center Europe within the EC project KerMIT (see [4] for details on preprocessing). The 100,000 documents were divided into 5 equally sized groups of chronologically consecutive documents. We then used each adjacent pair of groups as training and test set in an experiment (here the fifth and first group are considered adjacent), and then averaged the test set performance over the 5 experiments. The second dataset is a specific subtree of the OHSUMED corpus of medical abstracts [1]: the subtree rooted in “Quality of Health Care” (MeSH code N05.715). After removing overlapping classes (OHSUMED is not quite a tree but a DAG), we ended up with 94 Table 1: Experimental results on two hierarchical text classification tasks under various loss functions. We report average test errors along with standard deviations (in parenthesis). In bold are the best performance figures among the incremental algorithms. RCV1 PERC H - PERC H - RLS AH - BAY SVM H - SVM OHSU. PERC H - PERC H - RLS AH - BAY SVM H - SVM 0/1-loss 0.702(±0.045) 0.655(±0.040) 0.456(±0.010) 0.550(±0.010) 0.482(±0.009) 0.440(±0.008) unif. H-loss 1.196(±0.127) 1.224(±0.114) 0.743(±0.026) 0.815(±0.028) 0.790(±0.023) 0.712(±0.021) norm. H-loss 0.100(±0.029) 0.099(±0.028) 0.057(±0.001) 0.090(±0.001) 0.057(±0.001) 0.055(±0.001) ∆-loss 1.695(±0.182) 1.861(±0.172) 1.086(±0.036) 1.465(±0.040) 1.173(±0.051) 1.050(±0.027) 0/1-loss 0.899(±0.024) 0.846(±0.024) 0.769(±0.004) 0.819(±0.004) 0.784(±0.003) 0.759(±0.002) unif. H-loss 1.938(±0.219) 1.560(±0.155) 1.200(±0.007) 1.197(±0.006) 1.206(±0.003) 1.170(±0.005) norm. H-loss 0.058(±0.005) 0.057(±0.005) 0.045(±0.000) 0.047(±0.000) 0.044(±0.000) 0.044(±0.000) ∆-loss 2.639(±0.226) 2.528(±0.251) 1.957(±0.011) 2.029(±0.009) 1.872(±0.005) 1.910(±0.007) classes and 55,503 documents. We made this choice based only on the structure of the subtree: the longest path has length 4, the distribution of nodes sorted by increasing path length is {0.26, 0.37, 0.22, 0.12, 0.03}, and there are a significant number of partial and multiple path multilabels. The vectorization of the subtree was carried out as follows: after tokenization, we removed all stopwords and also those words that did not occur at least 3 times in the corpus. Then, we vectorized the documents using the Bow library [13] with a log(1 + TF) log(IDF) encoding. We ran 5 experiments by randomly splitting the corpus in a training set of 40,000 documents and a test set of 15,503 documents. Test set performances are averages over these 5 experiments. In the training set we kept more documents than in the RCV1 splits since the OHSUMED corpus turned out to be a harder classification problem than RCV1. In both datasets instances have been normalized to unit length. We tested the hierarchical Perceptron algorithm (H - PERC), the hierarchical regularized leastsquares algorithm (H - RLS), and the approximated Bayes-optimal algorithm (APPROX - H BAYES ), all described in Section 4. The results are summarized in Table 1. APPROX - H BAYES ( AH - BAY in Table 1) was trained using cost coefficients c i chosen as follows: if i ∈ root(G) then ci = |root(G)|−1 . Otherwise, ci = cj /|child(j)|, where j is the parent of i. Note that this choice of coefficients amounts to splitting a unit cost equally among the roots and then splitting recursively each node’s cost equally among its children. Since, in this case, 0 ≤ H ≤ 1, we call the resulting loss normalized H-loss. We also tested a hierarchical version of SVM (denoted by H - SVM in Table 1) in which each node is an SVM classifier trained using a batch version of our hierarchical learning protocol. More precisely, each node i was trained only on those examples (xt , y t ) such that ypar(i),t = 1 (note that, as no conditions are imposed on yi,t , node i is actually trained on both positive and negative examples). The resulting set of linear-threshold functions was then evaluated on the test set using the hierachical classification scheme (4). We tried both the C and ν parametrizations [18] for SVM and found the setting C = 1 to work best for our data. 1 We finally tested the “flat” variants of Perceptron and SVM, denoted by PERC and SVM. In these variants, each node is trained and evaluated independently of the others, disregarding all taxonomical information. All SVM experiments were carried out using the libSVM implementation [6]. All the tested algorithms used a linear kernel. 1 It should be emphasized that this tuning of C was actually chosen in hindsight, with no crossvalidation. As far as loss functions are concerned, we considered the 0/1-loss, the H-loss with cost coefficients set to 1 (denoted by uniform H-loss), the normalized H-loss, and the symmetric difference loss (denoted by ∆-loss). Note that H - SVM performs best, but our incremental algorithms were trained for a single epoch on the training set. The good performance of SVM (the flat variant of H - SVM ) is surprising. However, with a single epoch of training H - RLS does not perform worse than SVM (except on OHSUMED under the normalized H-loss) and comes reasonably close to H - SVM. On the other hand, the performance of APPROX - H - BAYES is disappointing: on OHSUMED it is the best algorithm only for the uniform H-loss, though it was trained using the normalized H-loss; on RCV1 it never outperforms H - RLS, though it always does better than PERC and H - PERC. A possible explanation for this behavior is that APPROX - H - BAYES is very sensitive to errors in the estimates of pi (x) (recall Section 3.1). Indeed, the least-squares estimates (5), which we used to approximate H - BAYES, seem to work better in practice on simpler (and possibly more robust) algorithms, such as H - RLS. The lower values of normalized H-loss on OHSUMED (a harder corpus than RCV1) can be explained because a quarter of the 94 nodes in the OHSUMED taxonomy are roots, and thus each top-level mistake is only charged about 4/94. As a final remark, we observe that the normalized H-loss gave too small a range of values to afford fine comparisons among the best performing algorithms. 6 Regret bounds for the H-loss In this section we prove a theoretical bound on the H-loss of a slight variant of the algorithm H - RLS tested in Section 5. More precisely, we assume data are generated according to the probabilistic model introduced in Section 3 with unknown instance distribution D and unknown coefficients u1 , . . . , uN . We define the regret of a classifier assigning label y to instance X as E H (y, Y t ) − E H (y, Y ), where the expected value is with respect the random draw of (X, Y ) and y is the multilabel assigned by classifier (4) when the decision functions gi are zero-threshold functions of the form gi (x) = {ui x ≥ 0}. The theorem below shows that the regret of the classifier learned by a variant of H - RLS after t training examples, with t large enough, is exponentially small in t. In other words, H - RLS learns to classify as well as the algorithm that is given the true parameters u1 , . . . , uN of the underlying data-generating process. We have been able to prove the theorem only for the variant of H - RLS storing all instances at each node. That is, every eligible node at time t is updated, irrespective of whether |w i,t xt | ≤ (5 ln t)/Ni,t . Given the i.i.d. data-generating process (X 1 , Y 1 ), (X 2 , Y 2 ), . . ., for each node k we define the derived process X k1 , X k2 , . . . including all and only the instances X s of the original process that satisfy Ypar(k),s = 1. We call this derived process the process at node k. Note that, for each k, the process at node k is an i.i.d. process. However, its distribution might depend on k. The spectrum of the process at node k is the set of eigenvalues of the correlation matrix with entries E[Xk1 ,i Xk1 ,j ] for i, j = 1, . . . , d. We have the following theorem, whose proof is omitted due to space limitations. Theorem 3 Let G be a taxonomy with N nodes and let fG be a joint density for G parametrized by N unit-norm vectors u1 , . . . , uN ∈ Rd . Assume the instance distribution is such that there exist γ1 , . . . , γN > 0 satisfying P |ui X t | ≥ γi = 1 for i = 1, . . . , N . Then, for all t > max maxi=1,...,N E H (y t , Y t ) −E 16 µ i λ i γi , maxi=1,...,N 192d µi λ 2 i the regret H (y t , Y t ) of the modified H - RLS algorithm is at most N 2 2 µi t e−κ1 γi λi t µi + t2 e−κ2 λi t µi cj , i=1 j∈sub(i) where κ1 , κ2 are constants, µi = E j∈anc(i) (1 + uj X)/2 eigenvalue in the spectrum of the process at node i. and λi is the smallest 7 Conclusions and open problems In this work we have studied the problem of hierarchical classification of data instances in the presence of partial and multiple path labellings. We have introduced a new hierarchical loss function, the H-loss, derived the corresponding Bayes-optimal classifier, and empirically compared an incremental approximation to this classifier with some other incremental and nonincremental algorithms. Finally, we have derived a theoretical guarantee on the H-loss of a simplified variant of the approximated Bayes-optimal algorithm. Our investigation leaves several open issues. The current approximation to the Bayesoptimal classifier is not satisfying, and this could be due to a bad choice of the model, of the estimators, of the datasets, or of a combination of them. Also, the normalized H-loss is not fully satisfying, since the resulting values are often too small. From the theoretical viewpoint, we would like to analyze the regret of our algorithms with respect to the Bayesoptimal classifier, rather than with respect to a classifier that makes a suboptimal use of the true model parameters. References [1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/. [2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. A second-order Perceptron algorithm. In Proc. 15th COLT, pages 121–137. Springer, 2002. [4] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classifiers via selective sampling. In Proc. 16th COLT, pages 373–386. Springer, 2003. [5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Regret bounds for hierarchical classification with linear-threshold functions. In Proc. 17th COLT. Springer, 2004. To appear. [6] C.-C. Chang and C.-J. Lin. Libsvm — a library for support vector machines. URL: www.csie.ntu.edu.tw/∼cjlin/libsvm/. [7] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2001. [8] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proc. 21st ICML. Omnipress, 2004. [9] S.T. Dumais and H. Chen. Hierarchical classification of web content. In Proc. 23rd ACM Int. Conf. RDIR, pages 256–263. ACM Press, 2000. [10] M. Granitzer. Hierarchical Text Classification using Methods from Machine Learning. PhD thesis, Graz University of Technology, 2003. [11] T. Hofmann, L. Cai, and M. Ciaramita. Learning with taxonomies: Classifying documents and words. In NIPS Workshop on Syntax, Semantics, and Statistics, 2003. [12] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. 14th ICML, Morgan Kaufmann, 1997. [13] A. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. URL: www-2.cs.cmu.edu/∼mccallum/bow/. [14] A.K. McCallum, R. Rosenfeld, T.M. Mitchell, and A.Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In Proc. 15th ICML. Morgan Kaufmann, 1998. [15] D. Mladenic. Turning yahoo into an automatic web-page classifier. In Proceedings of the 13th European Conference on Artificial Intelligence, pages 473–474, 1998. [16] F. Rosenblatt. The Perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Review, 65:386–408, 1958. [17] M.E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002. [18] B. Sch¨ lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. [19] B. Sch¨ lkopf and A. Smola. Learning with kernels. MIT Press, 2002. o [20] A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proc. 2001 Int. Conf. Data Mining, pages 521–528. IEEE Press, 2001. [21] V.N. Vapnik. Statistical Learning Theory. Wiley, 1998.
4 0.55876333 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
Author: Peter Sollich, Christopher Williams
Abstract: The equivalent kernel [1] is a way of understanding how Gaussian process regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related kernels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes. Consider the supervised regression problem for a dataset D with entries (xi , yi ) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x ∗ is given by ¯ f (x∗ ) = k (x∗ )(K + σ 2 I)−1 y, (1) where K denotes the n × n matrix of covariances between the training points with entries k(xi , xj ), k(x∗ ) is the vector of covariances k(xi , x∗ ), σ 2 is the noise variance on the observations and y is a n × 1 vector holding the training targets. See e.g. [2] for further details. ¯ We can define a vector of functions h(x∗ ) = (K + σ 2 I)−1 k(x∗ ) . Thus we have f (x∗ ) = h (x∗ )y, making it clear that the mean prediction at a point x∗ is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x∗ , h(x∗ ) gives the vector of weights applied to targets y. Silverman [1] called h (x∗ ) the weight function. Understanding the form of the weight function is made complicated by the matrix inversion of K + σ 2 I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be “smeared out” in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK). The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK. 1 Gaussian Process Regression and the Equivalent Kernel It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional J[f ] = 1 f 2 2 H + 1 2 2σn n (yi − f (xi ))2 , (2) i=1 where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.) Let η(x) = E[y|x] be the target function for our regression problem and write E[(y − f (x))2 ] = E[(y − η(x))2 ] + (η(x) − f (x))2 . Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2, Jρ [f ] = ρ 2σ 2 (η(x) − f (x))2 dx + 1 f 2 2 H, where ρ has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x − x ), the natural basis in which to analyse equation 1 is the Fourier ˜ basis of complex sinusoids so that f (x) is represented as f (s)e2πis·x ds and similarly for η(x). Thus we obtain Jρ [f ] = 1 2 ˜ ρ ˜ |f (s)|2 |f (s) − η (s)|2 + ˜ 2 σ S(s) ds, ˜ as f 2 = |f (s)|2 /S(s)ds where S(s) is the power spectrum of the kernel k, H −2πis·x S(s) = k(x)e dx. Jρ [f ] can be minimized using calculus of variations to ob˜ tain f (s) = S(s)η(s)/(σ 2 /ρ + S(s)) which is recognized as the convolution f (x∗ ) = h(x∗ − x)η(x)dx. Here the Fourier transform of the equivalent kernel h(x) is ˜ h(s) = 1 S(s) = . S(s) + σ 2 /ρ 1 + σ 2 /(ρS(s)) (3) ˜ The term σ 2 /ρ in the first expression for h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, §14-1]. Notice that as ρ → ∞, h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = ρ. This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel. 2 The EK for the Squared Exponential and Related Kernels For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e−α|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m) )2 dx [1, 6], and the regularizer P f 2 = ( 2 f )2 dx in two dimensions, where the EK is given in terms of the Kelvin function kei [7]. We now consider the commonly used squared exponential (SE) kernel k(r) = exp(−r 2 /2 2 ), where r 2 = ||x−x ||2 . (This is sometimes called the Gaussian or radial basis function kernel.) Its Fourier transform is given by S(s) = (2π 2 )D/2 exp(−2π 2 2 |s|2 ), where D denotes the dimensionality of x (and s) space. From equation 3 we obtain ˜ hSE (s) = 1 , 1 + b exp(2π 2 2 |s|2 ) where b = σ 2 /ρ(2π 2 )D/2 . We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large ρ, b will be small. Thus for small ˜ s = |s| we have that hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(2π 2 2 s2 ) = 1, i.e. s2 = log(1/b)/2π 2 2 . As c c ˜ exp(2π 2 2 s2 ) grows quickly with s, the transition of hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function. Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by sc r hSE (r) = D/2 JD/2 (2πsc r). Proof: hSE (s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give ∞ hSE (r) = 2πr 0 sc 2πr 0 s r s r ν+1 ν+1 ˜ Jν (2πrs)hSE (s) ds Jν (2πrs) ds = sc r (4) D/2 JD/2 (2πsc r). where ν = D/2 − 1, Jν (z) is a Bessel function of the first kind and we have used the identity z ν+1 Jν (z) = (d/dz)[z ν+1 Jν+1 (z)]. Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE (x) = 2sc sinc(2πsc x) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2 (z) = (2/πz)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below. Notice that sc scales as (log(ρ))1/2 so that the width of the EK (which is proportional to 1/sc ) will decay very slowly as ρ increases. In contrast for a spline of order m (with power spectrum ∝ |s|−2m ) the width of the EK scales as ρ−1/2m [1]. If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp (x, x ) = n∈Z k(x − x + 2nπ). This kernel will be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel). We now show that the result of Proposition 1 is asymptotically exact for ρ → ∞, and calculate the leading corrections for finite ρ. The scaling of the width of the EK as 1/s c suggests writing hSE (r) = (2πsc )D g(2πsc r). Then from equation 4 and using the definition of sc z sc (2πsc )D ∞ u =z 2πz 0 ∞ g(z) = 0 ν+1 2πsc s z ν+1 Jν (zs/sc ) ds 1 + exp[2π 2 2 (s2 − s2 )] c Jν (zu) du 1 + exp[2π 2 2 s2 (u2 − 1)] c (5) where we have rescaled s = sc u in the second step. The value of sc , and hence ρ, now enters only in the exponential via a = 2π 2 2 s2 . For a → ∞, the exponential tends to zero c for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function Θ(1 − u) in the limit and Proposition 1 becomes exact, with g∞ (z) ≡ lima→∞ g(z) = (2πz)−D/2 JD/2 (z). To calculate corrections to this, one uses that for large but finite a the difference ∆(u) = {1 + exp[a(u2 − 1)]}−1 − Θ(1 − u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give ∞ g(z) = g∞ (z) + z k=0 I k dk k! duk u 2πz ν+1 ∞ Jν (zu) , ∆(u)(u − 1)k du Ik = 0 u=1 The problem is thus reduced to calculating the integrals Ik . Setting u = 1 + v/a one has 0 ak+1 Ik = −a a = 0 1 − 1 v k dv + 1 + exp(v 2 /a + 2v) (−1)k+1 v k dv + 1 + exp(−v 2 /a + 2v) ∞ 0 ∞ 0 vk dv 1 + exp(v 2 /a + 2v) vk dv 1 + exp(v 2 /a + 2v) In the first integral, extending the upper limit to ∞ gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0 /a2 , I1 = c1 /a2 while all Ik with k ≥ 2 are smaller by at least 1/a2 . The numerical constants are −c0 = c1 = π 2 /24. This gives, using that (d/dz)[z ν+1 Jν (z)] = z ν Jν (z) + z ν+1 Jν−1 (z) = (2ν + 1)z ν Jν (z) − z ν+1 Jν+1 (z): Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large ρ by hSE (r) = (2πsc )D g(2πsc r) with g(z) = 1 (2πz) D 2 JD/2 (z) + z (c0 + c1 (D − 1))JD/2−1 (z) − c1 zJD/2 (z) a2 +O( 1 ) a4 For e.g. D = 1 this becomes g(z) = π −1 {sin(z)/z − π 2 /(24a2 )[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z 2 /a2 . It will therefore provide a useful improvement as long as z = 2πsc r < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order. 2.1 Accuracy of the approximation To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density ρgrid . For making 2 predictions at the grid points we obtain the smoother matrix K(K + σgrid I)−1 , where1 2 2 σgrid = σ ρgrid /ρ, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects. 2 To understand this scaling of σgrid consider the case where ρgrid > ρ which means that the effective variance at each of the ρgrid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there 2 are ρgrid /ρ independent Gaussian observations with variance σgrid at a single x-point; this would 2 be equivalent to one Gaussian observation with variance σ . In effect the ρ observations per unit x-space have been smoothed out uniformly. 1 0.16 0.35 0.35 Numerical Proposition 1 Proposition 2 0.3 0.25 0.14 0.2 0.2 0.15 0.12 0.15 0.1 0.1 0.05 0.1 0.05 0 0 −0.05 0.08 Numerical Proposition 1 Proposition 2 0.3 0.25 −0.05 −0.1 0 5 10 −0.1 0 15 5 10 15 0.06 0.04 0.02 0 −0.02 Numerical Proposition 1 Sample −0.04 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 1: Main figure: plot of the weight function corresponding to ρ = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for ρ = 100 (left) and ρ = 104 (right). Figure 1 shows plots of the weight function for ρ = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and σ 2 = 0.1, with ρgrid /ρ = 5/3. To reduce edge effects, the interval [−3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less “ringing” than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposition 2 to g(z) for ρ = 100 (a = 5.67, left) and ρ = 104 (a = 9.67, right). As expected, the addition of the 1/a2 -correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(ρ). The is larger than the na¨ve estimate (1/a2 )2 ∼ 1/(log(ρ))4 based on the first correction term from Proposition ı 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down. 2.2 Other kernels Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via σ 2 /ρ = S(sc ), so that ˜ h(s) = [1 + S(sc )/S(s)]−1 . The EK will then have the limiting form given in Proposi˜ tion 1 if h(s) approaches a step function Θ(sc − s), i.e. if it becomes infinitely “steep” around the point s = sc for sc → ∞. A quantitative criterion for this is that the slope ˜ |h (sc )| should become much larger than 1/sc , the inverse of the range of the step func˜ tion. Since h (s) = S (s)S(sc )S −2 (s)[1 + S(sc )/S(s)]−2 , this is equivalent to requiring that −sc S (sc )/4S(sc ) ∝ −d log S(sc )/d log sc must diverge for sc → ∞. The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s. A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2 ; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the “rational quadratic” k(r) = [1 + (r/l)2 ]−(D+1)/2 which has an exponentially decaying power spectrum S(s) ∝ exp(−2π s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(−r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc ∝ 1/ log(ρ). The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(−r 2 /2 2 )p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, §2.10]. Such a kernel has power spectrum ∞ S(s) = (2π)D/2 D exp(−2π 2 2 s2 )p( ) d (6) 0 and one easily verifies that the rational quadratic kernel, which has S(s) ∝ exp(−2π 0 s), is obtained for p( ) ∝ −D−2 exp(− 2 /2 2 ). More generally, because the exponential 0 1/s D factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) ∼ 0 p( ) d for large s. This will decay more strongly than any power of 1/s for s → ∞ if p( ) itself decreases more strongly than any power of for → 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies. 3 Understanding GP Learning Using the Equivalent Kernel We now turn to using EK analysis to get a handle on average case learning curves for Gaussian processes. Here the setup is that a function η is drawn from a Gaussian process, and we obtain ρ noisy observations of η per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and η. Averaging over the noise process, the x-locations of the training data and the prior over η we obtain the average MSE as a function of ρ. See e.g. [10] and [11] for an overview of earlier work on GP learning curves. To understand the asymptotic behaviour of for large ρ, we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK (x) = h(x − x )y(x )dx in the continuum limit of “smoothed out” input locations. We assume as before 2 that y = target + noise, i.e. y(x) = η(x) + ν(x) where E[ν(x)ν(x )] = (σ∗ /ρ)δ(x − x ). 2 Here σ∗ denotes the true noise variance, as opposed to the noise variance assumed in the 2 EK; the scaling of σ∗ with ρ is explained in footnote 1. For a fixed target η, the MSE is = ( dx)−1 [η(x) − fEK (x)]2 dx. Averaging over the noise process ν and target function η gives in Fourier space 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 ds [1 + σ 2 /(ρS(s))]2 (7) where Sη (s) is the power spectrum of the prior over target functions. In the case S(s) = 2 Sη (s) and σ 2 = σ∗ where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (σ 2 /ρ) [1 + σ 2 /(ρS(s))]−1 ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10] = σ2 2 ˜ ˜ Sη (s)[1 − h(s)]2 + (σ∗ /ρ)h2 (s) ds = ρ α=2 0.5 0.03 0.025 ε 0.02 0.015 0.01 α=4 0.1 0.005 0 −0.005 1 0.05 1 0.5 0.5 0 0 −0.5 −0.5 −1 25 −1 50 100 ρ 250 500 Figure 2: Left: plot of the asymptotic form of the EK (sc /r)J1 (2πsc r) for D = 2 and ρ = 1225. Right: log-log plot of against log(ρ) for the OU and Matern-class processes (α = 2, 4 respectively). The dashed lines have gradients of −1/2 and −3/2 which are the predicted rates. on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression. Treating the denominator in the expression for B again as a hard cutoff at s = sc , which is justified for large ρ, one obtains for an SE target and learner ≈ σ 2 sc /ρ ∝ (log(ρ))D/2 /ρ. To get analogous predictions for the mismatched case, one can write equation 7 as = 2 σ∗ ρ [1 + σ 2 /(ρS(s))] − σ 2 /(ρS(s)) ds + [1 + σ 2 /(ρS(s))]2 Sη (s) ds. [S(s)ρ/σ 2 + 1]2 2 The first integral is smaller than (σ∗ /σ 2 ) B and can be neglected as long as B . In the second integral we can again make the cutoff approximation—though now with s having ∞ to be above sc – to get the scaling ∝ sc sD−1 Sη (s) ds. For target functions with a power-law decay Sη (s) ∝ s−α of the power spectrum at large s this predicts ∝ sD−α ∝ c (log(ρ))(D−α)/2 . So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target (α = 2) we obtain ∼ (log(ρ)) −1/2 , and for the Matern-class covariance function k(r) = (1 + r/ ) exp(−r/ ) (which has power spectrum ∝ (3/ 2 + 4π 2 s2 )−2 , so α = 4) we get ∼ (log(ρ))−3/2 . These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level σ 2 = 0.1) against targets from the OU and Matern-class priors (with 2 = 0.05) and with noise level σ∗ = 0.01, averaging over 100 replications for each value of ρ. To demonstrate the predicted power-law dependence of on log(ρ), in Figure 2(right) we make a log-log plot of against log(ρ). The dashed lines show the gradients of −1/2 and −3/2 and we observe good agreement between experimental and theoretical results for large ρ. 3.1 Using the Equivalent Kernel in Kernel Regression Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction ρ−1 i h(x∗ − xi )yi at x∗ . Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction. Letting · denote averaging over noise, training input points and the test point and setting fη (x∗ ) = h(x, x∗ )η(x)dx, the average MSE of the EK predictor is pred = [η(x) − (1/ρ) i h(x, xi )yi ]2 = [η(x) − fη (x)]2 + = σ2 ρ 2 σ∗ ρ h2 (x, x )dx + 1 ρ h2 (x, x )η 2 (x )dx − 2 (σ 2 /ρ)Sη (s)/S 2 (s) + σ∗ /σ 2 η2 ds + 2 /(ρS(s))]2 [1 + σ ρ 1 ρ 2 fη (x) ds [1 + σ 2 /(ρS(s))]2 Here we have set η 2 = ( dx)−1 η 2 (x) dx = Sη (s) ds for the spatial average of the 2 squared target amplitude. Taking the matched case, (Sη (s) = S(s) and σ∗ = σ 2 ) as an example, the first term (which is the one we get for the prediction from “smoothed out” training inputs, see eq. 7) is of order σ 2 sD /ρ, while the second one is ∼ η 2 sD /ρ. Thus c c both terms scale in the same way, but the ratio of the second term to the first is the signal2 2 to-noise ratio η /σ , which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x∗ ) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets. In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner’s covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves. Acknowledgments: This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views. References [1] B. W. Silverman. Annals of Statistics, 12:898–916, 1984. [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599–621. Kluwer Academic, 1998. [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990. [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219–269, 1995. [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New York, 1991. Third Edition. [6] C. Thomas-Agnan. Numerical Algorithms, 13:21–32, 1996. [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985. [8] B. Sch¨ lkopf and A. Smola. Learning with Kernels. MIT Press, 2002. o [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999. [10] M. Opper and F. Vivarelli. In NIPS 11, pages 302–308, 1999. [11] P. Sollich and A. Halees. Neural Computation, 14:1393–1428, 2002. [12] P. Sollich. In NIPS 14, pages 519–526, 2002.
5 0.55764979 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
Author: Saharon Rosset, Robert Tibshirani, Ji Zhu, Trevor J. Hastie
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. 1
6 0.53842771 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
7 0.51395249 94 nips-2004-Kernels for Multi--task Learning
8 0.46273768 168 nips-2004-Semigroup Kernels on Finite Sets
9 0.45213988 68 nips-2004-Face Detection --- Efficient and Rank Deficient
10 0.44848266 42 nips-2004-Computing regularization paths for learning multiple kernels
11 0.42395103 22 nips-2004-An Investigation of Practical Approximate Nearest Neighbor Algorithms
12 0.41961113 30 nips-2004-Binet-Cauchy Kernels
13 0.41239464 51 nips-2004-Detecting Significant Multidimensional Spatial Clusters
14 0.41046977 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
15 0.41003031 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
16 0.40258175 188 nips-2004-The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space
17 0.40131798 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
18 0.40128359 34 nips-2004-Breaking SVM Complexity with Cross-Training
19 0.38900149 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment
20 0.38769472 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
topicId topicWeight
[(13, 0.108), (15, 0.146), (26, 0.063), (31, 0.026), (33, 0.179), (35, 0.021), (39, 0.037), (50, 0.05), (76, 0.249), (81, 0.012)]
simIndex simValue paperId paperTitle
Author: Tobias Blaschke, Laurenz Wiskott
Abstract: In contrast to the equivalence of linear blind source separation and linear independent component analysis it is not possible to recover the original source signal from some unknown nonlinear transformations of the sources using only the independence assumption. Integrating the objectives of statistical independence and temporal slowness removes this indeterminacy leading to a new method for nonlinear blind source separation. The principle of temporal slowness is adopted from slow feature analysis, an unsupervised method to extract slowly varying features from a given observed vectorial signal. The performance of the algorithm is demonstrated on nonlinearly mixed speech data. 1
2 0.92727894 30 nips-2004-Binet-Cauchy Kernels
Author: Alex J. Smola, S.v.n. Vishwanathan
Abstract: We propose a family of kernels based on the Binet-Cauchy theorem and its extension to Fredholm operators. This includes as special cases all currently known kernels derived from the behavioral framework, diffusion processes, marginalized kernels, kernels on graphs, and the kernels on sets arising from the subspace angle approach. Many of these kernels can be seen as the extrema of a new continuum of kernel functions, which leads to numerous new special cases. As an application, we apply the new class of kernels to the problem of clustering of video sequences with encouraging results. 1
3 0.87181187 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
Author: Volker Roth
Abstract: The problem of detecting “atypical objects” or “outliers” is one of the classical topics in (robust) statistics. Recently, it has been proposed to address this problem by means of one-class SVM classifiers. The main conceptual shortcoming of most one-class approaches, however, is that in a strict sense they are unable to detect outliers, since the expected fraction of outliers has to be specified in advance. The method presented in this paper overcomes this problem by relating kernelized one-class classification to Gaussian density estimation in the induced feature space. Having established this relation, it is possible to identify “atypical objects” by quantifying their deviations from the Gaussian model. For RBF kernels it is shown that the Gaussian model is “rich enough” in the sense that it asymptotically provides an unbiased estimator for the true density. In order to overcome the inherent model selection problem, a cross-validated likelihood criterion for selecting all free model parameters is applied. 1
same-paper 4 0.84792411 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
Author: Changjiang Yang, Ramani Duraiswami, Larry S. Davis
Abstract: The computation and memory required for kernel machines with N training samples is at least O(N 2 ). Such a complexity is significant even for moderate size problems and is prohibitive for large datasets. We present an approximation technique based on the improved fast Gauss transform to reduce the computation to O(N ). We also give an error bound for the approximation, and provide experimental results on the UCI datasets. 1
5 0.78694671 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
Author: Yoshitatsu Matsuda, Kazunori Yamaguchi
Abstract: In this paper, linear multilayer ICA (LMICA) is proposed for extracting independent components from quite high-dimensional observed signals such as large-size natural scenes. There are two phases in each layer of LMICA. One is the mapping phase, where a one-dimensional mapping is formed by a stochastic gradient algorithm which makes more highlycorrelated (non-independent) signals be nearer incrementally. Another is the local-ICA phase, where each neighbor (namely, highly-correlated) pair of signals in the mapping is separated by the MaxKurt algorithm. Because LMICA separates only the highly-correlated pairs instead of all ones, it can extract independent components quite efficiently from appropriate observed signals. In addition, it is proved that LMICA always converges. Some numerical experiments verify that LMICA is quite efficient and effective in large-size natural image processing.
6 0.74896967 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
7 0.73247278 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
8 0.73172259 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
9 0.72903067 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
10 0.72873461 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
11 0.72760785 124 nips-2004-Multiple Alignment of Continuous Time Series
12 0.72686255 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
13 0.72644126 131 nips-2004-Non-Local Manifold Tangent Learning
14 0.72638249 177 nips-2004-Supervised Graph Inference
15 0.7250728 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
16 0.72246075 178 nips-2004-Support Vector Classification with Input Data Uncertainty
17 0.72214717 70 nips-2004-Following Curved Regularized Optimization Solution Paths
18 0.72087604 168 nips-2004-Semigroup Kernels on Finite Sets
19 0.7204963 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models
20 0.71931356 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations