nips nips2009 nips2009-128 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. [sent-6, score-0.638]
2 We analyze this problem in the case of regression and the kernel ridge regression algorithm. [sent-7, score-0.455]
3 We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. [sent-8, score-0.519]
4 We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. [sent-9, score-0.227]
5 Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique. [sent-10, score-0.217]
6 1 Introduction Learning algorithms based on kernels have been used with much success in a variety of tasks [17,19]. [sent-11, score-0.383]
7 , kernel ridge regression and support vector regression (SVR) [16, 22], and general dimensionality reduction algorithms such as kernel PCA (KPCA) [18] all benefit from kernel methods. [sent-14, score-0.963]
8 Positive definite symmetric (PDS) kernel functions implicitly specify an inner product in a high-dimension Hilbert space where large-margin solutions are sought. [sent-15, score-0.293]
9 So long as the kernel function used is PDS, convergence of the training algorithm is guaranteed. [sent-16, score-0.325]
10 However, in the typical use of these kernel method algorithms, the choice of the PDS kernel, which is crucial to improved performance, is left to the user. [sent-17, score-0.254]
11 A less demanding alternative is to require the user to instead specify a family of kernels and to use the training data to select the most suitable kernel out of that family. [sent-18, score-0.719]
12 There is a large recent body of literature addressing various aspects of this problem, including deriving efficient solutions to the optimization problems it generates and providing a better theoretical analysis of the problem both in classification and regression [1, 8, 9, 11, 13, 15, 21]. [sent-20, score-0.115]
13 However, despite the substantial progress made in the theoretical understanding and the design of efficient algorithms for the problem of learning such linear combinations of kernels, no method seems to reliably give improvements over baseline methods. [sent-22, score-0.262]
14 For example, the learned linear combination does not consistently outperform either the uniform combination of base kernels or simply the best single base kernel (see, for example, UCI dataset experiments in [9, 12], see also NIPS 2008 workshop). [sent-23, score-1.057]
15 This suggests exploring other non-linear families of kernels to obtain consistent and significant performance improvements. [sent-24, score-0.383]
16 Non-linear combinations of kernels have been recently considered by [23]. [sent-25, score-0.52]
17 Another method, hierarchical multiple learning [3], considers learning a linear combination of an exponential number of linear kernels, which can be efficiently represented as a product of sums. [sent-27, score-0.146]
18 However, in [3] the base kernels are restricted to concatenation kernels, where the base kernels apply to disjoint subspaces. [sent-29, score-1.004]
19 For this approach the authors provide an effective and efficient algorithm and some performance improvement is actually observed for regression problems in very high dimensions. [sent-30, score-0.139]
20 This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. [sent-31, score-0.638]
21 We analyze that problem in the case of regression using the kernel ridge regression (KRR) algorithm. [sent-32, score-0.455]
22 We show how to simplify its optimization problem from a minimax problem to a simpler minimization problem and prove that the global solution of the optimization problem always lies on the boundary. [sent-33, score-0.286]
23 We give a projection-based gradient descent algorithm for solving this minimization problem that is shown empirically to converge in few iterations. [sent-34, score-0.209]
24 Furthermore, we give a necessary and sufficient condition for this algorithm to reach a global optimum. [sent-35, score-0.109]
25 Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique. [sent-36, score-0.217]
26 In Section 2, we introduce the non-linear family of kernels considered. [sent-38, score-0.427]
27 Section 3 discusses the learning problem, formulates the optimization problem, and presents our solution. [sent-39, score-0.108]
28 In Section 4, we study the performance of our algorithm for learning nonlinear combinations of kernels in regression (NKRR) on several publicly available datasets. [sent-40, score-0.728]
29 2 Kernel Family This section introduces and discusses the family of kernels we consider for our learning kernel problem. [sent-41, score-0.739]
30 , Kp be a finite set of kernels that we combine to define more complex kernels. [sent-45, score-0.383]
31 In much of the previous work on learning kernels, the family of kernels considered is that of linear or convex combinations of some base kernels. [sent-47, score-0.762]
32 Here, we consider polynomial combinations of higher degree d ≥ 1 of the base kernels with non-negative coefficients of the form: k k (1) µk1 ···kp ≥ 0. [sent-48, score-0.749]
33 µk1 ···kp K1 1 · · · Kp p , Kµ = 0≤k1 +···+kp ≤d, ki ≥0, i∈[0,p] Any kernel function Kµ of this form is PDS since products and sums of PDS kernels are PDS [4]. [sent-49, score-0.698]
34 k k Note that Kµ is in fact a linear combination of the PDS kernels K1 1 · · ·Kp p . [sent-50, score-0.432]
35 , kp ), µk1 ···kp can k be written as a product of non-negative coefficients µ1 , . [sent-55, score-0.393]
36 Then, the 1 general form of the polynomial combinations we consider becomes k k µk1 ···kp K1 1 · · · Kp p , k k µk1 · · · µkp K1 1 · · · Kp p + p 1 K= (2) (k1 ,. [sent-59, score-0.195]
37 The choice of the set I and its size depends on the sample size m and possible prior knowledge about relevant kernel combinations. [sent-67, score-0.254]
38 The second sum of equation (2) defining our general family of kernels represents a linear combination of PDS kernels. [sent-68, score-0.476]
39 In the following, we focus on kernels that have the form of the first sum and that are thus non-linear in the parameters µ1 , . [sent-69, score-0.383]
40 More specifically, we consider kernels Kµ defined by k k µk1 · · · µkp K1 1 · · · Kp p , p 1 Kµ = (3) k1 +···+kp =d where µ = (µ1 , . [sent-73, score-0.383]
41 For the ease of presentation, our analysis is given for the case d = 2, where the quadratic kernel can be given the following simpler expression: p µk µl Kk Kl . [sent-77, score-0.359]
42 Kµ = (4) k,l=1 But, the extension to higher-degree polynomials is straightforward and our experiments include results for degrees d up to 4. [sent-78, score-0.158]
43 1 Optimization Problem We consider a standard regression problem where the learner receives a training sample of size m, S = ((x1 , y1 ), . [sent-80, score-0.103]
44 The family of hypotheses Hµ out of which the learner selects a hypothesis is the reproducing kernel Hilbert space (RKHS) associated to a PDS kernel function Kµ : X × X → R as defined in the previous section. [sent-84, score-0.552]
45 Unlike standard kernel-based regression algorithms however, here, both the parameter vector µ defining the kernel Kµ and the hypothesis are learned using the training sample S. [sent-85, score-0.395]
46 The learning kernel algorithm we consider is derived from kernel ridge regression (KRR). [sent-86, score-0.706]
47 , ym ]⊤ ∈ Rm denote the vector of training labels and let Kµ denote the Gram matrix of the kernel Kµ for the sample S: [Kµ ]i,j = Kµ (xi , xj ), for all i, j ∈ [1, m]. [sent-90, score-0.323]
48 2 Algorithm Formulation For learning linear combinations of kernels, a typical technique consists of applying the minimax theorem to permute the min and max operators, which can lead to optimization problems computationally more efficient to solve [8, 12]. [sent-98, score-0.256]
49 Instead, our method for learning non-linear kernels and solving the min-max problem in equation (6) consists of first directly solving the inner maximization problem. [sent-100, score-0.412]
50 (9) Plugging the optimal expression of α in the min-max optimization yields the following equivalent minimization in terms of µ only: min F (µ) = y⊤ (Kµ + λI)−1 y. [sent-102, score-0.119]
51 Although the original min-max problem has been reduced to a simpler minimization problem, the function F is not convex in general as illustrated by Figure 1. [sent-104, score-0.123]
52 Thus, standard interiorpoint or gradient methods are not guaranteed to be successful at finding a global optimum. [sent-106, score-0.102]
53 Algorithm 1 illustrates a general gradient descent algorithm for the norm-2 bounded setting which projects µ back to the feasible set M2 after each gradient step (projecting to M1 is very similar). [sent-108, score-0.197]
54 5 µ1 0 1 µ2 Figure 1: Example plots for F defined over two linear base kernels generated from the first two features of the sonar dataset. [sent-121, score-0.54]
55 Because of the closed form expression (10), we do not alternate between solving for the dual variables and performing a gradient step in the kernel parameters. [sent-129, score-0.376]
56 We only need to optimize with respect to the kernel parameters. [sent-130, score-0.254]
57 3 Algorithm Properties We first explicitly calculate the gradient of the objective function for the optimization problem (10). [sent-132, score-0.107]
58 Assume that µ⋆ > 0 is a stationary point, then, by Proposition 2, F (µ⋆ ) = y⊤ (Kµ⋆ + ⊤ λI)−1 y = y λ y , which implies that y is an eigenvector of (Kµ⋆ +λI)−1 with eigenvalue λ−1 . [sent-153, score-0.119]
59 The previous propositions are sufficient to show that the gradient descent algorithm will not become stuck at a local minimum while searching the interior of a convex set M and, furthermore, they indicate that the optimum is found at the boundary. [sent-159, score-0.19]
60 The following proposition gives a necessary and sufficient condition for the convexity of F on a convex region C. [sent-160, score-0.177]
61 If the boundary region defined by µ − µ0 = Λ is contained in this convex region, then Algorithm 1 is guaranteed to converge to a global optimum. [sent-161, score-0.132]
62 The function F : µ → y⊤ (Kµ + λI)−1 y is convex over the convex set C iff the following condition holds for all µ ∈ C and all u: M, N − 1 5 F ≥ 0, (18) Data Parkinsons Iono Sonar Breast m 194 351 208 683 p 21 34 60 9 lin. [sent-168, score-0.131]
63 ii i ii (23) i=1 A sufficient condition for this inequality to hold is that each term (4[Kµ ]2 Vii − 1) be non-negative, ii or equivalently that 4K2 V − I µ mini p k=1 µk [Kk ]ii ≥ λ 3 I. [sent-237, score-0.136]
64 4 Empirical Results To test the advantage of learning non-linear kernel combinations, we carried out a number of experiments on publicly available datasets. [sent-239, score-0.364]
65 The datasets are chosen to demonstrate the effectiveness of the algorithm under a number of conditions. [sent-240, score-0.102]
66 For general performance improvement, we chose a number of UCI datasets frequently used in kernel learning experiments, e. [sent-241, score-0.325]
67 For learning with thousands of kernels, we chose the sentiment analysis dataset of Blitzer et. [sent-244, score-0.137]
68 4 0 4000 1000 2000 3000 # bigrams 4000 5000 Figure 2: The performance of baseline and learned quadratic kernels (plus or minus one standard deviation) versus the number of bigrams (and kernels) used. [sent-267, score-0.748]
69 1 UCI Datasets We first analyzed the performance of the kernels learned as quadratic combinations. [sent-269, score-0.485]
70 We associated a base kernel to each feature, which computes the product of this feature between different examples. [sent-273, score-0.412]
71 We compared both linear and quadratic combinations, each with a baseline (uniform), norm-1-regularized and norm-2-regularized weighting using µ0 = 1 corresponding to the weights of the baseline kernel. [sent-274, score-0.202]
72 The parameters λ and Λ were selected via 10-fold cross validation and the error reported was based on 30 random 50/50 splits of the entire dataset into training and test sets. [sent-275, score-0.175]
73 For the gradient descent algorithm, we started with η = 1 and reduced it by a factor of 0. [sent-276, score-0.107]
74 The results, which are presented in Table 1, are in line with previous ones reported for learning kernels on these datasets [7,8,12,15]. [sent-282, score-0.49]
75 They indicate that learning quadratic combination kernels can sometimes offer improvements and that it clearly does not degrade with respect to the performance of the baseline kernel. [sent-283, score-0.621]
76 The learned quadratic combination performs well, particularly on tasks where the number of features was large compared to the number of points. [sent-284, score-0.151]
77 This suggests that the learned kernel is better regularized than the plain quadratic kernel and can be advantageous is scenarios where over-fitting is an issue. [sent-285, score-0.678]
78 Each base kernel computes the product between the counts of a particular n-gram for the given pair of points. [sent-288, score-0.412]
79 Such kernels have a direct connection to count-based rational kernels, as described in [8]. [sent-289, score-0.383]
80 We used the sentiment analysis dataset of Blitzer et. [sent-290, score-0.108]
81 This dataset contains text-based user reviews found for products on amazon. [sent-292, score-0.11]
82 The product reviews fall into two categories: electronics and kitchen-wares, each with 2,000 data-points. [sent-295, score-0.129]
83 We used the performance of the uniformly weighted quadratic combination kernel as a baseline, and showed the improvement when learning the kernel with norm-1 or norm-2 regularization using µ0 = 1 corresponding to the weights of the baseline kernel. [sent-300, score-0.8]
84 As shown by Figure 2, the learned kernels significantly improved over the baseline quadratic kernel in both the kitchen and electronics categories. [sent-301, score-0.911]
85 Using 900 training points and about 3,600 bigrams, and thus kernels, each iteration of the algorithm took approximately 25 7 KRR, with (dashed) and without (solid) learning 0. [sent-303, score-0.1]
86 20 1st degree 2nd degree 3rd degree 4th degree 0. [sent-305, score-0.208]
87 For all polynomials, we compared un-weighted, standard KRR (solid lines) with norm-2 regularized kernel learning (dashed lines). [sent-308, score-0.283]
88 For 4th degree polynomials we observed a clear performance improvement, especially for medium amount of training data (subsampling factor of 10-50). [sent-309, score-0.284]
89 Here too, we used polynomial kernels over the features, but this time we experimented with polynomials with degrees as high as 4. [sent-318, score-0.599]
90 Again, we made the assumption that all coefficients of µ are in the form of products of µi s (see Section 2), thus only 8 kernel parameters needed to be estimated. [sent-319, score-0.287]
91 We split the data into 10,000 examples for training and 10,000 examples for testing, and, to investigate the effect of the sample size on learning kernels, subsampled the training data so that only a fraction from 1 to 100 was used. [sent-320, score-0.105]
92 The parameters λ and Λ were determined by 10-fold cross validation on the training data, and results are reported on the test data, see Figure 3. [sent-321, score-0.101]
93 For lower degree polynomials, the performance was essentially the same, but for 4th degree polynomials we observed a significant performance improvement of learning kernels over the uniformly weighted KRR, especially for a medium amount of training data (subsampling factor of 10-50). [sent-323, score-0.789]
94 This result corroborates the finding on the UCI dataset, that learning kernels is better regularized than plain unweighted KRR and can be advantageous is scenarios where overfitting is an issue. [sent-327, score-0.48]
95 5 Conclusion We presented an analysis of the problem of learning polynomial combinations of kernels in regression. [sent-328, score-0.607]
96 This extends learning kernel ideas and helps explore kernel combinations leading to better performance. [sent-329, score-0.674]
97 We proved that the global solution of the optimization problem always lies on the boundary and gave a simple projection-based gradient descent algorithm shown empirically to converge in few iterations. [sent-330, score-0.3]
98 We also gave a necessary and sufficient condition for that algorithm to converge to a global optimum. [sent-331, score-0.146]
99 Finally, we reported the results of several experiments on publicly available datasets demonstrating the benefits of learning polynomial combinations of kernels. [sent-332, score-0.417]
100 Exploring large feature spaces with hierarchical multiple kernel learning. [sent-350, score-0.254]
wordName wordTfidf (topN-words)
[('ku', 0.413), ('kernels', 0.383), ('kp', 0.354), ('pds', 0.291), ('kernel', 0.254), ('krr', 0.166), ('polynomials', 0.158), ('combinations', 0.137), ('kk', 0.126), ('base', 0.119), ('bigrams', 0.097), ('proposition', 0.096), ('lj', 0.09), ('ik', 0.088), ('publicly', 0.081), ('cortes', 0.075), ('ijkl', 0.073), ('ridge', 0.071), ('mohri', 0.069), ('baseline', 0.069), ('regression', 0.065), ('quadratic', 0.064), ('stationary', 0.063), ('sentiment', 0.062), ('psd', 0.062), ('micchelli', 0.062), ('yy', 0.062), ('subsampling', 0.059), ('electronics', 0.059), ('polynomial', 0.058), ('gradient', 0.057), ('kr', 0.056), ('uk', 0.056), ('delve', 0.055), ('nkrr', 0.055), ('degree', 0.052), ('kl', 0.051), ('vec', 0.05), ('rmse', 0.05), ('descent', 0.05), ('convex', 0.05), ('optimization', 0.05), ('combination', 0.049), ('blitzer', 0.049), ('rp', 0.048), ('uci', 0.046), ('dataset', 0.046), ('global', 0.045), ('kitchen', 0.044), ('corinna', 0.044), ('init', 0.044), ('family', 0.044), ('datasets', 0.042), ('vii', 0.042), ('simpler', 0.041), ('improvement', 0.041), ('minimax', 0.04), ('regularization', 0.04), ('plain', 0.039), ('argyriou', 0.039), ('jj', 0.039), ('product', 0.039), ('training', 0.038), ('learned', 0.038), ('sonar', 0.038), ('xr', 0.038), ('expression', 0.037), ('converge', 0.037), ('medium', 0.036), ('mercer', 0.036), ('courant', 0.036), ('reported', 0.036), ('google', 0.036), ('ii', 0.035), ('demonstrating', 0.034), ('rm', 0.034), ('tr', 0.034), ('cients', 0.034), ('products', 0.033), ('algorithm', 0.033), ('street', 0.032), ('minimization', 0.032), ('condition', 0.031), ('ym', 0.031), ('reviews', 0.031), ('xs', 0.03), ('learning', 0.029), ('advantageous', 0.029), ('discusses', 0.029), ('eigenvalue', 0.028), ('ki', 0.028), ('eigenvector', 0.028), ('splits', 0.028), ('coef', 0.028), ('lies', 0.028), ('dual', 0.028), ('improvements', 0.027), ('york', 0.027), ('cross', 0.027), ('effectiveness', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
2 0.22587298 119 nips-2009-Kernel Methods for Deep Learning
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
3 0.19530867 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
4 0.18942456 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou
Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9
5 0.15758072 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
Author: Liefeng Bo, Cristian Sminchisescu
Abstract: In visual recognition, the images are frequently modeled as unordered collections of local features (bags). We show that bag-of-words representations commonly used in conjunction with linear classifiers can be viewed as special match kernels, which count 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse, motivating research into the design of match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels for large datasets due to their significant computational cost. To address this problem, we propose efficient match kernels (EMK) that map local features to a low dimensional feature space and average the resulting vectors to form a setlevel feature. The local feature maps are learned so their inner products preserve, to the best possible, the values of the specified kernel function. Classifiers based on EMK are linear both in the number of images and in the number of local features. We demonstrate that EMK are extremely efficient and achieve the current state of the art in three difficult computer vision datasets: Scene-15, Caltech-101 and Caltech-256. 1
6 0.15278181 95 nips-2009-Fast subtree kernels on graphs
7 0.15048678 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
8 0.14686345 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
9 0.1360084 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
10 0.12022319 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
11 0.11084911 33 nips-2009-Analysis of SVM with Indefinite Kernels
12 0.10710309 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
13 0.094979838 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
14 0.08727017 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming
15 0.08459805 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning
16 0.082166478 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
17 0.078100912 81 nips-2009-Ensemble Nystrom Method
18 0.073046789 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
19 0.071977198 72 nips-2009-Distribution Matching for Transduction
20 0.068183996 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
topicId topicWeight
[(0, -0.224), (1, 0.15), (2, -0.076), (3, 0.177), (4, -0.121), (5, -0.063), (6, -0.013), (7, 0.212), (8, -0.105), (9, 0.038), (10, 0.153), (11, -0.177), (12, -0.017), (13, 0.044), (14, -0.118), (15, 0.064), (16, -0.141), (17, 0.047), (18, -0.026), (19, 0.008), (20, 0.034), (21, 0.015), (22, -0.004), (23, -0.022), (24, 0.01), (25, -0.009), (26, 0.006), (27, -0.041), (28, -0.038), (29, 0.018), (30, -0.105), (31, -0.053), (32, -0.055), (33, -0.129), (34, 0.014), (35, -0.024), (36, -0.068), (37, 0.095), (38, 0.036), (39, 0.028), (40, -0.022), (41, 0.061), (42, 0.026), (43, -0.054), (44, -0.014), (45, 0.087), (46, -0.001), (47, 0.061), (48, 0.034), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.9605931 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
2 0.81423998 119 nips-2009-Kernel Methods for Deep Learning
Author: Youngmin Cho, Lawrence K. Saul
Abstract: We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets. 1
3 0.77776569 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Author: Kenji Fukumizu, Arthur Gretton, Gert R. Lanckriet, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example. 1
4 0.76170987 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
Author: Liefeng Bo, Cristian Sminchisescu
Abstract: In visual recognition, the images are frequently modeled as unordered collections of local features (bags). We show that bag-of-words representations commonly used in conjunction with linear classifiers can be viewed as special match kernels, which count 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse, motivating research into the design of match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels for large datasets due to their significant computational cost. To address this problem, we propose efficient match kernels (EMK) that map local features to a low dimensional feature space and average the resulting vectors to form a setlevel feature. The local feature maps are learned so their inner products preserve, to the best possible, the values of the specified kernel function. Classifiers based on EMK are linear both in the number of images and in the number of local features. We demonstrate that EMK are extremely efficient and achieve the current state of the art in three difficult computer vision datasets: Scene-15, Caltech-101 and Caltech-256. 1
5 0.72731853 95 nips-2009-Fast subtree kernels on graphs
Author: Nino Shervashidze, Karsten M. Borgwardt
Abstract: In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G¨ rtner scales as O(n2 4d h). Key to this efficiency is the observation that the a Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1
6 0.71201909 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
8 0.65988076 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test
9 0.65095514 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
10 0.59565264 33 nips-2009-Analysis of SVM with Indefinite Kernels
11 0.58501816 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
12 0.58214253 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
13 0.52997655 81 nips-2009-Ensemble Nystrom Method
14 0.48613235 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
15 0.41283131 72 nips-2009-Distribution Matching for Transduction
16 0.39642122 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
17 0.39314094 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
18 0.38349834 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning
19 0.38240656 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions
20 0.38108608 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
topicId topicWeight
[(7, 0.021), (18, 0.207), (24, 0.043), (25, 0.068), (35, 0.038), (36, 0.167), (39, 0.04), (58, 0.119), (61, 0.032), (71, 0.051), (81, 0.016), (86, 0.109), (91, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.86774564 128 nips-2009-Learning Non-Linear Combinations of Kernels
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper studies the general problem of learning kernels based on a polynomial combination of base kernels. We analyze this problem in the case of regression and the kernel ridge regression algorithm. We examine the corresponding learning kernel optimization problem, show how that minimax problem can be reduced to a simpler minimization problem, and prove that the global solution of this problem always lies on the boundary. We give a projection-based gradient descent algorithm for solving the optimization problem, shown empirically to converge in few iterations. Finally, we report the results of extensive experiments with this algorithm using several publicly available datasets demonstrating the effectiveness of our technique.
2 0.8494767 112 nips-2009-Human Rademacher Complexity
Author: Xiaojin Zhu, Bryan R. Gibson, Timothy T. Rogers
Abstract: We propose to use Rademacher complexity, originally developed in computational learning theory, as a measure of human learning capacity. Rademacher complexity measures a learner’s ability to fit random labels, and can be used to bound the learner’s true error based on the observed training sample error. We first review the definition of Rademacher complexity and its generalization bound. We then describe a “learning the noise” procedure to experimentally measure human Rademacher complexities. The results from empirical studies showed that: (i) human Rademacher complexity can be successfully measured, (ii) the complexity depends on the domain and training sample size in intuitive ways, (iii) human learning respects the generalization bounds, (iv) the bounds can be useful in predicting the danger of overfitting in human learning. Finally, we discuss the potential applications of human Rademacher complexity in cognitive science. 1
3 0.82803982 204 nips-2009-Replicated Softmax: an Undirected Topic Model
Author: Geoffrey E. Hinton, Ruslan Salakhutdinov
Abstract: We introduce a two-layer undirected graphical model, called a “Replicated Softmax”, that can be used to model and automatically extract low-dimensional latent semantic representations from a large unstructured collection of documents. We present efficient learning and inference algorithms for this model, and show how a Monte-Carlo based method, Annealed Importance Sampling, can be used to produce an accurate estimate of the log-probability the model assigns to test data. This allows us to demonstrate that the proposed model is able to generalize much better compared to Latent Dirichlet Allocation in terms of both the log-probability of held-out documents and the retrieval accuracy.
4 0.75946152 104 nips-2009-Group Sparse Coding
Author: Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow
Abstract: Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification. 1
5 0.75858206 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm
Author: Rong Jin, Shijun Wang, Yang Zhou
Abstract: In this paper, we examine the generalization error of regularized distance metric learning. We show that with appropriate constraints, the generalization error of regularized distance metric learning could be independent from the dimensionality, making it suitable for handling high dimensional data. In addition, we present an efficient online learning algorithm for regularized distance metric learning. Our empirical studies with data classification and face recognition show that the proposed algorithm is (i) effective for distance metric learning when compared to the state-of-the-art methods, and (ii) efficient and robust for high dimensional data.
6 0.75683981 191 nips-2009-Positive Semidefinite Metric Learning with Boosting
7 0.75568801 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
8 0.75504524 72 nips-2009-Distribution Matching for Transduction
9 0.75353062 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
10 0.75280732 79 nips-2009-Efficient Recovery of Jointly Sparse Vectors
11 0.74999112 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
12 0.74967325 76 nips-2009-Efficient Learning using Forward-Backward Splitting
13 0.7481069 27 nips-2009-Adaptive Regularization of Weight Vectors
14 0.74772233 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
15 0.74745238 129 nips-2009-Learning a Small Mixture of Trees
16 0.74577045 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
17 0.74551427 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning
18 0.74491382 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
19 0.74391532 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
20 0.7437405 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes