nips nips2008 nips2008-122 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Haixuan Yang, Irwin King, Michael Lyu
Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 hk Abstract Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. [sent-8, score-0.234]
2 However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. [sent-9, score-0.049]
3 Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. [sent-10, score-0.105]
4 The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. [sent-11, score-0.056]
5 Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. [sent-12, score-0.26]
6 Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. [sent-13, score-0.232]
7 1 Introduction Regularized Least Squares (RLS) algorithms have been drawing people’s attention since they were proposed due to their ability to avoid over-fitting problems and to express solutions as kernel expansions in terms of the training data [4, 9, 12, 13]. [sent-14, score-0.134]
8 Various modifications of RLS are made to improve its performance either from the viewpoint of manifold [1] or in a more generalized form [7, 11]. [sent-15, score-0.128]
9 For a constant function f = c, a nonzero term ||f ||K is penalized in both RLS and LapRLS [1]. [sent-18, score-0.19]
10 As a result, for a distribution generalized by a nonzero constant function, the resulting regression function by both RLS and LapRLS is not a constant as illustrated in the left diagram in Fig. [sent-19, score-0.229]
11 In this work, we aim to provide a new viewpoint for supervised or semi-supervised learning problems. [sent-22, score-0.046]
12 By such a viewpoint we can provide a general condition under which constant functions should not be penalized. [sent-23, score-0.132]
13 The basic idea is that, if a learning algorithm can learn an inductive function f (x) from examples generated by a joint probability distribution P on X × R, then the learned function f (x) and the marginal PX represents a new distribution on X × R, from which there is a re-learned function r(x). [sent-24, score-0.108]
14 Because the re-learned function depends on the underlying kernel, the difference f (x) − r(x) depends on f (x) and the kernel, and from this point of view, we name this work. [sent-26, score-0.018]
15 The over penalized constant functions in the term ||f ||K cause the phenomena that smaller γ can achieve better results. [sent-50, score-0.199]
16 On the other hand, the overfitting phenomenon when γ = 0 suggests the necessity of the regularization term. [sent-51, score-0.037]
17 Middle diagram: r(x) is very smooth, and f (x)−r(x) remains the uneven part of f (x); therefore f (x)−r(x) should be penalized while f is over penalized in ||f ||K . [sent-53, score-0.259]
18 Right diagram: the proposed model has a stable property so that a large variant of γ results in small changes of the curves, suggesting a right way of penalizing functions. [sent-54, score-0.063]
19 2 Background The RKHS Theory enables us to express solutions of RLS as kernel expansions in terms of the training data. [sent-55, score-0.134]
20 Let X be a compact domain or manifold, ν be a Borel measure on X, and K : X × X → R be a Mercer kernel, then there is an associated Hilbert space RKHS HK of functions X → R with the corresponding norm || · ||K . [sent-58, score-0.089]
21 Moreover, an operator LK can be defined on HK as: (LK f )(x) = X f (y)K(x, y)dν(y), where L2 (X) is the Hilbert space of square integrable ν functions on X with the scalar product f, g ν = X f (x)g(x)dν(x). [sent-62, score-0.061]
22 Given a Mercer kernel and a set of labeled examples (xi , yi ) (i = 1, . [sent-63, score-0.163]
23 , l), there are two popular inductive learning algorithms: RLS [12, 13] and the Nadaraya-Watson Formula [5, 8, 14]. [sent-66, score-0.086]
24 By the standard Tikhonov regularization, RLS is a special case of the following functional extreme problem: f ∗ = arg min f ∈HK 1 l l V (xi , yi , f ) + γ||f ||2 K (1) i=1 where V is some loss function. [sent-67, score-0.045]
25 The Nadaraya-Watson Formula is based on local weighted averaging, and it comes with a closed form: l l r(x) = yi K(x, xi )/ i=1 K(x, xi ). [sent-70, score-0.256]
26 (3) i=1 The formula has a similar appearance as Eq. [sent-71, score-0.032]
27 Let p(x) be a probability density function over X, P (x) be the corresponding cumulative distribution function, and f (x) be an inductive function. [sent-73, score-0.086]
28 , l) are sampled from the function y = f (x), then A Re-learned Function can be expressed as r(x) = lim l→∞ l i=1 f (xi )K(x, xi ) l i=1 K(x, xi ) = X f (α)K(x, α)dP (α) = K(x, α)dP (α) X LK (f ) , K(x, α)dP (α) X (4) based on f (x) and P (x). [sent-77, score-0.21]
29 Consequently, f (x) − r(x) represents the intrinsically uneven part of f (x), which we will penalize. [sent-79, score-0.033]
30 This intuition is illustrated in the middle diagram in Fig. [sent-80, score-0.103]
31 Throughout this paper, we assume that X K(x, α)dP (α) is a constant, and for simplicity all kernels are normalized by K/ X K(x, α)dP (α) so that r(x) = LK (f ). [sent-82, score-0.109]
32 3 Partially-penalized Regularization For a given kernel K and an inductive function f , LK (f ) is the prediction function produced by K through the Nadaraya-Watson Formula. [sent-84, score-0.157]
33 (1), penalizing the inconsistent part f (x) − LK (f ) leads to the following Partially-penalized Regularization problem: f ∗ = arg min f ∈HK 1 l l V (xi , yi , f ) + γ||f − LK (f )||2 . [sent-86, score-0.11]
34 K K K = 0, then ||f1 − LK (f1 ) + f2 − LK (f2 )||2 = K It is well-known that the operator LK is compact, self-adjoint, and positive with respect to L2 (X), ν and by the Spectral Theorem [2, 3], its eigenfunctions e1 (x), e2 (x), . [sent-90, score-0.024]
35 Let f1 = i ai ei (x), f2 = i bi ei (x), then f1 −LK (f1 ) = i ai ei (x)−LK ( i ai ei (x)) = i ai ei (x)− i λi ai ei (x) = i (1−λi )ai ei (x), and similarly, f2 − LK (f2 ) = i (1 − λi )bi ei (x). [sent-97, score-1.398]
36 By the discussions in [1], we have ei , ej ν = 0 1 if i = j, and ei , ei ν = 1; ei , ej K = 0 if i = j, and ei , ei K = λi . [sent-98, score-0.881]
37 If we consider the situation that ai , bi ≥ 0 for all i ≥ 1, then f1 , f2 K = 0 implies that ai bi = 0 for all i ≥ 1, and consequently f1 − LK (f1 ), f2 − LK (f2 ) K = i (1 − λi )2 ai bi ei (x), ei (x) K = 0. [sent-99, score-0.58]
38 Theorem 2 Let µj (x) be a basis in H0 of the operator I − LK , i. [sent-102, score-0.024]
39 Under Assumption 1, the minimizer of the optimization problem in Eq. [sent-105, score-0.038]
40 Any function f ∈ HK can be uniquely decomposed into a component f|| in the linear subspace spanned by the kernel functions {K(xi , ·)}l , and a compoi=1 l nent f⊥ orthogonal to it. [sent-107, score-0.108]
41 By the reproducing property i=1 and the fact that f⊥ , K(xi , ·) = 0 for 1 ≤ i ≤ l, we have l l f (xj ) = f, K(xj , ·) = αi K(xi , ·), K(xj , ·) . [sent-109, score-0.043]
42 (5) depend only on the value of the coefficients {αi }l and the gram matrix of the kernel function. [sent-111, score-0.102]
43 (5) must have ||f⊥ − LK (f⊥ )||2 = 0, and therefore admits a K l representation f ∗ (x) = f⊥ + o i=1 3. [sent-114, score-0.021]
44 i=1 Partially-penalized Regularized Least Squares (PRLS) Algorithm In this section, we focus our attention in the case that V (xi , yi , f ) = (yi − f (xi ))2 , i. [sent-116, score-0.027]
45 , αl ]T , K is the l × l gram matrix Kij = K(xi , xj ), K and K are reconstructed l × l matrices Kij = K(xi , x), LK (K(xj , x)) K , and Kij = LK (K(xi , x)), LK (K(xj , x)) K . [sent-125, score-0.087]
46 2 The PLapRLS Algorithm The idea in the previous section can also be extended to LapRLS in the manifold regularization framework [1]. [sent-140, score-0.119]
47 In the manifold setting, the smoothness on the data adjacency graph should be considered, and Eq. [sent-141, score-0.082]
48 (5) is modified as f ∗ = arg min f ∈HK 1 l l V (xi , yi , f )+γA ||f −LK (f )||2 + K i=1 γI (u + l)2 l+u (f (xi )−f (xj ))2 Wij , (12) i,j=1 where Wij are edge weights in the data adjacency. [sent-142, score-0.045]
49 For this optimization problem, the result in Theorem 2 can be modified slightly as: Theorem 3 Under Assumption 1, the minimizer of the optimization problem in Eq. [sent-144, score-0.038]
50 (12) admits an expansion o f ∗ (x) = l+u αi K(xi , x). [sent-145, score-0.021]
51 1 Heat Kernels and the Computation of K and K In this section we will illustrate the computation of K and K in the case of heat kernels. [sent-168, score-0.232]
52 The basic facts about heat kernels are excerpted from [6], and for more materials, see [10]. [sent-169, score-0.341]
53 Given a manifold M and points x and y, the heat kernel Kt (x, y) is a special solution to the heat equation with a special initial condition called the delta function δ(x−y). [sent-170, score-0.617]
54 More specifically, δ(x−y) describes a unit heat source at position y with no heat in other positions. [sent-171, score-0.464]
55 Equation (15) describes the heat flow throughout a geometric manifold with initial conditions. [sent-174, score-0.314]
56 Then there exists a function K ∈ C ∞ (R+ × M × M), called the heat kernel, which satisfies the following properties for all x, y ∈ M, with Kt (x, y) = K(t, x, y): (1) Kt (x, y) defines a Mercer kernel. [sent-176, score-0.232]
57 (4) 1 = M Kt (x, y)1dy and (5) When M = Rm , Lf is simplified as M ∂2f i ∂x2 , i m and the heat kernel takes the Gaussian RBF form Kt (x, y) = (4πt)− 2 e− ||x−y||2 4t . [sent-180, score-0.303]
58 From Theorem 2, we know that the functions in the null space H0 = {f ∈ HK |f − LK (f ) = 0} should not be penalized. [sent-185, score-0.067]
59 Although there may be looser assumptions that can guarantee the validity of the result in Theorem 2, there are two assumptions in this work: X is compact and K(x, α)dP (α) in Eq. [sent-186, score-0.052]
60 Next we discuss the constant functions and the linear X functions. [sent-188, score-0.086]
61 Under the two assumptions, a constant function c should not be penalized, because c = X cK(x, α)p(α)dα/ X K(x, α)p(α)dα, i. [sent-190, score-0.049]
62 For heat kernels, if P (x) is uniformly distributed on M, then by Property 4 in Theorem 4, X K(x, α)dP (α) is a constant, and so c should not be penalized. [sent-193, score-0.269]
63 For polynomial kernels, the theory cannot guarantee that constant functions should not be penalized even with a uniform distribution P (x). [sent-194, score-0.235]
64 For example, considering the polynomial kernel xy +1 in the 1 interval X = [0 1] and the uniform distribution on X, X (xy +1)dP (y) = 0 (xy +1)dy = x/2+1 is not a constant. [sent-195, score-0.164]
65 3 that not penalizing constant functions in polynomial kernels will result in much worse accuracy. [sent-197, score-0.275]
66 The reason for this phenomenon is that constant functions may not be smooth in the feature space produced by the polynomial kernel 1 under some distributions. [sent-198, score-0.214]
67 The readers can deduce an example for p(x) such that 0 (xy + 1)dP (y) happens to be a constant. [sent-199, score-0.028]
68 In the case when X is a closed ball Br with radius r when P (x) is uniformly distributed over Br and when K is the Gaussian RBF kernel, then aT x should not be penalized when r is big enough. [sent-201, score-0.189]
69 1 Since r is big enough, we have Rn ·dx ≈ Br ·dx and Br Kt (x, y)dy ≈ 1, and so aT x = Rn Kt (x, y)aT ydy ≈ Br Kt (x, y)aT ydy ≈ LK (aT x). [sent-202, score-0.096]
70 5 Experiments In this section, we evaluate the proposed algorithms PRLS and PLapRLS on a toy dataset (size: 40), a medium-sized dataset (size: 3,119), and a large-sized dataset (size: 20,000), and provide a counter example for constant functions on another dataset (size: 9,298). [sent-205, score-0.273]
71 We use the Gaussian RBF kernels in the first three datasets, and use polynomial kernels to provide a counter example on the last dataset. [sent-206, score-0.301]
72 The data and results for the toy dataset are illustrated in the left diagram and the right diagram in Fig. [sent-208, score-0.224]
73 The dataset contains utterances of 150 subjects who 1 Note that a subset of Rn is compact if and only if it is closed and bounded. [sent-212, score-0.136]
74 This is the reason why we cannot talk about Rn directly. [sent-214, score-0.021]
75 The speakers were grouped into 5 sets of 30 speakers each. [sent-216, score-0.102]
76 The data of the first 30 speakers forms a training set of 1,560 examples, and that of the last 29 speakers forms the test set. [sent-217, score-0.102]
77 To simulate a real-world situation, 30 binary classification problems corresponding to 30 splits of the training data where all 52 utterances of one speaker were labeled and all the rest were left unlabeled. [sent-219, score-0.13]
78 2, we can see that both PRLS and PLapRLS make significant performance improvements over their corresponding counterparts on both unlabeled data and test set. [sent-228, score-0.069]
79 2 UCI Dataset Letter about Printed Letter Recognition In Dataset Letter, there are 16 features for each example, and there are 26 classes representing the upper case printed letters. [sent-230, score-0.038]
80 The first 400 examples were taken to form the training set. [sent-231, score-0.022]
81 For each of the four algorithms RLS, PRLS, LapRLS, and PLapRLS, for each of the 26 one-versus-all binary classification tasks, and for each of 10 runs, two examples for each class were randomly labeled. [sent-236, score-0.022]
82 For each algorithm, the averages over all the 260 one-versus-all binary classification error rates for unlabeled 398 examples and test set are listed respectively as follows: (5. [sent-237, score-0.114]
83 From the results, we can see that RLS is improved on both unlabeled examples and test set. [sent-244, score-0.091]
84 The fact that there is no error in the total 260 tasks for LapRLS and PLapRLS on unlabeled examples suggests that the data is distributed in a curved manifold. [sent-245, score-0.137]
85 On a curved manifold, the heat kernels do not take the Gaussian RBF form, and so PLapRLS using the Gaussian RBF form cannot achieve its best. [sent-246, score-0.368]
86 This is the reason why we can observe that PLapRLS is slightly worse than LapRLS on the test set. [sent-247, score-0.021]
87 This suggests the need for a vast of investigations on heat kernels on a manifold. [sent-248, score-0.341]
88 3 A Counter Example in Handwritten Digit Recognition Note that, polynomial kernels with degree 3 were used on USPS dataset in [1], and 2 images for each class were randomly labeled. [sent-250, score-0.18]
89 (2), then the averages of 45 pairwise binary classification error rates are 8. [sent-253, score-0.023]
90 41% for unlabeled 398 images and 8,898 images in the test set respectively. [sent-255, score-0.069]
91 If constant functions are not l penalized, then we should use f ∗ (x) = i=1 αi K(xi , x) + a, and the corresponding error rates are 9. [sent-256, score-0.109]
92 By this example, we show that leaving constant functions outside the regularization term is dangerous; however, it is fortunate that we have a theory to guide this in Section 4: if X is compact and X K(x, α)dP (α) in Eq. [sent-259, score-0.175]
93 (4) is a constant, then constant functions should not be penalized. [sent-260, score-0.086]
94 6 Conclusion A novel learning scheme is proposed based on a new viewpoint of penalizing the inconsistent part between inductive functions and kernels. [sent-261, score-0.253]
95 In theoretical aspects, we have three important claims: (1) On a compact domain or manifold, if the denominator in Eq. [sent-262, score-0.052]
96 (4) is a constant, then there is a new Representer Theorem; (2) The same conditions become a sufficient condition under which constant functions should not be penalized; and (3) under the same conditions, a function belongs to the null space if and only if the function should not be penalized. [sent-263, score-0.116]
97 Empirically, we claim that the novel learning scheme can achieve accuracy improvement in practical applications. [sent-264, score-0.019]
98 Acknowledgments The work described in this paper was supported by two grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. [sent-265, score-0.019]
99 Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-270, score-0.112]
100 Generalized regularized least-squares learning with predefined features in a Hilbert space. [sent-295, score-0.039]
wordName wordTfidf (topN-words)
[('lk', 0.564), ('rls', 0.363), ('prls', 0.304), ('plaprls', 0.266), ('heat', 0.232), ('laprls', 0.232), ('kt', 0.168), ('ei', 0.134), ('hk', 0.123), ('representer', 0.114), ('penalized', 0.113), ('kernels', 0.109), ('xi', 0.105), ('diagram', 0.086), ('inductive', 0.086), ('dp', 0.082), ('manifold', 0.082), ('kernel', 0.071), ('unlabeled', 0.069), ('rbf', 0.065), ('br', 0.064), ('speaker', 0.057), ('kij', 0.057), ('isolet', 0.057), ('xy', 0.057), ('ai', 0.057), ('xj', 0.056), ('compact', 0.052), ('mercer', 0.051), ('speakers', 0.051), ('letter', 0.051), ('constant', 0.049), ('counter', 0.047), ('theorem', 0.047), ('viewpoint', 0.046), ('vs', 0.045), ('penalizing', 0.044), ('labeled', 0.043), ('bi', 0.041), ('regularized', 0.039), ('haixuan', 0.038), ('printed', 0.038), ('wenye', 0.038), ('ydy', 0.038), ('minimizer', 0.038), ('regularization', 0.037), ('functions', 0.037), ('polynomial', 0.036), ('dataset', 0.035), ('uneven', 0.033), ('kx', 0.033), ('px', 0.033), ('formula', 0.032), ('gram', 0.031), ('null', 0.03), ('utterances', 0.03), ('tikhonov', 0.03), ('squares', 0.03), ('hilbert', 0.03), ('lf', 0.028), ('readers', 0.028), ('nonzero', 0.028), ('yi', 0.027), ('wij', 0.027), ('curved', 0.027), ('yl', 0.027), ('discussions', 0.027), ('kong', 0.026), ('dy', 0.025), ('ej', 0.025), ('hong', 0.025), ('rn', 0.025), ('reproducing', 0.024), ('operator', 0.024), ('dx', 0.023), ('expansions', 0.023), ('rates', 0.023), ('express', 0.023), ('examples', 0.022), ('admits', 0.021), ('inconsistent', 0.021), ('reason', 0.021), ('big', 0.02), ('rkhs', 0.019), ('grants', 0.019), ('closed', 0.019), ('property', 0.019), ('distributed', 0.019), ('scheme', 0.019), ('nitely', 0.018), ('consequently', 0.018), ('arg', 0.018), ('uniformly', 0.018), ('name', 0.018), ('uci', 0.017), ('laplacian', 0.017), ('width', 0.017), ('solutions', 0.017), ('illustrated', 0.017), ('irwin', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
Author: Haixuan Yang, Irwin King, Michael Lyu
Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1
2 0.13593483 193 nips-2008-Regularized Co-Clustering with Dual Supervision
Author: Vikas Sindhwani, Jianying Hu, Aleksandra Mojsilovic
Abstract: By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surprisingly impressive performance improvements over traditional one-sided row clustering techniques. A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization that may lead to a better clustering of examples (and vice-versa). In many applications, partial supervision in the form of a few row labels as well as column labels may be available to potentially assist co-clustering. In this paper, we develop two novel semi-supervised multi-class classification algorithms motivated respectively by spectral bipartite graph partitioning and matrix approximation formulations for co-clustering. These algorithms (i) support dual supervision in the form of labels for both examples and/or features, (ii) provide principled predictive capability on out-of-sample test data, and (iii) arise naturally from the classical Representer theorem applied to regularization problems posed on a collection of Reproducing Kernel Hilbert Spaces. Empirical results demonstrate the effectiveness and utility of our algorithms. 1
3 0.11283199 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
4 0.081201874 44 nips-2008-Characteristic Kernels on Groups and Semigroups
Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1
5 0.072944976 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
6 0.069790021 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
7 0.066447139 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
8 0.063633204 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
9 0.061708227 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
10 0.061680667 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
11 0.061519988 113 nips-2008-Kernelized Sorting
12 0.059212636 138 nips-2008-Modeling human function learning with Gaussian processes
13 0.058644414 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
14 0.0568421 78 nips-2008-Exact Convex Confidence-Weighted Learning
15 0.056327257 178 nips-2008-Performance analysis for L\ 2 kernel classification
16 0.056097575 62 nips-2008-Differentiable Sparse Coding
17 0.054906067 194 nips-2008-Regularized Learning with Networks of Features
18 0.05392972 151 nips-2008-Non-parametric Regression Between Manifolds
19 0.052739125 119 nips-2008-Learning a discriminative hidden part model for human action recognition
20 0.052413344 202 nips-2008-Robust Regression and Lasso
topicId topicWeight
[(0, -0.151), (1, -0.06), (2, -0.071), (3, 0.04), (4, 0.022), (5, 0.007), (6, 0.017), (7, -0.024), (8, 0.009), (9, -0.016), (10, 0.16), (11, -0.073), (12, -0.058), (13, -0.035), (14, 0.007), (15, 0.045), (16, 0.003), (17, 0.015), (18, -0.014), (19, 0.015), (20, -0.056), (21, -0.027), (22, -0.002), (23, -0.081), (24, 0.021), (25, -0.05), (26, 0.034), (27, -0.037), (28, 0.122), (29, 0.033), (30, -0.03), (31, -0.015), (32, 0.026), (33, 0.08), (34, -0.044), (35, -0.07), (36, 0.089), (37, 0.03), (38, 0.047), (39, -0.022), (40, -0.034), (41, 0.108), (42, -0.105), (43, 0.079), (44, 0.05), (45, 0.025), (46, 0.016), (47, 0.068), (48, -0.003), (49, -0.161)]
simIndex simValue paperId paperTitle
same-paper 1 0.90165085 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
Author: Haixuan Yang, Irwin King, Michael Lyu
Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1
2 0.57039821 193 nips-2008-Regularized Co-Clustering with Dual Supervision
Author: Vikas Sindhwani, Jianying Hu, Aleksandra Mojsilovic
Abstract: By attempting to simultaneously partition both the rows (examples) and columns (features) of a data matrix, Co-clustering algorithms often demonstrate surprisingly impressive performance improvements over traditional one-sided row clustering techniques. A good clustering of features may be seen as a combinatorial transformation of the data matrix, effectively enforcing a form of regularization that may lead to a better clustering of examples (and vice-versa). In many applications, partial supervision in the form of a few row labels as well as column labels may be available to potentially assist co-clustering. In this paper, we develop two novel semi-supervised multi-class classification algorithms motivated respectively by spectral bipartite graph partitioning and matrix approximation formulations for co-clustering. These algorithms (i) support dual supervision in the form of labels for both examples and/or features, (ii) provide principled predictive capability on out-of-sample test data, and (iii) arise naturally from the classical Representer theorem applied to regularization problems posed on a collection of Reproducing Kernel Hilbert Spaces. Empirical results demonstrate the effectiveness and utility of our algorithms. 1
3 0.56412148 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
Author: Andrew Smith, Hongyuan Zha, Xiao-ming Wu
Abstract: We study the convergence and the rate of convergence of a local manifold learning algorithm: LTSA [13]. The main technical tool is the perturbation analysis on the linear invariant subspace that corresponds to the solution of LTSA. We derive a worst-case upper bound of errors for LTSA which naturally leads to a convergence result. We then derive the rate of convergence for LTSA in a special case. 1
4 0.53732109 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
5 0.53468382 44 nips-2008-Characteristic Kernels on Groups and Semigroups
Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur
Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1
6 0.53364664 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
7 0.49547133 113 nips-2008-Kernelized Sorting
8 0.49367303 151 nips-2008-Non-parametric Regression Between Manifolds
9 0.45425487 178 nips-2008-Performance analysis for L\ 2 kernel classification
10 0.45013535 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
11 0.44559145 126 nips-2008-Localized Sliced Inverse Regression
12 0.43538105 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
13 0.43294773 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
14 0.43014759 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
15 0.4287017 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
16 0.41341034 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
17 0.41076264 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
18 0.40771317 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
19 0.39159346 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
20 0.38790256 225 nips-2008-Supervised Bipartite Graph Inference
topicId topicWeight
[(6, 0.045), (7, 0.077), (12, 0.03), (28, 0.116), (57, 0.046), (59, 0.013), (63, 0.013), (77, 0.498), (83, 0.044)]
simIndex simValue paperId paperTitle
Author: Christoph Kolodziejski, Bernd Porr, Minija Tamosiunaite, Florentin Wörgötter
Abstract: In this theoretical contribution we provide mathematical proof that two of the most important classes of network learning - correlation-based differential Hebbian learning and reward-based temporal difference learning - are asymptotically equivalent when timing the learning with a local modulatory signal. This opens the opportunity to consistently reformulate most of the abstract reinforcement learning framework from a correlation based perspective that is more closely related to the biophysics of neurons. 1
2 0.88607579 61 nips-2008-Diffeomorphic Dimensionality Reduction
Author: Christian Walder, Bernhard Schölkopf
Abstract: This paper introduces a new approach to constructing meaningful lower dimensional representations of sets of data points. We argue that constraining the mapping between the high and low dimensional spaces to be a diffeomorphism is a natural way of ensuring that pairwise distances are approximately preserved. Accordingly we develop an algorithm which diffeomorphically maps the data near to a lower dimensional subspace and then projects onto that subspace. The problem of solving for the mapping is transformed into one of solving for an Eulerian flow field which we compute using ideas from kernel methods. We demonstrate the efficacy of our approach on various real world data sets. 1
same-paper 3 0.87092614 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
Author: Haixuan Yang, Irwin King, Michael Lyu
Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1
4 0.85381955 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
Author: Ali Nouri, Michael L. Littman
Abstract: The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE’s broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches. 1
5 0.70820612 216 nips-2008-Sparse probabilistic projections
Author: Cédric Archambeau, Francis R. Bach
Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1
6 0.58620119 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
7 0.57706654 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
8 0.52781463 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
9 0.52117056 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
10 0.51109725 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
11 0.50577426 200 nips-2008-Robust Kernel Principal Component Analysis
12 0.49623135 44 nips-2008-Characteristic Kernels on Groups and Semigroups
13 0.49433219 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
14 0.48799536 195 nips-2008-Regularized Policy Iteration
15 0.47805038 238 nips-2008-Theory of matching pursuit
16 0.46905217 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks
17 0.46882436 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
18 0.46811748 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
19 0.46745902 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
20 0.46344727 151 nips-2008-Non-parametric Regression Between Manifolds