nips nips2010 nips2010-199 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gilles Blanchard, Nicole Krämer
Abstract: We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Optimal learning rates for Kernel Conjugate Gradient regression Nicole Kr¨ mer a Weierstrass Institute Mohrenstr. [sent-1, score-0.199]
2 de Abstract We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. [sent-6, score-0.717]
3 This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. [sent-7, score-0.21]
4 The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. [sent-8, score-0.577]
5 Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. [sent-9, score-0.576]
6 If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. [sent-10, score-0.264]
7 The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. [sent-11, score-0.417]
8 1 Introduction The contribution of this paper is the learning theoretical analysis of kernel-based least squares regression in combination with conjugate gradient techniques. [sent-12, score-0.371]
9 Following the kernelization principle, we implicitly map the data into a reproducing 1 kernel Hilbert space H with a kernel k. [sent-19, score-0.387]
10 We denote by Kn = n (k(Xi , Xj )) ∈ Rn×n the normalized n kernel matrix and by Υ = (Y1 , . [sent-20, score-0.165]
11 The task is to find coefficients α such that the function defined by the normalized kernel expansion fα (X) = 1 n n αi k(Xi , X) i=1 is an adequate estimator of the true regression function f ∗ . [sent-24, score-0.254]
12 It is well-known that to avoid overfitting, some form of regularization is needed. [sent-29, score-0.11]
13 Perhaps the most well-known one is α = (Kn + λI)−1 Υ, (2) known alternatively as kernel ridge regression, Tikhonov’s regularization, least squares support vector machine, or MAP Gaussian process regression. [sent-33, score-0.357]
14 This type of regularization, which we refer to as linear regularization methods, is directly inspired from the theory of inverse problems. [sent-37, score-0.145]
15 Popular examples include as particular cases kernel ridge regression, principal components regression and L2 -boosting. [sent-38, score-0.249]
16 In this paper, we study conjugate gradient (CG) techniques in combination with early stopping for the regularization of the kernel based learning problem (1). [sent-41, score-0.725]
17 In the learning context considered here, regularization corresponds to early stopping. [sent-52, score-0.155]
18 Algorithm 1 displays the computation of the CG kernel coefficients αm defined by (5). [sent-57, score-0.165]
19 Algorithm 1 Kernel Conjugate Gradient regression Input kernel matrix Kn , response vector Υ, maximum number of iterations m Initialization: α0 = 0n ; r1 = Υ; d1 = Υ; t1 = Kn Υ for i = 1, . [sent-58, score-0.216]
20 , m do ti = ti / ti Kn ; di = di / ti Kn (normalization of the basis, resp. [sent-61, score-0.354]
21 of Υ on basis vector) αi = αi−1 + γi di (update) ri+1 = ri − γi ti (residuals) di+1 = ri+1 − di ti , Kn ri+1 Kn ; ti+1 = Kn di+1 (new update, resp. [sent-63, score-0.252]
22 However, the polynomial qm is not fixed but depends on Υ as well, making the CG method nonlinear in the sense that the coefficients αm depend on Υ in a nonlinear fashion. [sent-65, score-0.206]
23 2 We remark that in machine learning, conjugate gradient techniques are often used as fast solvers for operator equations, e. [sent-66, score-0.303]
24 We stress that in this paper, we study conjugate gradients as a regularization approach for kernel based learning, where the regularity is ensured via early stopping. [sent-69, score-0.559]
25 Moreover, a similar conjugate gradient approach for non-definite kernels has been proposed and empirically evaluated by Ong et al [17]. [sent-75, score-0.161]
26 In particular, we establish the existence of early stopping rules that lead to optimal convergence rates. [sent-77, score-0.362]
27 We first assume that the kernel space H is separable and that the kernel function is measurable. [sent-80, score-0.33]
28 ) Furthermore, for all results, we make the (relatively standard) assumption that the kernel is bounded: k(x, x) ≤ κ for all x ∈ X . [sent-82, score-0.191]
29 In particular, the first assumption implies that not only the noise, but also the target function f ∗ is bounded in supremum norm, while the second assumption does not put any additional restriction on the target function. [sent-87, score-0.295]
30 The regularity of the target function f ∗ is measured in terms of a source condition as follows. [sent-88, score-0.251]
31 The kernel integral operator is given by K : L2 (PX ) → L2 (PX ), g → k(. [sent-89, score-0.351]
32 The regularity of the kernel operator K with respect to the marginal distribution PX is measured in terms of the so-called effective dimensionality condition, defined by the two parameters s ∈ (0, 1), D ≥ 0 and the condition ED(s, D) : tr(K(K + λI)−1 ) ≤ D2 (κ−1 λ)−s for all λ ∈ (0, 1]. [sent-94, score-0.474]
33 It is known that the best attainable rates of convergence, as a function of the number of examples n, are determined by the parameters r and s in the above conditions: It was shown in [6] that the minimax learning rate given these two parameters is lower bounded by O(n−2r/(2r+s) ). [sent-96, score-0.229]
34 It is not difficult to prove from (4) and (5) that Υ − Kn αn Kn = 0, so that the above type of stopping rule always has m ≤ n. [sent-104, score-0.337]
35 1 Inner case without knowledge on effective dimension The inner case corresponds to r ≥ 1/2, i. [sent-106, score-0.092]
36 For some constants τ > 1 and 1 > γ > 0, we consider the discrepancy stopping rule with the threshold sequence κ log(2γ −1 ) √ Λm = 4τ κ αm Kn + M log(2γ −1 ) . [sent-109, score-0.432]
37 (7) n For technical reasons, we consider a slight variation of the rule in that we stop at step m−1 instead of m if qm (0) ≥ 4κ log(2γ −1 )/n, where qm is the iteration polynomial such that αm = qm (Kn )Υ. [sent-110, score-0.525]
38 Suppose that Y is bounded (Bounded), and that the source condition SC(r, ρ) holds for r ≥ 1/2. [sent-115, score-0.119]
39 With probability 1 − 2γ , the estimator fm obtained by the (modified) discrepancy e stopping rule (7) satisfies fm − e 2 f∗ 2 ≤ c(r, τ )(M + ρ) 2r 2r+1 log2 γ −1 n 2 . [sent-116, score-0.816]
40 2 Optimal rates in inner case We now introduce a stopping rule yielding order-optimal convergence rates as a function of the two parameters r and s in the “inner” case (r ≥ 1/2, which is equivalent to saying that the target function belongs to H almost surely). [sent-119, score-0.868]
41 For some constant τ > 3/2 and 1 > γ > 0, we consider the discrepancy stopping rule with the fixed threshold √ Λm ≡ Λ = τ M κ 6 4D √ log γ n 2r+1 2r+s . [sent-120, score-0.432]
42 Suppose that the noise fulfills the Bernstein assumption (Bernstein), that the source condition SC(r, ρ) holds for r ≥ 1/2, and that ED(s, D) holds. [sent-123, score-0.098]
43 With probability 1 − 3γ , the estimator fm obtained by the discrepancy stopping rule (8) satisfies b fm − b 2 f∗ 2 2 ≤ c(r, τ )(M + ρ) 16D2 6 log2 n γ 2r 2r+s . [sent-124, score-0.816]
44 3 Optimal rates in outer case, given additional unlabeled data We now turn to the “outer” case. [sent-127, score-0.244]
45 We use the same threshold (8) as in the previous section for the ˜ stopping rule, except that the factor M is replaced by max(M, ρ). [sent-146, score-0.269]
46 Then with probability 1 − 3γ , the estimator fm obtained by the discrepancy stopping rule defined b above satisfies fm − f ∗ b 2 2 ≤ c(r, τ )(M + ρ)2 16D2 6 log2 n γ 2r 2r+s . [sent-151, score-0.816]
47 f ∗ ∈ H almost surely – we provide two different consistent stopping criteria. [sent-155, score-0.318]
48 However, an interesting feature of stopping rule (7) is that the rule itself does not depend on the a priori knowledge of the regularity parameter r, while the achieved learning rate does (and with the optimal dependence in r when s = 1). [sent-158, score-0.556]
49 1 implies that the obtained rule is automatically adaptive with respect to the regularity of the target function. [sent-160, score-0.303]
50 This contrasts with the results obtained in [1] for linear regularization schemes of the form (3), (also in the case s = 1) for which the choice of the regularization parameter λ leading to optimal learning rates required the knowledge or r beforehand. [sent-161, score-0.414]
51 2 provides the order-optimal convergence rate in the inner case (up to a log factor). [sent-163, score-0.101]
52 1 however, is that the stopping rule is no longer adaptive, that is, it depends on the a priori knowledge of parameters r and s. [sent-165, score-0.37]
53 We observe that previously obtained results for linear regularization schemes of the form (2) in [6] and of the form (3) in [5], also rely on the a priori knowledge of r and s to determine the appropriate regularization parameter λ. [sent-166, score-0.299]
54 The outer case – when the target function does not lie in the reproducing Kernel Hilbert space H – is more challenging and to some extent less well understood. [sent-167, score-0.192]
55 The fact that additional assumptions are made is not a particular artefact of CG methods, but also appears in the studies of other regularization techniques. [sent-168, score-0.133]
56 [5] (to study linear regularization of the form (3)) and assume that we have sufficient additional unlabeled data in order to ensure learning rates that are optimal as a function of the number of labeled data. [sent-171, score-0.305]
57 For regularized M-estimation schemes studied in [20], availability of unlabeled data is not p 1−p required, but a condition is imposed of the form f ∞ ≤ C f H f 2 for all f ∈ H and some p ∈ (0, 1]. [sent-173, score-0.157]
58 In [13], assumptions on the supremum norm of the eigenfunctions of the kernel integral operator are made (see [20] for an in-depth discussion on this type of assumptions). [sent-174, score-0.43]
59 In the context of learning, our approach is most closely linked to Partial Least Squares (PLS) [21] and its kernel extension [18]. [sent-176, score-0.165]
60 In [8, 14], consistency properties are provided for linear PLS under the assumption that the target function f ∗ depends on a finite known number of orthogonal latent components. [sent-178, score-0.112]
61 These findings were recently extended to the nonlinear case and without the assumption of a latent components model [3], but all results come without optimal rates of convergence. [sent-179, score-0.197]
62 For the slightly different CG approach studied by Ong et al [17], bounds on the difference between the empirical risks of the CG approximation and of the target function are derived in [16], but no bounds on the generalization error were derived. [sent-180, score-0.136]
63 4 Proofs Convergence rates for regularization methods of the type (2) or (3) have been studied by casting kernel learning methods into the framework of inverse problems (see [9]). [sent-181, score-0.458]
64 5 We first define the empirical evaluation operator Tn as follows: g ∈ H → Tn g := (g(X1 ), . [sent-183, score-0.142]
65 , g(Xn )) ∈ Rn Tn : ∗ and the empirical integral operator Tn as: ∗ ∗ Tn : u = (u1 , . [sent-186, score-0.186]
66 i=1 ∗ Using the reproducing property of the kernel, it can be readily checked that Tn and Tn are adjoint ∗ n operators, i. [sent-190, score-0.086]
67 Based on these facts, equation (5) can be rewritten as fm = arg min ∗ f ∈Km (Tn Υ,Sn ) ∗ Tn Y − Sn f H , (9) ∗ where Sn = Tn Tn is a self-adjoint operator of H, called empirical covariance operator. [sent-194, score-0.339]
68 This definition corresponds to that of the “usual” conjugate gradient algorithm formally applied to the so-called normal equation (in H) ∗ Sn fα = Tn Υ , ∗ which is obtained from (1) by left multiplication by Tn . [sent-195, score-0.185]
69 , x)g(x)dPX (x) = E [k(X, ·)g(X)] ∈ H , and the change-of-space operator T : g ∈ H → g ∈ L2 (PX ) . [sent-197, score-0.142]
70 (11) P Tn Y − SfH ≤ √ γ n ∗ ∗ We note that f ∗ = T fH implies that the target function f ∗ coincides with a function fH belonging to H (remember that T is just the change-of-space operator). [sent-206, score-0.086]
71 1 Nemirovskii’s result on conjugate gradient regularization rates We recall a sharp result due to Nemirovskii [15] establishing convergence rates for conjugate gradient methods in a deterministic context. [sent-209, score-0.771]
72 Consider the linear equation Az ∗ = b , 6 where A is a bounded linear operator over a Hilbert space H . [sent-212, score-0.213]
73 Assume that the above equation has a ¯ solution and denote z ∗ its minimal norm solution; assume further that a self-adjoint operator A, and an element ¯ ∈ H are known such that b ¯ A−A ≤ δ; b −¯ ≤ ε, b (12) ¯ (with δ and ε known positive numbers). [sent-213, score-0.198]
74 Consider the CG algorithm based on the noisy operator A ¯ giving the output at step m and data b, ¯ zm = Arg Min Az − ¯ b 2 . [sent-214, score-0.22]
75 (13) ¯ b) z∈Km (A,¯ The discrepancy principle stopping rule is defined as follows. [sent-215, score-0.456]
76 Consider a fixed constant τ > 1 and define ¯ b m = min m ≥ 0 : Azm − ¯ < τ (δ zm + ε) . [sent-216, score-0.078]
77 Consider a minor variation of of this rule: ¯ m ¯ max(0, m − 1) ¯ m= if qm (0) < ηδ −1 ¯ otherwise, ¯b where qm is the degree m − 1 polynomial such that zm = qm (A)¯ , and η is an arbitrary positive ¯ ¯ ¯ constant such that η < 1/τ . [sent-218, score-0.51]
78 1 1 We apply Nemirovskii’s result in our setting (assuming r ≥ 2 ): By identifying the approximate ∗ ¯ operator and data as A = Sn and ¯ = Tn Y, we see that the CG algorithm considered by Nemirovskii b (13) is exactly (9), more precisely with the identification zm = fm . [sent-225, score-0.393]
79 The operator norm is upper bounded by the Hilbert-Schmidt norm, so that the deviation inequality for the operators is actually stronger than what is needed. [sent-233, score-0.268]
80 We consider the discrepancy principle stopping rule associated to these parameters, the choice η = 1 1/(2τ ), and θ = 2 , thus obtaining the result, since 1 A 2 (zm − z ∗ ) b 4. [sent-234, score-0.456]
81 3 2 1 ∗ = S 2 (fm − fH ) b 2 H ∗ = fm − fH b 2 2 . [sent-235, score-0.173]
82 3 The above proof shows that an application of Nemirovskii’s fundamental result for CG regularization of inverse problems under deterministic noise (on the data and the operator) allows us to obtain our first result. [sent-238, score-0.176]
83 1 is too coarse to prove the optimal rates of convergence taking into account the effective dimension 7 parameter. [sent-247, score-0.225]
84 of the form (S + λI)− 2 (Tn Y − T ∗ f ∗ ) for the data, and (S + λI)− 2 (Sn − S) HS for the operator (with an appropriate choice of λ > 0) respectively. [sent-250, score-0.142]
85 Deviations of this form were introduced and used in [5, 6] to obtain sharp rates in the framework of Tikhonov’s regularization (2) and of the more general linear regularization schemes of the form (3). [sent-251, score-0.414]
86 Bounds on deviations of this form can be obtained via a Bernstein-type concentration inequality for Hilbert-space valued random variables. [sent-252, score-0.075]
87 On the one hand, the results concerning linear regularization schemes of the form (3) do not apply to the nonlinear CG regularization. [sent-253, score-0.179]
88 5 Conclusion In this work, we derived early stopping rules for kernel Conjugate Gradient regression that provide optimal learning rates to the true target function. [sent-262, score-0.769]
89 Depending on the situation that we study, the rates are adaptive with respect to the regularity of the target function in some cases. [sent-263, score-0.358]
90 We also note that theoretically well-grounded model selection rules can generally help cross-validation in practice by providing a well-calibrated parametrization of regularizer functions, or, as is the case here, of thresholds used in the stopping rule. [sent-267, score-0.274]
91 One crucial property used in the proofs is that the proposed CG regularization schemes can be conveniently cast in the reproducing kernel Hilbert space H as displayed in e. [sent-268, score-0.406]
92 This point is the main technical justification on why we focus on (5) rather than kernel PLS. [sent-271, score-0.165]
93 Obtaining optimal convergence rates also valid for Kernel PLS is an important future direction and should build on the present work. [sent-272, score-0.191]
94 Another important direction for future efforts is the derivation of stopping rules that do not depend on the confidence parameter γ. [sent-273, score-0.274]
95 Currently, this dependence prevents us to go from convergence in high probability to convergence in expectation, which would be desirable. [sent-274, score-0.086]
96 Perhaps more importantly, it would be of interest to find a stopping rule that is adaptive to both parameters r (target function regularity) and s (effective dimension parameter) without their a priori knowledge. [sent-275, score-0.401]
97 We recall that our first stopping rule is adaptive to r but at the price of being worst-case in s. [sent-276, score-0.368]
98 In the literature on linear regularization methods, the optimal choice of regularization parameter is also non-adaptive, be it when considering optimal rates with respect to r only [1] or to both r and s [5]. [sent-277, score-0.368]
99 An approach to alleviate this problem is to use a hold-out sample for model selection; this was studied theoretically in [7] for linear regularization methods (see also [4] for an account of the properties of hold-out in a general setup). [sent-278, score-0.11]
100 Learning bounds for kernel regression using effective data dimensionality. [sent-413, score-0.275]
wordName wordTfidf (topN-words)
[('kn', 0.522), ('cg', 0.363), ('tn', 0.269), ('stopping', 0.244), ('nemirovskii', 0.211), ('fm', 0.173), ('kernel', 0.165), ('rates', 0.148), ('operator', 0.142), ('qm', 0.136), ('squares', 0.126), ('conjugate', 0.121), ('fh', 0.111), ('px', 0.11), ('regularization', 0.11), ('discrepancy', 0.095), ('rule', 0.093), ('regularity', 0.093), ('km', 0.088), ('target', 0.086), ('pls', 0.085), ('zm', 0.078), ('bernstein', 0.076), ('blanchard', 0.068), ('sn', 0.066), ('ti', 0.066), ('sc', 0.065), ('krylov', 0.064), ('inner', 0.058), ('hilbert', 0.057), ('reproducing', 0.057), ('partial', 0.053), ('regression', 0.051), ('caponnetto', 0.051), ('outer', 0.049), ('surely', 0.049), ('operators', 0.047), ('bounded', 0.047), ('unlabeled', 0.047), ('schemes', 0.046), ('di', 0.045), ('early', 0.045), ('integral', 0.044), ('ong', 0.043), ('convergence', 0.043), ('deviations', 0.041), ('hs', 0.04), ('gradient', 0.04), ('condition', 0.04), ('estimator', 0.038), ('rosipal', 0.037), ('exy', 0.037), ('vito', 0.037), ('wold', 0.037), ('inverse', 0.035), ('potsdam', 0.034), ('az', 0.034), ('attainable', 0.034), ('effective', 0.034), ('concentration', 0.034), ('ridge', 0.033), ('priori', 0.033), ('least', 0.033), ('norm', 0.032), ('source', 0.032), ('theorem', 0.032), ('rosasco', 0.032), ('kr', 0.032), ('population', 0.032), ('proof', 0.031), ('adaptive', 0.031), ('rules', 0.03), ('ri', 0.03), ('cients', 0.029), ('adjoint', 0.029), ('proofs', 0.028), ('tikhonov', 0.028), ('warped', 0.028), ('satis', 0.028), ('noiseless', 0.027), ('assumption', 0.026), ('almost', 0.025), ('remember', 0.025), ('ful', 0.025), ('stress', 0.025), ('bounds', 0.025), ('replaced', 0.025), ('supremum', 0.024), ('principle', 0.024), ('polynomial', 0.024), ('coef', 0.024), ('regularized', 0.024), ('rn', 0.024), ('equation', 0.024), ('ideas', 0.023), ('depending', 0.023), ('nonlinear', 0.023), ('assumptions', 0.023), ('belongs', 0.023), ('subspaces', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
Author: Gilles Blanchard, Nicole Krämer
Abstract: We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. 1
2 0.24321038 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
3 0.15058097 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
4 0.10735481 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
Author: Tobias Glasmachers
Abstract: Steinwart was the first to prove universal consistency of support vector machine classification. His proof analyzed the ‘standard’ support vector machine classifier, which is restricted to binary classification problems. In contrast, recent analysis has resulted in the common belief that several extensions of SVM classification to more than two classes are inconsistent. Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart’s techniques to the multi-class case. 1
5 0.098167263 133 nips-2010-Kernel Descriptors for Visual Recognition
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet.
6 0.096356958 153 nips-2010-Learning invariant features using the Transformed Indian Buffet Process
7 0.088697784 124 nips-2010-Inductive Regularized Learning of Kernel Functions
8 0.085635446 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
9 0.083642751 108 nips-2010-Graph-Valued Regression
10 0.082841769 243 nips-2010-Smoothness, Low Noise and Fast Rates
11 0.081317134 174 nips-2010-Multi-label Multiple Kernel Learning by Stochastic Approximation: Application to Visual Object Recognition
12 0.081245624 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.080051899 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
14 0.076683365 281 nips-2010-Using body-anchored priors for identifying actions in single images
15 0.075746141 148 nips-2010-Learning Networks of Stochastic Differential Equations
16 0.074390925 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
17 0.073681213 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions
18 0.071576245 101 nips-2010-Gaussian sampling by local perturbations
19 0.06899368 233 nips-2010-Scrambled Objects for Least-Squares Regression
20 0.068513364 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
topicId topicWeight
[(0, 0.192), (1, 0.027), (2, 0.088), (3, 0.034), (4, 0.157), (5, 0.069), (6, 0.133), (7, -0.003), (8, 0.038), (9, 0.019), (10, -0.019), (11, -0.035), (12, -0.126), (13, -0.056), (14, -0.038), (15, 0.008), (16, 0.101), (17, 0.093), (18, 0.012), (19, 0.005), (20, -0.057), (21, -0.017), (22, -0.003), (23, 0.009), (24, 0.165), (25, 0.016), (26, -0.067), (27, 0.025), (28, -0.06), (29, -0.006), (30, -0.029), (31, -0.012), (32, -0.046), (33, 0.139), (34, -0.013), (35, 0.181), (36, -0.142), (37, 0.051), (38, 0.068), (39, 0.014), (40, 0.005), (41, 0.08), (42, 0.043), (43, -0.035), (44, -0.115), (45, 0.059), (46, -0.101), (47, -0.145), (48, 0.063), (49, -0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.93065566 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
Author: Gilles Blanchard, Nicole Krämer
Abstract: We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. 1
2 0.78656769 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
3 0.72438502 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
4 0.59631735 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
Author: Tobias Glasmachers
Abstract: Steinwart was the first to prove universal consistency of support vector machine classification. His proof analyzed the ‘standard’ support vector machine classifier, which is restricted to binary classification problems. In contrast, recent analysis has resulted in the common belief that several extensions of SVM classification to more than two classes are inconsistent. Countering this belief, we prove the universal consistency of the multi-class support vector machine by Crammer and Singer. Our proof extends Steinwart’s techniques to the multi-class case. 1
5 0.56923002 279 nips-2010-Universal Kernels on Non-Standard Input Spaces
Author: Andreas Christmann, Ingo Steinwart
Abstract: During the last years support vector machines (SVMs) have been successfully applied in situations where the input space X is not necessarily a subset of Rd . Examples include SVMs for the analysis of histograms or colored images, SVMs for text classiÄ?Ĺš cation and web mining, and SVMs for applications from computational biology using, e.g., kernels for trees and graphs. Moreover, SVMs are known to be consistent to the Bayes risk, if either the input space is a complete separable metric space and the reproducing kernel Hilbert space (RKHS) H ⊂ Lp (PX ) is dense, or if the SVM uses a universal kernel k. So far, however, there are no kernels of practical interest known that satisfy these assumptions, if X ⊂ Rd . We close this gap by providing a general technique based on Taylor-type kernels to explicitly construct universal kernels on compact metric spaces which are not subset of Rd . We apply this technique for the following special cases: universal kernels on the set of probability measures, universal kernels based on Fourier transforms, and universal kernels for signal processing. 1
6 0.4887552 233 nips-2010-Scrambled Objects for Least-Squares Regression
7 0.4621124 280 nips-2010-Unsupervised Kernel Dimension Reduction
8 0.4422752 124 nips-2010-Inductive Regularized Learning of Kernel Functions
9 0.42171863 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
10 0.40139526 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls
11 0.39631766 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
12 0.39124939 148 nips-2010-Learning Networks of Stochastic Differential Equations
14 0.38980621 223 nips-2010-Rates of convergence for the cluster tree
15 0.38603243 133 nips-2010-Kernel Descriptors for Visual Recognition
16 0.36913151 134 nips-2010-LSTD with Random Projections
17 0.36711758 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
18 0.36208999 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
19 0.36060363 281 nips-2010-Using body-anchored priors for identifying actions in single images
20 0.35784644 72 nips-2010-Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions
topicId topicWeight
[(13, 0.046), (27, 0.05), (30, 0.078), (35, 0.031), (38, 0.205), (45, 0.189), (50, 0.062), (52, 0.032), (60, 0.059), (77, 0.035), (78, 0.027), (90, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.82365698 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
Author: Gilles Blanchard, Nicole Krämer
Abstract: We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. 1
2 0.77406102 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect
Author: Oliver Stegle, Dominik Janzing, Kun Zhang, Joris M. Mooij, Bernhard Schölkopf
Abstract: We propose a novel method for inferring whether X causes Y or vice versa from joint observations of X and Y . The basic idea is to model the observed data using probabilistic latent variable models, which incorporate the effects of unobserved noise. To this end, we consider the hypothetical effect variable to be a function of the hypothetical cause variable and an independent noise term (not necessarily additive). An important novel aspect of our work is that we do not restrict the model class, but instead put general non-parametric priors on this function and on the distribution of the cause. The causal direction can then be inferred by using standard Bayesian model selection. We evaluate our approach on synthetic data and real-world data and report encouraging results. 1
3 0.75822306 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
Author: Han Liu, Xi Chen
Abstract: We propose a new nonparametric learning method based on multivariate dyadic regression trees (MDRTs). Unlike traditional dyadic decision trees (DDTs) or classification and regression trees (CARTs), MDRTs are constructed using penalized empirical risk minimization with a novel sparsity-inducing penalty. Theoretically, we show that MDRTs can simultaneously adapt to the unknown sparsity and smoothness of the true regression functions, and achieve the nearly optimal rates of convergence (in a minimax sense) for the class of (α, C)-smooth functions. Empirically, MDRTs can simultaneously conduct function estimation and variable selection in high dimensions. To make MDRTs applicable for large-scale learning problems, we propose a greedy heuristics. The superior performance of MDRTs are demonstrated on both synthetic and real datasets. 1
4 0.75707257 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
5 0.7434985 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
Author: Malik Magdon-Ismail
Abstract: We define a data dependent permutation complexity for a hypothesis set H, which is similar to a Rademacher complexity or maximum discrepancy. The permutation complexity is based (like the maximum discrepancy) on dependent sampling. We prove a uniform bound on the generalization error, as well as a concentration result which means that the permutation estimate can be efficiently estimated.
6 0.73991305 117 nips-2010-Identifying graph-structured activation patterns in networks
8 0.73654902 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
9 0.7355839 222 nips-2010-Random Walk Approach to Regret Minimization
10 0.73460817 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
11 0.73385322 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
12 0.73377877 243 nips-2010-Smoothness, Low Noise and Fast Rates
13 0.73282504 282 nips-2010-Variable margin losses for classifier design
14 0.73259032 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
15 0.73174673 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
16 0.72982317 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
17 0.72655874 265 nips-2010-The LASSO risk: asymptotic results and real world examples
18 0.72651833 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
19 0.72641945 280 nips-2010-Unsupervised Kernel Dimension Reduction
20 0.72591192 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models