jmlr jmlr2005 jmlr2005-47 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone
Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE
Reference: text
sentIndex sentText sentNum sentScore
1 IT DISI Universit` di Genova, a Genova, Italy Editor: Peter Bartlett Abstract Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. [sent-10, score-0.4]
2 In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. [sent-11, score-0.252]
3 Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. [sent-12, score-0.225]
4 From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. [sent-14, score-0.395]
5 The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. [sent-15, score-0.154]
6 By means of standard results on the approximation term, the consistency of the algorithm easily follows. [sent-16, score-0.129]
7 Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. [sent-17, score-0.403]
8 (2000) and references therein), this is in essence the idea underlying regularization techniques for ill-posed problems (Tikhonov and Arsenin, 1977; Engl et al. [sent-26, score-0.126]
9 , 2002) and the point of view of regularization is indeed not new to learning (Poggio and Girosi, 1992; Evgeniou et al. [sent-29, score-0.126]
10 In paro ticular it allowed to cast a large class of algorithms in a common framework, namely regularization networks or regularized kernel methods (Evgeniou et al. [sent-31, score-0.245]
11 o Anyway a careful analysis shows that a rigorous mathematical connection between learning theory and the theory of ill-posed inverse problems is not straightforward since the settings underlying the two theories are different. [sent-33, score-0.236]
12 In fact learning theory is intrinsically probabilistic whereas the theory of inverse problem is mostly deterministic. [sent-34, score-0.229]
13 Statistical methods were recently applied in the context of inverse problems (Kaipio and Somersalo, 2005). [sent-35, score-0.148]
14 Recently the connection between learning and inverse problems was considered in the restricted setting in which the elements of the input space are fixed and not probabilistically drawn (Mukherjee et al. [sent-37, score-0.19]
15 , 1996) o and when the noise level is fixed and known, the problem is well studied in the context of inverse problems (Bertero et al. [sent-40, score-0.24]
16 Though such setting is indeed close to the algorithmic setting from a theoretical perspective it is not general enough to allow a consistency analysis of a given algorithm since it does not take care of the random sampling providing the data. [sent-46, score-0.129]
17 First, we study the mathematical connections between learning theory and inverse problems theory. [sent-50, score-0.171]
18 when the probability distribution is known) to show that the discrete inverse problem which is solved in practice can be seen as the stochastic discretization of an infinite dimensional inverse problem. [sent-53, score-0.296]
19 Second, we exploit the established connection to study the regularized least-squares algorithm. [sent-57, score-0.161]
20 This passes through the definition of a natural notion of discretization noise providing a straightforward relation between the number of available data and the noise affecting the problem. [sent-58, score-0.157]
21 Classical regularization theory results can be easily adapted to the needs of learning. [sent-59, score-0.149]
22 As the major aim of the paper was to investigate the relation between learning from examples and inverse problem we just prove convergence without dealing with rates. [sent-64, score-0.197]
23 Several theoretical results are available on regularized kernel methods for large class of loss functions. [sent-67, score-0.148]
24 In Steinwart (2004) it is proved that such results as well as other probabilistic bounds can be used to derive consistency results without convergence rates. [sent-69, score-0.19]
25 For the specific case of regularized least-squares algorithm a functional analytical approach to derive consistency results for regularized least squares was proposed in Cucker and Smale (2002a) and eventually refined in De Vito et al. [sent-70, score-0.425]
26 Anyway none of the mentioned papers exploit the connection with inverse problems. [sent-74, score-0.19]
27 The arguments used to derive our results are close to those used in the study of stochastic inverse problems discussed in Vapnik (1998). [sent-75, score-0.148]
28 From the algorithmic point of view Ong and Canu (2004) apply other techniques than Tikhonov regularization in the context of learning. [sent-76, score-0.126]
29 In particular several iterative algorithms are considered and convergence with respect to the regularization parameter (semiconvergence) is proved. [sent-77, score-0.152]
30 After recalling the main concepts and notation of statistical learning (Section 2) and of inverse problems (Section 3), in Section 4 we develop a formal connection between the two theories. [sent-79, score-0.213]
31 Given the sample z, the aim of learning theory is to find a function fz : X → R such that fz (x) is a good estimate of the output y when a new input x is given. [sent-95, score-0.625]
32 The function fz is called estimator and the map providing fz , for any training set z, is called learning algorithm. [sent-96, score-0.627]
33 I[ fz λ ] − inf I[ f ] f ∈H 886 L EARNING FROM E XAMPLES AS AN I NVERSE P ROBLEM is small with high probability. [sent-103, score-0.374]
34 Since ρ is unknown, the above difference is studied by means of a probabilistic bound B (λ, , η), which is a function depending on the regularization parameter λ, the number of examples and the confidence level 1 − η, such that P I[ fz λ ] − inf I[ f ] ≤ B (λ, , η) ≥ 1 − η. [sent-104, score-0.56]
35 In particular, the learning algorithm is consistent if it is possible to choose the regularization parameter, as a function of the available data λ = λ( , z), in such a way that lim P I[ fz λ( →+∞ ,z) ] − inf I[ f ] ≥ ε = 0, f ∈H (2) for every ε > 0. [sent-109, score-0.53]
36 The above convergence in probability is usually called (weak) consistency of the algorithm (see Devroye et al. [sent-110, score-0.155]
37 (4) In particular, in the learning algorithm (1) we choose the penalty term Ω( f ) = f H 2 , so that, by a standard convex analysis argument, the minimizer fz λ exists, is unique and can be computed by solving a linear finite dimensional problem, (Wahba, 1990). [sent-126, score-0.345]
38 With the above choices, we will show that the consistency of the regularized least squares algorithm can be deduced using the theory of linear inverse problems we review in the next section. [sent-127, score-0.477]
39 Ill-Posed Inverse Problems and Regularization In this section we give a very brief account of the main concepts of linear inverse problems and regularization theory (see Tikhonov and Arsenin (1977); Groetsch (1984); Bertero et al. [sent-129, score-0.297]
40 Finding the function f satisfying the above equation, given A and g, is the linear inverse problem associated to (5). [sent-135, score-0.148]
41 Existence and uniqueness can be restored introducing the Moore-Penrose generalized solution f † defined as the minimal norm solution of the least squares problem min A f − g 2 . [sent-137, score-0.137]
42 This is a problem since the exact datum g is not known, but only a noisy datum gδ ∈ K is given, where g − gδ K ≤ δ. [sent-141, score-0.163]
43 A crucial issue is the choice of the regularization parameter λ as a function of the noise. [sent-143, score-0.126]
44 In particular, λ must be selected, as a function of the noise level δ and the data gδ , in such λ(δ,gδ ) a way that the regularized solution fδ converges to the generalized solution, that is, λ(δ,gδ ) lim fδ δ→0 − f† for any g such that f † exists. [sent-145, score-0.243]
45 Nonetheless regularization is needed since the problems are usually ill conditioned and lead to unstable solutions. [sent-150, score-0.154]
46 Comparing (9) and (10), it is clear that while studying the convergence of the residual we do not have to assume that the generalized solution exists. [sent-152, score-0.15]
47 We conclude this section noting that the above formalism can be easily extended to the case of a noisy operator Aδ : H → K where A − Aδ ≤ δ, and · is the operator norm (Tikhonov et al. [sent-153, score-0.248]
48 Learning as an Inverse Problem The similarity between regularized least squares and Tikhonov regularization is apparent comparing Problems (1) and (7). [sent-156, score-0.303]
49 • To treat the problem of learning in the setting of ill-posed inverse problems we have to define a direct problem by means of a suitable operator A between two Hilbert spaces H and K . [sent-158, score-0.239]
50 (11) Moreover, if P is the projection on the closure of the range of A, that is, the closure of H into L2 (X, ν), then the definition of projection gives inf A f − g f ∈H 2 L2 (X,ν) = g − Pg 2 L2 (X,ν) . [sent-171, score-0.187]
51 (12) Given f ∈ H , clearly PA f = A f , so that I[ f ] − inf I[ f ] = A f − g f ∈H 2 L2 (X,ν) − g − Pg 2 L2 (X,ν) = A f − Pg 2 L2 (X,ν) , (13) which is the square of the residual of f . [sent-172, score-0.17]
52 Now, comparing (11) and (6), it is clear that the expected risk admits a minimizer fH on the hypothesis space H if and only if fH is precisely the generalized solution f † of the linear inverse problem A f = g. [sent-173, score-0.269]
53 i=1 It is straightforward to check that 1 ∑ ( f (xi ) − yi )2 = i=1 Ax f − y 2 E , so that the estimator fz λ given by the regularized least squares algorithm, see Problem (1), is the Tikhonov regularized solution of the discrete problem Ax f = y. [sent-186, score-0.649]
54 Second, in statistical learning theory, the key quantity is the residual of the solution, which is a weaker measure than the reconstruction error, usually studied in the inverse problem setting. [sent-189, score-0.304]
55 In particular, consistency requires a weaker kind of convergence than the one usually studied in the context of inverse problems . [sent-190, score-0.328]
56 Regularization, Stochastic Noise and Consistency Table 1 compares the classical framework of inverse problems (see Section 3) with the formulation of learning proposed above. [sent-196, score-0.148]
57 Given the above premise our derivation of consistency results is developed in two steps: we first study the residual of the solution by means of a measure of the noise due to discretization, then we show a possible way to give a probabilistic evaluation of the noise previously introduced. [sent-201, score-0.422]
58 When the projection of the regression function is not in the range of the operator A the ideal solution fH does not exist. [sent-203, score-0.205]
59 1 Bounding the Residual of Tikhonov Solution In this section we study the dependence of the minimizer of Tikhonov functional on the operator A and the data g. [sent-206, score-0.135]
60 The Tikhonov solutions of Problems (14) and (15) can be written as f λ = (A∗ A + λI)−1 A∗ g, λ fz = (A∗ Ax + λI)−1 A∗ y x x λ (see for example Engl et al. [sent-209, score-0.301]
61 The above equations show that fz and f λ depend only on A∗ Ax and A∗ A, which are operators from H into H , and on A∗ y and A∗ g, x x which are elements of H . [sent-211, score-0.326]
62 Theorem 2 Given λ > 0, the following inequality holds λ A fz − Pg L2 (X,ν) − A f λ − Pg L2 (X,ν) δ1 Mδ2 ≤ √ + 4λ 2 λ for any training set z ∈ Uδ . [sent-215, score-0.301]
63 of the inequality is exactly the residual of the regularized solution whereas the second term represents the approximation error, which does not depend on the sample. [sent-221, score-0.243]
64 Our bound quantifies the difference between the residual of the regularized solutions of the exact and noisy problems in terms of the noise level δ = (δ1 , δ2 ). [sent-222, score-0.324]
65 Our result bounds the residual both from above and below and is obtained introducing the collection Uδ of training sets compatible with a certain noise level δ. [sent-224, score-0.164]
66 In the context of inverse problems a noise estimate is a part of the available data whereas in learning problems we need a probabilistic analysis. [sent-229, score-0.25]
67 An interesting aspect in our approach is that the collection of training sets compatible with a certain noise level δ does not depend on the regularization parameter λ. [sent-236, score-0.193]
68 Since through data dependent parameter choices the regularization parameter becomes a function of the given sample λ( , z), in general some further analysis is needed to ensure that the bounds hold uniformly w. [sent-238, score-0.126]
69 The following is a trivial modification of a classical result in the context of inverse problems (see for example Engl et al. [sent-243, score-0.148]
70 Proposition 4 Let f λ the Tikhonov regularized solution of the problem A f = g, then the following convergence holds lim A f λ − Pg λ→0+ L2 (X,ν) = 0. [sent-247, score-0.202]
71 We are now in the position to derive the consistency result that we present in the following section. [sent-254, score-0.129]
72 Theorem 5 Given 0 < η < 1, λ > 0 and greater that 1 − η ∈ N, the following inequality holds with probability Mκ Mκ2 √ + 4λ 2 λ λ I[ fz ] − inf I[ f ] ≤ f ∈H = Mκ2 4 log η 2λ2 ψ 8 4 log + A f λ − Pg η + A f λ − Pg L2 (X,ν) +o 2 lim P I[ fz λ( ,z) ] − inf I[ f ] ≥ ε = 0. [sent-258, score-0.832]
73 It is interesting to note that since δ = δ( ) we have an equivalence between the limit → ∞, usually studied in learning theory, and the limit δ → 0, usually considered for inverse problems. [sent-263, score-0.173]
74 Our result presents the formal connection between the consistency approach considered in learning theory, and the regularization-stability convergence property used in ill-posed inverse problems. [sent-264, score-0.345]
75 We now briefly compare our result with previous work on the consistency of the regularized least squares algorithm. [sent-266, score-0.306]
76 Recently, several works studied the consistency property and the related convergence rate of learning algorithms inspired by Tikhonov regularization. [sent-267, score-0.18]
77 For regression problems in Bousquet and Elisseeff (2002) a large class of loss functions is considered and a bound of the form 1 I[ fz λ ] − Iz [ fz λ ] ≤ O √ λ is proved, where Iz [ fz λ ] is the empirical error. [sent-270, score-0.954]
78 2 Such a bound allows to prove consistency using the error decomposition in Steinwart (2004). [sent-271, score-0.129]
79 The square loss was considered in Zhang (2003) where, using leave-one out techniques, the following bound in expectation was proved Ez (I[ fz λ ]) ≤ O 1 . [sent-272, score-0.33]
80 (2004) to derive a bound of the form I[ fz λ ] − inf I[ f ] ≤ S(λ, ) + A f λ − Pg f ∈H fz λ − f λ where S(λ, ) is a data-independent bound on √1 and we see that Theorem 4 gives S(λ, ) ≤ O Theorem 2 gives O log √ λ2 λ L2 (X,ν) 2 L2 (X,ν) . [sent-274, score-0.702]
81 Proof [of Theorem 2] The idea of the proof is to note that, by triangular inequality, we can write λ A fz − Pg L2 (X,ν) − A f λ − Pg L2 (X,ν) λ ≤ A fz − A f λ L2 (X,ν) (17) so that we can focus on the difference between the discrete and continuous solutions. [sent-285, score-0.628]
82 By a simple algebraic computation we have that λ fz − f λ = (A∗ Ax + λI)−1 A∗ y − (A∗ A + λI)−1 A∗ g x x = [(A∗ Ax + λI)−1 − (A∗ A + λI)−1 ]A∗ y + (A∗ A + λI)−1 (A∗ y − A∗ g) x x x (18) = (A∗ A + λI)−1 (A∗ A − A∗ Ax )(A∗ Ax + λI)−1 A∗ y + (A∗ A + λI)−1 (A∗ y − A∗ g). [sent-286, score-0.301]
83 Since y M, (19) and (20) give λ A fz − Pg L2 − A f λ − Pg L2 ≤ E ≤ M ∗ 1 A A − A∗ Ax L (H ) + √ A∗ y − A∗ g H x x 4λ 2 λ so that the theorem is proved. [sent-294, score-0.301]
84 Finally we combine the above results to prove the consistency of the regularized least squares algorithm. [sent-300, score-0.306]
85 Proof [Theorem 4] Theorem 1 gives λ A fz − Pg L2 (X,ν) ≤ 1 M √ δ1 + δ2 + A f λ − Pg 4λ 2 λ L2 (X,ν) . [sent-301, score-0.301]
86 Equation (13) and the estimates for the noise levels δ1 and δ2 given by Theorem 2 ensure that I[ fz λ ] − inf I[ f ] ≤ f ∈H Mκ Mκ2 √ + 4λ 2 λ ψ 8 log 4 + A f λ − Pg η L2 (X,ν) and (16) simply follows taking the square of the above inequality. [sent-302, score-0.468]
87 Let now λ = 0( 0 < 1 2, −b ) with the consistency of the regularised least squares algorithm is proved by inverting the relation between ε and η and using the result of Proposition (4) (see Appendix). [sent-303, score-0.21]
88 Conclusions In this paper we analyse the connection between the theory of statistical learning and the theory of ill-posed problems. [sent-305, score-0.119]
89 As a consequence, the consistency of the algorithm is traced back to the well known convergence property of the Tikhonov regularization. [sent-307, score-0.155]
90 Moreover, since, in general, the expected risk I[ f ] for arbitrary loss function does not define a metric, the relation between the expected risk and the residual is not clear. [sent-311, score-0.203]
91 Further problems are the choice of the regularization parameter, for example by means of the generalized Morozov principle (Engl et al. [sent-312, score-0.126]
92 , 1996) and the extension of our analysis to a wider class of regularization algorithms. [sent-313, score-0.126]
93 Proposition 6 The operator A is a Hilbert-Schmidt operator from H into L2 (X, ν) and A φ = ∗ A∗ A = Z ZX X φ(x)Kx dν(x), ·, Kx H Kx dν(x), (21) (22) where φ ∈ L2 (X, ν), the first integral converges in norm and the second one in trace norm. [sent-325, score-0.207]
94 Since A∗ A is a positive operator and | we have that Kx , en H |2 is a positive function, by monotone convergence theorem, Tr (A∗ A) = ∑ Z Z ∑ | en , Kx n = X | en , Kx H |2 dν(x) X n = = Z ZX X H| 2 dν(x) Kx , Kx H dν(x) K(x, x) dν(x) < κ2 and the thesis follows. [sent-342, score-0.192]
95 Corollary 7 The sampling operator Ax : H → E is a Hilbert-Schmidt operator and Ax ∗ y = 1 ∑ yi Kx (23) ∑ (24) i i=1 Ax ∗ Ax = 1 i=1 ·, Kxi H Kxi . [sent-343, score-0.182]
96 2 1+t 2H η The thesis follows, observing that g is the inverse of t2 1+t and that g(t) = √ √ t + o( t). [sent-372, score-0.148]
97 Best choices for regularization parameters in learning theory: on the bias-variance problem. [sent-434, score-0.126]
98 Regularization of inverse problems, volume 375 of Mathematics and its Applications. [sent-465, score-0.148]
99 The theory of Tikhonov regularization for Fredholm equations of the first kind, volume 105 of Research Notes in Mathematics. [sent-483, score-0.149]
100 Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. [sent-522, score-0.156]
wordName wordTfidf (topN-words)
[('tikhonov', 0.346), ('kx', 0.319), ('fz', 0.301), ('smale', 0.245), ('pg', 0.227), ('ax', 0.218), ('rosasco', 0.194), ('aponnetto', 0.181), ('iovannini', 0.181), ('ito', 0.181), ('nverse', 0.165), ('xamples', 0.165), ('inverse', 0.148), ('roblem', 0.138), ('consistency', 0.129), ('regularization', 0.126), ('cucker', 0.123), ('regularized', 0.119), ('vito', 0.111), ('fh', 0.106), ('engl', 0.099), ('residual', 0.097), ('operator', 0.091), ('bertero', 0.082), ('zhou', 0.074), ('inf', 0.073), ('noise', 0.067), ('arsenin', 0.066), ('disi', 0.066), ('genova', 0.066), ('datum', 0.061), ('anyway', 0.061), ('earning', 0.061), ('hilbert', 0.06), ('squares', 0.058), ('steinwart', 0.058), ('caponnetto', 0.055), ('evgeniou', 0.052), ('iz', 0.049), ('kxi', 0.049), ('umberto', 0.049), ('unige', 0.049), ('yurinsky', 0.049), ('minimizer', 0.044), ('poggio', 0.044), ('connection', 0.042), ('ideal', 0.042), ('noisy', 0.041), ('mol', 0.041), ('mukherjee', 0.038), ('probabilistic', 0.035), ('closure', 0.034), ('gy', 0.034), ('reconstruction', 0.034), ('girosi', 0.034), ('andrea', 0.033), ('arbib', 0.033), ('ernesto', 0.033), ('francesca', 0.033), ('giovannini', 0.033), ('kaipio', 0.033), ('kurkova', 0.033), ('lorenzo', 0.033), ('modena', 0.033), ('noyaux', 0.033), ('odone', 0.033), ('pinelis', 0.033), ('polar', 0.033), ('analyse', 0.031), ('italy', 0.031), ('lim', 0.03), ('proposition', 0.029), ('loss', 0.029), ('devroye', 0.029), ('groetsch', 0.028), ('ill', 0.028), ('log', 0.027), ('solution', 0.027), ('stability', 0.027), ('risk', 0.027), ('reproducing', 0.027), ('convergence', 0.026), ('continuous', 0.026), ('therein', 0.025), ('niyogi', 0.025), ('norm', 0.025), ('en', 0.025), ('operators', 0.025), ('studied', 0.025), ('estimator', 0.025), ('adjoint', 0.025), ('spectral', 0.024), ('rkhs', 0.024), ('relation', 0.023), ('theory', 0.023), ('hypothesis', 0.023), ('projection', 0.023), ('continuously', 0.023), ('recalling', 0.023), ('regression', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 47 jmlr-2005-Learning from Examples as an Inverse Problem
Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone
Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE
2 0.2284632 48 jmlr-2005-Learning the Kernel Function via Regularization
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.
3 0.13674702 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
Author: Peter Binev, Albert Cohen, Wolfgang Dahmen, Ronald DeVore, Vladimir Temlyakov
Abstract: This paper is concerned with the construction and analysis of a universal estimator for the regression problem in supervised learning. Universal means that the estimator does not depend on any a priori assumptions about the regression function to be estimated. The universal estimator studied in this paper consists of a least-square fitting procedure using piecewise constant functions on a partition which depends adaptively on the data. The partition is generated by a splitting procedure which differs from those used in CART algorithms. It is proven that this estimator performs at the optimal convergence rate for a wide class of priors on the regression function. Namely, as will be made precise in the text, if the regression function is in any one of a certain class of approximation spaces (or smoothness spaces of order not exceeding one – a limitation resulting because the estimator uses piecewise constants) measured relative to the marginal measure, then the estimator converges to the regression function (in the least squares sense) with an optimal rate of convergence in terms of the number of samples. The estimator is also numerically feasible and can be implemented on-line. Keywords: distribution-free learning theory, nonparametric regression, universal algorithms, adaptive approximation, on-line algorithms c 2005 Peter Binev, Albert Cohen, Wolfgang Dahmen, Ronald DeVore and Vladimir Temlyakov. B INEV, C OHEN , DAHMEN , D E VORE AND T EMLYAKOV
4 0.063800134 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
5 0.06260623 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
Author: Alain Rakotomamonjy, Stéphane Canu
Abstract: This work deals with a method for building a reproducing kernel Hilbert space (RKHS) from a Hilbert space with frame elements having special properties. Conditions on existence and a method of construction are given. Then, these RKHS are used within the framework of regularization theory for function approximation. Implications on semiparametric estimation are discussed and a multiscale scheme of regularization is also proposed. Results on toy and real-world approximation problems illustrate the effectiveness of such methods. Keywords: regularization, kernel, frames, wavelets
6 0.047707155 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
7 0.04367606 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
8 0.035775606 64 jmlr-2005-Semigroup Kernels on Measures
9 0.035769422 11 jmlr-2005-Algorithmic Stability and Meta-Learning
10 0.034508646 67 jmlr-2005-Stability of Randomized Learning Algorithms
11 0.034083635 3 jmlr-2005-A Classification Framework for Anomaly Detection
12 0.032323778 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
13 0.029747155 6 jmlr-2005-A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs
14 0.029480103 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
15 0.029352147 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
16 0.028747482 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
17 0.026367459 36 jmlr-2005-Gaussian Processes for Ordinal Regression
18 0.026023418 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
19 0.025807124 39 jmlr-2005-Information Bottleneck for Gaussian Variables
20 0.025489675 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
topicId topicWeight
[(0, 0.177), (1, 0.14), (2, 0.149), (3, 0.394), (4, 0.061), (5, 0.46), (6, 0.061), (7, -0.077), (8, -0.001), (9, 0.076), (10, 0.056), (11, 0.169), (12, -0.18), (13, -0.069), (14, 0.156), (15, -0.001), (16, 0.01), (17, -0.135), (18, 0.05), (19, -0.087), (20, -0.0), (21, 0.021), (22, -0.046), (23, 0.019), (24, 0.045), (25, 0.044), (26, 0.004), (27, -0.021), (28, -0.072), (29, 0.01), (30, 0.117), (31, -0.007), (32, -0.018), (33, -0.066), (34, 0.015), (35, 0.021), (36, -0.031), (37, -0.021), (38, 0.036), (39, -0.035), (40, 0.005), (41, -0.001), (42, -0.003), (43, 0.065), (44, -0.054), (45, -0.016), (46, -0.085), (47, -0.024), (48, 0.003), (49, 0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.94979382 47 jmlr-2005-Learning from Examples as an Inverse Problem
Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone
Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE
2 0.76368535 48 jmlr-2005-Learning the Kernel Function via Regularization
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.
3 0.48118651 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
Author: Peter Binev, Albert Cohen, Wolfgang Dahmen, Ronald DeVore, Vladimir Temlyakov
Abstract: This paper is concerned with the construction and analysis of a universal estimator for the regression problem in supervised learning. Universal means that the estimator does not depend on any a priori assumptions about the regression function to be estimated. The universal estimator studied in this paper consists of a least-square fitting procedure using piecewise constant functions on a partition which depends adaptively on the data. The partition is generated by a splitting procedure which differs from those used in CART algorithms. It is proven that this estimator performs at the optimal convergence rate for a wide class of priors on the regression function. Namely, as will be made precise in the text, if the regression function is in any one of a certain class of approximation spaces (or smoothness spaces of order not exceeding one – a limitation resulting because the estimator uses piecewise constants) measured relative to the marginal measure, then the estimator converges to the regression function (in the least squares sense) with an optimal rate of convergence in terms of the number of samples. The estimator is also numerically feasible and can be implemented on-line. Keywords: distribution-free learning theory, nonparametric regression, universal algorithms, adaptive approximation, on-line algorithms c 2005 Peter Binev, Albert Cohen, Wolfgang Dahmen, Ronald DeVore and Vladimir Temlyakov. B INEV, C OHEN , DAHMEN , D E VORE AND T EMLYAKOV
4 0.21796708 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
Author: Alain Rakotomamonjy, Stéphane Canu
Abstract: This work deals with a method for building a reproducing kernel Hilbert space (RKHS) from a Hilbert space with frame elements having special properties. Conditions on existence and a method of construction are given. Then, these RKHS are used within the framework of regularization theory for function approximation. Implications on semiparametric estimation are discussed and a multiscale scheme of regularization is also proposed. Results on toy and real-world approximation problems illustrate the effectiveness of such methods. Keywords: regularization, kernel, frames, wavelets
5 0.21271479 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
6 0.17762744 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
7 0.17445266 3 jmlr-2005-A Classification Framework for Anomaly Detection
8 0.15952305 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
9 0.15063669 11 jmlr-2005-Algorithmic Stability and Meta-Learning
10 0.13393483 64 jmlr-2005-Semigroup Kernels on Measures
11 0.13309081 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
12 0.13163251 67 jmlr-2005-Stability of Randomized Learning Algorithms
13 0.12460899 36 jmlr-2005-Gaussian Processes for Ordinal Regression
14 0.12306388 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
15 0.11542511 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
16 0.10924076 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
17 0.10585448 39 jmlr-2005-Information Bottleneck for Gaussian Variables
18 0.10425354 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
19 0.10265175 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
20 0.10156034 41 jmlr-2005-Kernel Methods for Measuring Independence
topicId topicWeight
[(13, 0.017), (17, 0.018), (19, 0.019), (36, 0.034), (37, 0.028), (42, 0.014), (43, 0.047), (47, 0.02), (52, 0.06), (70, 0.019), (76, 0.485), (88, 0.105), (90, 0.027), (94, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.68922973 47 jmlr-2005-Learning from Examples as an Inverse Problem
Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone
Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE
2 0.29920608 48 jmlr-2005-Learning the Kernel Function via Regularization
Author: Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of finding an optimal kernel from a prescribed convex set of kernels K for learning a real-valued function by regularization. We establish for a wide variety of regularization functionals that this leads to a convex optimization problem and, for square loss regularization, we characterize the solution of this problem. We show that, although K may be an uncountable set, the optimal kernel is always obtained as a convex combination of at most m + 2 basic kernels, where m is the number of data examples. In particular, our results apply to learning the optimal radial kernel or the optimal dot product kernel.
3 0.26870981 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
4 0.26784056 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
5 0.26555187 3 jmlr-2005-A Classification Framework for Anomaly Detection
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classification, SVMs
6 0.26551133 64 jmlr-2005-Semigroup Kernels on Measures
7 0.26352566 39 jmlr-2005-Information Bottleneck for Gaussian Variables
8 0.26317468 20 jmlr-2005-Clustering with Bregman Divergences
9 0.26298174 44 jmlr-2005-Learning Module Networks
10 0.26269144 71 jmlr-2005-Variational Message Passing
11 0.26214793 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
12 0.26027969 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
13 0.2595022 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
14 0.25841427 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
15 0.25549588 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
16 0.25515747 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
17 0.25494552 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
18 0.25478497 36 jmlr-2005-Gaussian Processes for Ordinal Regression
19 0.2530289 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
20 0.25178978 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods