jmlr jmlr2005 jmlr2005-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
Reference: text
sentIndex sentText sentNum sentScore
1 COM Adalbertstrasse 55 D-80799 M¨ nchen, Germany u Editor: Tommi Jaakkola Abstract A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. [sent-2, score-0.341]
2 We generally use m to denote the size of the ordinary samples and n for the size of the meta samples. [sent-59, score-0.282]
3 To state generalization error bounds for meta-algorithms, we need to define a statistical measure of the performance of an algorithm A with respect to an environment E , analogous to the risk R (c, D) of a hypothesis c with respect to a task D. [sent-68, score-0.37]
4 The risk (1) measures the expected loss of a hypothesis for future data drawn from the task distribution D, so the analogous quantity for an algorithm should measure the expected loss of the hypothesis returned by the algorithm for future tasks drawn from the environmental distribution E . [sent-69, score-0.396]
5 A corresponding experiment involves the random draw of a task D from E , training the algorithm with a sample S drawn randomly and independently from D, and applying the resulting hypothesis to data randomly drawn from D. [sent-70, score-0.212]
6 (4) The transfer risk R (A, E ) measures how well the algorithm A is adapted to the environment E . [sent-72, score-0.307]
7 The idea is to bound R (A (S) , E ) in terms of S with high probability in S, as S is drawn from the environment E for every environment E . [sent-77, score-0.409]
8 For example set l = lemp with the empirical estimator m lemp (A, S) = ∑ l (A (S) , zi ) . [sent-86, score-1.648]
9 Notice that (7) has exactly the same structure as an ordinary generalization error bound (3) where D has been repaced with D, S with S, A with A, l with l, and B with Π. [sent-89, score-0.22]
10 Because it controls future values of the estimator, a two-argument function Π satisfying (7) will be called an estimator prediction bound for A with respect to the estimator l. [sent-91, score-0.383]
11 The simplest case, where a nontrivial estimator prediction bound can be found, occurs when A searches only a finite set of algorithms, but there are many other possibilities, some are listed in Section 3. [sent-92, score-0.235]
12 Methods for deriving ordinary generalization error bounds often use an intermediate bound on the estimation error |R (A (S) , D) − l (A, S)| , valid for all distributions with high probability in S, for example by bounding the complexity of a hypothesis space searched by A. [sent-95, score-0.339]
13 Algorithmic stability is also useful at a different level to prove that a meta-algorithm A has an estimator prediction bound. [sent-109, score-0.228]
14 Then for every S,S\i and every ordinary sample S we have lemp (A (S) , S) − lemp A S\i , S 970 ≤β. [sent-116, score-1.689]
15 , Sn ) drawn from (DE )n , the inequality R (A (S) , E ) ≤ 1 n ∑ lemp (A (S) , Si ) + 2β + 4nβ + M n i=1 ln (1/δ) + 2β. [sent-126, score-0.829]
16 In Section 3 we show how to obtain estimator prediction bounds from standard results in learning theory. [sent-149, score-0.233]
17 In Section 5 we attempt a comparison of our bounds to ordinary generalization error bounds and compare our method and results to the approach taken by J. [sent-151, score-0.263]
18 Given such a task D and a hypothesis c ∈ C and a loss function l we use R (c, D) = Ez∼D [l (c, z)] to denote the risk (=expected loss) of the hypothesis c in task D w. [sent-179, score-0.26]
19 The leave-one-out estimator lloo and the empirical estimator lemp are the functions (the notation is from Bousquet, Elisseeff, 2002) lloo , lemp : A (C, Z) × (Z m ) → [0, M] defined for A ∈ A (C, Z) and S = (z1 , . [sent-186, score-2.262]
20 , zm ) ∈ Z m by lloo (A, S) = 1 m ∑ l A S\i , zi , m i=1 where S\i generally denotes the sample S with the i-th element deleted, and lemp (A, S) = 1 m ∑ l (A (S) , zi ) . [sent-189, score-1.121]
21 A meta-learning task is specified by an environment E ∈ M1 (M1 (Z)) which models the drawing of learning tasks D ∼ E . [sent-198, score-0.216]
22 The induced distribution DE models the probability DE (S) for an m-sample S to arise when a task D is drawn from the environment E , followed by m independent draws of examples from the same distribution D. [sent-201, score-0.26]
23 Given an environment E ∈ M1 (M1 (Z)), an algorithm A ∈ A (C, Z) and a loss function l : C × Z → [0, M] the transfer risk of A in the environment E w. [sent-205, score-0.487]
24 It gives the expected risk of the hypothesis A (S) for a task D randomly drawn from the environment and the sample S randomly drawn from this task. [sent-209, score-0.387]
25 Given an m-sample S, the object A (S) (S) is the hypothesis returned by the algorithm A (S), when trained with an ordinary sample S. [sent-219, score-0.243]
26 A function Π : (0, 1] × ∞ (Z m )n → [0, M] is an estiman=1 tor prediction bound for the meta-algorithm A ∈ A (A (C, Z) , Z m ) with respect to the estimator l : A (C, Z) × (Z m ) → [0, M] iff S ∀D ∈ M1 (Z m ) , ∀δ > 0, Dn {S : ES∼D [l (A (S) , S)] ≤ Π (δ, S)} ≥ 1 − δ. [sent-221, score-0.235]
27 (10) An estimator prediction bound is formally equivalent to an ordinary generalization bound under the identifications Z ↔ Z m , C ↔ A (C, Z) , l ↔ l, A ↔ A, B ↔ Π. [sent-222, score-0.47]
28 Given an estimator l : A (C, Z)× (Z m ) → [0, M] the empirical meta-estimator lemp is the function lemp : A (A (C, Z) , Z m ) × (Z m )n → [0, M] 973 M AURER defined for A ∈ A (A (C, Z) , Z m ) and S = (S1 , . [sent-224, score-1.622]
29 , Sn ) ∈ (Z m )n by lemp (A, S) = 1 n ∑ l (A (S) , Si ) . [sent-227, score-0.737]
30 n i=1 The meta-estimator lloo is defined analogously. [sent-228, score-0.246]
31 For example if l=lloo then (lloo )emp (A, S) = 1 n ∑ lloo (A (S) , Si ) . [sent-230, score-0.246]
32 , zm ) ∈ Z m c ∈C A ∈ A (C, Z) l : C × Z → [0, M] Learning Task D ∈ M1 (Z) Empirical estimator Risk Bound lemp (A, S) = 1 = m ∑m l (A (S) , zi ) i=1 R (c, D) = Ez∼D [l (c, z)] Generalization error Meta learning S = (z1 , . [sent-234, score-0.978]
33 Correspondingly an estimator prediction bound is not a generalization error bound for the transfer risk. [sent-242, score-0.422]
34 Estimator Prediction Bounds In this section we give examples of estimator prediction bounds obtained from established results of statistical learning theory. [sent-265, score-0.233]
35 Anthony, Bartlett, 1999) give, for any δ > 0, ∀D, Dm S : sup R (c, D) − c∈H 1 m m ∑ l (c, z j ) j=1 ≤ ln (K/δ) 2m ≥ 1 − δ, (11) which gives the following generalization error bound for A: ∀D ∈ M1 (Z) , ∀δ > 0, Dm {S : R (A (S) , D) ≤ B (δ, S)} ≥ 1 − δ with ln K + ln (1/δ) . [sent-277, score-0.256]
36 Substituting Z m for Z, A (C, Z) for C, l = lemp or l = lloo for l and a finite set of algorithms {A1 , . [sent-280, score-0.983]
37 , cK } , we arrive at the following statement: B (δ, S) = lemp (A, S) + 975 M AURER Every meta algorithm A that such A (S) ∈ {A1 , . [sent-286, score-0.871]
38 , Sn ) has the estimator prediction bound ∀D ∈ M1 (Z m ) , ∀δ > 0, Dn {S : ES∼D [l (A (S) , S)] ≤ Π (δ, S)} ≥ 1 − δ with Π (δ, S) = lemp (A, S) + ln K + ln (1/δ) . [sent-292, score-1.072]
39 8 (13) which implies the following generalization error bound, valid for every algorithm A searching only the hypothesis space H : B (δ, S) = lemp (A, S) + inf t : 4N1 −t 2 m t , F (H , l) , 2m e 32 ≤ δ . [sent-297, score-0.899]
40 8 (14) Suppose now that H ⊆ A (C, Z) is a space of algorithms and fix an estimator l = lloo or l = lemp . [sent-298, score-1.131]
41 8 Every meta-algorithm A such A (S) ∈ H for all S has thus the estimator prediction bound Π (δ, S) = lemp (A, S) + inf t : 4N1 −t 2 n t , F (H, l) , 2n e 32 ≤ δ . [sent-300, score-1.01]
42 Then for any learning task D ∈ M1 (Z) and any positive integer m, with probability greater 1 − δ in a sample S drawn from Dm l (A (S) , D) ≤ lloo (A, S) + β + (4mβ + M) ln 1 δ 2m and ln 1 δ . [sent-304, score-0.455]
43 2m l (A (S) , D) ≤ lemp (A, S) + 2β + (4mβ + M) These bounds are good if we can show uniform β-stability with β ≈ 1/ma , with a > 1/2. [sent-305, score-0.782]
44 The notion of uniform stability easily transfers to meta-algorithms to give estimator prediction bounds. [sent-306, score-0.228]
45 Fix an estimator l = lloo or l = lemp and suppose that the meta-algorithm satisfies the following condition: For every meta sample S= (S1 , . [sent-307, score-1.308]
46 Theorem 2 then gives the estimator prediction bounds Πloo (δ, S) = lloo (A, S) + β + (4nβ + M) ln 1 δ 2n (16) and Πemp (δ, S) = lemp (A, S) + 2β + (4nβ + M) ln 1 δ . [sent-311, score-1.316]
47 977 M AURER Also |ES∼Dm [lemp (A, S) − lloo (A, S)] | ≤ 1 m ∑ ES l (A (S) , zi ) − l A S\i , zi m i=1 ≤ 1 m ∑ |ES [β]| = β. [sent-320, score-0.298]
48 m i=1 Suppose now that we have an estimator prediction bound Π for the meta-algorithm A with respect to the estimator l, so that, for all δ > 0, ∀D ∈ M1 (Z m ) , Dn {S : ES∼D [l (A (S) , S)] ≤ Π (δ, S)} ≥ 1 − δ, (18) where the estimator l :A (C, Z) × Z m → [0, M] refers to either lemp or lloo . [sent-321, score-1.514]
49 When l = lloo the bound (18) is already powerful by itself. [sent-323, score-0.293]
50 , Sn ) define A (S) to be 1 n A (S) = arg min ∑ lloo (A, Si ) . [sent-334, score-0.246]
51 Applying the estimator prediction bound (12) for this type of algorithm in combination with (19) above then gives, for any E and with probability greater than 1 − δ in the meta sample drawn from (DE )n , ED∼E [ES∼Dm−1 [R (A (S) (S) , D)]] ≤ 1 n ∑ lloo (A (S) , Si ) + n i=1 ln (K/δ) . [sent-339, score-0.742]
52 2n (20) A similar result should hold if lloo is replaced by any other, nearly unbiased estimator. [sent-340, score-0.246]
53 For more sophisticated meta-algorithms we need to consider the case l = lemp . [sent-343, score-0.737]
54 In this case an estimator prediction bound only bounds the expected empirical error lemp (A (S) , S) of A (S) for a sample S drawn from DE , but it does not give any generalization guarantee for the hypothesis A (S) (S). [sent-344, score-1.163]
55 For example A (S) could be some single-nearest-neighbour algorithm for which we would have lemp (A (S) , S) = 0 for almost all S, but A (S) would have poor generalization performance. [sent-345, score-0.762]
56 D,S The estimator prediction bound controls the first term above, so it remains to bound the second term which is independent of S. [sent-347, score-0.282]
57 We need to bound the expected estimation error of the estimator l uniformly for all distributions D and all algorithms A (S) for all meta-samples S. [sent-348, score-0.247]
58 Then for every environment E , with probability greater than 1 − δ in S as drawn from (DE )n we have R (A (S) , E ) ≤ Π (δ, S) + ε. [sent-351, score-0.23]
59 Proof For any D, S and arbitrary η we have ES∼Dm [R (A (S) (S) , D)] ≤ ES∼Dm [lemp (A (S) , S)] + ES∼Dm [|R (A (S) (S) , D) − lemp (A (S) , S)|] ≤ ES∼Dm [lemp (A (S) , S)] + B (η) + Mη, where (21) was used in the last inequality. [sent-352, score-0.737]
60 Using the results in Section 3, now on the level of ordinary learning, we see that the above theorem can be applied • if every A (S) selects a hypothesis from a finite set H (S) of choices with H (S) ≤ K for all S. [sent-355, score-0.252]
61 Taking the expectation D ∼ E and using the estimator prediction bound (18) with DE in place of D gives the result in just as in the proof of the previous theorem. [sent-367, score-0.235]
62 Theorem 1 now follows immediately from Theorem 6 and from the estimator prediction bound (17) in Section 3. [sent-368, score-0.235]
63 The estimator prediction bound Π (δ, S) will typically depend on the size n of the meta-sample S = (S1 , . [sent-370, score-0.235]
64 The ordinary generalization error bound (examples in Section 3) applies to a situation where a sample S has already been drawn from an unknown task D and the estimator lemp (A, S) already has a definite value. [sent-383, score-1.198]
65 It typically has the structure ∀D, Dm {S : R (A (S) , D) ≤ lemp (A, S) + ε0 } ≥ 1 − δ where ε0 is a bound on the estimation error. [sent-384, score-0.798]
66 To get ε0 our method always requires some condition (uniform bounds on estimation errors, βstability) on the algorithms A (S), which is also sufficient to prove an ordinary generalization error bound for such algorithms A (S). [sent-387, score-0.279]
67 The corresponding estimation errors are about the same in our bounds and in the ordinary generalization error bounds. [sent-388, score-0.232]
68 Comparing the two bounds therefore involves a comparison of the estimator prediction bound Π (δ, S) to a ’generic’ value of the estimator lemp (A, S). [sent-393, score-1.165]
69 But as the size n of the meta-sample S becomes large, corresponding to an experienced meta-learner, this additional term tends to zero, and Π (δ, S) is likely to win over the ’generic’ lemp (A, S), because A (S) is likely to outperform the ’generic’ algorithm A on the meta-sample S. [sent-395, score-0.737]
70 While it is easy to define a generic value of S (simply taking S ∼ DE if some environment E is given), it is not so clear how we should pick a generic algorithm A. [sent-397, score-0.24]
71 The generic value of lemp (A, S) is then Γ = ES∼DE 1 K ∑ lemp (Ak , S) . [sent-403, score-1.52]
72 K k=1 981 M AURER The meta algorithm to consider for comparison is A (S) = arg 1 ∑ lemp (A, S) A∈{A1 ,. [sent-404, score-0.871]
73 ,AK } n S∈S min with the estimator prediction bound K ES∼DQ [lemp (A (S) , S)] ≤ min k=1 1 lemp (Ak , Si ) + n S∑ i ∈S ln (K/δM ) 2n = Π (δM , S) , (22) where δM is the confidence parameter associated with the draw of the meta-sample S. [sent-407, score-1.039]
74 Now let ∆ (S) = K 1 1 K 1 ∑ n ∑ lemp (Ak , S) − min n ∑ lemp (Ak , S) . [sent-408, score-1.474]
75 So for large meta-samples S our bounds will very probably be true and better than the generic value of ordinary generalization bounds by a margin of roughly ∆ (S). [sent-411, score-0.309]
76 We have sketched a version of this approach which can be applied both to ordinary and to meta algorithms in Section 3. [sent-423, score-0.282]
77 In our framework it is natural to study covering numbers for the space of algorithms HH = AH : H ∈H and use them to derive an estimator prediction bound (15) as outlined in Section 3. [sent-431, score-0.253]
78 Putting together the estimator prediction bound (15), the uniform bound on the estimation error (13) and Theorem 5, we arrive at Corollary 7 Let ε0 = inf γ + 4 sup N1 γ>0 and, for δ > 0, H ∈H ε1 = inf t : 4N1 2 γ , F (H , l) , 2m e−γ m/32 8 −t 2 n t , F (H, l) , 2n e 32 ≤ δ . [sent-433, score-0.406]
79 8 Then for any environment E , with probability at least 1 − δ in the draw of a meta-sample S from (DE )n , we have 1 R AH (S) , E ≤ ∑ lemp AH (S) , Si + ε1 +ε0 . [sent-434, score-0.902]
80 Baxter (2000) also defines capacities for H, but aims at giving a bound on sup ED∼E inf R (c, D) − H ∈H c∈H 1 lemp (AH , Si ) n S∑ i ∈S valid with high probability in S as drawn from (DE )n for any E . [sent-437, score-0.917]
81 The inequality erE (H (S)) = ED∼E ES∼Dm inf R (c, D) c∈H (S) ≤ ED∼E ES∼Dm R AH (S) (S) , D = R AH (S) , E (28) shows that our bounds on the transfer risk also provide bounds on (27). [sent-441, score-0.287]
82 This is similar to the estimator prediction bounds in our approach and contrary to our bounds on the transfer risk. [sent-443, score-0.393]
83 If corresponding capacity bounds held for all hypothesis spaces H ∈ H, a bound on the transfer risk R AH (S) , E would result from the bound on (27) in a way parallel to our approach (in Baxter, 2000 a bound on the transfer risk comparable to our bounds is never stated). [sent-445, score-0.627]
84 In Baxter (2000) Theorem 2, to get erE (H (S)) ≤ 1 lemp AH (S) , Si + ε n S∑ i ∈S with probability at least 1 − δ in S, it is required that ε 256 8C 32 , H∗ n ≥ 2 ln , ε δ (29) and there is an additional condition on m. [sent-452, score-0.787]
85 To compare (29) with our bound (25), we disregard the constants (which are better in (25)) and concentrate on a comparison of the complexity measures C (ε, H∗ ) and N1 (ε, F (HH , lemp ) , n). [sent-453, score-0.8]
86 Proposition 9 For all ε, n N1 (ε, F (HH , lemp ) , n) ≤ C (ε, H∗ ) . [sent-459, score-0.737]
87 Note that for H ∈ H we have 1 m ∑ l (c, zi ) = lemp (AH , S) . [sent-464, score-0.763]
88 , ΨN } ⊆ L1 (M1 (Z)) such that for every H ∈ H there is some i such that ε ≥ dQS (H ∗ , Ψi ) 1 n = ∑ H ∗ DS j − Ψi DS j n j=1 = 1 n n ∑ j=1 lemp (AH , S j ) − Ψi DS j . [sent-476, score-0.761]
89 (30) On the other hand we have F (HH , lemp ) |S = (lemp (AH , S1 ) , . [sent-477, score-0.737]
90 , lemp (AH , Sn )) : H ∈ H , so, setting xi ∈ Rn with (xi ) j = Ψi DS j , we see from (30) that every member of F (HH , lemp ) |S is within d1 -distance ε of some xi . [sent-480, score-1.498]
91 The formula for the empirical loss of A (S) is, using (32) and (33) Rm , lemp (A, S) = 1 m ∑ ((Gα)i − yi )2 m i=1 = 1 m ∑ (((G + mλI) α)i − yi − mλαi )2 m i=1 = 1 m ∑ (−mλαi )2 m i=1 m = mλ2 ∑ α2 . [sent-515, score-0.769]
92 , Sn ), drawn from (DE )n for some environment E , and suppose that we have used some ’primer’ algorithm A0 (for example the regression algorithm above for an appropriate value of λ = λ0 ) to train corresponding regression functions hk = A0 (Sk ) ∈ H. [sent-521, score-0.297]
93 Therefore A (S) is ordinary regularized least squares regression with the modified inner product . [sent-565, score-0.231]
94 The first condition, essential for the estimator prediction bound, is satisfied by virtue of the following proposition which is proven in the next subsection: 989 M AURER Corollary 10 The algorithm A is uniformly β -stable w. [sent-581, score-0.244]
95 , Sn ) is the same as S, with only some Sk deleted, then lemp (A (S) , S) − lemp A S , S for every sample S ∈ Z m , with β = ≤β 4µ . [sent-590, score-1.517]
96 λ (n − 1) Substitution in Theorem 1 gives, for every environment E with probability at least 1 − δ in a meta-sample S drawn from (DE )n , R (A (S) , E ) ≤ + 1 lemp (A (S) , Si ) + n S∑ i ∈S 8µ 1 16µn + + λ (n − 1) λ (n − 1) λ 4 ln (1/δ) + . [sent-591, score-1.001]
97 2n mλ (36) The bound gives a performance guarantee of the algorithm applied to future tasks on the basis of the empirical term 1 (37) (lemp )emp (A, S) = ∑ lemp (A (S) , Si ) . [sent-592, score-0.82]
98 A more principled approach would involve the direct minimization of 1 lemp (A, Si ) + N (A) n S∑ i ∈S where N (A) would be some meta-regularizer. [sent-598, score-0.737]
99 We have to show that lemp (A (S) , S) − lemp A S , S ≤ 4µ λ (n − 1) for every sample S = (z1 , . [sent-631, score-1.517]
100 992 y S TABILITY AND M ETA -L EARNING Using the formula (34) for the empirical error in regularised least squares regression then gives lemp (A (S) , S) − lemp A S , S = mλ2 α ≤ 2 m− α 2 m 4µ . [sent-650, score-1.515]
wordName wordTfidf (topN-words)
[('lemp', 0.737), ('dm', 0.288), ('lloo', 0.246), ('ah', 0.17), ('baxter', 0.159), ('estimator', 0.148), ('ordinary', 0.148), ('environment', 0.148), ('meta', 0.134), ('aurer', 0.132), ('eta', 0.123), ('transfer', 0.115), ('sn', 0.106), ('tability', 0.091), ('gi', 0.077), ('zm', 0.067), ('hk', 0.067), ('hh', 0.063), ('elisseeff', 0.062), ('hypothesis', 0.06), ('si', 0.06), ('ln', 0.05), ('bound', 0.047), ('generic', 0.046), ('bousquet', 0.046), ('earning', 0.045), ('bounds', 0.045), ('risk', 0.044), ('drawn', 0.042), ('sk', 0.042), ('ds', 0.041), ('prediction', 0.04), ('stability', 0.04), ('inf', 0.038), ('ak', 0.038), ('uniformly', 0.038), ('dqs', 0.038), ('nflt', 0.038), ('tasks', 0.036), ('sup', 0.034), ('measurable', 0.032), ('loss', 0.032), ('task', 0.032), ('anthony', 0.031), ('chorus', 0.028), ('ere', 0.028), ('gramian', 0.028), ('primer', 0.028), ('zi', 0.026), ('ez', 0.026), ('regularized', 0.026), ('hn', 0.026), ('es', 0.025), ('generalization', 0.025), ('rm', 0.025), ('every', 0.024), ('dn', 0.023), ('draws', 0.023), ('hypotheses', 0.023), ('dq', 0.022), ('squares', 0.021), ('edelman', 0.021), ('pratt', 0.021), ('theorem', 0.02), ('regression', 0.02), ('emp', 0.019), ('modelled', 0.019), ('qs', 0.019), ('capacities', 0.019), ('ker', 0.019), ('metaalgorithm', 0.019), ('metasample', 0.019), ('robins', 0.019), ('vectorspace', 0.019), ('sample', 0.019), ('character', 0.018), ('capacity', 0.018), ('covering', 0.018), ('virtue', 0.018), ('ym', 0.018), ('draw', 0.017), ('complexities', 0.017), ('deleted', 0.017), ('analogous', 0.016), ('operator', 0.016), ('thrun', 0.016), ('greater', 0.016), ('object', 0.016), ('andreas', 0.016), ('christianini', 0.016), ('disregard', 0.016), ('prototypes', 0.016), ('inner', 0.016), ('induced', 0.015), ('searching', 0.015), ('objects', 0.015), ('bartlett', 0.015), ('formally', 0.015), ('algorithmic', 0.015), ('estimation', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
2 0.081176311 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
3 0.055889171 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
4 0.051920321 67 jmlr-2005-Stability of Randomized Learning Algorithms
Author: Andre Elisseeff, Theodoros Evgeniou, Massimiliano Pontil
Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.1 Keywords: stability, randomized learning algorithms, sensitivity analysis, bagging, bootstrap methods, generalization error, leave-one-out error.
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
6 0.035769422 47 jmlr-2005-Learning from Examples as an Inverse Problem
7 0.035745706 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
8 0.034572948 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
9 0.032311205 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
10 0.031763289 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
11 0.030910371 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
12 0.030609298 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
13 0.027677497 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
14 0.026839502 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
15 0.02516488 22 jmlr-2005-Concentration Bounds for Unigram Language Models
16 0.024379959 29 jmlr-2005-Efficient Margin Maximizing with Boosting
17 0.024180938 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
18 0.023640562 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
19 0.023401771 36 jmlr-2005-Gaussian Processes for Ordinal Regression
20 0.02291017 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
topicId topicWeight
[(0, 0.142), (1, -0.02), (2, 0.056), (3, 0.217), (4, -0.05), (5, -0.049), (6, 0.043), (7, -0.081), (8, 0.087), (9, -0.138), (10, 0.007), (11, 0.105), (12, 0.008), (13, 0.066), (14, -0.181), (15, 0.046), (16, 0.192), (17, 0.008), (18, 0.131), (19, 0.056), (20, -0.243), (21, 0.138), (22, 0.217), (23, 0.132), (24, -0.206), (25, -0.211), (26, 0.002), (27, 0.059), (28, -0.004), (29, 0.176), (30, -0.125), (31, 0.186), (32, -0.003), (33, 0.092), (34, -0.016), (35, 0.086), (36, 0.268), (37, -0.056), (38, -0.093), (39, 0.2), (40, -0.079), (41, -0.222), (42, 0.241), (43, 0.048), (44, -0.213), (45, -0.075), (46, -0.216), (47, 0.065), (48, -0.171), (49, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.94633621 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
2 0.25248563 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
Author: John Langford
Abstract: We discuss basic prediction theory and its impact on classification success evaluation, implications for learning algorithm design, and uses in learning algorithm execution. This tutorial is meant to be a comprehensive compilation of results which are both theoretically rigorous and quantitatively useful. There are two important implications of the results presented here. The first is that common practices for reporting results in classification should change to use the test set bound. The second is that train set bounds can sometimes be used to directly motivate learning algorithms. Keywords: sample complexity bounds, classification, quantitative bounds
3 0.18224062 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
Author: Theodoros Evgeniou, Charles A. Micchelli, Massimiliano Pontil
Abstract: We study the problem of learning many related tasks simultaneously using kernel methods and regularization. The standard single-task kernel methods, such as support vector machines and regularization networks, are extended to the case of multi-task learning. Our analysis shows that the problem of estimating many task functions with regularization can be cast as a single task learning problem if a family of multi-task kernel functions we define is used. These kernels model relations among the tasks and are derived from a novel form of regularizers. Specific kernels that can be used for multi-task learning are provided and experimentally tested on two real data sets. In agreement with past empirical work on multi-task learning, the experiments show that learning multiple related tasks simultaneously using the proposed approach can significantly outperform standard single-task learning particularly when there are many related tasks but few data per task. Keywords: multi-task learning, kernels, vector-valued functions, regularization, learning algorithms
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
5 0.13180704 47 jmlr-2005-Learning from Examples as an Inverse Problem
Author: Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini, Francesca Odone
Abstract: Many works related learning from examples to regularization techniques for inverse problems, emphasizing the strong algorithmic and conceptual analogy of certain learning algorithms with regularization algorithms. In particular it is well known that regularization schemes such as Tikhonov regularization can be effectively used in the context of learning and are closely related to algorithms such as support vector machines. Nevertheless the connection with inverse problem was considered only for the discrete (finite sample) problem and the probabilistic aspects of learning from examples were not taken into account. In this paper we provide a natural extension of such analysis to the continuous (population) case and study the interplay between the discrete and continuous problems. From a theoretical point of view, this allows to draw a clear connection between the consistency approach in learning theory and the stability convergence property in ill-posed inverse problems. The main mathematical result of the paper is a new probabilistic bound for the regularized least-squares algorithm. By means of standard results on the approximation term, the consistency of the algorithm easily follows. Keywords: statistical learning, inverse problems, regularization theory, consistency c 2005 Ernesto De Vito, Lorenzo Rosasco, Andrea Caponnetto, Umberto De Giovannini and Francesca Odone. D E V ITO , ROSASCO , C APONNETTO , D E G IOVANNINI AND O DONE
6 0.1284253 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
7 0.12812436 67 jmlr-2005-Stability of Randomized Learning Algorithms
8 0.12698369 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
9 0.12550016 29 jmlr-2005-Efficient Margin Maximizing with Boosting
10 0.10279776 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
11 0.10199444 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
12 0.10187338 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
13 0.098103583 3 jmlr-2005-A Classification Framework for Anomaly Detection
14 0.097198717 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
15 0.094242595 36 jmlr-2005-Gaussian Processes for Ordinal Regression
16 0.091459043 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
17 0.090498202 10 jmlr-2005-Adaptive Online Prediction by Following the Perturbed Leader
18 0.088775367 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
19 0.085893326 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
20 0.082996801 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
topicId topicWeight
[(3, 0.354), (13, 0.011), (17, 0.015), (19, 0.023), (36, 0.027), (37, 0.026), (43, 0.063), (47, 0.017), (52, 0.089), (59, 0.026), (70, 0.026), (74, 0.02), (88, 0.125), (90, 0.028), (94, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.68075627 11 jmlr-2005-Algorithmic Stability and Meta-Learning
Author: Andreas Maurer
Abstract: A mechnism of transfer learning is analysed, where samples drawn from different learning tasks of an environment are used to improve the learners performance on a new task. We give a general method to prove generalisation error bounds for such meta-algorithms. The method can be applied to the bias learning model of J. Baxter and to derive novel generalisation bounds for metaalgorithms searching spaces of uniformly stable algorithms. We also present an application to regularized least squares regression. Keywords: algorithmic stability, meta-learning, learning to learn
2 0.42766994 64 jmlr-2005-Semigroup Kernels on Measures
Author: Marco Cuturi, Kenji Fukumizu, Jean-Philippe Vert
Abstract: We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the “kernel trick”. Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. Keywords: kernels on measures, semigroup theory, Jensen divergence, generalized variance, reproducing kernel Hilbert space
3 0.42320862 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
Author: Cheng Soon Ong, Alexander J. Smola, Robert C. Williamson
Abstract: This paper addresses the problem of choosing a kernel suitable for estimation with a support vector machine, hence further automating machine learning. This goal is achieved by defining a reproducing kernel Hilbert space on the space of kernels itself. Such a formulation leads to a statistical estimation problem similar to the problem of minimizing a regularized risk functional. We state the equivalent representer theorem for the choice of kernels and present a semidefinite programming formulation of the resulting optimization problem. Several recipes for constructing hyperkernels are provided, as well as the details of common machine learning problems. Experimental results for classification, regression and novelty detection on UCI data show the feasibility of our approach. Keywords: learning the kernel, capacity control, kernel methods, support vector machines, representer theorem, semidefinite programming
4 0.41892093 3 jmlr-2005-A Classification Framework for Anomaly Detection
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: One way to describe anomalies is by saying that anomalies are not concentrated. This leads to the problem of finding level sets for the data generating density. We interpret this learning problem as a binary classification problem and compare the corresponding classification risk with the standard performance measure for the density level problem. In particular it turns out that the empirical classification risk can serve as an empirical performance measure for the anomaly detection problem. This allows us to compare different anomaly detection algorithms empirically, i.e. with the help of a test set. Furthermore, by the above interpretation we can give a strong justification for the well-known heuristic of artificially sampling “labeled” samples, provided that the sampling plan is well chosen. In particular this enables us to propose a support vector machine (SVM) for anomaly detection for which we can easily establish universal consistency. Finally, we report some experiments which compare our SVM to other commonly used methods including the standard one-class SVM. Keywords: unsupervised learning, anomaly detection, density levels, classification, SVMs
5 0.41772959 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
Author: Roni Khardon, Rocco A. Servedio
Abstract: Recent work has introduced Boolean kernels with which one can learn linear threshold functions over a feature space containing all conjunctions of length up to k (for any 1 ≤ k ≤ n) over the original n Boolean features in the input space. This motivates the question of whether maximum margin algorithms such as Support Vector Machines can learn Disjunctive Normal Form expressions in the Probably Approximately Correct (PAC) learning model by using this kernel. We study this question, as well as a variant in which structural risk minimization (SRM) is performed where the class hierarchy is taken over the length of conjunctions. We show that maximum margin algorithms using the Boolean kernels do not PAC learn t(n)term DNF for any t(n) = ω(1), even when used with such a SRM scheme. We also consider PAC learning under the uniform distribution and show that if the kernel uses conjunctions of length √ ˜ ω( n) then the maximum margin hypothesis will fail on the uniform distribution as well. Our results concretely illustrate that margin based algorithms may overfit when learning simple target functions with natural kernels. Keywords: computational learning theory, kernel methods, PAC learning, Boolean functions
6 0.41737357 67 jmlr-2005-Stability of Randomized Learning Algorithms
7 0.41234881 20 jmlr-2005-Clustering with Bregman Divergences
8 0.41229758 44 jmlr-2005-Learning Module Networks
9 0.41215697 39 jmlr-2005-Information Bottleneck for Gaussian Variables
10 0.41201749 71 jmlr-2005-Variational Message Passing
11 0.41019887 48 jmlr-2005-Learning the Kernel Function via Regularization
12 0.40863124 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
13 0.40856805 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
14 0.40653259 54 jmlr-2005-Managing Diversity in Regression Ensembles
15 0.40474606 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
16 0.40277794 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
17 0.40224001 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
18 0.40200588 36 jmlr-2005-Gaussian Processes for Ordinal Regression
19 0.40064082 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
20 0.40036139 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning