jmlr jmlr2006 jmlr2006-16 knowledge-graph by maker-knowledge-mining

16 jmlr-2006-Bounds for Linear Multi-Task Learning

Source: pdf

Author: Andreas Maurer

Abstract: We give dimension-free and data-dependent bounds for linear multi-task learning where a common linear operator is chosen to preprocess data for a vector of task speciﬁc linear-thresholding classiﬁers. The complexity penalty of multi-task learning is bounded by a simple expression involving the margins of the task-speciﬁc classiﬁers, the Hilbert-Schmidt norm of the selected preprocessor and the Hilbert-Schmidt norm of the covariance operator for the total mixture of all task distributions, or, alternatively, the Frobenius norm of the total Gramian matrix for the data-dependent version. The results can be compared to state-of-the-art results on linear single-task learning. Keywords: learning to learn, transfer learning, multi-task learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 COM Adalbertstrasse 55 D-80799 M¨ nchen, Germany u Editor: Nello Cristianini Abstract We give dimension-free and data-dependent bounds for linear multi-task learning where a common linear operator is chosen to preprocess data for a vector of task speciﬁc linear-thresholding classiﬁers. [sent-2, score-0.098]

2 For a hypothesis f ∈ F let er( f ) be the expected error and eˆ( f ) the empirical error on a training r sample S of size n (drawn iid from the underlying task distribution) respectively. [sent-12, score-0.057]

3 Anthony and Bartlett 1999), that with probability greater than 1 − δ we have for every f ∈ F the error bound 1 er ( f ) ≤ eˆ ( f ) + √ r 2n ln |F | + ln (1/δ). [sent-15, score-0.185]

4 For a cleverly chosen preprocessor g ∈ G it will set H of classiﬁers h : Y → {0, 1} with H likely be the case that we ﬁnd some h ∈ H such that h ◦ g has the same or even a smaller empirical error than the best f ∈ F . [sent-17, score-0.06]

5 M AURER The situation is improved if we have a set of m different learning tasks with corresponding task distributions and samples S1 , . [sent-20, score-0.049]

6 hm ◦ g for each of the m tasks where the preprocessing map g ∈ G is constrained to be the same for all tasks and only the hl ∈ H specialize to each task l at hand. [sent-27, score-0.129]

7 Again Hoeffding’s inequality and a union bound imply that with probability greater 1 − δ we have for all (h1 , . [sent-28, score-0.049]

8 , hm ) ∈ H m and every g ∈ G 1 m 1 1 m l er (hl ◦ g) ≤ ∑ eˆl (hl ◦ g) + √ r ∑ m l=1 m l=1 2n ln H + ln |G | + ln (1/δ) . [sent-31, score-0.228]

9 The remaining term, which bounds the estimation error, now exhibits the advantage of multi-task learning: Sharing the preprocessor implies sharing its cost of estimation, and the entropy contribution arising from the selection of g ∈ G decreases with the number of learning tasks. [sent-36, score-0.071]

10 Since by assump|F |, the estimation error in the multi-task bound (2) can become much smaller than in tion H the single task case (1) if the number m of tasks becomes large. [sent-37, score-0.062]

11 The choice of the preprocessor g ∈ G can also be viewed as the selection of the hypothesis space H ◦ g. [sent-38, score-0.058]

12 This leads to an alternative formulation of multi-task learning, where the common object is a hypothesis space chosen from a class of hypothesis spaces (in this case H ◦ g : g ∈ G ), and the classiﬁers for the individual tasks are all chosen from the selected hypothesis space. [sent-39, score-0.071]

13 Here we prefer the functional formulation of selecting a preprocessor instead of a hypothesis space, because it is more intuitive and sufﬁcient in the situations which we consider. [sent-40, score-0.058]

14 The learner now searches for a multi-classiﬁer hv ◦T = (hν1 ◦ T, . [sent-43, score-0.062]

15 , hνm ◦ T ) where the preprocessing operator T ∈ A is the same for all tasks and only the vectors vl specialize to each task l at hand. [sent-46, score-0.199]

16 The desired multi-classiﬁer hv ◦ T should have a small value of the average error er (hv ◦ T ) = 1 m l 1 m ∑ er (hvl ◦ T ) = m ∑ Pr sign m l=1 l=1 T X l , vl = Yl , where X l and Y l are the random variables modeling input-values and labels for the l-th task. [sent-47, score-0.213]

17 , vm ∈ H with vl ≤ 1 and all bounded symmetric operators T on H with T HS ≥ 1, and for all γ ∈ (0, 1) that 4 T ln δγ HS 8 T HS 1 er (hv ◦ T ) ≤ eˆγ (v ◦ T ) + √ r C HS + + . [sent-56, score-0.322]

18 γ n m 2nm Here eˆγ (v ◦ T ) is a margin-based empirical error estimate, bounded by the relative number of r examples Xil ,Yil in the total training sample for all tasks l, where Yil T Xil , vl < γ (see section 4). [sent-57, score-0.14]

19 C is the total covariance operator corresponding to the mixture of all the task-input-distributions in H. [sent-59, score-0.091]

20 In this case learning amounts to the selection of a d-dimensional subspace M of H and of an m-tuple of vectors vl in M (components of vl orthogonal √ to M are irrelevant to the projected data). [sent-64, score-0.209]

21 All operators T ∈ Pd satisfy T HS = d, which can then be substituted in the above bound. [sent-65, score-0.048]

22 Almost to the contrary: Suppose that the input data are distributed uniformly on M ∩ S1 where M is a k-dimensional subspace in H and S1 is the sphere consisting of vectors with unit norm in H. [sent-69, score-0.07]

23 If we compare the second term on the right hand side to the estimation error bound in (2), we can recognize a certain similarity: Loosely speaking we can identify T 2 /m with the cost of HS estimating the operator T , and T 2 C HS with the cost of ﬁnding the linear classiﬁers v1 , . [sent-73, score-0.072]

24 The main inequality of the theorem then becomes er (hv ◦ T ) ≤ eˆγ (v ◦ T ) + r 1/2 HS 2 T2 √ (1 − ε) γ n 2 119 1/2 C 2 HS + 3 m 1/4 + ln T 2 HS δγε2 2nm . [sent-79, score-0.13]

25 If T is an orthogonal projection with d-dimensional 1/2 range then T 2 HS = d 1/4 , so for a large number of tasks m the bound on the estimation error becomes approximately 1/2 2d 1/4 C HS √ . [sent-81, score-0.06]

26 The results stated above give some insights, but they have the practical disadvantage of being unobservable, because they depend on the properties of the covariance operator C, which in turn depends on an unknown data distribution. [sent-87, score-0.091]

27 One way to solve this problem is using the fact that the ﬁnite-sample approximations to the covariance operator have good concentration properties (see Theorem 8 below). [sent-88, score-0.091]

28 , vm ∈ H with vl ≤ 1 and all bounded symmetric operators T on H with T HS ≥ 1, and for all γ ∈ (0, 1) that 8 T er (hv ◦ T ) ≤ eˆγ (v ◦ T ) + √ HS r γ n ˆ where the C (X) Fr 1 ˆ C (X) mn 1 + + Fr m 9 ln 8 T HS δγ 2nm . [sent-92, score-0.401]

29 In section 2 we introduce the necessary terminology and results on Hilbert-Schmidt operators and in section 3 the covariance operator of random elements in a Hilbert space. [sent-98, score-0.139]

30 In section 6 we give bounds for non-interacting systems, which are essentially equivalent to single-task learning, and derive bounds for proper, interacting multi-task learning, including the above mentioned results. [sent-102, score-0.086]

31 With HS we denote the real vector space of operators T on H satisfying ∑∞ Tei 2 ≤ ∞ for every orthonormal basis (ei )∞ of i=1 i=1 H. [sent-110, score-0.069]

32 Then x = ∑k x, ek ek , so by boundedness of T T ∑ x, ek ek , y = T x, y = ∑ Tek , x, ek y = ∑ Tek , Gx,y ek k = T, Gx,y k k HS , which is (viii). [sent-141, score-0.486]

33 Similarly Gx,y , Gx ,y ∑ = HS ∑ x, ek y, x , ek y = y, y k = x , ek x, ek k y, y , x, x which gives (vii) and (vi). [sent-142, score-0.324]

34 Then 1/2 m ∑ l=1 Twl , vl ≤ B T ∑ | wl , wr HS | l,r and 1/4 m ∑ ∗ 1/2 l=1 Twl , vl ≤ Bm T T ∑ 1/2 HS wl , wr 2 l,r Proof Without loss of generality assume B = 1. [sent-152, score-0.458]

35 Using Lemma 4 (viii), Schwarz’ inequality in HS and Lemma 4 (vii) we have m ∑ m m Twl , vl T, ∑ Gwl ,vl = l=1 l=1 HS ≤ T T l HS 1/2 ∑ HS l l=1 m = ∑ Gw ,v HS wl , wr vl , vr l,r 1/2 m ≤ T ∑ | wl , wr HS l,r | . [sent-153, score-0.474]

36 An operator T is called trace-class if ∑∞ Tei , ei is an absolutely convergent series for every i=1 orthonormal basis (ei )∞ of H. [sent-157, score-0.099]

37 If A ⊂ HS∗ is a set of symmetric and bounded operators in H we use the notation A HS = sup { T HS : T ∈ A } and A 2 = T 2 : T ∈ A . [sent-159, score-0.086]

38 Passing to the space HS of Hilbert-Schmidt operators the above construction can be carried out again: By Lemma 4 (i) E [ QX HS ] = E X 2 < ∞, so there is a unique operator E [QX ] ∈ HS such that E [ QX , T HS ] = E [QX ] , T HS , ∀T ∈ HS. [sent-166, score-0.107]

39 R Deﬁnition 6 The operator E [QX ] is called the covariance operator of X. [sent-167, score-0.15]

40 Lemma 7 The covariance operator E [QX ] ∈ HS+ has the following properties. [sent-168, score-0.091]

41 We have with orthonormal basis (ek )∞ and using (ii) k=1 tr (E [QX ]) = ∑ E [QX ] ek , ek = ∑ E [ ek , X ek , X ] = E k k 123 X 2 , M AURER which gives (iii). [sent-176, score-0.363]

42 We have given this derivation to illustrate the tendency of the Hilbert-Schmidt norm of the covariance operator of a distribution concentrated on unit vectors to decrease with the effective dimensionality of the distribution. [sent-182, score-0.129]

43 The fact that HS is a separable Hilbertspace just as H allows to deﬁne an operator E [T ] whenever T is a random variable with values in HS and E [ T HS ] < ∞. [sent-184, score-0.073]

44 Then for all δ > 0 with probability greater than δ we have 1 m 1 m E [Ti ] − ∑ Ti ∑ m i=1 m i=1 HS 2 ≤√ m 1+ ln (1/δ) 2 . [sent-191, score-0.081]

45 The theorem then shows that the covariance operator E [QX ] can be approximated in HS-norm with high probability by the empirical estimates (1/m) ∑i QXi . [sent-193, score-0.128]

46 We will use the notations (n,m) (n,m) (n,m) x = xil (i,l)=(1,1) for generic members of (X n )m and z = zl (i,l)=(1,1) = (x, y) = xil , yl (i,l)=(1,1) for i i m generic members of ((X × {−1, 1})n ) . [sent-212, score-0.636]

47 , hm and interpret hl (x) as the label assigned to the vector x when the task is known to be l. [sent-217, score-0.059]

48 The average error of a multiclassiﬁer h is the quantity 1 m er (h) = ∑ Pr hl X l = Y l , m l=1 which is just the misclassiﬁcation probability averaged over all tasks. [sent-218, score-0.06]

49 Here we consider zero-threshold classiﬁers hf which arise as follows: Suppose that F is a class of vector valued functions f : X → Rm with f = f 1 , . [sent-220, score-0.1]

50 A function f ∈ F deﬁnes a multi-classiﬁer hf = h1 , . [sent-224, score-0.1]

51 r 125 M AURER Theorem 9 Let ε, δ ∈ (0, 1) (i) With probability greater than 1 − δ it holds for all f ∈ F and all γ ∈ (0, 1) that er (hf ) ≤ eˆγ (f) + r 1 ˆ E Rnm (F ) (X) + γ (1 − ε) ln (1/ (δγε)) . [sent-230, score-0.111]

52 2nm (ii) With probability greater than 1 − δ it holds for all f ∈ F and all γ ∈ (0, 1) that er (hf ) ≤ eˆγ (f) + r 1 ˆ R m (F ) (X) + γ (1 − ε) n 9 ln (2/ (δγε)) . [sent-231, score-0.111]

53 The empirical Rademacher complexity of a class F of ˆ functions f : X → Rm is the function Rnm (F ) deﬁned on (X n )m by 2 m n l l l ∑ ∑ σi f xi f∈F mn l=1 i=1 ˆ Rnm (F ) (x) = Eσ sup . [sent-239, score-0.113]

54 The last one expresses the dependence of the estimation error on the conﬁdence parameter δ and a model-selection penalty ln (1/ (γε)) √ for the choice of the margin γ. [sent-242, score-0.072]

55 The 1/ nm decay however implies that even for moderate values of m and n the parameter ε in Theorem 9 can be chosen very small, so that the factor 1/ (1 − ε) in the second term on the right of the two bounds is very close to unity. [sent-245, score-0.083]

56 We write Cl for the covariance operator E [QX l ] and C for the total covariance operator C = (1/m) ∑l Cl , corresponding to a uniform mixture of distributions. [sent-258, score-0.182]

57 Let B > 0, let T be a ﬁxed symmetric, bounded linear operator on H with T ∞ ≤ 1, and let A be a set of symmetric, bounded linear operators T on H, all satisfying T ∞ ≤ 1. [sent-260, score-0.107]

58 The algorithms which choose from FB and FB ◦ T are essentially trivial extensions of linear single-task learning, where the tasks do not interact in the selection of the individual classiﬁers vi , which are chosen independently. [sent-277, score-0.051]

59 In the case of FB ◦ T the preprocessing operator T is chosen before seeing the training data. [sent-278, score-0.059]

60 Here the preprocessing operator T is selected from A in response to the data. [sent-281, score-0.059]

61 The constraint that T be the same for all tasks forces an interaction of tasks in the choice of T and (v1 , . [sent-282, score-0.07]

62 , vm ), deliberately aiming for a low empirical error. [sent-285, score-0.088]

63 Lemma 11 We have ˆ Rnm (FB ) (x) ≤ ˆ E Rnm (FB ) (X) ≤ 2B m ∑ nm l=1 1/2 n ∑ 2 xil i=1 2B 1 m √ ∑ E n m l=1 127 Xl 2 1/2 2B 1 m =√ ∑ tr Cl n m l=1 1/2 M AURER Proof Using Schwarz’ and Jensen’s inequality and the independence of the σl we get i ˆ Rnm (FB ) (x) = Eσ ≤ BEσ ≤ = v1 ,. [sent-290, score-0.41]

64 ,vm , vl 2 m n l l ∑ ∑σ x nm l=1 i=1 i i   2B m   ∑ Eσ nm l=1 2B m ∑ nm l=1 n ∑ n 2 m ∑ ≤B nm l=1 sup ∑ σli xil , vl i=1 2 n ∑ σli xil  i=1 2 xil 1/2 1/2 . [sent-293, score-1.388]

65 i=1 Jensen’s inequality then gives the second conclusion The ﬁrst bound in the lemma is just the average of the bounds given by Bartlett and Mendelson in on the empirical complexities for the various task-components of the sample. [sent-294, score-0.095]

66 For inputs constrained √ to the unit sphere in H, when X l = 1, both bounds become 2B/ n, which sets the mark for comparison with the interacting case FB ◦ A . [sent-295, score-0.093]

67 For motivation we next look at the case FB ◦ T , working with a ﬁxed linear preprocessor T of operator norm bounded by 1. [sent-296, score-0.128]

68 Using the above bound we obtain 1/2 m n ˆ m (FB ◦ T ) (x) = R m (FB ) (T x) ≤ 2B ∑ ∑ T xl 2 ˆ Rn , (4) n i nm l=1 i=1 √ which is always bounded by B/ n, because T x ≤ x , ∀x. [sent-297, score-0.087]

69 (8) Proof Fix x and deﬁne vectors wl = wl (σ) = ∑n σl xil depending on the Rademacher variables σl . [sent-310, score-0.484]

70 i=1 i i Then by Lemma 5 and Jensen’s inequality ˆ R (FB ◦ A ) (x) = Eσ sup sup T ∈A v1 ,. [sent-311, score-0.056]

71 ,vm , vi ≤ 2B A nm ≤ 2B A nm Now we have Eσ wl 2  n  HS Eσ 2 m ∑ Twl , vl ≤B nm l=1  1/2 ∑ | wl , wr l,r l,r n = ∑ ∑ Eσ σl σlj i i=1 j=1 | 1/2 ∑ Eσ [| wl , wr HS (9) |]  . [sent-314, score-0.64]

72 (10) i=1 Also, for l = r, we get, using Jensen’s inequality and independence of the Rademacher variables, (Eσ [| wl , wr |])2 ≤ Eσ n = = wl , wr n 2 n n (11) ∑ ∑ ∑ ∑ Eσ σl σrj σl σrj i i xil , xrj xil , xrj i=1 j=1 i =1 j =1 n 2 xil , xrj . [sent-316, score-1.366]

73 To prove (6) ﬁrst use the second part of Lemma 5 and Jensen’s inequality to get 1/4 1/2 ˆ R (FB ◦ A ) (x) ≤ 2B A 2 HS √ n m ∑ Eσ 2 wl , wr . [sent-319, score-0.154]

74 (13) l,r Now we have Eσ σl σlj σl σlj ≤ δi j δi j + δii δ j j + δi j δ ji so i i Eσ wl , wl n 2 ≤ ∑ xil 2 2 xlj 2 + 2 xil , xlj i, j=1 n ∑ ≤ 2 2 2 xil n ∑ + i=1 2 xil , xlj ≤ 2n2 + i, j=1 n ∑ xil , xlj 2 , i, j=1 where we used xil ≤ 1. [sent-320, score-2.206]

75 Inserting this together with (11) in (13) gives 1/2 ˆ Rnm (FB ◦ A ) (x) ≤ 2B A 2 HS √ n m ≤ 2B A 2 HS √ n m ∑ Eσ m n wl , wr + ∑ Eσ wl , wl 2 l=1 l,r:l=r 1/2 1/4 m 2 ∑ ∑ xil , xrj 2 1/4 2 + 2mn , (14) l,r=1 i, j=1 which is (6). [sent-321, score-0.662]

76 In a similar way we obtain from (14) ˆ E Rnm (FB ◦ A ) (X) 1/2 2B A 2 HS √ ≤ n m ≤ which gives (8) 2B 1/2 A 2 HS √ n m m mn + ∑ 2 1/4 n ∑ E QX l , E QX jr i l=r i, j=1  m2 n2 E 2 m n 1 ∑ ∑Q l mn l=1 i=1 Xi HS 1/4 2 HS + 2mn + 3mn2  , 6. [sent-323, score-0.173]

77 Bounds for Linear Multi-Task Learning Inserting the bounds of Theorem 12 in Theorem 9 immediately gives Theorem 13 Let A be a be set of bounded, symmetric operators in H and ε, δ ∈ (0, 1) (i) With probability greater than 1−δ it holds for all f = (v1 , . [sent-324, score-0.111]

78 , vm )◦T ∈ FB ◦ A and all γ ∈ (0, 1) that 1 ln (1/ (δγε)) er (hf ) ≤ eˆγ (f) + r A+ , γ (1 − ε) 2nm where A is either A= 2B A √ n or 2B A 2 √ A= n HS 1/2 HS C C HS + 2 HS + 3 m 1 m (15) 1/4 . [sent-327, score-0.165]

79 , vm ) ◦ T ∈ FB ◦ A and for all γ ∈ (0, 1) that 1 9 ln (2/ (δγε)) er (hf ) ≤ eˆ γ (f) + r A (X) + , γ (1 − ε) 2nm where the random variable A (X) is either A (X) = 2B A √ n 1 ˆ C (x) mn HS Fr + 1 m or 2B A 2 √ A (X) = n 1/2 HS 1 ˆ C (x) mn 2 Fr 2 + m 1/4 . [sent-331, score-0.323]

80 We ﬁnally extend this result from uniformly bounded sets A of operators to the set HS∗ of all symmetric Hilbert-Schmidt operators. [sent-332, score-0.066]

81 For α ∈ (0, 1] set A (α) = {T ∈ HS∗ : T HS ≤ 1/α} and consider the events F (α1 , α2 , δ) = {∃f ∈ FB ◦ A (α2 ) such that er (hf ) > eˆγ (f) + r 2B √ α1 γ (1 − ε) n C HS + ln (1/ (δγε)) 2nm 1 + m . [sent-339, score-0.091]

82 , vm ) ∈ FB and T ∈ HS∗ with T HS ≥ 1 and all γ ∈ (0, 1) that er (hf ) ≤ eˆγ (f) + r 2B T HS 2√ γ (1 − ε) n 1 C HS + + m ln T HS δγε2 2nm . [sent-347, score-0.165]

83 We will look at the behavior of both of our bounds for noninteracting (single-task) and interacting (multi-task) learners as we decrease the signal amplitude s. [sent-369, score-0.076]

84 The bounds which we use are implied by Lemma 11 and Theorem 9 for the non-interacting case and Theorem 13 for the interacting case, and state that each of the following two statements holds with probability at least 1 − δ: 1. [sent-370, score-0.061]

85 Here Cs = (1/m) ∑l E QsX l +(1−s)X N is the total covariance operator for the noisy mixture problem. [sent-385, score-0.091]

86 By independence of X N and X l and the mean-zero assumption for X N we obtain from Lemma 4 and 7 that Cs = (1/m) ∑ s2 E [QX l ] + (1 − s)2 E [QX N ] = s2C + (1 − s)2 E [QX N ] , l where C would be the total covariance operator for the original problem. [sent-386, score-0.091]

87 To bound the HS-norm of this operator we now introduce a simplifying assumption of homogeneity for the noise distribution: • X N is distributed uniformly on a k-dimensional unit-sphere centered at the origin. [sent-387, score-0.072]

88 The quotient of (19) to the non-interacting (17) is A HS s2 C HS + 1 1 √ + , m k and the interacting bound will be better than the non-interacting bound whenever this expression is less than unity. [sent-391, score-0.062]

89 This is more likely to happen when the signal amplitude s is small, and the dimension k of the noise distribution and the number of tasks m are large. [sent-392, score-0.05]

90 We suspect that the relevant information for all m tasks lies in a low-dimensional (d is small) If one believes these criteria to be met, then one can use an algorithm as the one developed in (Ando, Zhang, 2005) to minimize the interacting bound above, with A = Pd . [sent-405, score-0.084]

91 , f m ∈ F Theorem 16 Let F be a [0, 1]m -valued function class on a space X , and X = Xil 1 m ∑ E f l X1l m l=1 ≤ 1 m n l l ∑ ∑ f Xi + Rnm (F ) + mn l=1 i=1 ln (1/δ) . [sent-413, score-0.14]

92 , f m ∈ F , that 1 m ∑ E f l X1l m l=1 ≤ 1 m n l l ˆ ∑ ∑ f Xi + Rnm (F ) (X) + mn l=1 i=1 9 ln (2/δ) . [sent-417, score-0.14]

93 2mn Proof Let Ψ be the function on X mn given by 1 m ∑ E f l X1l m l=1 f∈F Ψ (x) = sup − 1 n l l ∑ f Xi n i=1 and let X be an iid copy of the X mn -valued random variable X. [sent-418, score-0.195]

94 Hence 2 m n l l l ∑ ∑ σi f Xi f∈F mn l=1 i=1 E [Ψ (X)] ≤ EX Eσ sup = Rnm (F ) . [sent-420, score-0.099]

95 Now ﬁx x ∈ X mn and let x ∈ X mn be as x, except for one modiﬁed coordinate xil . [sent-421, score-0.476]

96 So by the one-sided version of the bounded difference inequality (see McDiarmid, 1998) Pr Ψ (X) > EX Ψ X + ln (1/δ) 2mn ≤ δ. [sent-423, score-0.077]

97 , f m ∈ F we have er (hf ) = ≤ = and 1 m n ∑∑ f mn l=1 i=1 l 1 l E 1(−∞,0] Y1l f l X1 m∑ 1 l l E φγ ◦ f X1 ,Y1l m∑ 1 l l X1 ,Y1l E f m∑ Xil ,Yil = 1 m n ∑ ∑ φγ Yil f l Xil mn l=1 i=1 (21) = eˆγ (f) . [sent-452, score-0.188]

98 Then with probability greater than 1 − δ we have for all f ∈ F ln (1/δ) . [sent-455, score-0.081]

99 er (hf ) ≤ eˆγ (f) + γ−1 Rnm (F ) + r 2mn We also have with probability greater than 1 − δ for all f ∈ F , that ˆ er (hf ) ≤ eˆγ (f) + γ−1 Rnm (F ) (X) + r 137 9 ln (2/δ) . [sent-456, score-0.141]

100 er (hf ) > eˆα2 (f) + α−1 Rnm (F ) + r 1 ln (1/δ) 2mn . [sent-460, score-0.091]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hs', 0.74), ('qx', 0.371), ('xil', 0.318), ('rnm', 0.239), ('fb', 0.165), ('hf', 0.1), ('vl', 0.091), ('wl', 0.083), ('ek', 0.081), ('mn', 0.079), ('vm', 0.074), ('aurer', 0.073), ('ulti', 0.073), ('hv', 0.062), ('ln', 0.061), ('ando', 0.06), ('operator', 0.059), ('nm', 0.058), ('rademacher', 0.057), ('wr', 0.055), ('operators', 0.048), ('preprocessor', 0.046), ('inear', 0.042), ('tei', 0.04), ('xrj', 0.04), ('fr', 0.04), ('pd', 0.037), ('interacting', 0.036), ('tasks', 0.035), ('twl', 0.034), ('xlj', 0.033), ('covariance', 0.032), ('hl', 0.03), ('er', 0.03), ('lemma', 0.027), ('multiclassi', 0.027), ('preprocessors', 0.027), ('yil', 0.027), ('bartlett', 0.026), ('bounds', 0.025), ('jensen', 0.024), ('theorem', 0.023), ('norm', 0.023), ('cl', 0.022), ('earning', 0.022), ('anthony', 0.021), ('orthonormal', 0.021), ('schwarz', 0.02), ('viii', 0.02), ('exx', 0.02), ('gramian', 0.02), ('qy', 0.02), ('zil', 0.02), ('sup', 0.02), ('greater', 0.02), ('ei', 0.019), ('tr', 0.018), ('hilbert', 0.018), ('symmetric', 0.018), ('iv', 0.017), ('iid', 0.017), ('mendelson', 0.017), ('sphere', 0.017), ('ii', 0.017), ('vi', 0.016), ('xl', 0.016), ('inequality', 0.016), ('amplitude', 0.015), ('hm', 0.015), ('unit', 0.015), ('subspace', 0.015), ('jr', 0.015), ('lj', 0.015), ('rm', 0.015), ('empirical', 0.014), ('task', 0.014), ('pr', 0.014), ('separable', 0.014), ('andreas', 0.013), ('qxi', 0.013), ('reed', 0.013), ('sei', 0.013), ('supt', 0.013), ('tek', 0.013), ('zhang', 0.013), ('bound', 0.013), ('vii', 0.013), ('orthogonal', 0.012), ('substitution', 0.012), ('orthogonality', 0.012), ('readers', 0.012), ('iii', 0.012), ('ex', 0.012), ('hypothesis', 0.012), ('margin', 0.011), ('cristianini', 0.011), ('baxter', 0.011), ('frobenius', 0.011), ('homogeneously', 0.011), ('gx', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 16 jmlr-2006-Bounds for Linear Multi-Task Learning

Author: Andreas Maurer

2 0.042179119 44 jmlr-2006-Large Scale Transductive SVMs

Author: Ronan Collobert, Fabian Sinz, Jason Weston, Léon Bottou

Abstract: We show how the concave-convex procedure can be applied to transductive SVMs, which traditionally require solving a combinatorial search problem. This provides for the ﬁrst time a highly scalable algorithm in the nonlinear case. Detailed experiments verify the utility of our approach. Software is available at http://www.kyb.tuebingen.mpg.de/bs/people/fabee/transduction. html. Keywords: transduction, transductive SVMs, semi-supervised learning, CCCP

3 0.030989174 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to inﬁnity. We derive probabilistic ﬁnite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reﬂecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a signiﬁcant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

4 0.029068204 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random ﬁeld: that is, the model parameters are estimated on the basis of an initial set of data, and then the ﬁtted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working under the restriction of limited computation, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for ﬁtting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the inﬁnite data limit) is provably beneﬁcial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of convex variational methods. This stability result provides additional incentive, apart from the obvious beneﬁt of unique global optima, for using message-passing methods based on convex variational relaxations. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. Keywords: graphical model, Markov random ﬁeld, belief propagation, variational method, parameter estimation, prediction error, algorithmic stability

5 0.027851071 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

Author: Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

Abstract: We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework. Keywords: semi-supervised learning, graph transduction, regularization, kernel methods, manifold learning, spectral graph theory, unlabeled data, support vector machines

6 0.027157608 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

7 0.025136465 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)

8 0.024523448 48 jmlr-2006-Learning Minimum Volume Sets

9 0.023527667 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

10 0.023454076 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

11 0.020458246 45 jmlr-2006-Learning Coordinate Covariances via Gradients

12 0.019885216 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

13 0.019640043 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

14 0.019175399 93 jmlr-2006-Universal Kernels

15 0.018500127 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

16 0.017361356 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

17 0.017143371 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

18 0.017137302 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

19 0.016316371 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

20 0.016050804 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.085), (1, -0.034), (2, -0.017), (3, -0.061), (4, 0.006), (5, 0.024), (6, -0.007), (7, 0.007), (8, -0.091), (9, -0.015), (10, 0.002), (11, 0.036), (12, -0.027), (13, -0.026), (14, -0.007), (15, -0.003), (16, 0.005), (17, 0.037), (18, -0.074), (19, 0.069), (20, -0.008), (21, -0.002), (22, 0.085), (23, -0.067), (24, -0.154), (25, -0.13), (26, -0.18), (27, 0.07), (28, -0.083), (29, 0.035), (30, -0.08), (31, 0.089), (32, 0.17), (33, -0.265), (34, 0.23), (35, 0.222), (36, 0.261), (37, 0.33), (38, -0.257), (39, 0.281), (40, -0.054), (41, 0.15), (42, -0.249), (43, 0.092), (44, 0.256), (45, 0.034), (46, -0.018), (47, -0.069), (48, 0.125), (49, -0.139)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97995466 16 jmlr-2006-Bounds for Linear Multi-Task Learning

Author: Andreas Maurer

2 0.16806002 48 jmlr-2006-Learning Minimum Volume Sets

Author: Clayton D. Scott, Robert D. Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing conﬁdence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classiﬁcation. As in classiﬁcation, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain ﬁnite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency

3 0.16693228 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

4 0.15664609 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

5 0.13178827 14 jmlr-2006-An Efficient Implementation of an Active Set Method for SVMs (Special Topic on Machine Learning and Optimization)

Author: Katya Scheinberg

Abstract: We propose an active set algorithm to solve the convex quadratic programming (QP) problem which is the core of the support vector machine (SVM) training. The underlying method is not new and is based on the extensive practice of the Simplex method and its variants for convex quadratic problems. However, its application to large-scale SVM problems is new. Until recently the traditional active set methods were considered impractical for large SVM problems. By adapting the methods to the special structure of SVM problems we were able to produce an efﬁcient implementation. We conduct an extensive study of the behavior of our method and its variations on SVM problems. We present computational results comparing our method with Joachims’ SVM light (see Joachims, 1999). The results show that our method has overall better performance on many SVM problems. It seems to have a particularly strong advantage on more difﬁcult problems. In addition this algorithm has better theoretical properties and it naturally extends to the incremental mode. Since the proposed method solves the standard SVM formulation, as does SVMlight , the generalization properties of these two approaches are identical and we do not discuss them in the paper. Keywords: active set methods, support vector machines, quadratic programming

6 0.12142745 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

7 0.1154447 44 jmlr-2006-Large Scale Transductive SVMs

8 0.11011954 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

9 0.097842216 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)

10 0.097172119 54 jmlr-2006-Learning the Structure of Linear Latent Variable Models

11 0.097071663 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations

12 0.087841071 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

13 0.0856907 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)

14 0.080632806 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

15 0.080329865 93 jmlr-2006-Universal Kernels

16 0.080274589 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra (Special Topic on Machine Learning and Optimization)

17 0.076469183 53 jmlr-2006-Learning a Hidden Hypergraph

18 0.076400042 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

19 0.073683478 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

20 0.073501781 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.028), (36, 0.06), (45, 0.021), (50, 0.064), (63, 0.023), (76, 0.018), (78, 0.02), (81, 0.045), (84, 0.023), (85, 0.436), (90, 0.037), (91, 0.034), (96, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76855659 16 jmlr-2006-Bounds for Linear Multi-Task Learning

Author: Andreas Maurer

2 0.70882183 49 jmlr-2006-Learning Parts-Based Representations of Data

Author: David A. Ross, Richard S. Zemel

Abstract: Many perceptual models and theories hinge on treating objects as a collection of constituent parts. When applying these approaches to data, a fundamental problem arises: how can we determine what are the parts? We attack this problem using learning, proposing a form of generative latent factor model, in which each data dimension is allowed to select a different factor or part as its explanation. This approach permits a range of variations that posit different models for the appearance of a part. Here we provide the details for two such models: a discrete and a continuous one. Further, we show that this latent factor model can be extended hierarchically to account for correlations between the appearances of different parts. This permits modeling of data consisting of multiple categories, and learning these categories simultaneously with the parts when they are unobserved. Experiments demonstrate the ability to learn parts-based representations, and categories, of facial images and user-preference data. Keywords: parts, unsupervised learning, latent factor models, collaborative ﬁltering, hierarchical learning

3 0.27082723 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

4 0.26680452 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

Author: Sayan Mukherjee, Qiang Wu

Abstract: We introduce an algorithm that simultaneously estimates a classiﬁcation function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to ﬁnd salient variables and estimate how they covary. An efﬁcient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classiﬁcation function and its gradient to the true classiﬁcation function and true gradient. Keywords: Tikhnonov regularization, variable selection, reproducing kernel Hilbert space, generalization bounds, classiﬁcation

5 0.26466548 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classiﬁcation where the documents are allowed to belong to more than one category at a time. The classiﬁcation model is a variant of the Maximum Margin Markov Network framework, where the classiﬁcation hierarchy is represented as a Markov tree equipped with an exponential family deﬁned on the edges. We present an efﬁcient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classiﬁcation hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efﬁcient as training independent SVM-light classiﬁers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classiﬁcation learning algorithms. Keywords: kernel methods, hierarchical classiﬁcation, text categorization, convex optimization, structured outputs

6 0.26095128 66 jmlr-2006-On Model Selection Consistency of Lasso

7 0.25789884 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

8 0.25692657 70 jmlr-2006-Online Passive-Aggressive Algorithms

9 0.25586921 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

10 0.25494096 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

11 0.25485811 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

12 0.25339359 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

13 0.25057423 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

14 0.25011405 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

15 0.24978842 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

16 0.24771468 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

17 0.24730901 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

18 0.24580601 36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution

19 0.24578893 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

20 0.24542955 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers