jmlr jmlr2012 jmlr2012-111 knowledge-graph by maker-knowledge-mining

111 jmlr-2012-Structured Sparsity and Generalization


Source: pdf

Author: Andreas Maurer, Massimiliano Pontil

Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 London, UK Editor: Gabor Lugosi Abstract We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. [sent-8, score-0.131]

2 The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. [sent-9, score-0.214]

3 A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. [sent-11, score-0.205]

4 The regularizer is expressed as an infimum convolution which involves a set M of linear transformations (see Equation (1) below). [sent-15, score-0.146]

5 As we shall see, this regularizer generalizes, depending on the choice of the set M , the regularizers used by several learning algorithms, such as ridge regression, the Lasso, the group Lasso (Yuan and Lin, 2006), multiple kernel learning (Lanckriet et al. [sent-16, score-0.25]

6 , 2004), the group Lasso with overlap (Obozinski et al. [sent-18, score-0.088]

7 We give a bound on the Rademacher average of the linear function class associated with this regularizer. [sent-21, score-0.068]

8 In particular, the bound applies to the Lasso in a separable Hilbert space or to multiple kernel learning with a countable number of kernels, under certain finite second-moment conditions. [sent-23, score-0.205]

9 Let M be an at most countable set of symmetric bounded linear operators on H such that for every x ∈ H, x = 0, there is some linear operator M ∈ M with Mx = 0 and that supM∈M |||M||| < ∞, where ||| · ||| is the operator norm. [sent-26, score-0.129]

10 2 that the chosen notation is justified, because · M is indeed a norm on the subspace of H where it is finite, and the dual norm is, for every z ∈ H, given by z M ∗ = sup Mz . [sent-30, score-0.264]

11 1 Given a bound on RM (x) we obtain uniform bounds on the estimation error, for example using the following standard result (adapted from Bartlett and Mendelson 2002), where the Lipschitz function φ is to be interpreted as a loss function. [sent-43, score-0.093]

12 , Xn ) be a vector of iid random variables with values in H, let X be iid to X1 , let φ : R → [0, 1] have Lipschitz constant L and δ ∈ (0, 1). [sent-47, score-0.126]

13 Then with probability at least 1 − δ in the draw of X it holds, for every β ∈ Rd with β M ≤ 1, that Eφ ( β, X ) ≤ 1 n ∑ φ ( β, Xi ) + L RM (X) + n i=1 9 ln 2/δ . [sent-48, score-0.124]

14 2n A similar (slightly better) bound is obtained if RM (X) is replaced by its expectation RM = ERM (X) (see Bartlett and Mendelson 2002). [sent-49, score-0.068]

15 The following is the main result of this paper and leads to consistency proofs and finite sample generalization guarantees for all algorithms which use a regularizer of the form (1). [sent-50, score-0.082]

16 Then   RM (x) ≤ ≤ 23/2 n 23/2 n n sup ∑ M∈M i=1 n ∑ i=1 Mxi 2  2 + xi 2 ∗ 2 + M ln M  ln  ∑ M∈M ∑i Mxi 2 sup ∑ j Nx j N∈M    2 . [sent-57, score-0.594]

17 In this case we can draw the following conclusion: If we have an a priori bound on X M ∗ for some data distribution, say X M ∗ ≤ C, and X = (X1 , . [sent-63, score-0.068]

18 , Xn ), with Xi iid to X, then 23/2C RM (X) ≤ √ 2 + ln M , n thus passing from a data-dependent to a distribution dependent bound. [sent-66, score-0.187]

19 2 But the first bound in Theorem 2 can be considerably smaller than the second and may be finite even if M is infinite. [sent-71, score-0.068]

20 Corollary 3 Under the conditions of Theorem 2 we have RM (x) ≤ 23/2 n sup ∑ Mxi 2 2+ ln M∈M i 1 ∑ Mxi n ∑ M∈M i 2 2 +√ . [sent-73, score-0.284]

21 To obtain a distribution dependent bound we retain the condition X M ∗ ≤ C and replace finiteness of M by the condition that R2 := E ∑ MX 2 < ∞. [sent-76, score-0.068]

22 (3) M∈M Taking the expectation in Corollary 3 and using Jensen’s inequality then gives a bound on the expected Rademacher complexity √ 23/2C 2 2 + ln R2 + √ . [sent-77, score-0.256]

23 We note that the numerical implementation and practical application of specific cases of the regularizer described here have been addressed in detail in a number of papers. [sent-85, score-0.082]

24 Examples Before giving the examples we mention a great simplification in the definition of the norm · M which occurs when the members of M have mutually orthogonal ranges. [sent-94, score-0.177]

25 If, in addition, every member of M is an orthogonal projection P, the norm further simplifies to β M = ∑ Pβ , P∈M and the quantity R2 occurring in the second moment condition (3) simplifies to R2 = E ∑ PX 2 =E X 2 . [sent-96, score-0.14]

26 , Xn ) will be a generic iid random vector of data points, Xi ∈ H, and X will be a generic data variable, iid to Xi . [sent-100, score-0.126]

27 Then β M = β , z M ∗ = z , and the bound on the empirical Rademacher complexity becomes RM (x) ≤ 25/2 n ∑ xi 2 , i worse by a constant factor of 23/2 than the corresponding result in Bartlett and Mendelson (2002), a tribute paid to the generality of our result. [sent-104, score-0.094]

28 The bound on RM (x) now reads √ 23/2 RM (x) ≤ ∑ xi 2 2 + ln d . [sent-111, score-0.24]

29 ∞ n i If X ∞ ≤ 1 almost surely we obtain √ 23/2 2 + ln d , n RM (X) ≤ √ which agrees with the bound in Kakade et al. [sent-112, score-0.218]

30 674 S TRUCTURED S PARSITY AND G ENERALIZATION Our last bound is useless if d ≥ en or if d is infinite. [sent-114, score-0.068]

31 But whenever the norm of the data has finite second moments we can use Corollary 3 and inequality (4) to obtain 23/2 n RM (X) ≤ √ 2+ ln E X 2 2 2 +√ . [sent-115, score-0.24]

32 Then β M = ∑ α−1 |βk | k k and z M ∗ = sup αk |zk | . [sent-130, score-0.16]

33 n n RM (X) ≤ √ So in this case the second moment bound is enforced by the weighting sequence. [sent-135, score-0.1]

34 The ranges of the PJℓ then provide an orthogonal decomposition of Rd and the above mentioned simplifications also apply. [sent-148, score-0.09]

35 ℓ=1 The algorithm which uses β M as a regularizer is called the group Lasso (see, for example, Yuan and Lin 2006). [sent-150, score-0.136]

36 , r} then we get √ 23/2 2 + ln r , n RM (X) ≤ √ (5) in complete symmetry with the Lasso and essentially the same as given in Kakade et al. [sent-155, score-0.124]

37 5 Overlapping Groups In the previous examples the members of M always had mutually orthogonal ranges, which gave a simple appearance to the norm β M . [sent-160, score-0.177]

38 If the ranges are not mutually orthogonal, the norm has a more complicated form. [sent-161, score-0.123]

39 For example, in the group Lasso setting, if the groups Jℓ cover {1, . [sent-162, score-0.078]

40 , d}, but are not disjoint, we obtain the regularizer of Obozinski et al. [sent-165, score-0.082]

41 , r} then the Rademacher complexity of the set of linear functionals with Ωoverlap (β) ≤ 1 is bounded as in (5), in complete equivalence to the bound for the group Lasso. [sent-170, score-0.145]

42 The same bound also holds for the class satisfying Ωgroup (β) ≤ 1, where the function Ωgroup is defined, for every β ∈ Rd , as Ωgroup (β) = r ∑ PJℓ β ℓ=1 which has been proposed by Jenatton et al. [sent-171, score-0.068]

43 The bound obtained from ℓ=1 this simple comparison may however be quite loose. [sent-175, score-0.068]

44 6 Regularizers Generated from Cones Our next example considers structured sparsity regularizers as in Micchelli et al. [sent-177, score-0.139]

45 (2011) that ΩΛ is a norm and that the dual norm is given by   1/2  d  z Λ∗ = sup µ j z2 : µ j = λ/ λ 1 with λ ∈ Λ . [sent-181, score-0.264]

46 If E (Λ) is finite and x is a sample then the Rademacher complexity of the class with ΩΛ (β) ≤ 1 is bounded by 23/2 n n ∑ xi i=1 2 Λ∗ 2+ ln |E (Λ)| . [sent-186, score-0.15]

47 7 Kernel Learning This is the most general case to which the simplification applies: Suppose that H is the direct sum H = ⊕ j∈J H j of an at most countable number of Hilbert spaces H j . [sent-188, score-0.099]

48 Then ∑ β M = Pj β j∈J and z M ∗ = sup Pj z . [sent-190, score-0.16]

49 Let φ j : X → H j be the feature map representation associated with kernel K j , so that, for every x,t ∈ X K j (x,t) = φ j (x), φ j (t) (for background on kernel methods see, for example, Shawe-Taylor and Cristianini 2004). [sent-195, score-0.076]

50 Define the kernel matrix K j = (K j (xi , xk ))n . [sent-200, score-0.101]

51 i,k=1 Using this notation the bound in Theorem 2 reads R ((φ(x1 ), . [sent-201, score-0.09]

52 , φ(xn ))) ≤ 23/2 n sup trK j 2 + j∈J ln ∑ j∈J trK j sup j∈J trK j . [sent-204, score-0.444]

53 In particular, if J is finite and K j (x, x) ≤ 1 for every x ∈ X and j ∈ J , then the the bound reduces to 23/2 √ 2+ n ln |J | , essentially in agreement with Cortes et al. [sent-205, score-0.192]

54 j∈J 677 M AURER AND P ONTIL We conclude this section by noting that, for every set M we may choose a set of kernels such that empirical risk minimization with the norm · M is equivalent to multiple kernel learning with kernels KM (x,t) = Mx, Mt , M ∈ M . [sent-211, score-0.15]

55 The following concentration inequality, known as the bounded difference inequality (see McDiarmid 1998), goes back to the work of Hoeffding (1963). [sent-238, score-0.088]

56 Theorem 4 Let F : X n → R and write B2 = n sup ∑ y ,y ∈X , x∈X k=1 1 2 n (F (xk←y1 ) − F (xk←y2 ))2 . [sent-240, score-0.16]

57 Then ∞ δ exp −t 2 2a2 dt ≤ 678 a2 −δ2 exp δ 2a2 . [sent-247, score-0.171]

58 Thus ∞ δ exp −t 2 2a2 ∞ dt = a δ/a e−t 2 /2 dt ≤ a2 δ ∞ δ/a te−t 2 /2 dt = −δ2 a2 exp δ 2a2 . [sent-249, score-0.373]

59 2 Properties of the Regularizer In this section, we show that the regularizer in Equation (1) is indeed a norm and we derive the associated dual norm. [sent-251, score-0.134]

60 Condition 6 M is an at most countable set of symmetric bounded linear operators on a real separable Hilbert space H such that (a) For every x ∈ H with x = 0, there exists M ∈ M such that Mx = 0 (b) supM∈M |||M||| < ∞ if q = 1 and ∑M∈M |||M||| p < ∞ if q > 1. [sent-257, score-0.129]

61 For z ∈ H the norm of the linear functional β ∈ ℓq (M ) → β, z is   sup Mz , if q = 1,  M∈M 1/p z Mq∗ =   ∑ Mz p , if q > 1. [sent-263, score-0.212]

62 If w = (wM )M∈M is an H-valued sequence indexed by M , then the linear functional v ∈ Vq (M ) → has norm ∑ vM , wM M∈M   sup MwM ,  M∈M w Vq (M )∗ =    ∑ vM if q = 1, 1/p p , if q > 1. [sent-266, score-0.212]

63 By Condition 6(b) and H¨ lder’s inequality A is a bounded linear transformation whose kernel K o is therefore closed, making the quotient space Vq (M ) /K into a Banach space with quotient norm w+K Q = inf v Vq (M ) : w − v ∈ K . [sent-270, score-0.234]

64 ˆ ˆ The range of A is ℓq (M ) and becomes a Banach space with the norm A−1 (β) Q = inf . [sent-272, score-0.086]

65 Then z Mq∗ = sup z, β : β Mq ≤ 1 = sup z, Av : v Vq (M ) ≤ 1 = sup A∗ z, v : v Vq (M ) ≤ 1 = A∗ z Vq (M )∗ 1/p = sup Mz if q = 1 or M∈M ∑ M∈M 680 Mz p if q > 1. [sent-282, score-0.64]

66 S TRUCTURED S PARSITY AND G ENERALIZATION Proposition 8 If the ranges of the members of M are mutually orthogonal then for β ∈ ℓ1 (M ) ∑ β M = M+β , M∈M where M + is the pseudoinverse of M. [sent-283, score-0.159]

67 Proof The ranges of the members of M provide an orthogonal decomposition of H, so ∑ β= M M+β , M∈M where we used the fact that MM + is the orthogonal projection onto the range of M. [sent-284, score-0.178]

68 3 Bounds for the ℓ1 (M )-Norm Regularizer We use the bounded difference inequality to derive a concentration inequality for linearly transformed random vectors. [sent-288, score-0.152]

69 By the triangle inequality n ∑ sup ∑ sup n k=1 y1 ,y2 ∈[−1,1], x∈[−1,1] n ≤ = (F (xk←y1 ) − F (xk←y2 ))2 n k=1 y1 ,y2 ∈[−1,1], x∈[−1,1] n ∑ y ,y sup (y1 − y2 )2 ∈[−1,1] k=1 1 ≤ 4 M 2 M (xk←y1 − xk←y2 ) Mek 2 2 2 HS . [sent-298, score-0.544]

70 (ii) If ε is orthonormal then it follows from Jensen’s inequality that  E Mε ≤ E 2 n ∑ εi Mei 1/2  i=1 1/2 n = ∑ Mei 2 = M HS . [sent-300, score-0.104]

71 We now use integration by parts, a union bound and the above concentration inequality to derive a bound on the expectation of the supremum of the norms Mε . [sent-305, score-0.272]

72 682 S TRUCTURED S PARSITY AND G ENERALIZATION Lemma 10 Let M be an at most countable set of linear transformations M : Rn → H and ε = (ε1 , . [sent-309, score-0.163]

73 Then √ M 2 ∑ HS E sup Mε ≤ 2 sup M HS 2 + ln M∈M . [sent-313, score-0.444]

74 We now use integration by HS ∞ E sup Mε = sup Mε > t dt Pr 0 M∈M M∈M ∞ ≤ M∞ + δ + M∞ +δ ≤ M∞ + δ + ∑ sup Mε > t dt Pr M∈M ∞ M∈M M∞ +δ Pr { Mε > t} dt, where we have introduced a parameter δ ≥ 0. [sent-315, score-0.708]

75 Substitution in the previous chain of inequalities and using Hoelder’s inequality (in the ℓ1 /ℓ∞ -version) give E sup Mε ≤ M∞ + δ + M∈M 1 δ ∑ M 2 HS exp M∈M −δ2 2 2M∞ . [sent-318, score-0.259]

76 (8) We now set δ = M∞ 2 ln e ∑M∈M M 2 M∞ 2 HS . [sent-319, score-0.124]

77 The substitution makes the last term in (8) smaller than M∞ / e 2 , and √ √ since 1 + 1/ e 2 < 2, we obtain  √ E sup Mε ≤ 2M∞ 1 + M∈M 683 ln e ∑M∈M M 2 M∞ 2 HS  . [sent-321, score-0.319]

78 M AURER AND P ONTIL Finally we use √ √ ln es ≤ 1 + ln s for s ≥ 1. [sent-322, score-0.248]

79 We have i=1 2 n n β, ∑ εi xi RM (x) = E sup β: β M ≤1 i=1 n 2 ≤ E n ∑ εi xi i=1 M∗ 2 = E sup Mxε . [sent-328, score-0.372]

80 n M∈M Applying Lemma 10 to the set of transformations M x = Mx : M ∈ M RM (x) ≤ Substitution of Mx 2 HS 23/2 supM∈M Mx n 2 = ∑n Mxi i=1 M∈M 2+ ln 2 ∑M∈M Mx HS supM∈M Mx 2 HS . [sent-329, score-0.188]

81 gives the first inequality of Theorem 2 and 2 HS sup Mx HS gives n ≤ ∑ sup Mxi i=1 M∈M 2 n = ∑ xi 2 ∗ M i=1 gives the second inequality. [sent-330, score-0.41]

82 For A, B > 0 and n ∈ N this implies that B = n [(A/n) ln (B/n) − (A/n) ln (A/n)] ≤ A ln (B/n) + n/e. [sent-332, score-0.372]

83 A Now multiply out the first inequality of Theorem 2 and use (9) with A ln n A = sup ∑ Mxi 2 n and B = M∈M i=1 Finally use √ a+b ≤ ∑ ∑ Mxi 2 (9) . [sent-333, score-0.348]

84 The second result is not dimension free, but it approaches the bound in Theorem 2 for arbitrarily large dimensions. [sent-342, score-0.068]

85 M S TRUCTURED S PARSITY AND G ENERALIZATION The proof is based on the following Lemma 12 Let M be an at most countable set of linear transformations M : Rn → H and ε = (ε1 , . [sent-346, score-0.189]

86 (10) M∈M We rewrite the expectation appearing in the right hand side using integration by parts and a change of variable as ∞ E [ Mε p ] = 0 Pr { Mε p ∞ > t} dt = A p + p 0 p Pr { Mε > s p + A p } s p−1 ds (11) where A ≥ 0. [sent-352, score-0.127]

87 (1 − λ)1/p This allows us to bound Pr { Mε p > s p + A p } ≤ Pr = Pr Mε p Mε > λ p−1 p Combining Equations (11) and (12), choosing A = (1 − λ) variable t = λ p−1 p ∞ s + (1 − λ) s + (1 − λ) 1−p p M HS p−1 p p−1 p p A A . [sent-354, score-0.068]

88 One can verify that the leading constant in our bound is smaller than the one in Cortes et al. [sent-366, score-0.068]

89 Theorem 13 Under the conditions of Theorem 11 RMq (x) ≤ 4 M n 1/p sup ∑ Mxi 2 M∈M i 2+ ln ∑ M ∑i Mxi 2 supN∈M ∑i Nxi 2 . [sent-371, score-0.284]

90 The key step in the proof of Theorem 13 is the following 686 S TRUCTURED S PARSITY AND G ENERALIZATION Lemma 14 Let M be a finite set of linear transformations M : Rn → H and ε = (ε1 , . [sent-373, score-0.09]

91 Then  1/p ∑ E Mε p  ≤2 M M p Proof If t ≥ 0 and ∑M Mε ∑ Mε p sup M HS M∈M 2+ ln 2 ∑M M HS supN∈M N 2 HS > t p , then there must exist some M ∈ M such that Mε which in turn implies that Mε > t/ M Pr 1/p > tp ≤ ∑ Pr 1/p p > t p/ M , . [sent-377, score-0.284]

92 It then follows from a union bound that 1/p Mε > t/ M ≤ exp M M . [sent-378, score-0.103]

93 4 M −t 2 2/p M 2 HS , where we used the subgaussian concentration inequality Lemma 9-(ii) with r = 2. [sent-379, score-0.088]

94 We now o substitute e ∑M M 2 1/p HS δ=2 M sup M HS ln supN∈M N 2 M∈M HS and use 1 + 1/e ≤ 2 to arrive at the conclusion. [sent-381, score-0.284]

95 This gives  E 1/p ∑ M Mxε p  ≤2 M 1/p sup Mx M∈M HS 2+ We now proceed as in the proof of Theorem 11 to obtain the result. [sent-383, score-0.186]

96 Conclusion and Future Work We have presented a bound on the Rademacher average for linear function classes described by infimum convolution norms which are associated with a class of bounded linear operators on a Hilbert space. [sent-386, score-0.098]

97 When the bound is applied to specific cases (ℓ2 , ℓ1 , mixed ℓ1 /ℓ2 norms) it recovers existing bounds (up to small changes in the constants). [sent-388, score-0.093]

98 Specifically, we have shown that the bound can be applied in infinite dimensional settings, provided that the moment condition (3) is satisfied. [sent-390, score-0.1]

99 We have also applied the bound to multiple kernel learning. [sent-391, score-0.106]

100 While in the standard case the bound is only slightly worse in the constants, the bound is potentially smaller and applies to the more general case in which there is a countable set of kernels, provided the expectation of the sum of the kernels is bounded. [sent-392, score-0.265]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hs', 0.579), ('vq', 0.265), ('mx', 0.258), ('mxi', 0.211), ('aurer', 0.177), ('ontil', 0.177), ('vm', 0.164), ('pj', 0.164), ('sup', 0.16), ('parsity', 0.136), ('tructured', 0.136), ('lasso', 0.128), ('ln', 0.124), ('mq', 0.11), ('mz', 0.106), ('supm', 0.106), ('eneralization', 0.105), ('pr', 0.104), ('dt', 0.101), ('countable', 0.099), ('rademacher', 0.088), ('regularizer', 0.082), ('regularizers', 0.076), ('mvm', 0.076), ('rm', 0.073), ('rmq', 0.071), ('supn', 0.071), ('bound', 0.068), ('cortes', 0.065), ('inequality', 0.064), ('banach', 0.064), ('transformations', 0.064), ('iid', 0.063), ('xk', 0.063), ('mendelson', 0.06), ('orthogonal', 0.056), ('kakade', 0.055), ('group', 0.054), ('mei', 0.053), ('trk', 0.053), ('norm', 0.052), ('micchelli', 0.051), ('obozinski', 0.05), ('hilbert', 0.046), ('morales', 0.045), ('maurer', 0.045), ('jenatton', 0.045), ('meir', 0.041), ('structured', 0.04), ('orthonormal', 0.04), ('bartlett', 0.039), ('andreas', 0.038), ('kernel', 0.038), ('mutually', 0.037), ('av', 0.037), ('baldassarre', 0.035), ('mol', 0.035), ('shimamura', 0.035), ('exp', 0.035), ('substitution', 0.035), ('theorem', 0.035), ('inf', 0.034), ('overlap', 0.034), ('ranges', 0.034), ('ying', 0.033), ('xn', 0.033), ('lanckriet', 0.033), ('moment', 0.032), ('members', 0.032), ('kloft', 0.031), ('kernels', 0.03), ('baraniuk', 0.03), ('panchenko', 0.03), ('rd', 0.03), ('operators', 0.03), ('assertion', 0.027), ('nx', 0.027), ('meier', 0.027), ('xi', 0.026), ('proof', 0.026), ('integration', 0.026), ('surely', 0.026), ('bounds', 0.025), ('ledoux', 0.025), ('lighten', 0.025), ('lemma', 0.025), ('pontil', 0.025), ('groups', 0.024), ('concentration', 0.024), ('functionals', 0.023), ('quotient', 0.023), ('massimiliano', 0.023), ('sparsity', 0.023), ('supremum', 0.022), ('wm', 0.022), ('lounici', 0.022), ('pd', 0.022), ('campbell', 0.022), ('reads', 0.022), ('geer', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 111 jmlr-2012-Structured Sparsity and Generalization

Author: Andreas Maurer, Massimiliano Pontil

Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.

2 0.1314522 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

Abstract: The ℓ1 -penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1 -penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1 -penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

3 0.120288 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning

4 0.10603953 80 jmlr-2012-On Ranking and Generalization Bounds

Author: Wojciech Rejchel

Abstract: The problem of ranking is to predict or to guess the ordering between objects on the basis of their observed features. In this paper we consider ranking estimators that minimize the empirical convex risk. We prove generalization bounds for the excess risk of such estimators with rates that are 1 faster than √n . We apply our results to commonly used ranking algorithms, for instance boosting or support vector machines. Moreover, we study the performance of considered estimators on real data sets. Keywords: convex risk minimization, excess risk, support vector machine, empirical process, U-process

5 0.10405479 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity

6 0.091413528 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

7 0.089220352 82 jmlr-2012-On the Necessity of Irrelevant Variables

8 0.083892792 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

9 0.082575671 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

10 0.07959675 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity

11 0.076310806 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

12 0.061691433 73 jmlr-2012-Multi-task Regression using Minimal Penalties

13 0.061633173 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

14 0.061065122 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

15 0.056280449 91 jmlr-2012-Plug-in Approach to Active Learning

16 0.054273341 13 jmlr-2012-Active Learning via Perfect Selective Classification

17 0.050649036 59 jmlr-2012-Linear Regression With Random Projections

18 0.050553657 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

19 0.050481822 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

20 0.049939625 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.236), (1, 0.102), (2, -0.222), (3, 0.113), (4, -0.002), (5, 0.003), (6, -0.021), (7, -0.06), (8, -0.005), (9, 0.001), (10, 0.006), (11, -0.094), (12, 0.07), (13, -0.018), (14, 0.007), (15, -0.244), (16, 0.098), (17, 0.03), (18, -0.009), (19, -0.003), (20, 0.084), (21, 0.023), (22, 0.182), (23, -0.158), (24, 0.048), (25, -0.096), (26, 0.19), (27, 0.015), (28, -0.007), (29, -0.046), (30, -0.02), (31, 0.056), (32, 0.197), (33, -0.009), (34, 0.039), (35, 0.057), (36, 0.096), (37, 0.029), (38, -0.038), (39, 0.087), (40, 0.026), (41, -0.023), (42, 0.052), (43, 0.055), (44, 0.098), (45, -0.032), (46, 0.07), (47, -0.105), (48, -0.021), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93201268 111 jmlr-2012-Structured Sparsity and Generalization

Author: Andreas Maurer, Massimiliano Pontil

Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.

2 0.60626227 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

Abstract: The ℓ1 -penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1 -penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1 -penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

3 0.54718739 97 jmlr-2012-Regularization Techniques for Learning with Matrices

Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari

Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning

4 0.49969047 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

5 0.48074332 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

Author: Marius Kloft, Gilles Blanchard

Abstract: We derive an upper bound on the local Rademacher complexity of ℓ p -norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches analyzed the case p = 1 only while our analysis covers all cases 1 ≤ p ≤ ∞, assuming the different feature mappings corresponding to the different kernels to be uncorrelated. We also show a lower bound that shows that the bound is tight, and derive consequences regarding excess loss, namely α fast convergence rates of the order O(n− 1+α ), where α is the minimum eigenvalue decay rate of the individual kernels. Keywords: multiple kernel learning, learning kernels, generalization bounds, local Rademacher complexity

6 0.46307305 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment

7 0.45710599 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

8 0.45417419 80 jmlr-2012-On Ranking and Generalization Bounds

9 0.44338927 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

10 0.44274664 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

11 0.44112077 82 jmlr-2012-On the Necessity of Irrelevant Variables

12 0.36833474 73 jmlr-2012-Multi-task Regression using Minimal Penalties

13 0.3310504 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

14 0.32727116 91 jmlr-2012-Plug-in Approach to Active Learning

15 0.31171957 59 jmlr-2012-Linear Regression With Random Projections

16 0.27707103 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

17 0.27268058 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

18 0.27120739 13 jmlr-2012-Active Learning via Perfect Selective Classification

19 0.26652858 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

20 0.25396574 4 jmlr-2012-A Kernel Two-Sample Test


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.03), (21, 0.069), (26, 0.054), (29, 0.03), (49, 0.043), (64, 0.012), (69, 0.013), (75, 0.083), (77, 0.01), (79, 0.022), (80, 0.32), (92, 0.131), (96, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73426151 111 jmlr-2012-Structured Sparsity and Generalization

Author: Andreas Maurer, Massimiliano Pontil

Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an infinite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.

2 0.50876719 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations suffice to produce a predictor with empirical risk ε-close to the infimum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

3 0.50153655 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification

4 0.49950999 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO

Author: Ji Liu, Peter Wonka, Jieping Ye

Abstract: We consider the following sparse signal recovery (or feature selection) problem: given a design matrix X ∈ Rn×m (m ≫ n) and a noisy observation vector y ∈ Rn satisfying y = Xβ∗ + ε where ε is the noise vector following a Gaussian distribution N(0, σ2 I), how to recover the signal (or parameter vector) β∗ when the signal is sparse? The Dantzig selector has been proposed for sparse signal recovery with strong theoretical guarantees. In this paper, we propose a multi-stage Dantzig selector method, which iteratively refines the target signal β∗ . We show that if X obeys a certain condition, then with a large probability ˆ the difference between the solution β estimated by the proposed method and the true solution β∗ measured in terms of the ℓ p norm (p ≥ 1) is bounded as ˆ β − β∗ p ≤ C(s − N)1/p log m + ∆ σ, where C is a constant, s is the number of nonzero entries in β∗ , the risk of the oracle estimator ∆ is independent of m and is much smaller than the first term, and N is the number of entries of β∗ larger √ than a certain value in the order of O (σ log m). The proposed √ method improves the estimation √ bound of the standard Dantzig selector approximately from Cs1/p log mσ to C(s − N)1/p log mσ where the value N depends on the number of large entries in β∗ . When N = s, the proposed algorithm achieves the oracle solution with a high probability, where the oracle solution is the projection of the observation vector y onto true features. In addition, with a large probability, the proposed method can select the same number of correct features under a milder condition than the Dantzig selector. Finally, we extend this multi-stage procedure to the LASSO case. Keywords: multi-stage, Dantzig selector, LASSO, sparse signal recovery

5 0.4987736 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

Author: Lan Xue, Annie Qu

Abstract: The varying-coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying-coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying-coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying-coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the least absolute shrinkage and selection operator (LASSO), the adaptive LASSO and the smoothly clipped absolute deviation (SCAD) approaches. Keywords: coordinate decent algorithm, difference convex programming, L0 - regularization, large-p small-n, model selection, nonparametric function, oracle property, truncated L1 penalty

6 0.498191 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

7 0.49778581 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

8 0.49479714 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

9 0.49348 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

10 0.49245709 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

11 0.49138123 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss

12 0.49036074 73 jmlr-2012-Multi-task Regression using Minimal Penalties

13 0.49035889 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

14 0.4879148 13 jmlr-2012-Active Learning via Perfect Selective Classification

15 0.48783171 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

16 0.48702165 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

17 0.48644546 82 jmlr-2012-On the Necessity of Irrelevant Variables

18 0.48597139 80 jmlr-2012-On Ranking and Generalization Bounds

19 0.48165411 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels

20 0.48128149 97 jmlr-2012-Regularization Techniques for Learning with Matrices