jmlr jmlr2008 jmlr2008-27 knowledge-graph by maker-knowledge-mining

27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Source: pdf

Author: Francis R. Bach

Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufﬁcient conditions for the consistency of group Lasso under practical assumptions, such as model misspeciﬁcation. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this inﬁnite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisﬁed. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we study the asymptotic group selection consistency of the group Lasso. [sent-6, score-0.15]

2 When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. [sent-8, score-0.124]

3 In the case of a ﬁxed number of covariates, the Lasso does recover the sparsity pattern if and only if a certain simple condition on the generating covariance matrices is veriﬁed (Yuan and Lin, 2007). [sent-26, score-0.116]

4 A related Lasso-type procedure is the group Lasso, where the covariates are assumed to be clustered in groups, and instead of summing the absolute values of each individual loading, the sum of Euclidean norms of the loadings in each group is used. [sent-30, score-0.146]

5 Intuitively, this should drive all the weights in one group to zero together, and thus lead to group selection (Yuan and Lin, 2006). [sent-31, score-0.114]

6 In Section 2, we extend the consistency results of the Lasso to the group Lasso, showing that similar correlation conditions are necessary and sufﬁcient conditions for consistency. [sent-32, score-0.114]

7 The group Lasso essentially replaces groups of size one by groups of size larger than one. [sent-39, score-0.103]

8 We extend the consistency results of the group Lasso to this nonparametric case, by using covariance operators and appropriate notions of functional analysis. [sent-47, score-0.2]

9 In this paper, we study path consistency of the group Lasso and of multiple kernel learning, and in simulations we use the publicly available code for the algorithm of Bach et al. [sent-139, score-0.124]

10 Note that the conditions involve the covariance between all active groups X j , j ∈ J and all non active groups Xi , i ∈ Jc . [sent-143, score-0.133]

11 In the case of the group Lasso, even with a ﬁnite ﬁxed number of groups, our results are not as strong, as we can only get the strict condition as sufﬁcient and the weak condition as necessary. [sent-148, score-0.113]

12 1, shows that if the condition (4) is satisﬁed, any regularization parameter that satisﬁes a certain decay conditions will lead to a consistent estimator; thus the strong condition (4) is sufﬁcient for path consistency: Theorem 2 Assume (A1-3). [sent-152, score-0.106]

13 (1) converges in probability to w and ˆ the group sparsity pattern J(w) = { j, w j = 0} converges in probability to J (i. [sent-154, score-0.146]

14 On the other hand, the result (and the similar one for the Lasso) is rather disappointing regarding the applicability of the group Lasso as a practical group selection method, as Theorem 3 states that if the weak correlation condition (5) is not satisﬁed, we cannot have consistency. [sent-162, score-0.148]

15 6 Reﬁnements of Consistency Conditions Our current results state that the strict condition (4) is sufﬁcient for joint consistency of the group Lasso, while the weak condition (5) is only necessary. [sent-169, score-0.161]

16 The main technical reason for those differences is that in dimension one, the set of vectors of unit norm is ﬁnite (two possible values), and thus regular squared norm consistency leads to estimates of the signs of the loadings (i. [sent-171, score-0.105]

17 Assume the weak condition (5) is satisﬁed and that for all i ∈ J c such 1 that di ΣXi XJ Σ−1XJ Diag(d j / w j )wJ = 1, we have XJ ∆ ΣXJ Xi ΣXi XJ Σ−1XJ Diag d j / w j XJ Ip j − w jw j wj wj ∆ > 0, (6) with ∆ = −Σ−1XJ Diag(d j / w j )wJ . [sent-176, score-0.81]

18 (1) converges in probability to w and the group sparsity ˆ pattern J(w) = { j, w j = 0} converges in probability to J. [sent-178, score-0.146]

19 The previous theorem does not contradict the fact that condition (4) is necessary for w jw j j wj path-consistency in the Lasso case: indeed, if w j has dimension one (i. [sent-182, score-0.417]

20 We can also further reﬁned the necessary condition results in Theorem 3: as stated in Theorem 3, the group Lasso estimator may be both consistent in terms of norm and sparsity patterns only if the condition (5) is satisﬁed. [sent-188, score-0.185]

21 However, if we require only the consistent sparsity pattern estimation, then we may allow the convergence of the regularization parameter λ n to a strictly positive limit λ0 . [sent-189, score-0.109]

22 w∈R 2 j=1 (7) If there exists λ0 > 0 such that the solution has the correct sparsity pattern, then the group Lasso estimate with λn → λ0 , will have a consistent sparsity pattern. [sent-191, score-0.147]

23 Thus, even when condition (5) is not satisﬁed, we may have consistent estimation of the sparsity pattern but inconsistent estimation of the loading vectors. [sent-195, score-0.163]

24 In this section, we consider sufﬁcient conditions that do not depend on the loading vector, but only on the sparsity pattern J and of course on the covariance matrices. [sent-217, score-0.127]

25 The following condition is sufﬁcient for consistency of the group Lasso, for all possible loading vectors w with sparsity pattern J: C(ΣXX , d, J) = max c i∈J 1 ΣX X Σ−1 Diag(d j )uJ < 1. [sent-218, score-0.217]

26 This leads to the following sufﬁcient condition for consistency of the group Lasso (which extends the condition of Yuan and Lin, 2007): max c i∈J 1 di ∑ d j ∑ ΣX X i k j∈J k∈J Σ−1XJ XJ kj < 1. [sent-220, score-0.216]

27 (2004a), we can instead consider regularization by the square of the block 1 -norm: min w∈R p , b∈R 1 ¯ ¯ Y − Xw − b1n 2n 2 1 + µn 2 2 m ∑ dj wj . [sent-238, score-0.458]

28 (1), by minimizing in closed form with respect to b, to obtain: 1 1 1ˆ ˆ ˆ minp ΣYY − ΣY X w + w ΣXX w + µn w∈R 2 2 2 2 m ∑ dj wj . [sent-243, score-0.422]

29 2): Proposition 8 A vector w ∈ R p with sparsity pattern J = { j, w j = 0} is optimal for problem (12) if and only if ∀ j ∈ Jc, ∀ j ∈ J, ˆ ˆ ΣX j X w − ΣX j Y ˆ ˆ ΣX j X w − ΣX j Y µn d j (∑n di wi ) , i=1 d jw j = −µn (∑n di wi ) . [sent-246, score-0.175]

30 i=1 wj Note the correspondence at the optimum between optimal solutions of the two optimization problems in Eq. [sent-247, score-0.352]

31 But since the relationship between λ n ˆ and µn at optimum is λn = µn (∑n di wi ) and that ∑n di wi converges to a constant whenever i=1 i=1 w is consistent, it does apply as well with minor modiﬁcations (in particular, to deal with the case ˆ where J is empty, which requires µn = ∞). [sent-252, score-0.135]

32 Covariance Operators and Multiple Kernel Learning We now extend the previous consistency results to the case of nonparametric estimation, where each group is a potentially inﬁnite dimensional space of functions. [sent-254, score-0.133]

33 In this nonparametric context, covariance operators constitute appropriate tools for the statistical analysis and are becoming standard in the theoretical analysis of kernel methods (Fukumizu et al. [sent-262, score-0.126]

34 1188 C ONSISTENCY OF THE G ROUP L ASSO AND M ULTIPLE K ERNEL L EARNING kernel is linear, the covariance operator is exactly the covariance matrix, and the Hilbert-Schmidt norm is the Frobenius norm, while the operator norm is the maximum singular value (also referred to as the spectral norm). [sent-285, score-0.205]

35 , m, then we can naturally deﬁne the crosscovariance operators ΣXi X j from F j to Fi such that ∀( f i , f j ) ∈ Fi × F j , fi , ΣXi X j f j Fi = cov( fi (Xi ), f j (X j )) = E( fi (Xi ) f j (X j )) − (E fi (Xi ))(E f j (X j )). [sent-318, score-0.295]

36 We can now deﬁne a joint block covariance operator on F = F1 × · · · × Fm following the block structure of covariance matrices in Section 2. [sent-323, score-0.11]

37 Throughout this paper we will make the assumption that those operators CXi X j are compact for i = j: compact operators can be characterized as limits of ﬁnite rank operators or as operators that can be diagonalized on a countable basis with spectrum composed of a sequence tending to zero (see, e. [sent-328, score-0.209]

38 (2008) requires to consider consistency for norms weaker than the RKHS norms (Caponnetto and de Vito, 2005; Steinwart, 2001), and is left for future research. [sent-403, score-0.114]

39 (12), where empirical covariance matrices are replaced by empirical covariance operators: 1 µn 1ˆ ˆ ˆ min ΣYY − f , ΣXY F + f , ΣXX f F + f ∈F 2 2 2 2 m ∑ dj f j Fj . [sent-447, score-0.15]

40 The main reason of rewriting the conditions in terms of correlation operators rather than covariance operators is that correlation operators are invertible by assumption, while covariance operators are not as soon as the Hilbert spaces have inﬁnite dimensions. [sent-473, score-0.283]

41 The following theorems give necessary and sufﬁcient conditions for the path consistency of the nonparametric group Lasso (see proofs in Appendix C. [sent-474, score-0.121]

42 If condition (16) is satisﬁed, then for any sequence µn such that µn → 0 and µn n1/2 → +∞, any sequence of nonparametric group Lasso estimates fˆ converges in probability to f and the sparsity pattern J( fˆ) = { j, fˆj = 0} converges in probability to J. [sent-477, score-0.199]

43 ,m dj di2 The optimal function may then be written as f j = η j ∑n αi k j (·, xi j ). [sent-506, score-0.122]

44 Finally, we state a corollary of Proposition 13 that shows that under our assumptions regarding the correlation operator, we have a unique solution to the nonparametric groups Lasso problem with probability tending to one (see proof in Appendix A. [sent-518, score-0.128]

45 Adaptive Group Lasso and Multiple Kernel Learning In previous sections, we have shown that speciﬁc necessary and sufﬁcient conditions are needed for path consistency of the group Lasso and multiple kernel learning. [sent-531, score-0.124]

46 We denote by w ˆ 1ˆ 1 µn ˆ ˆ ΣYY − ΣY X w + w ΣXX w + 2 2 2 1195 m ∑ j=1 2 wLS −γ ˆj wj . [sent-539, score-0.352]

47 The unique minimizer fˆκn of 1ˆ 1 κn ˆ ˆ ΣYY − ΣXY , f F + f , ΣXX f F + 2 2 2 m ∑ fj 2 j, F j=1 1/2 LS converges in probability to f if κn → 0 and κn n1/2 → 0. [sent-547, score-0.247]

48 n Since the least-square estimate is consistent and we have an upper bound on its convergence rate, we follow Zou (2006) and use it to deﬁned adaptive weights d j for which we get both sparsity and regular consistency without any conditions on the value of the correlation operators. [sent-549, score-0.187]

49 1 Groups of Finite Sizes In the ﬁnite dimensional group case, we sampled X ∈ R p from a normal distribution with zero mean vector and a covariance matrix of size p = 8 for m = 4 groups of size p j = 2, j = 1, . [sent-574, score-0.129]

50 We selected Card(J) = 2 groups at random and sampled non zero loading vectors as follows: (a) sample each loading from from independent standard normal distributions, (b) rescale those to unit norm, (c) rescale those 1 by a scaling which is uniform at random between 3 and 1. [sent-581, score-0.151]

51 This allows 1197 BACH consistent − adaptive (γ = 1) consistent − non adaptive 0. [sent-599, score-0.167]

52 2 0 0 5 10 0 15 −log(µ) 10 −log(µ) inconsistent − non adaptive inconsistent − adaptive (γ = 1) 0. [sent-607, score-0.177]

53 2 0 2 4 6 8 −log(µ) 10 0 12 5 10 −log(µ) 15 inconsistent − adaptive (γ = 1) inconsistent − non adaptive 1 0. [sent-614, score-0.177]

54 4 4 6 −log(µ) 8 consistent − adaptive (γ=1) 2 consistent − adaptive (γ=1) 1 P(correct pattern) 0 −4 0. [sent-633, score-0.126]

55 When it is less than one, then we get with 1199 BACH inconsistent − non adaptive inconsistent − non adaptive n=101 n=102 n=103 n=104 n=105 0 0. [sent-648, score-0.218]

56 2 −2 −4 0 2 4 6 −log(µ) 8 2 4 6 −log(µ) 8 inconsistent − adaptive (γ=1) inconsistent − adaptive (γ=1) 0. [sent-652, score-0.136]

57 We compare them to the population values η j : both in terms of values, j=1 1200 C ONSISTENCY OF THE G ROUP L ASSO AND M ULTIPLE K ERNEL L EARNING inconsistent − non adaptive 2 inconsistent − non adaptive 0. [sent-683, score-0.218]

58 2 0 0 4 6 −log(µ) inconsistent − adaptive (γ=1) 2 inconsistent − adaptive (γ=1) 1 P(correct pattern) 0 n=101 n=102 n=103 n=104 n=105 2 4 6 −log(µ) 8 0 2 4 6 −log(µ) 8 Figure 4: Synthetic example where consistency condition in Eq. [sent-691, score-0.215]

59 In particular, under practical assumptions regarding the distributions the data are sampled from, we have provided necessary and sufﬁcient conditions for model consistency of the group Lasso and its nonparametric version, multiple kernel learning. [sent-699, score-0.146]

60 1201 BACH correct sparsity, regular consistency correct sparsity, no regular consistency incorrect sparsity 1 0. [sent-700, score-0.186]

61 4 −5 5 0 5 −1 −5 0 5 Figure 6: Functions to be estimated in the synthetic nonparametric group Lasso experiments (left: consistent case, right: inconsistent case). [sent-721, score-0.118]

62 The current work could be extended in several ways: ﬁrst, a more detailed study of the limiting distributions of the group Lasso and adaptive group Lasso estimators could be carried and then extend the analysis of Zou (2006) or Juditsky and Nemirovski (2000) and Wu et al. [sent-722, score-0.145]

63 1202 C ONSISTENCY OF THE G ROUP L ASSO AND M ULTIPLE K ERNEL L EARNING consistent − adaptive (γ = 2) consistent − non adaptive 0. [sent-728, score-0.167]

64 2 0 0 5 10 −log(µ) 0 15 10 −log(µ) 15 inconsistent − adaptive (γ = 2) inconsistent − non adaptive 1 0. [sent-736, score-0.177]

65 2 0 0 5 10 −log(µ) 0 15 10 15 −log(µ) 20 Figure 7: Regularization paths for the group Lasso for two weighting schemes (left: non adaptive, right: adaptive) and two different population densities (top: strict consistency condition satisﬁed, bottom: weak condition not satisﬁed. [sent-744, score-0.214]

66 We consider the Lagrangian with dual variables (β j , γ j ) ∈ R p j × R such that β j γ j: 1 2 m 1 2 ˆ ˆ ˆ L (w, v, β, γ) = ΣYY − ΣY X w + w ΣXX w + λn d v − ∑ j=1 wj vj βj . [sent-754, score-0.386]

67 Since the dual and the primal problems are strictly feasible, strong duality holds and the KKT conditions for reduced primal/dual variables (w, β) are ˆ ˆ ∀ j, β j = ΣX j X w − ΣX jY m ∀ j, ∑ d j w j = j=1 ∀ j, βj dj wj + wj (stationarity - 1) , βi 1 max µn i=1,. [sent-772, score-0.822]

68 Complementary slackness for the second order cone implies that: βj dj wj + wj i=1,. [sent-779, score-0.787]

69 ,m βi = 0, di βj dj = max max if and only if, either (a) w j = 0, or (b) w j = 0 and i=1,. [sent-782, score-0.125]

70 ,m di j=1 α 1n = 0 ∀ j, ¯ −X j α dj (stationarity - 1) , (stationarity - 3) , wj + wj (α Ki α)1/2 =0 i=1,. [sent-833, score-0.829]

71 Complementary slackness for the second order cone goes leads to: ¯ −X j α dj wj + wj (α Ki α)1/2 = 0, i=1,. [sent-837, score-0.787]

72 5 Proof of Proposition 14 What makes this proposition non obvious is the fact that the covariance operator Σ XX is not invertible in general. [sent-854, score-0.167]

73 Thus, non unicity implies that the empirical covariance matrix of the random variables g j (X j ), j ∈ J, is non invertible. [sent-863, score-0.122]

74 XJ Since w is consistent, and λn n1/2 → +∞, then for each i ∈ J c , ˜ 1 ˆ ˆ ΣXiY − ΣXi XJ wJ ˜ d i λn 1 converges in probability to di ΣXi XJ Σ−1XJ Diag(d j / w j )wJ which is of norm strictly smaller than XJ one because condition (4) is satisﬁed. [sent-893, score-0.151]

75 Thus the probability that w is indeed optimal, which is equal ˜ to 1 1 ˆ ˆ ˆ ˆ 1 ΣXiY − ΣXi XJ wJ ˜ ˜ P ∀i ∈ Jc , ∏c P di λn ΣXiY − ΣXi XJ wJ 1 , d i λn i∈J is tending to 1, which implies the theorem. [sent-894, score-0.108]

76 The ﬁrst term is O p (n−1/2 λn ) = o p (λ2 ), while the last ones are n equal to w j + λn ∆ j − w j = λn wj wj ∆ j + o p (λn ). [sent-920, score-0.704]

77 Given w deﬁned ˜ c , with probability through Lemma 20 and 21, we need to satisfy optimality condition (2) for all i ∈ J 1 tending to one. [sent-925, score-0.104]

78 1, that the optimality condition is indeed satisﬁed with probability tending to one. [sent-927, score-0.104]

79 (25), we get: 1 ˆ ˆ (ΣXiY − ΣXi XJ wJ ) = O p (n−1/2 λ−1 ) + ΣXi XJ Σ−1XJ ˜ n XJ λn Diag(d j / w j )wJ + λn ΣXi XJ Σ−1XJ Diag d j / w j XJ Ip j − w jw j ∆ + o p (λn ) wj wj = A + λn B + o p (λn ) + O p (n−1/2 λ−1 ). [sent-934, score-0.724]

80 Moreover, if E 1 is true, then the group Lasso estimate has the correct sparsity pattern if and only if for all i ∈ Jc , ˆ ˆ ΣXi XJ (wJ − wJ ) − ΣXi ε ˜ λn di = λ0 n−1/2 di . [sent-943, score-0.218]

81 Thus, the probability P(maxi∈Jc si /di λ0 ) converges to P max c i∈J 1 σ(ΣXi XJ Σ−1XJ vJ − vi ) − λ0 ΣXi XJ Σ−1XJ Diag(d j / w j )wJ XJ XJ di λ0 . [sent-952, score-0.11]

82 Under the event E1 which has probability tending to one, we have correct pattern selection if and only if maxi∈Jc si /di λ0 , which leads to P max c i∈J 1 σti − λ0 ΣXi XJ Σ−1XJ Diag(d j / w j )wJ XJ di λ0 , where ti = ΣXi XJ Σ−1XJ vJ − vi . [sent-953, score-0.163]

83 (2007), which states that the empirical covariance estimator converges in probability at rate O p (n−1/2 ) to the population covariance operators: ˆ Lemma 22 Assume (A4) and (A6). [sent-959, score-0.105]

84 2 Proof of Theorem 11 We now extend Lemma 20 to covariance operators, which requires to use the alternative formulation and a slower rate of decrease for the regularization parameter: Lemma 25 Let f˜J be any minimizer of 1 µn 1ˆ ˆ ˆ ΣYY − ΣXJY , fJ FJ + fJ , ΣXJ XJ fJ FJ + 2 2 2 2 ∑ dj f j Fj . [sent-972, score-0.149]

85 n Proof Note that from Cauchy-Schwarz inequality, we have:  2 ∑ dj j∈J f j Fj 1/2 1/2 = ∑ d j f j Fj × j∈J ∑ dj j∈J f j Fj ∑ j∈J 1/2 dj 2 f j Fj  1/2 f j Fj dj fj 2 j F f j Fj , 5. [sent-975, score-0.487]

86 We consider the unique minimizer f¯J of the following cost function, built by replacing the regularization by its upperbound, 1ˆ 1 µn ˆ ˆ F( fJ ) = ΣYY − ΣXJY , fJ FJ + fJ , ΣXJ XJ fJ FJ + 2 2 2 ∑ dj f j Fj j∈J dj fj 2 j F ∑ fj F . [sent-978, score-0.593]

87 Note that D is upperbounded and lowerbounded, as an auto-adjoint operator, by strictly positive constants times the identity operator (with probability tending to one), that is, Dmax IFJ D Dmin IFJ with Dmin , Dmax > 0. [sent-980, score-0.103]

88 fi F j Since the right hand side of the previous equation corresponds to a continuously differentiable function of fJ around fJ (with upper-bounded derivatives around fJ ), we have: ∇ fi Fn ( f¯J ) − 0 Fi 1/2 Cµn fJ − f¯J FJ = µn O p (µn + n−1/2 µ−1 ). [sent-990, score-0.128]

89 Since by Proposition 14, the solution is unique with probability tending to one, we need to prove that with probability tending to one f˜ is optimal for problem in Eq. [sent-999, score-0.106]

90 Thus 1 ˆ ˆ ΣXiY − ΣXi XJ f˜J Fi 1 P di µn f d is tending to 1, which implies the theorem (using the same arguments than in the proof of Theorem 2 in Appendix B. [sent-1025, score-0.134]

91 With probability tending to one, we have the optimality condition (15): ˆ ˆ ˆ ˆ ΣXJ ε + ΣXJ XJ fJ = ΣXJY = ΣXJ XJ fˆJ + µn fˆ If we denote Dn = n1/2 µn fˆ d Diag(d j / d Diag(d j / fˆj F j ) fˆJ . [sent-1035, score-0.104]

92 We have: ˆ ˆ ˆ n1/2 vi , ΣXiY − ΣXi XJ fˆJ Fi = n1/2 vi , ΣXi ε Fi + wJ , ΣXJ XJ n1/2 (fJ − fˆJ ) FJ + o p (1) ˆ ˆ = n1/2 vi , ΣXi ε Fi + wJ , µ0 D fJ − n1/2 ΣXJ ε FJ + o p (1) ˆ ˆ = wJ , µ0 D fJ FJ +n1/2 vi , ΣXi ε Fi −n1/2 wJ , ΣXJ ε FJ + o p (1). [sent-1041, score-0.12]

93 2 shows that we have: 2 EEn (1 − 1/n)σ2 vi , ΣXi Xi vi Fi + σ2 wJ , ΣXJ XJ wJ FJ − 2σ2 vi , ΣXi XJ wJ Fi min min min 2 2 = (1 − 1/n)(σmin vi , ΣXi Xi vi Fi − σmin vi , ΣXi XJ wJ Fi ). [sent-1047, score-0.18]

94 Thus the ˆ ˆ probability P n1/2 vi , ΣXiY − ΣXi XJ fˆJ Fi di fˆ d vi Fi + 1 converges to a strictly positive limit (note that fˆ d can be replaced by f d without changing the result). [sent-1054, score-0.16]

95 Since µn n1/2 → µ0 < ∞, this implies that ˆ ˆ P µ−1 vi , ΣXiY − ΣXi XJ fˆJ Fi > di fˆ d vi Fi n is asymptotically strictly positive (i. [sent-1055, score-0.135]

96 Σ CX X C−1 Diag(d j / f j F j )gJ di Xi Xi i J XJ XJ Fi Since with probability tending to one J( fˆ) = J, with probability tending to one, we have from optimality condition (15), and the usual line of arguments (see Eq. [sent-1061, score-0.212]

97 Following the same argument as in the proof of Theorem 11, (and because µn n1/2 → +∞ as a consequence of Proposition 26), the ﬁrst term in the last expression (divided by µn ) converges to 1/2 −1 vi = ΣXi Xi CXi XJ CXJ XJ f d Diag(d j / f j F j )gJ By assumption vi > di f d . [sent-1064, score-0.152]

98 µn fˆ d Thus there is a non vanishing probability of being strictly larger than d i , which implies that with non vanishing probability, the optimality condition (14) is not satisﬁed, which is a contradiction. [sent-1068, score-0.153]

99 Proof of Results on Adaptive Group Lasso In this appendix, we give proofs of the consistency of the adaptive group Lasso procedures. [sent-1089, score-0.142]

100 We consider the eigenbasis of the non centered covariance operators on each F j , j = 1, . [sent-1123, score-0.12]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xj', 0.822), ('wj', 0.352), ('fj', 0.207), ('xx', 0.153), ('lasso', 0.114), ('cxj', 0.099), ('diag', 0.094), ('dj', 0.07), ('bach', 0.07), ('asso', 0.067), ('fi', 0.064), ('xiy', 0.064), ('roup', 0.057), ('di', 0.055), ('tending', 0.053), ('xi', 0.052), ('group', 0.051), ('onsistency', 0.051), ('consistency', 0.048), ('cxi', 0.047), ('jc', 0.044), ('xjc', 0.044), ('ernel', 0.043), ('adaptive', 0.043), ('loading', 0.042), ('non', 0.041), ('ultiple', 0.041), ('covariance', 0.04), ('operators', 0.039), ('proposition', 0.039), ('yy', 0.037), ('px', 0.035), ('norms', 0.033), ('fukumizu', 0.032), ('sparsity', 0.032), ('condition', 0.031), ('dn', 0.03), ('vi', 0.03), ('operator', 0.03), ('xjy', 0.029), ('xy', 0.029), ('yuan', 0.026), ('groups', 0.026), ('converges', 0.025), ('kernel', 0.025), ('inconsistent', 0.025), ('regularization', 0.024), ('gj', 0.023), ('earning', 0.022), ('nonparametric', 0.022), ('ls', 0.022), ('jw', 0.02), ('optimality', 0.02), ('lanckriet', 0.02), ('strictly', 0.02), ('hilbert', 0.02), ('consistent', 0.02), ('rkhs', 0.02), ('norm', 0.02), ('ek', 0.019), ('xw', 0.018), ('baker', 0.018), ('cxx', 0.018), ('dgj', 0.018), ('hilbertian', 0.018), ('uj', 0.018), ('vj', 0.018), ('invertible', 0.017), ('ki', 0.017), ('wls', 0.017), ('vaart', 0.017), ('regular', 0.017), ('dual', 0.016), ('integrable', 0.016), ('correlation', 0.015), ('der', 0.015), ('minimizer', 0.015), ('fm', 0.014), ('dk', 0.014), ('bn', 0.014), ('theorem', 0.014), ('lemma', 0.013), ('slackness', 0.013), ('pattern', 0.013), ('dx', 0.013), ('weighting', 0.012), ('dmin', 0.012), ('fn', 0.012), ('weights', 0.012), ('stationarity', 0.012), ('correct', 0.012), ('proof', 0.012), ('primal', 0.012), ('dimensional', 0.012), ('zou', 0.012), ('dfj', 0.012), ('jy', 0.012), ('kernels', 0.012), ('square', 0.012), ('covariates', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

2 0.11238918 26 jmlr-2008-Consistency of Trace Norm Minimization

Author: Francis R. Bach

Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufﬁcient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulﬁlled. Keywords: convex optimization, singular value decomposition, trace norm, consistency

3 0.068077974 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont

Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our ﬁrst algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s ﬁrst order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data

4 0.066013388 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui

Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefﬁcients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semideﬁnite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefﬁcients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufﬁcient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artiﬁcial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso

5 0.060655516 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective

Author: Liviu Panait, Karl Tuyls, Sean Luke

Abstract: This paper presents the dynamics of multiple learning agents from an evolutionary game theoretic perspective. We provide replicator dynamics models for cooperative coevolutionary algorithms and for traditional multiagent Q-learning, and we extend these differential equations to account for lenient learners: agents that forgive possible mismatched teammate actions that resulted in low rewards. We use these extended formal models to study the convergence guarantees for these algorithms, and also to visualize the basins of attraction to optimal and suboptimal solutions in two benchmark coordination problems. The paper demonstrates that lenience provides learners with more accurate information about the beneﬁts of performing their actions, resulting in higher likelihood of convergence to the globally optimal solution. In addition, the analysis indicates that the choice of learning algorithm has an insigniﬁcant impact on the overall performance of multiagent learning algorithms; rather, the performance of these algorithms depends primarily on the level of lenience that the agents exhibit to one another. Finally, the research herein supports the strength and generality of evolutionary game theory as a backbone for multiagent learning. Keywords: multiagent learning, reinforcement learning, cooperative coevolution, evolutionary game theory, formal models, visualization, basins of attraction

6 0.051370114 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

7 0.04008558 52 jmlr-2008-Learning from Multiple Sources

8 0.038603973 86 jmlr-2008-SimpleMKL

9 0.036374837 92 jmlr-2008-Universal Multi-Task Kernels

10 0.034278046 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

11 0.032715946 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

12 0.032194566 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

13 0.031106317 83 jmlr-2008-Robust Submodular Observation Selection

14 0.02982525 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

15 0.027522493 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

16 0.026812937 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

17 0.025808163 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

18 0.024439165 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

19 0.023791395 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

20 0.022571769 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.131), (1, -0.055), (2, -0.073), (3, -0.001), (4, 0.094), (5, -0.061), (6, -0.063), (7, 0.066), (8, -0.053), (9, -0.042), (10, 0.052), (11, -0.099), (12, -0.033), (13, 0.122), (14, 0.213), (15, -0.196), (16, 0.036), (17, 0.001), (18, -0.278), (19, 0.075), (20, 0.082), (21, 0.011), (22, 0.146), (23, -0.029), (24, 0.031), (25, -0.065), (26, -0.074), (27, 0.246), (28, -0.02), (29, -0.093), (30, 0.008), (31, 0.046), (32, -0.025), (33, -0.168), (34, 0.006), (35, 0.051), (36, -0.067), (37, 0.234), (38, -0.093), (39, -0.01), (40, -0.227), (41, -0.082), (42, 0.216), (43, -0.028), (44, -0.297), (45, -0.113), (46, 0.009), (47, 0.006), (48, -0.079), (49, 0.166)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97136635 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

2 0.6641922 26 jmlr-2008-Consistency of Trace Norm Minimization

Author: Francis R. Bach

3 0.23248044 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective

Author: Liviu Panait, Karl Tuyls, Sean Luke

4 0.21765471 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

Author: Andreas Maurer

Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classiﬁers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning

5 0.15947337 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park

Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefﬁcients of the linear support vector machine. A Bahadur type representation of the coefﬁcients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classiﬁcation, convexity lemma, Radon transform

6 0.1415835 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

7 0.12864302 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

8 0.12413236 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)

9 0.123236 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

10 0.12145147 86 jmlr-2008-SimpleMKL

11 0.11217836 52 jmlr-2008-Learning from Multiple Sources

12 0.11028875 96 jmlr-2008-Visualizing Data using t-SNE

13 0.10705356 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers

14 0.10615925 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

15 0.099568948 80 jmlr-2008-Ranking Individuals by Group Comparisons

16 0.093217552 58 jmlr-2008-Max-margin Classification of Data with Absent Features

17 0.09199129 83 jmlr-2008-Robust Submodular Observation Selection

18 0.090998553 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)

19 0.089203194 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

20 0.085351907 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.034), (5, 0.011), (12, 0.307), (40, 0.032), (54, 0.04), (58, 0.069), (61, 0.065), (66, 0.097), (76, 0.016), (78, 0.03), (88, 0.062), (92, 0.039), (94, 0.036), (99, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78044617 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

2 0.74814188 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity (Special Topic on Model Selection)

Author: Imhoi Koo, Rhee Man Kil

Abstract: This paper presents a new method of model selection for regression problems using the modulus of continuity. For this purpose, we suggest the prediction risk bounds of regression models using the modulus of continuity which can be interpreted as the complexity of functions. We also present the model selection criterion referred to as the modulus of continuity information criterion (MCIC) which is derived from the suggested prediction risk bounds. The suggested MCIC provides a risk estimate using the modulus of continuity for a trained regression model (or an estimation function) while other model selection criteria such as the AIC and BIC use structural information such as the number of training parameters. As a result, the suggested MCIC is able to discriminate the performances of trained regression models, even with the same structure of training models. To show the effectiveness of the proposed method, the simulation for function approximation using the multilayer perceptrons (MLPs) was conducted. Through the simulation for function approximation, it was demonstrated that the suggested MCIC provides a good selection tool for nonlinear regression models, even with the limited size of data. Keywords: regression models, multilayer perceptrons, model selection, information criteria, modulus of continuity

3 0.477694 26 jmlr-2008-Consistency of Trace Norm Minimization

Author: Francis R. Bach

4 0.42538434 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs

Author: Xianchao Xie, Zhi Geng

Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is ﬁrst decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efﬁciency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning

5 0.40920717 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

Author: Zongming Ma, Xianchao Xie, Zhi Geng

Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning

6 0.39734334 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

7 0.38941634 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

8 0.38815385 57 jmlr-2008-Manifold Learning: The Price of Normalization

9 0.38592035 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

10 0.38348848 56 jmlr-2008-Magic Moments for Structured Output Prediction

11 0.38337076 9 jmlr-2008-Active Learning by Spherical Subdivision

12 0.38332674 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

13 0.38145611 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

14 0.38039276 83 jmlr-2008-Robust Submodular Observation Selection

15 0.37977767 86 jmlr-2008-SimpleMKL

16 0.379426 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

17 0.37911761 58 jmlr-2008-Max-margin Classification of Data with Absent Features

18 0.37450743 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)

19 0.37327582 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

20 0.37240112 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections