jmlr jmlr2011 jmlr2011-97 knowledge-graph by maker-knowledge-mining

97 jmlr-2011-Union Support Recovery in Multi-task Learning

Source: pdf

Author: Mladen Kolar, John Lafferty, Larry Wasserman

Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the ﬁeld of statistics, is a simpliﬁed model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation 1. [sent-11, score-0.107]

2 Introduction We consider the problem of estimating a sparse signal in the presence of noise. [sent-12, score-0.15]

3 , 2009), that considering related estimation tasks jointly can improve estimation performance. [sent-15, score-0.102]

4 (2011) propose to minimize the penalized least squares objective with a mixed (2, 1)-norm on the coefﬁcients as the penalty term. [sent-36, score-0.223]

5 Negahban and Wainwright (2009) use the mixed (∞, 1)-norm to penalize the coefﬁcients and focus on the exact recovery of the non-zero pattern of the regression coefﬁcients, rather than the support set S. [sent-38, score-0.255]

6 For a rather limited case of k = 2, the authors show that when the regression do not share a common support, it may be harmful to consider the regression problems jointly using the mixed (∞, 1)-norm penalty. [sent-39, score-0.19]

7 In particular, to the best of our knowledge, there has been no previous work that sharply characterizes the performance of different penalization schemes on the problem of selecting the relevant variables in the multi-task setting. [sent-49, score-0.128]

8 Our results provide a sharp characterization of the sparsity patterns under which the Lasso procedure performs better than the group Lasso. [sent-53, score-0.261]

9 Similarly, our results characterize how the group Lasso (with the mixed (2, 1) norm) can perform better when each non-zero row is dense. [sent-54, score-0.352]

10 However, the Normal means model (3) allows us to analyze the estimation procedure in (4) and focus on the scaling of the important parameters (n, k, p, s, ε, µmin ) for the success of the support recovery. [sent-120, score-0.174]

11 Using the model (3) and the estimation procedure in (4), we are able to identify regimes in which estimating the support is more efﬁcient using the ordinary Lasso than with the multi-task Lasso and vice versa. [sent-121, score-0.156]

12 An alternative representation of the model is Yi j = N (ξi j µi j , σ2 ) N(0, σ2 ) j ∈ [k], j ∈ [k], i∈S i ∈ Sc where ξi j is a Bernoulli random variable with success probability ε. [sent-123, score-0.155]

13 Under the model (3), we analyze penalized least squares procedures of the form µ = argmin µ∈R p×k 1 ||Y − µ||2 + pen(µ) F 2 2417 (4) KOLAR , L AFFERTY AND WASSERMAN where ||A||F = ∑ jk A2 is the Frobenious norm, pen(·) is a penalty function and µ is a p × k matrix jk of means. [sent-126, score-0.146]

14 the mixed (2, 1)-norm penalty pen(µ) = λ ∑ ||θi ||2 , i∈[p] which corresponds to the multi-task Lasso formulation in Obozinski et al. [sent-129, score-0.192]

15 the mixed (∞, 1)-norm penalty pen(µ) = λ ∑ ||θi ||∞ , i∈[p] which correspond to the multi-task Lasso formulation in Negahban and Wainwright (2009), and denote the resulting estimate as µℓ1 /ℓ∞ . [sent-132, score-0.192]

16 These results are complemented with necessary conditions for the recovery of the support set S. [sent-135, score-0.095]

17 There is a large literature on the penalized least squares estimation using concave penalties as introduced in Fan and Li (2001). [sent-137, score-0.098]

18 Our result can be interpreted as follows: for any estimation procedure there exists a model given by (3) with non-zero elements equal to µmin such that the estimation procedure will make an error when identifying the set S with probability bounded away from zero. [sent-144, score-0.147]

19 We establish the sufﬁcient conditions on the signal strength µmin for the Lasso and both variants of the group Lasso under which these procedures can correctly identify the set of nonzero rows S. [sent-146, score-0.493]

20 2418 U NION S UPPORT R ECOVERY By comparing the lower bounds with the sufﬁcient conditions, we are able to identify regimes in which each procedure is optimal for the problem of identifying the set of non-zero rows S. [sent-147, score-0.167]

21 Furthermore, we point out that the usage of the popular group Lasso with the mixed (∞, 1) norm can be disastrous when features are not perfectly shared among tasks. [sent-148, score-0.344]

22 We start by analyzing the lower bound for any procedure for the problem of identifying the set of non-zero rows in Section 2. [sent-152, score-0.122]

23 In Section 3 we provide sufﬁcient conditions on the signal strength µmin for the Lasso and the group Lasso to be able to detect the set of non-zero rows S. [sent-153, score-0.441]

24 Theorem 1 Let µ2 = µ2 (n, k, p, s, ε, β) = ln 1 + u + min min 2419 2u + u2 σ2 KOLAR , L AFFERTY AND WASSERMAN where ln 1 + α u= 2 (p−s+1) 2 2k1−2β . [sent-180, score-0.578]

25 The result can be interpreted in words in the following way: whatever the estimation procedure µ, there exists some matrix M ∈ M[µmin , s] such that the probability of incorrectly identifying the support S(M) is bounded away from zero. [sent-182, score-0.137]

26 The parameter α′ controls the probability of making a type one error P[∃i ∈ [p] : i ∈ S(µ) and i ∈ S] ≤ α′ , that is, the parameter α′ upper bounds the probability that there is a zero row of the matrix M that is estimated as a non-zero row. [sent-187, score-0.221]

27 Likewise, the parameter δ′ controls the probability of making a type two error P[∃i ∈ [p] : i ∈ S(µ) and i ∈ S] ≤ δ′ , that is, the parameter δ′ upper bounds the probability that there is a non-zero row of the matrix M that is estimated as a zero row. [sent-188, score-0.221]

28 Setting the right hand side to α′ /(p − s) in the above display, we obtain that λ can be set as 2k(p − s) 2 ln √ 2πα′ λ=σ (8) √ and (7) holds as soon as 2 ln 2k(p−s) ≥ 1. [sent-199, score-0.43]

29 Suppose µmin satisﬁes one of the following two cases: √ (i) µmin = σ 2r ln k where r> 1 +Ck,p,s − with Ck,p,s = 1−β √ ln 2(p−s) 2πα′ ln k and limn→∞ Ck,p,s ∈ [0, ∞); (ii) µmin ≥ λ when ln k =0 n→∞ ln(p − s) lim and k1−β /2 ≥ ln(s/δ′ ). [sent-205, score-0.804]

30 The two different cases describe two different regimes characterized by the ratio of ln k and ln(p − s). [sent-209, score-0.246]

31 Now we can compare the lower bound on µ2 from Theorem 1 and the upper bound from min Theorem 2. [sent-210, score-0.163]

32 We have that when β < 1/2 the lower bound is of the order O ln kβ−1/2 ln(p − s) and the upper bound is of the order ln(k(p − s)). [sent-212, score-0.276]

33 Ignoring the logarithmic terms in p and s, we have that the lower bound is of the order O (kβ−1/2 ) and the upper bound is of the order O (ln k), which implies that the Lasso does not achieve the lower bound when the non-zero rows are dense. [sent-213, score-0.157]

34 When the non-zero rows are sparse, β > 1/2, we have that both the lower and upper bound are of the order O (ln k) (ignoring the terms depending on p and s). [sent-214, score-0.111]

35 2 Upper Bounds for the Group Lasso Recall that the group Lasso estimator is given as µℓ1 /ℓ2 = argmin µ∈R p×k 1 ||Y − µ||2 + λ ∑ ||θi ||2 , F 2 i∈[p] where θi = (µi j ) j∈[k] . [sent-216, score-0.193]

36 The group Lasso estimator can be obtained in a closed form as a result of the following thresholding operation (see, for example, Friedman et al. [sent-217, score-0.169]

37 Then the test φα satisﬁes P[φα = 1] ≤ α when f = 0 and P[φα = 0] ≤ δ for all f such that √ || f ||2 ≥ 2( 5 + 4)σ2 ln 2 2422 2e √ n. [sent-225, score-0.201]

38 Then P[S(µℓ1 /ℓ2 ) = S] ≤ α if √ µmin ≥ σ 2( 5 + 4) where c = k−1/2+β 1−c ln 2e(2s − δ′ )(p − s) α′ δ′ 2 ln(2s/δ′ )/k1−β . [sent-230, score-0.201]

39 Using Theorem 1 and Theorem 4 we can compare the lower bound on µ2 and the upper min bound. [sent-233, score-0.14]

40 When each non-zero row is dense, that is, when β < 1/2, we have that both lower and upper bounds are of the order O (kβ−1/2 ) (ignoring the logarithmic terms in p and s). [sent-235, score-0.105]

41 This suggest that the group Lasso performs better than the Lasso for the case where there is a lot of feature sharing between different tasks. [sent-236, score-0.198]

42 However, when β > 1/2, that is, in the sparse non-zero row regime, we see that the lower bound is of the order O (ln(k)) whereas the upper bound is of the order O (kβ−1/2 ). [sent-238, score-0.151]

43 This implies that the group Lasso does not have optimal dependence on k in the sparse non-zero row setting. [sent-239, score-0.245]

44 3 Upper Bounds for the Group Lasso with the Mixed (∞, 1) Norm In this section, we analyze the group Lasso estimator with the mixed (∞, 1) norm, deﬁned as µℓ1 /ℓ∞ = argmin µ∈R p×k 1 ||Y − µ||2 + λ ∑ ||θi ||∞ , F 2 i∈[p] where θi = (µi j ) j∈[k] . [sent-241, score-0.323]

45 Suppose that the penalty parameter λ is set as λ = kσ 2 ln k(p − s) . [sent-248, score-0.263]

46 Then P[S(µℓ1 /ℓ∞ ) = S] ≤ α if µmin ≥ where c = 1 + τ −1+β k λ 1−c ′ 2 ln(2s/δ′ )/k1−β and τ = σ 2k ln 2s−δ /λ. [sent-251, score-0.201]

47 Comparing upper bounds for the Lasso and the group Lasso with the mixed (2, 1) norm with the result of Theorem 6, we can see that both the Lasso and the group Lasso have better dependence on k than the group Lasso with the mixed (∞, 1) norm. [sent-254, score-0.864]

48 This suggest that we should be very cautious when using the group Lasso with the mixed (∞, 1) norm, since as soon as the tasks do not share exactly the same features, the other two procedures have much better performance on identifying the set of non-zero rows. [sent-256, score-0.43]

49 Simulation Results We conduct a small-scale empirical study of the performance of the Lasso and the group Lasso (both with the mixed (2, 1) norm and with the mixed (∞, 1) norm). [sent-258, score-0.474]

50 In particular, we demonstrate that as the minimum signal level µmin varies in the model (3), our theory sharply determines points at which probability of identifying non-zero rows of matrix M successfully transitions from 0 to 1 for different procedures. [sent-260, score-0.359]

51 The sparsity of each non-zero row is controlled by changing the parameter β in {0, 0. [sent-264, score-0.145]

52 Figure 1 plots the probability of success as a function of the signal strength. [sent-279, score-0.281]

53 On the same ﬁgure we plot the probability of success for the group Lasso with both (2, 1) and (∞, 1)-mixed norms. [sent-280, score-0.324]

54 Figure 1 plots probability of success as a function of the parameter ρ, which controls the signal strength. [sent-286, score-0.319]

55 This probability transitions very sharply from 0 to 1. [sent-287, score-0.16]

56 A rectangle on a horizontal line represents points at which the probability P[S = S] is between 0. [sent-288, score-0.192]

57 From each subﬁgure in Figure 1, we can observe that the probability of success for the Lasso transitions from 0 to 1 for the same value of the parameter ρ for different values of p, which indicates that, except for constants, our theory correctly characterizes the scaling of µmin . [sent-291, score-0.27]

58 In addition, we can see that the Lasso outperforms the group Lasso (with (2, 1)-mixed norm) when each non-zero row is very sparse (the parameter β is close to one). [sent-292, score-0.245]

59 2 Group Lasso Next, we focus on the empirical performance of the group Lasso with the mixed (2, 1) norm. [sent-294, score-0.299]

60 Figure 2 plots the probability of success as a function of the signal strength. [sent-295, score-0.281]

61 Using theorem 4, we set √ (2s − δ′ )(p − s) k−1/2+β µgroup = σ 2( 5 + 4) ln (14) 1−c α′ δ′ where c is deﬁned in theorem 4. [sent-296, score-0.267]

62 Figure 2 plots probability of success as a function of the parameter ρ, which controls the signal strength. [sent-300, score-0.319]

63 A rectangle on a horizontal line represents points at which the probability P[S = S] is between 0. [sent-301, score-0.192]

64 From each subﬁgure in Figure 2, we can observe that the probability of success for the group Lasso transitions from 0 to 1 for the same value of the parameter ρ for different values of p, which indicated that, except for constants, our theory correctly characterizes the scaling of µmin . [sent-304, score-0.439]

65 We observe also that the group Lasso outperforms the Lasso when each non-zero row is not too sparse, that is, when there is a considerable overlap of features between different tasks. [sent-305, score-0.222]

66 3 Group Lasso with the Mixed (∞, 1) Norm Next, we focus on the empirical performance of the group Lasso with the mixed (∞, 1) norm. [sent-307, score-0.299]

67 Figure 3 plots the probability of success as a function of the signal strength. [sent-308, score-0.281]

68 Figure 3 plots probability of success as a function of the parameter ρ, which controls the signal strength. [sent-312, score-0.319]

69 A rectangle on a horizontal line represents points at which the probability P[S = S] is between 0. [sent-313, score-0.192]

70 5 Signal strength ( ¥¤ (2,1) Group Lasso p = 512 ( , 1) p = 256 p = 128 © p = 1024 Group Lasso p = 512 p = 1024 p = 512 p = 256 p = 128 0. [sent-329, score-0.113]

71 0 Figure 1: The probability of success for the Lasso for the problem of estimating S plotted against the signal strength, which is varied as a multiple of µlasso deﬁned in (13). [sent-339, score-0.313]

72 A rectangle on each horizontal line represents points at which the probability P[S = S] is between 0. [sent-340, score-0.192]

73 To the left of the rectangle the probability is smaller than 0. [sent-343, score-0.146]

74 Different subplots represent the probability of success as the sparsity parameter β changes. [sent-346, score-0.301]

75 in Figure 3, we can observe that the probability of success for the group Lasso transitions from 0 to 1 for the same value of the parameter ρ for different values of p, which indicated that, except for constants, our theory correctly characterizes the scaling of µmin . [sent-347, score-0.439]

76 We also observe that the group Lasso with the mixed (∞, 1) norm never outperforms the Lasso or the group Lasso with the mixed (2, 1) norm. [sent-348, score-0.643]

77 2426 U NION S UPPORT R ECOVERY Probability of successful support recovery: group Lasso Sparsity parameter: = 0. [sent-349, score-0.193]

78 0 Signal strength ( Lasso p = 1024 p = 512 p = 256 p = 128 0. [sent-353, score-0.113]

79 5 Signal strength ( (2,1) Group Lasso p = 512 ( p = 256 p = 128 , 1) # p = 1024 Group Lasso p = 512 p = 256 p = 128 0. [sent-362, score-0.113]

80 0 Figure 2: The probability of success for the group Lasso for the problem of estimating S plotted against the signal strength, which is varied as a multiple of µgroup deﬁned in (14). [sent-373, score-0.482]

81 A rectangle on each horizontal line represents points at which the probability P[S = S] is between 0. [sent-374, score-0.192]

82 To the left of the rectangle the probability is smaller than 0. [sent-377, score-0.146]

83 Different subplots represent the probability of success as the sparsity parameter β changes. [sent-380, score-0.301]

84 Under many scenarios, the group lasso outperforms the lasso. [sent-383, score-0.79]

85 The ℓ1 /ℓ2 penalty seems to be a much better choice for the group lasso than the ℓ1 /ℓ∞ . [sent-384, score-0.852]

86 5 Signal strength ( 32 (2,1) Group Lasso p = 512 ( , 1) p = 256 p = 128 7 p = 1024 Group Lasso p = 512 p = 1024 p = 512 p = 256 p = 128 0. [sent-399, score-0.113]

87 0 infty Figure 3: The probability of success for the group Lasso with mixed (∞, 1) norm for the problem of estimating S plotted against the signal strength, which is varied as a multiple of µinfty deﬁned in (15). [sent-409, score-0.784]

88 A rectangle on each horizontal line represents points at which the probability P[S = S] is between 0. [sent-410, score-0.192]

89 To the left of the rectangle the probability is smaller than 0. [sent-413, score-0.146]

90 Different subplots represent the probability of success as the sparsity parameter β changes. [sent-416, score-0.301]

91 Since in many practical situations one does not know how much overlap there is between different tasks, it would be useful to combine the Lasso and the group Lasso in order to improve the performance. [sent-418, score-0.169]

92 For example, one can take the union of the Lasso and the group Lasso estimate, S = S(µℓ1 ) ∪ S(µℓ1 /ℓ2 ). [sent-419, score-0.169]

93 , ak ) = EZ Em Eξ exp − = EZ exp − Zµ2 min 2 2430 Zµ2 min + µmin ∑ ξ j a j 2 j∈m Em ∏ cosh(µmin a j ) j∈m . [sent-516, score-0.264]

94 The following holds P0 (g2 ) = P0 EZ ′ ,Z exp − = EZ ′ ,Z exp − (Z + Z ′ )µ2 min 2 Em,m′ ∏ cosh(µmin a j ) ∏ cosh(µmin a j ) j∈m′ j∈m (Z + Z ′ )µ2 min 2 Em,m′ ∏ cosh2 (µmin a j )φ(a j )da j j∈m∩m′ ∏ , cosh(µmin a j )φ(a j )da j j∈m△m′ where we use m△m′ to denote (m ∪ m′ )\(m ∩ m′ ). [sent-518, score-0.24]

95 By direct calculation, we have that cosh2 (µmin a j )φ(a j )da j = exp(µ2 ) cosh(µ2 ) min min and cosh(µmin a j )φ(a j )da j = exp(µ2 /2). [sent-519, score-0.176]

96 [The ﬁrst parameter denotes the total number of stones in an urn, the second parameter denotes the number of stones we are going to sample without replacement from the urn and the last parameter denotes the fraction of white stones in the urn. [sent-522, score-0.213]

97 By convexity, it follows that P0 (g2 ) ≤ EZ,Z ′ EX cosh(µ2 )X min Z′ cosh(µ2 ) − 1 min k Z′ exp Z ln 1 + u k = EZ,Z ′ exp Z ln 1 + = EZ ′ EZ 2431 KOLAR , L AFFERTY AND WASSERMAN √ where µ2 = ln(1 + u + 2u + u2 ) with min ln 1 + α2T 2 u= 2k1−2β . [sent-525, score-0.931]

98 Continuing with our calculations, we have that P0 (g2 ) = EZ ′ exp k ln 1 + k−(1+β) uZ ′ ≤ EZ ′ exp k−β uZ ′ = exp k ln 1 + k−β exp(k−β u − 1) (19) ≤ exp k1−β exp k−β u − 1 ≤ exp 2k1−2β u = 1+ α2 T , 2 where the last inequality follows since k−β u < 1 for all large p. [sent-526, score-0.594]

99 Therefore, using lemma 3 with δ = δ′ /(2s − δ′ ), if follows that P[Sk (i) ≤ λ] ≤ δ′ /(2s) for all i ∈ S when √ k−1/2+β 2e(2s − δ′ )(p − s) µmin ≥ σ 2( 5 + 4) ln . [sent-543, score-0.201]

100 A note on the group lasso and a sparse group lasso. [sent-582, score-0.982]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lasso', 0.621), ('kolar', 0.272), ('ln', 0.201), ('ecovery', 0.182), ('nion', 0.182), ('cosh', 0.17), ('group', 0.169), ('afferty', 0.154), ('wasserman', 0.131), ('mixed', 0.13), ('upport', 0.127), ('infty', 0.127), ('success', 0.116), ('lounici', 0.115), ('strength', 0.113), ('rectangle', 0.107), ('signal', 0.1), ('sc', 0.095), ('sparsity', 0.092), ('min', 0.088), ('bin', 0.084), ('ez', 0.083), ('normal', 0.073), ('baraud', 0.073), ('recovery', 0.071), ('sk', 0.07), ('negahban', 0.069), ('pen', 0.069), ('pm', 0.067), ('sharply', 0.064), ('penalty', 0.062), ('rows', 0.059), ('transitions', 0.057), ('ui', 0.055), ('rk', 0.055), ('mladen', 0.054), ('stones', 0.054), ('subplots', 0.054), ('row', 0.053), ('yi', 0.052), ('larry', 0.051), ('mk', 0.051), ('liu', 0.047), ('obozinski', 0.046), ('horizontal', 0.046), ('tn', 0.046), ('regimes', 0.045), ('norm', 0.045), ('wainwright', 0.042), ('identifying', 0.04), ('odd', 0.04), ('probability', 0.039), ('han', 0.038), ('sara', 0.038), ('controls', 0.038), ('arlot', 0.036), ('cmu', 0.036), ('characterizes', 0.035), ('da', 0.035), ('estimation', 0.034), ('tasks', 0.034), ('massimiliano', 0.033), ('alexandre', 0.033), ('theorem', 0.033), ('penalties', 0.033), ('exp', 0.032), ('tsybakov', 0.032), ('varied', 0.031), ('penalized', 0.031), ('jian', 0.031), ('karim', 0.031), ('regression', 0.03), ('sharing', 0.029), ('procedures', 0.029), ('upper', 0.029), ('penalization', 0.029), ('geer', 0.028), ('uz', 0.028), ('soon', 0.028), ('display', 0.028), ('estimating', 0.027), ('going', 0.027), ('cients', 0.027), ('sub', 0.026), ('ordinary', 0.026), ('studying', 0.026), ('plots', 0.026), ('turlach', 0.025), ('screening', 0.025), ('kim', 0.025), ('zou', 0.025), ('ak', 0.024), ('support', 0.024), ('urn', 0.024), ('pj', 0.024), ('argmin', 0.024), ('bound', 0.023), ('sparse', 0.023), ('correctly', 0.023), ('bounds', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

2 0.22652583 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efﬁciently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

3 0.16496071 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez

Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA

4 0.10552069 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and reﬁtting: ﬁrst we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

5 0.10510471 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufﬁcient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

6 0.083981164 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.078603074 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

8 0.068024896 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

9 0.066242509 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

10 0.063522018 104 jmlr-2011-X-Armed Bandits

11 0.061248247 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.060971323 35 jmlr-2011-Forest Density Estimation

13 0.060642213 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

14 0.052394513 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

15 0.045392018 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

16 0.04063772 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

17 0.040348098 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

18 0.038636185 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

19 0.035235245 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

20 0.03359805 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.222), (1, 0.026), (2, 0.25), (3, 0.284), (4, 0.022), (5, 0.017), (6, -0.024), (7, -0.028), (8, -0.052), (9, 0.01), (10, 0.233), (11, -0.273), (12, 0.064), (13, -0.257), (14, 0.024), (15, -0.085), (16, -0.135), (17, 0.177), (18, 0.009), (19, -0.042), (20, -0.063), (21, -0.067), (22, -0.08), (23, 0.024), (24, 0.09), (25, 0.021), (26, -0.111), (27, -0.061), (28, 0.05), (29, -0.044), (30, -0.041), (31, 0.025), (32, -0.066), (33, 0.024), (34, -0.023), (35, -0.027), (36, -0.024), (37, 0.043), (38, 0.015), (39, -0.016), (40, 0.011), (41, -0.012), (42, -0.037), (43, -0.086), (44, -0.142), (45, -0.044), (46, -0.033), (47, -0.015), (48, -0.04), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97964972 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

2 0.88145399 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

3 0.57734334 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez

4 0.50082874 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

5 0.42693764 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

6 0.36021274 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.35058504 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

8 0.3334547 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

9 0.32054025 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

10 0.30999321 104 jmlr-2011-X-Armed Bandits

11 0.27764139 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

12 0.25975516 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

13 0.20663506 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

14 0.19850756 35 jmlr-2011-Forest Density Estimation

15 0.18600135 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

16 0.17710592 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

17 0.17194714 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

18 0.17057064 23 jmlr-2011-DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model

19 0.16795182 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

20 0.16069703 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.647), (9, 0.027), (10, 0.013), (24, 0.026), (31, 0.052), (32, 0.015), (41, 0.014), (73, 0.018), (78, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95447993 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisﬁes a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing

same-paper 2 0.93571049 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

3 0.92991132 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

4 0.64672279 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

5 0.56147784 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

6 0.53615451 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.53218925 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

8 0.52929997 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

9 0.48753434 104 jmlr-2011-X-Armed Bandits

10 0.46528393 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

11 0.44324541 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

12 0.44267067 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

13 0.44055498 35 jmlr-2011-Forest Density Estimation

14 0.43910712 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

15 0.42850688 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

16 0.42733651 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

17 0.42034405 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

18 0.41882879 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

19 0.41640505 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

20 0.39430746 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing