jmlr jmlr2013 jmlr2013-72 knowledge-graph by maker-knowledge-mining

72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

Source: pdf

Author: Pinghua Gong, Jieping Ye, Changshui Zhang

Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. Keywords: multi-task learning, multi-stage, non-convex, sparse learning

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. [sent-11, score-0.171]

2 In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. [sent-12, score-0.166]

3 To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. [sent-13, score-0.112]

4 Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. [sent-15, score-0.135]

5 Multi-task feature learning, which aims to learn a common set of shared features, has received a lot of interests in machine learning recently, due to the popularity of various sparse learning formulations and their successful applications in many problems. [sent-36, score-0.158]

6 In this paper, we focus on a speciﬁc multi-task feature learning setting, in which we learn the features speciﬁc to each task as well as the common features shared among tasks. [sent-37, score-0.199]

7 Although many multi-task feature learning algorithms have been proposed in the past, many of them require the relevant features to be shared by all tasks. [sent-38, score-0.143]

8 (2010) proposed an ℓ1 + ℓ1,∞ regularized formulation, called “dirty model”, to leverage the common features shared among tasks. [sent-42, score-0.126]

9 The dirty model allows a certain feature to be shared by some tasks but not all tasks. [sent-43, score-0.226]

10 The ℓ1 + ℓ1,∞ regularizer is a convex relaxation of an ℓ0 -type one, in which a globally optimal solution can be obtained. [sent-48, score-0.143]

11 However, a convex regularizer is known to be too loose to approximate the ℓ0 -type one and often achieves suboptimal performance (either require restrictive conditions or obtain a suboptimal error bound) (Zou and Li, 2008; Zhang, 2010, 2012; Zhang and Zhang, 2012; Shen et al. [sent-49, score-0.105]

12 However, the non-convex formulation is usually difﬁcult to solve and a globally optimal solution can not be obtained in most practical problems. [sent-53, score-0.125]

13 Moreover, the solution of the non-convex formulation heavily depends on the speciﬁc optimization algorithms employed. [sent-54, score-0.1]

14 We propose a non-convex formulation, called capped-ℓ1 ,ℓ1 regularized model for multi-task feature learning. [sent-57, score-0.105]

15 The proposed model aims to simultaneously learn the features speciﬁc to each task as well as the common features shared among tasks. [sent-58, score-0.138]

16 To address the reproducibility issue of the non-convex formulation, we show that the solution generated by the MSMTFL algorithm is unique (i. [sent-62, score-0.125]

17 deﬁne Nn as {1, · · · , n} and as the normal distribution with mean µ and variance For a d × m matrix W and sets Ii ⊆ Nd × {i}, I ⊆ Nd × Nd , we let wIi be the d × 1 vector with the j-th entry being w ji , if ( j, i) ∈ Ii , and 0, otherwise. [sent-77, score-0.506]

18 We also let WI be a d × m matrix with the ( j, i)-th entry being w ji , if ( j, i) ∈ I , and 0, otherwise. [sent-78, score-0.506]

19 The Proposed Formulation and Algorithm In this section, we ﬁrst propose a non-convex formulation for multi-task feature learning, based on the capped-ℓ1 , ℓ1 regularization. [sent-85, score-0.123]

20 We consider learning a weight matrix W = [w1 , · · · , wm ] ∈ Rd×m (wi ∈ Rd , i ∈ Nm ) consisting of the weight vectors for m linear predictive models: yi ≈ fi (Xi ) = Xi wi , i ∈ Nm . [sent-90, score-0.247]

21 In this paper, we propose a non-convex multi-task feature learning formulation to learn these m models simultaneously, based on the capped-ℓ1 ,ℓ1 regularization. [sent-91, score-0.123]

22 In this paper, we focus on the following quadratic loss function: m 1 Xi wi − yi i=1 mni l(W ) = ∑ 2981 2 . [sent-95, score-0.168]

23 Thus, under the formulation in Equation (1), some features can be shared by some tasks but not all the tasks. [sent-98, score-0.187]

24 Therefore, the proposed formulation can leverage the common features shared among tasks. [sent-99, score-0.144]

25 The ﬁrst one is the capped-ℓ1 feature learning formulation: d capped−ℓ1 : min W ∈Rd×m m l(W ) + λ ∑ ∑ min (|w ji |, θ) . [sent-102, score-0.632]

26 (3) j=1 i=1 Although the optimal solution of formulation (3) has a similar sparse pattern to that of the proposed capped-ℓ1 , ℓ1 multi-task feature learning (i. [sent-103, score-0.204]

27 , the optimal solution can have many zero rows and enable some entries of a non-zero row to be zero), the models for different tasks decouple and thus formulation (3) is equivalent to the single task feature learning. [sent-105, score-0.309]

28 The second one is the capped-ℓ1 , ℓ2 multi-task feature learning formulation: d capped−ℓ1 , ℓ2 : min W ∈Rd×m l(W ) + λ ∑ min wj ,θ . [sent-107, score-0.231]

29 (4) j=1 Due to the use of the capped-ℓ1 , ℓ2 penalty, the optimal solution W ⋆ of formulation (4) has many zero rows. [sent-108, score-0.1]

30 This is obviously different from the motivation of the proposed cappedℓ1 , ℓ1 multi-task feature learning, that is, some features are shared by some tasks but not all the tasks. [sent-112, score-0.186]

31 2982 M ULTI -S TAGE M ULTI -TASK F EATURE L EARNING Algorithm 1: MSMTFL: Multi-Stage Multi-Task Feature Learning 1 2 3 (0) Initialize λ j = λ; for ℓ = 1, 2, · · · do ˆ Let W (ℓ) be a solution of the following problem: d W ∈Rd×m 4 5 (ℓ−1) l(W ) + ∑ λ j min wj . [sent-123, score-0.162]

32 ˆ ˆ • Fix v = v(ℓ) = I( (w(ℓ) )1 1 ˆ < θ), · · · , I( (w(ℓ) )d T 1 < θ) : ˆ W (ℓ+1) = arg min l(W ) + λ(ˆ (ℓ) )T h(W ) − λg⋆ (ˆ (ℓ) ) v v W = arg min l(W ) + λ(ˆ (ℓ) )T h(W ) , v (12) W which corresponds to Step 3 of Algorithm 1. [sent-156, score-0.15]

33 3 D ISCUSSIONS If we terminate the algorithm with ℓ = 1, the MSMTFL algorithm is equivalent to the ℓ1 regularized multi-task feature learning algorithm (Lasso). [sent-162, score-0.105]

34 Speciﬁcally, a penalty is imposed in the current stage if the ℓ1 -norm of some row of W in the last stage is smaller than the threshold θ; otherwise, no penalty is imposed. [sent-165, score-0.185]

35 In other words, MSMTFL in the current stage tends to shrink the small rows of W and keep the large rows of W in the last stage. [sent-166, score-0.164]

36 It may incorrectly keep the irrelevant rows (which should have been zero rows) or shrink the relevant rows (which should have been large rows) to be zero vectors. [sent-168, score-0.088]

37 That is, the error bound in the current stage improves the one in the last stage. [sent-172, score-0.101]

38 ℓ∈K →∞ ℓ→∞ Taking limits on both sides of Equation (13) with ℓ ∈ K → ∞, we have f (W ⋆ , v⋆ ) ≤ f (W, v⋆ ), ∀W ∈ Rd×m , which implies W ⋆ ∈ arg min f (W, v⋆ ) W = arg min l(W ) + λ(v⋆ )T h(W ) − λg⋆ (v⋆ ) W = arg min l(W ) + λ(v⋆ )T h(W ) . [sent-188, score-0.225]

39 (16) Note that the objective function of Equation (1) can be written as a difference of two convex functions: d d l(W ) + λ ∑ min w j 1 , θ = (l(W ) + λ W j=1 1,1 ) − λ ∑ max wj j=1 1 − θ, 0 . [sent-192, score-0.148]

40 In the sequel, we will conduct analysis in two aspects: reproducibility and the parameter estimation performance. [sent-200, score-0.12]

41 Proof Equation (5) can be decomposed into m independent smaller minimization problems: (ℓ) ˆ wi = arg min wi ∈Rd 1 Xi wi − yi mni 2 d (ℓ−1) + ∑ λj j=1 |w ji |. [sent-208, score-1.012]

42 We ˆ ˆ ˆ ˆ ˆ deﬁne E = j ∈ Nd : s = sign 2 T ˆ |x (y − X w)| = λ j , mn j 2 T ˆ X (y − X w) , mn E where XE denotes the matrix composed of the columns of X indexed by E . [sent-211, score-0.094]

43 Then, the optimal ˆ solution w of Equation (18) satisﬁes ˆ wNd \E = 0, ˆ wE = arg min wE ∈R|E | 1 XE wE − y mn 2 + ∑ λ j |w j |, (19) j∈E where wE denotes the vector composed of entries of w indexed by E . [sent-212, score-0.144]

44 Although the solution may not be globally optimal, we show in the next section that the solution has good performance in terms of the parameter estimation error bound. [sent-220, score-0.159]

45 2988 M ULTI -S TAGE M ULTI -TASK F EATURE L EARNING Remark 3 Zhang (2010, 2012) study the capped-ℓ1 regularized formulation for the single task setting and propose the multi-stage algorithm for such formulation. [sent-221, score-0.134]

46 Deﬁne Fi = {( j, i) : w ji = 0} and F = ∪i∈Nm Fi . [sent-243, score-0.479]

47 (2010) gave an ℓ∞,∞ -norm error bound W Dirty − W ∞,∞ = O ln(dm/η)/n as well ˆ ¯ as a sign consistency result between W and W . [sent-260, score-0.088]

48 (2010) presented an ℓ∞,∞ -norm parameter estimation error bound and hence a sign consistency result can be obtained. [sent-268, score-0.121]

49 Remark 12 The capped-ℓ1 regularized formulation in Zhang (2010) is a special case of our formulation when m = 1. [sent-272, score-0.168]

50 The key difﬁculty is that, in order to achieve a similar support recovery result for the formulation in Equation (1), ¯ we need to assume that each row of the underlying sparse weight matrix W is either a zero vector or a vector composed of all nonzero entries. [sent-278, score-0.196]

51 ¯ ¯ ¯ ¯ ¯ ¯ Lemma 13 Let ϒ = [ǫ1 , · · · , ǫm ] with ǫi = [ε1i , · · · , εdi ]T = 1 XiT (Xi wi − yi ) (i ∈ Nm ). [sent-285, score-0.145]

52 Let λGi = min( j,i)∈Gi λ ji , λG = mini∈Gi λGi and λ0i = max j λ ji , λ0 = maxi λ0i . [sent-296, score-0.958]

53 ˆ If 2 ǫi ∞ < λGi , then the following inequality holds at any stage ℓ ≥ 1: ¯ m ∑ ∑ (ℓ) i=1 ( j,i)∈Gi |w ji | ≤ ˆ ˆ ¯ 2 ϒ ∞,∞ + λ0 m ∑ ˆ ¯ λG − 2 ϒ ∞,∞ i=1 ∑ ( j,i)∈Gic (ℓ) |∆w ji |. [sent-297, score-1.073]

54 ¯ ˆ (ℓ−1) = λ} = Lemma 15 Using the notations of Lemma 14, we denote G = G(ℓ) = H c ∩ {( j, i) : λ ji ¯ ∪i∈Nm Gi with H being deﬁned as in Lemma 13 and Gi ⊆ Nd × {i}. [sent-301, score-0.507]

55 2,1 ≤ − ρmin (2¯ + s) r ¯ c 8m 4 ϒG(ℓ) 2 F ˆ (ℓ−1) + ∑( j,i)∈F (λ ji )2 ¯ , ρ− (2¯ + s) min r (29) (30) Lemma 15 is established based on Lemma 14, by considering the relationship between Equation (22) and Equation (27), and the speciﬁc deﬁnition of G = G(ℓ) . [sent-306, score-0.525]

56 Equation (29) provides a ¯ c parameter estimation error bound in terms of ℓ2,1 -norm by ϒG(ℓ) 2 and the regularization paramF ˆ (ℓ−1) (see the deﬁnition of λ ji (λ(ℓ−1) ) in Lemma 14). [sent-307, score-0.537]

57 This is the result directly used in the ˆ ˆ eters λ ji ji proof of Theorem 8. [sent-308, score-0.958]

58 ¯ ˆ ¯ ˆ ˆ Lemma 16 Let λ ji = λI w j 1 < θ, j ∈ Nd , ∀i ∈ Nm with some W ∈ Rd×m . [sent-310, score-0.479]

59 Then under the condition of Equation (20), we have: ∑ ¯ ( j,i)∈F ˆ λ2 ≤ ji ∑ ¯ ( j,i)∈H ˆ ˆ¯ ¯¯ λ2 ≤ mλ2 WH − WH ji 2992 2 2 2,1 /θ . [sent-312, score-0.958]

60 M ULTI -S TAGE M ULTI -TASK F EATURE L EARNING ˆ ¯¯ ˆ ¯ 2,1 Lemma 16 establishes an upper bound of ∑( j,i)∈F λ2 by WH − WH 2 , which is critical for ¯ ji ˆ ¯ ˆ ¯ building the recursive relationship between W (ℓ) − W 2,1 and W (ℓ−1) − W 2,1 in the proof of Theorem 8. [sent-313, score-0.502]

61 5 ≤ ≤ ≤ 2¯ r s 2 2 2,1 ¯ c 4 ϒG(ℓ) 2 F ˆ (ℓ−1) + ∑( j,i)∈F (λ ji )2 ¯ (ρ− (2¯ + s))2 min r ˆ ¯ 78m 4u + (37/36)mλ2 θ−2 W (ℓ−1) − W 2 2,1 (ρ− (2¯ + s))2 min r 312mu ˆ ¯ + 0. [sent-322, score-0.571]

62 ρ− (2¯ + s) ρ− (2¯ + s) min r min r Substituting Equation (31) into the above inequality, we verify Theorem 8. [sent-335, score-0.092]

63 This requires the r ¯ solution obtained by Algorithm 1 at each stage is sparse, which is consistent with the sparsity of W in Assumption 1. [sent-337, score-0.114]

64 , 2006) • DirtyMTL: dirty model multi-task feature learning algorithm with λs P 1,1 + λb Q 1,∞ (W = P + Q) as the regularizer (Jalali et al. [sent-344, score-0.185]

65 We use MSMTFL-type algorithms (similar to Algorithm 1) to sovle capped-ℓ1 and capped-ℓ1 , ℓ2 regularized feature learning problems (details are provided in Appendix C). [sent-346, score-0.105]

66 2 Synthetic Data Experiments We generate synthetic data by setting the number of tasks as m and each task has n samples which are of dimensionality d; each element of the data matrix Xi ∈ Rn×d (i ∈ Nm ) for the i-th task is sampled i. [sent-348, score-0.13]

67 from the ¯ Gaussian distribution N(0, σ2 ); the responses are computed as yi = Xi wi + δi (i ∈ Nm ). [sent-357, score-0.145]

68 Stage (ℓ) plots for ¯ We ﬁrst report the averaged parameter estimation error W MSMTFL (Figure 1). [sent-359, score-0.108]

69 ˆ ¯ We then report the averaged parameter estimation error W − W 2,1 in comparison with six algorithms in different parameter settings (Figure 2 and Figure 3). [sent-363, score-0.144]

70 We observe that the parameter estimation errors of the capped-ℓ1 , ℓ1 , cappedℓ1 and capped-ℓ1 , ℓ2 regularized feature learning formulations solved by MSMTFL-type algorithms are the smallest among all algorithms. [sent-365, score-0.138]

71 Thus, we have 6 tasks with each task corresponding to a time point and the sample sizes corresponding to 6 tasks are 648, 642, 293, 569, 389 and 87, respectively. [sent-377, score-0.114]

72 2995 G ONG , Y E AND Z HANG Paramter estimation error (L2,1) Paramter estimation error (L2,1) 200 Paramter estimation error (L2,1) m=15,n=40,d=250,σ=0. [sent-396, score-0.174]

73 We evaluate the six algorithms in terms of the normalized mean squared error (nMSE) and the averaged means squared error (aMSE), which are commonly used in multi-task learning problems (Zhang and Yeung, 2010; Zhou et al. [sent-413, score-0.136]

74 5 for CapL1,L2 (The settings of θ/λ for CapL1,L1, CapL1 and CapL1,L2 are based on the relationships of w j 1 , |w ji | and w j , where w j ∈ R1×m and w ji are the j-th row and the ( j, i)-th entry of W , respectively). [sent-451, score-1.018]

75 5 for CapL1,L2 (The settings of θ/λ for CapL1,L1, CapL1 and CapL1,L2 are based on the relationships of w j 1 , |w ji | and w j , where w j ∈ R1×m and w ji are the j-th row and the ( j, i)-th entry of W , respectively). [sent-478, score-1.018]

76 Conclusions In this paper, we propose a non-convex formulation for multi-task feature learning, which learns the speciﬁc features of each task as well as the common features shared among tasks. [sent-484, score-0.261]

77 The nonconvex formulation adopts the capped-ℓ1 ,ℓ1 regularizer to better approximate the ℓ0 -type one than the commonly used convex regularizer. [sent-485, score-0.142]

78 Although the solution may not be globally optimal, we theoretically show that it has good performance in terms of the parameter estimation error bound. [sent-489, score-0.121]

79 0007) Table 1: Comparison of six feature learning algorithms on the MRI data set in terms of the averaged nMSE and aMSE (standard deviation), which are averaged over 10 random splittings. [sent-574, score-0.197]

80 0013) Table 2: Comparison of six feature learning algorithms on the Isolet data set in terms of the averaged nMSE and aMSE (standard deviation), which are averaged over 10 random splittings. [sent-660, score-0.197]

81 First, we will explore the conditions under which a globally optimal solution of the proposed formulation can be obtained by the MSMTFL algorithm. [sent-664, score-0.125]

82 We let wIi be a d × 1 vector with the j-th entry being w ji , if ( j, i) ∈ Ii , and 0, otherwise. [sent-677, score-0.506]

83 We also let WI be a d × m matrix with ( j, i)-th entry being w ji , if ( j, i) ∈ I , and 0, otherwise. [sent-678, score-0.506]

84 Proof of Lemma 13 For the j-th entry of ǫi ( j ∈ Nd ): ¯ ¯ |ε ji | = 1 n (i) T xj ¯ (Xi wi − yi ) = 1 n (i) T xj δi , (i) where x j is the j-th column of Xi . [sent-679, score-0.651]

85 According to Lemma 18, we have ∀t > i ¯ Pr(|ε ji | ≥ t) ≤ 2 exp(−nt 2 /(2σ2 ρ+ (1))) ≤ 2 exp(−nt 2 /(2σ2 ρ+ (1))). [sent-681, score-0.479]

86 We note that Xi wi − yi = ¯ ¯ ˆ Xi wi − Xi wi + Xi wi − yi and we can rewrite the above equation into the following form: ˆ ˆ 2Ai ∆wi = −2ǫi − λi ⊙ sign(wi ). [sent-686, score-0.703]

87 ¯ ˆ Thus, for all v ∈ Rd , we have d ˆ ˆ ˆ 2vT Ai ∆wi = −2vT ǫi − ∑ λ ji v j sign(w ji ). [sent-687, score-0.958]

88 ¯ ¯ The last equality above is due to Nd × {i} = Gi ∪ (Fi ∪ Gi )c ∪ Fi and ∆w ji = w ji , ∀( j, i) ∈ Fi ⊇ Gi . [sent-689, score-0.958]

89 Thus, all conditions of Lemma 14 are satisﬁed, by noticing the ji relationship between Equation (22) and Equation (27). [sent-694, score-0.479]

90 1/2 (38) ¯ ¯ In the above derivation, the second equality is due to Ii = Ji ∪ Fi ∪ (Fi c ∩ Gic ); the third equality is ˆ ji = λ ≥ 2 ǫi ∞ ≥ 2 ǫJ ∞ and due to Ii ∩ Gi = Ji ; the second inequality follows from ∀( j, i) ∈ Ji , λ ¯ ¯i ¯ i ⊆ G c ⊆ Ii . [sent-720, score-0.518]

91 ji Notice that xi ≤ a( yi + zi ) ⇒ X 2 2,1 ≤m X 2 F = m ∑ xi i 2 ≤ 2ma2 ( Y 2 F + Z 2 F ). [sent-722, score-0.533]

92 Thus, we have ˆ ∆WI 2,1 ≤ ¯ c 8m 4 ϒG(ℓ) 2 F ˆ (ℓ−1) + ∑( j,i)∈F (λ ji )2 ¯ ρ− (2¯ + s) min r 3004 . [sent-723, score-0.525]

93 83 m 4 ϒ ≤ ≤ 2,1 2 F ˆ (ℓ−1) + ∑( j,i)∈F (λ ji )2 ¯ ρ− (2¯ + s) min r 2 |G c | + rmλ2 ¯ ∞,∞ (ℓ) ρ− (2¯ + s) min r √ 8. [sent-727, score-0.571]

94 1mλ r ¯ , ≤ − ρmin (2¯ + s) r where the ﬁrst inequality is due to Equation (39); the second inequality is due to s ≥ r (assump¯ ¯ ˆ ji ≤ λ, rm = |H | ≥ |F | and the third inequality follows from Equation (35) ¯ tion in Theorem 8), λ ¯ ¯ ∞,∞ and ϒ 2 ≤ (1/144)λ2 . [sent-729, score-0.596]

95 For each ( j, i) ∈ F ¯ ˆ wj −wj 1 ¯ ≥ wj 1− ˆ wj 1 ≥ 2θ − θ = θ. [sent-743, score-0.234]

96 Lemma 19 The following inequality holds: 1/2 s πi (ki , si ) ≤ i 2 ρ+ (si )/ρ− (ki + si ) − 1. [sent-749, score-0.113]

97 Then for any wi ∈ Rd , we have max(0, wTi Ai wi ) ≥ ρ− (ki + si )( wIi − πi (ki + si , si ) wGi I i 1 /si ) wIi . [sent-751, score-0.401]

98 1 ¯ ¯ ¯ ¯ Lemma 21 Let ǫi = [ε1i , · · · , εdi ] = n XiT (Xi wi − yi ) (i ∈ Nm ), and Hi ⊆ Nd × {i}. [sent-752, score-0.145]

99 (ℓ) (ℓ) Let λ ji = λI(|w ji | < θ) ( j = 1, · · · , d, i = 1, · · · , m), where w ji is the ( j, i)-th entry of ˆ ˆ (ℓ) and I(·) denotes the {0, 1}-valued indicator function. [sent-770, score-1.464]

100 A general theory of concave regularization for high dimensional sparse estimation problems. [sent-1000, score-0.102]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ji', 0.479), ('msmtfl', 0.395), ('ulti', 0.339), ('dirtymtl', 0.215), ('wii', 0.173), ('tage', 0.169), ('wg', 0.169), ('wi', 0.145), ('gi', 0.14), ('nm', 0.124), ('equation', 0.123), ('gic', 0.116), ('jalali', 0.116), ('nmse', 0.113), ('eature', 0.106), ('fi', 0.102), ('amse', 0.097), ('ong', 0.094), ('hang', 0.087), ('reproducibility', 0.087), ('zhang', 0.085), ('tsinghua', 0.079), ('wti', 0.079), ('wj', 0.078), ('rd', 0.076), ('stage', 0.076), ('lasso', 0.068), ('dirty', 0.068), ('mri', 0.064), ('sign', 0.063), ('formulation', 0.062), ('feature', 0.061), ('gong', 0.06), ('earning', 0.059), ('paramter', 0.058), ('regularizer', 0.056), ('rni', 0.056), ('wgi', 0.056), ('shared', 0.054), ('obozinski', 0.052), ('averaged', 0.05), ('lemma', 0.05), ('multitask', 0.048), ('ln', 0.046), ('min', 0.046), ('traning', 0.045), ('regularized', 0.044), ('rows', 0.044), ('capped', 0.044), ('sparse', 0.043), ('tasks', 0.043), ('inequality', 0.039), ('wh', 0.039), ('solution', 0.038), ('isolet', 0.038), ('si', 0.037), ('six', 0.036), ('ti', 0.034), ('parameswaran', 0.034), ('wgic', 0.034), ('estimation', 0.033), ('row', 0.033), ('ki', 0.033), ('negahban', 0.032), ('rip', 0.032), ('xe', 0.032), ('composed', 0.031), ('synthetic', 0.031), ('interpretations', 0.029), ('nips', 0.029), ('followings', 0.029), ('jieping', 0.029), ('vt', 0.029), ('arg', 0.029), ('notations', 0.028), ('features', 0.028), ('task', 0.028), ('entry', 0.027), ('recovery', 0.027), ('incoherence', 0.027), ('xi', 0.027), ('concave', 0.026), ('speakers', 0.026), ('error', 0.025), ('globally', 0.025), ('intuitive', 0.025), ('ye', 0.024), ('lounici', 0.024), ('aii', 0.024), ('remark', 0.024), ('convex', 0.024), ('critical', 0.023), ('changshui', 0.023), ('gist', 0.023), ('grippo', 0.023), ('kolar', 0.023), ('mmse', 0.023), ('mni', 0.023), ('mtl', 0.023), ('pinghua', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Changshui Zhang

2 0.10006315 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont

Abstract: We describe a Bayesian method for group feature selection in linear regression problems. The method is based on a generalized version of the standard spike-and-slab prior distribution which is often used for individual feature selection. Exact Bayesian inference under the prior considered is infeasible for typical regression problems. However, approximate inference can be carried out efﬁciently using Expectation Propagation (EP). A detailed analysis of the generalized spike-and-slab prior shows that it is well suited for regression problems that are sparse at the group level. Furthermore, this prior can be used to introduce prior knowledge about speciﬁc groups of features that are a priori believed to be more relevant. An experimental evaluation compares the performance of the proposed method with those of group LASSO, Bayesian group LASSO, automatic relevance determination and additional variants used for group feature selection. The results of these experiments show that a model based on the generalized spike-and-slab prior and the EP algorithm has state-of-the-art prediction performance in the problems analyzed. Furthermore, this model is also very useful to carry out sequential experimental design (also known as active learning), where the data instances that are most informative are iteratively included in the training set, reducing the number of instances needed to obtain a particular level of prediction accuracy. Keywords: group feature selection, generalized spike-and-slab priors, expectation propagation, sparse linear model, approximate inference, sequential experimental design, signal reconstruction

3 0.093896009 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

4 0.075359397 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

Abstract: The performance of orthogonal matching pursuit (OMP) for variable selection is analyzed for random designs. When contrasted with the deterministic case, since the performance is here measured after averaging over the distribution of the design matrix, one can have far less stringent sparsity constraints on the coefﬁcient vector. We demonstrate that for exact sparse vectors, the performance of the OMP is similar to known results on the Lasso algorithm (Wainwright, 2009). Moreover, variable selection under a more relaxed sparsity assumption on the coefﬁcient vector, whereby one has only control on the ℓ1 norm of the smaller coefﬁcients, is also analyzed. As consequence of these results, we also show that the coefﬁcient estimate satisﬁes strong oracle type inequalities. Keywords: high dimensional regression, greedy algorithms, Lasso, compressed sensing

5 0.057845507 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

Author: Julien Mairal, Bin Yu

Abstract: We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to automatically select a subgraph with few connected components; by exploiting prior knowledge, one can indeed improve the prediction performance or obtain results that are easier to interpret. Regularization or penalty functions for selecting features in graphs have recently been proposed, but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and well-connected subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called “path coding” penalties. Unlike existing regularization functions that model long-range interactions between features in a graph, path coding penalties are tractable. The penalties and their proximal operators involve path selection problems, which we efﬁciently solve by leveraging network ﬂow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and leads to more connected subgraphs than other regularization functions for graphs. Keywords: convex and non-convex optimization, network ﬂow optimization, graph sparsity

6 0.057138912 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

7 0.053834636 114 jmlr-2013-The Rate of Convergence of AdaBoost

8 0.05331634 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

9 0.04873706 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

10 0.045606069 86 jmlr-2013-Parallel Vector Field Embedding

11 0.043626167 90 jmlr-2013-Quasi-Newton Method: A New Direction

12 0.042212486 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

13 0.042079758 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

14 0.041363951 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

15 0.040599771 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

16 0.038042549 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

17 0.03768618 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

18 0.037341315 76 jmlr-2013-Nonparametric Sparsity and Regularization

19 0.036635082 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

20 0.035994984 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.207), (1, 0.029), (2, 0.041), (3, 0.034), (4, 0.116), (5, -0.104), (6, -0.016), (7, -0.075), (8, 0.075), (9, -0.019), (10, 0.019), (11, -0.01), (12, -0.107), (13, 0.126), (14, 0.054), (15, -0.039), (16, -0.069), (17, 0.001), (18, -0.053), (19, -0.1), (20, 0.081), (21, 0.019), (22, 0.026), (23, 0.184), (24, 0.081), (25, -0.019), (26, -0.087), (27, 0.067), (28, -0.038), (29, -0.11), (30, 0.081), (31, -0.018), (32, 0.03), (33, -0.116), (34, -0.057), (35, -0.049), (36, 0.072), (37, 0.154), (38, 0.226), (39, 0.035), (40, 0.07), (41, 0.009), (42, 0.035), (43, 0.075), (44, 0.142), (45, 0.194), (46, 0.033), (47, -0.038), (48, -0.061), (49, 0.148)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93433255 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Changshui Zhang

2 0.5388791 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Pierre Dupont

3 0.4603647 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

4 0.44271141 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the ﬁdelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

5 0.42461655 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

6 0.39691943 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

7 0.3451058 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

8 0.34325728 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

9 0.33201122 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

10 0.33077168 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

11 0.32163602 15 jmlr-2013-Bayesian Canonical Correlation Analysis

12 0.31382659 33 jmlr-2013-Dimension Independent Similarity Computation

13 0.31109783 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

14 0.30542889 90 jmlr-2013-Quasi-Newton Method: A New Direction

15 0.30313638 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

16 0.3016296 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

17 0.29647714 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

18 0.29085776 68 jmlr-2013-Machine Learning with Operational Costs

19 0.28322509 114 jmlr-2013-The Rate of Convergence of AdaBoost

20 0.28212947 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (5, 0.129), (6, 0.055), (10, 0.084), (20, 0.018), (23, 0.037), (68, 0.048), (70, 0.029), (75, 0.04), (84, 0.33), (85, 0.054), (87, 0.035), (89, 0.011), (93, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71405989 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Changshui Zhang

2 0.47881559 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.47230798 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

4 0.4706839 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

Author: Daniil Ryabko, Jérémie Mary

Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classiﬁcation methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classiﬁcation and concern highly dependent time series. Speciﬁcally, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions

5 0.46981245 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

6 0.46744546 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

7 0.46692303 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

8 0.46634778 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

9 0.4659749 76 jmlr-2013-Nonparametric Sparsity and Regularization

10 0.46470308 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

11 0.46376756 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

12 0.46179405 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

13 0.46168938 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

14 0.46077302 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

15 0.46026874 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

16 0.46010169 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

17 0.45983011 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

18 0.45972425 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

19 0.45586723 120 jmlr-2013-Variational Algorithms for Marginal MAP

20 0.4555743 73 jmlr-2013-Multicategory Large-Margin Unified Machines