jmlr jmlr2012 jmlr2012-112 knowledge-graph by maker-knowledge-mining

112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

Source: pdf

Author: Zhiwei Qin, Donald Goldfarb

Abstract: We consider a class of sparse learning problems in high dimensional feature space regularized by a structured sparsity-inducing norm that incorporates prior knowledge of the group structure of the features. Such problems often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of the regularization term. In this paper, we focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1 /l2 -norm and the l1 /l∞ -norm. We propose a uniﬁed framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efﬁciently solved. As one of the core building-blocks of this framework, we develop new algorithms using a partial-linearization/splitting technique and prove that the accelerated versions 1 of these algorithms require O( √ε ) iterations to obtain an ε-optimal solution. We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms. Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable splitting, augmented Lagrangian

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We propose a uniﬁed framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efﬁciently solved. [sent-6, score-0.109]

2 We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms. [sent-8, score-0.132]

3 Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable splitting, augmented Lagrangian 1. [sent-9, score-0.202]

4 In particular, (1)-(2) cannot be cast into a nonoverlapping group lasso problem as done by Jacob et al. [sent-33, score-0.108]

5 Liu and Ye (2010) also apply FISTA to solve (3), but in each iteration, they transform the computation of the proximal operator associated with the combined penalty term into an equivalent constrained smooth problem and solve it by Nesterov’s accelerated gradient descent method (Nesterov, 2005). [sent-40, score-0.109]

6 (2010) apply the accelerated proximal gradient method to (1) with l1 /l∞ penalty and propose a network ﬂow algorithm to solve the proximal problem associated with Ωl1 /l∞ (x). [sent-42, score-0.15]

7 , 2011) and FISTA for solving an equivalent constrained version of problem (1) (to be introduced in Section 2) in an augmented Lagrangian method framework. [sent-49, score-0.089]

8 Speciﬁcally, we make the following contributions in this paper: • We build a general framework based on the augmented Lagrangian method, under which learning problems with both l1 /l2 and l1 /l∞ regularizations (and their variants) can be solved. [sent-50, score-0.089]

9 A Variable-Splitting Augmented Lagrangian Framework In this section, we present a uniﬁed framework, based on variable splitting and the augmented Lagrangian method for solving (1) with both l1 /l2 - and l1 /l∞ -regularizations. [sent-62, score-0.089]

10 To solve (4), we apply the augmented Lagrangian method (Hestenes, 1969; Powell, 1972; Nocedal and Wright, 1999; Bertsekas, 1999) to it. [sent-77, score-0.089]

11 This method, Algorithm 1, minimizes the augmented Lagrangian 1 1 ˜ L (x, y, v) = Ax − b 2 − vT (Cx − y) + Cx − y 2 + Ω(y) (5) 2 2µ 1437 Q IN AND G OLDFARB exactly for a given Lagrange multiplier v in every iteration followed by an update to v. [sent-78, score-0.157]

12 2: for l = 0, 1, · · · do 3: (xl+1 , yl+1 ) ← arg minx,y L (x, y, vl ) 1 4: vl+1 ← vl − µ (Cxl+1 − yl+1 ) 5: Update µ according to the chosen updating scheme. [sent-82, score-0.302]

13 We instead seek an approximate minimizer of the augmented Lagrangian via the abstract subroutine ApproxAugLagMin(x, y, v). [sent-85, score-0.116]

14 Theorem 1 Let αl := L (xl , yl , vl ) − infx∈Rm ,y∈RM L (x, y, vl ) and F ∗ be the optimal value of problem (4). [sent-87, score-0.383]

15 (6) l=1 Then, the sequence {vl } converges to v∗ , which satisﬁes inf x∈Rm ,y∈RM Fob j (x, y) − (v∗ )T (Cx − y) = F ∗ , while the sequence {xl , yl } satisﬁes liml→∞ Cxl − yl = 0 and liml→∞ Fob j (xl , yl ) = F ∗ . [sent-89, score-0.441]

16 The condition (6) requires the augmented Lagrangian subproblem be solved with increasing accuracy. [sent-90, score-0.113]

17 2: for l = 0, 1, · · · do 3: (xl+1 , yl+1 ) ← ApproxAugLagMin(xl , yl , vl ), to compute an approximate minimizer of L (x, y, vl ) 4: vl+1 ← vl − 1 (Cxl+1 − yl+1 ) µ 5: Update µ according to the chosen updating scheme. [sent-93, score-0.524]

18 Methods for Approximately Minimizing the Augmented Lagrangian In this section, we use the overlapping Group Lasso penalty Ω(x) = λ ∑s∈S ws xs to illustrate the optimization algorithms under discussion. [sent-98, score-0.096]

19 , 2010)2 approximately minimizes the augmented Lagrangian by minimizing (5) with respect to x and y alternatingly and then updates the Lagrange multiplier v on each iteration (e. [sent-103, score-0.157]

20 2: xl+1 ← arg minx L (x, yl , vl ) 3: yl+1 ← arg miny L (xl+1 , y, vl ) 4: return xl+1 , yl+1 . [sent-109, score-0.53]

21 Note that the problem solved in Line 3 of Algorithm 3, yl+1 = arg min L (xl+1 , y, vl ) ≡ arg min y y 1 l d −y 2µ 2 ˜ + Ω(y) , (7) where d l = Cxl+1 − µvl , is group-separable and hence can be solved in parallel. [sent-124, score-0.204]

22 1 PARTIAL L INEARIZATION AND C ONVERGENCE R ATE A NALYSIS Let us deﬁne f (x, y) + g(y) = L (x, y; v), F(x, y) := Lρ (x, y, y, γ) := f (x, y) + g(y) + γT (y − y) + ¯ ¯ ¯ 1 y − y 2, ¯ 2ρ (14) where γ is the Lagrange multiplier in the augmented Lagrangian (14) corresponding to problem (13). [sent-155, score-0.124]

23 We now present our partial-split alternating linearization algorithm to implement ApproxAugLagMin(x, y, v) in Algorithm 2. [sent-156, score-0.093]

24 ¯ ¯ 2: for k = 0, 1, · · · until stopping criterion is satisﬁed do 3: (xk+1 , yk+1 ) ← arg minx,y Lρ (x, y, yk , γk ). [sent-160, score-0.371]

25 When this condition is violated, a skipping step occurs in which the value of y is set to the value of y in the previous iteration (Line 5) and Lρ ¯ re-minimized with respect to x (Line 6) to ensure convergence. [sent-169, score-0.106]

26 Let us deﬁne a regular iteration of Algorithm 4 to be an iteration where no skipping step occurs, that is, Lines 5 and 6 are not executed. [sent-170, score-0.139]

27 Likewise, we deﬁne a skipping iteration to be an iteration where a skipping step occurs. [sent-171, score-0.212]

28 For ρ ≤ Ly1 f ) , the iterates ( (xk , yk ) in Algorithm 4 satisfy ¯ F(xk , yk ) − F(x∗ , y∗ ) ≤ ¯ y0 − y∗ 2 ¯ , 2ρ(k + kn ) ∀k, (20) where (x∗ , y∗ ) is an optimal solution to (10), and kn is the number of regular iterations among the ﬁrst k iterations. [sent-174, score-0.667]

29 3, we will discuss the case where the iterations entirely consist of skipping steps. [sent-180, score-0.096]

30 First, observe that since ρ = µ, ¯ arg min Lρ (x, y, y, ∇y f (x, y)) ≡ arg min ∇y f (x, y)T y + ¯ y ¯ y ¯ ≡ arg min y ¯ 1 d −y ¯ 2µ 2 1 y−y ¯ 2µ + λ ∑ ys ¯ 2 + g(y) ¯ , s where d = Cx − µv. [sent-186, score-0.176]

31 4, we present an algorithm FISTA-p, which is a special version of FAPLM-S in which every iteration is a skipping iteration and which has a much simpler form than FAPLM-S, while having essentially the same iteration complexity. [sent-198, score-0.172]

32 It is also possible to apply ALM-S directly, which splits both x and y, to solve the augmented Lagrangian subproblem. [sent-199, score-0.089]

33 Since λmax (AT A) is usually not known, it is necessary to perform a backtracking line-search on ρ to ensure that F(xk+1 , yk+1 ) ≤ Lρ (xk+1 , yk+1 , xk , yk , γk ). [sent-209, score-0.403]

34 3 ISTA: Partial Linearization (ISTA-p) We can also minimize the augmented Lagrangian (5), which we write as L (x, y, v) = f (x, y) + g(y) with f (x, y) and g(y) deﬁned as in (11) and (12), using a variant of ISTA that only linearizes f (x, y) with respect to the y variables. [sent-219, score-0.089]

35 We compute y+ by y+ = arg min Lρ (x, y, y′ , ∇y f (x, y)) ′ y = arg min ′ y 1 ( y′ − dy j 2µ ∑ j j 2 + λ y′j ) , (24) where dy = Cx − µv. [sent-223, score-0.16]

36 Now given y+ , we solve for x+ by x+ = arg min f (x′ , y+ ) ′ x = arg min ′ x 1 Ax′ − b 2 2 − vT (Cx′ − y+ ) + 1 Cx′ − y+ 2µ 2 (25) The algorithm that implements subroutine ApproxAugLagMin(x, y, v) in Algorithm 2 by ISTA with partial linearization is stated below as Algorithm 5. [sent-225, score-0.163]

37 ¯ 2: for k = 0, 1, · · · until stopping criterion is satisﬁed do 3: xk+1 ← arg minx f (x; yk ) ¯ 4: yk+1 ← arg miny Lρ (xk+1 , yk , y, ∇y f (xk+1 , yk )) ¯ ¯ ¯ 5: end for 6: return (xK+1 , yK+1 ) ¯ As we remarked in Section 3. [sent-229, score-1.079]

38 2, Algorithm 5 is equivalent to Algorithm 4 (APLM-S) where every iteration is a skipping iteration. [sent-230, score-0.106]

39 For ρ ≤ the iterates (xk , yk ) in Algorithm 5 satisfy ¯ F(xk , yk ) − F(x∗ , y∗ ) ≤ ¯ where (x∗ , y∗ ) is an optimal solution to (10). [sent-233, score-0.604]

40 Speciﬁcally, where d = x+ = dx , dy j y+ = max(0, dy j − λρ), j dy j (27) j = 1, . [sent-257, score-0.111]

41 Using ISTA to solve the outer augmented Lagrangian (5) subproblem is equivalent to taking only skipping steps in ALM-S. [sent-261, score-0.269]

42 zk dx x − ρ∇ f (zk , zk ) = 4: x y zk dy y 5: xk+1 ← dx ¯ dy 6: yk+1 ← dy j max(0, dy j − λρ), j = 1, . [sent-268, score-0.802]

43 ¯j j √ 2 1+ 1+4tk 7: tk+1 ← 2 −1 ¯ ¯ ¯ 8: zk+1 ← xk + ttkk+1 (xk+1 − xk ) x −1 zk+1 ← yk + ttkk+1 (yk+1 − yk ) ¯ ¯ ¯ y 10: end for 11: return (xK+1 , yK+1 ) ¯ ¯ 9: FISTA (resp. [sent-272, score-0.806]

44 ISTA-p), where we minimize with respect to x a linearized approximation 1 f˜(x, zk ) := f (xk , zk ) + ∇x f (xk , zk )(x − xk ) + x − xk 2ρ 2 of the quadratic objective function f (x, zk ) in (25). [sent-274, score-1.074]

45 Overlapping Group l1 /l∞ -Regularization The subproblems with respect to y (or y) involved in all the algorithms presented in the previous ¯ sections take the following form 1 ˜ min c − y 2 + Ω(y), (28) y 2ρ ˜ where Ω(y) = λ ∑s∈S ws ys ∞ in the case of l1 /l∞ -regularization. [sent-281, score-0.098]

46 For each ˜ , the subproblem in (28) is of the form s∈S min ys 1 cs − ys 2 2 + ρλ ys ∞. [sent-286, score-0.165]

47 dual residual sl+1 CT (yl+1 −yl ) CT yl CT (yK+1 −zK ) ¯ CT zK CT (yK+1 −yK+1 ) ¯ CT  yK+1   xK+1   zK  ¯ x  − zK yK+1 ¯  y  zK  x  zK y Inner iteration Rel. [sent-309, score-0.266]

48 We used stopping criteria based on the primal and dual residuals suggested by Boyd et al. [sent-317, score-0.095]

49 The maximum number of outer iterations was set to 500, and the tolerance for the outer loop was set at εout = 10−4 . [sent-320, score-0.189]

50 The number of inner-iterations was capped at 2000, and the tolerance at the l-th outer iteration for the inner loops was εl . [sent-321, score-0.147]

51 Our termination criterion for the in outer iterations was max{rl , sl } ≤ εout , l l where rl = max{Cx −y, yl } is the outer relative primal residual and sl is the relative dual residual, Cxl which is given for each algorithm in Table 1. [sent-322, score-0.62]

52 Recall that K + 1 is the index of the last inner iteration of the l-th outer iteration; for example, for APLM-S, (xl+1 , yl+1 ) takes the value of the last inner iterate (xK+1 , yK+1 ). [sent-323, score-0.178]

53 We stopped the inner iterations when the maximum of the relative primal ¯ residual and the relative objective gradient for the inner problem was less than εl . [sent-324, score-0.211]

54 ) We see there that sl+1 can be obtained directly from the relative gradient residual computed in the last inner iteration of the l-th outer iteration. [sent-326, score-0.23]

55 However, it is not possible to evaluate the sub-optimality quantity αl in (6) exactly because the optimal value of the augmented Lagrangian L (x, y, vl ) is not known in advance. [sent-332, score-0.207]

56 1448 S TRUCTURED S PARSITY VIA A LTERNATING D IRECTION M ETHODS and dual residuals (max{rl , sl }) as a surrogate to αl for two reasons: First, it has been shown (Boyd et al. [sent-337, score-0.089]

57 , 2010) that rl and sl are closely related to αl . [sent-338, score-0.092]

58 Second, the quantities rl and sl are readily available as bi-products of the inner and outer iterations. [sent-339, score-0.206]

59 However, since we terminate the outer iterations at εout > 0, it is in not necessary to solve the subproblems to an accuracy much higher than the one for the outer loop. [sent-343, score-0.21]

60 On the other hand, it is also important for εl to decrease to below εout , since sl is closely related in to the quantities involved in the inner stopping criteria. [sent-344, score-0.12]

61 In our context, this criterion essentially requires that the absolute dual residual is less than a fraction of the absolute primal residual. [sent-348, score-0.129]

62 99, w0 is a ¯ ¯ x 1 constant, and wy is an auxiliary variable updated in each outer iteration by wl+1 = wl − µ2 CT (yK+1 − ¯ y y zK ). [sent-350, score-0.116]

63 2 Strategies for Updating µ The penalty parameter µ in the outer augmented Lagrangian (5) not only controls the infeasibility in the constraint Cx = y, but also serves as the step-length in the y-subproblem (and the x-subproblem in the case of FISTA). [sent-353, score-0.197]

64 The second strategy is a dynamic scheme based on the values rl and sl (Boyd et al. [sent-359, score-0.14]

65 Hence, to keep rl and sl approximately balanced in each outer iteration, our scheme updated µ as follows:   max{βµl , µmin }, if rl > τsl l+1 min{µl /β, µmax }, if sl > τrl µ ←  l µ, otherwise, where we set µmax = 10, µmin = 10−6 , τ = 10 and β = 0. [sent-363, score-0.293]

66 We instead adopted a simple scheme of decreasing µ by half every 10 outer iterations. [sent-466, score-0.109]

67 It is, however, well suited for our partial linearization methods (APLM-S and FISTA-p), since there is no need for the augmented Lagrangian framework. [sent-518, score-0.139]

68 The l1 model (31) was treated as a special case of the group regularization model (32), with each group containing only one component of the feature vector. [sent-525, score-0.118]

69 6 For the Ridge/Elastic-Net penalty model, we applied FISTA-p directly without the outer augmented Lagrangian layer. [sent-526, score-0.197]

70 The solutions for the l1 /l2 , l1 /l∞ , and Lasso models were not strictly sparse in the sense that those supposedly zero feature coefﬁcients had non-zero (albeit extremely small) magnitudes, since we enforced the linear constraints Cx = y through an augmented Lagrangian approach. [sent-527, score-0.089]

71 Interestingly, the majority of the APLM-S inner iterations consisted of a skipping step for the tests on synthetic 1456 S TRUCTURED S PARSITY VIA A LTERNATING D IRECTION M ETHODS Model l1 /l2 l1 /l∞ l1 ridge + elastic net Accuracy (percent) 97. [sent-549, score-0.127]

72 It may seem straight-forward to apply FISTA directly to the Lasso problem (31) without the augmented Lagrangian framework. [sent-572, score-0.089]

73 In (19), by letting (x, y) = (x∗ , y∗ ), and y = yn , we get (p, q) = (xn+1 , yn+1 ), and ¯ ¯ 2ρ(F(x∗ , y∗ ) − F(xn+1 , yn+1 )) ≥ yn+1 − y∗ 2 − yn − y∗ 2 . [sent-603, score-0.232]

74 Adding (42) to (41), we get 2ρ(2F(x∗ , y∗ ) − F(xn+1 , yn+1 ) − F(xn+1 , yn+1 )) ≥ yn+1 − y∗ ¯ ¯ 2 − yn − y∗ 2 . [sent-606, score-0.116]

75 Since yn+1 = yn , it follows that ¯ 2ρ(F(x∗ , y∗ ) − F(xn+1 , yn+1 )) ≥ yn+1 − y∗ ¯ ¯ 2 − yn − y∗ 2 . [sent-608, score-0.232]

76 , k − 1 and observing that 2|I| + |Ic | = k + kn , we obtain k−1 ¯ 2ρ (k + kn )F(x∗ , y∗ ) − ∑ F(xn+1 , yn+1 ) − ∑ F(xn+1 , yn+1 ) n=1 k−1 ≥ = ∑( n=0 k yn+1 − y∗ ¯ y − y∗ ¯ 2 2 (45) n∈I − yn − y∗ 2 ) ¯ − y0 − y∗ ¯ 2 ≥ − y0 − y∗ 2 . [sent-612, score-0.156]

77 (46) Q IN AND G OLDFARB Similarly, for n ∈ I, if we let (x, y) = (xn , yn ) instead of (x∗ , y∗ ) in (41), we have ¯ 2ρ(F(xn , yn ) − F(xn+1 , yn+1 )) ≥ yn+1 − yn ¯ ¯ 2 ≥ 0. [sent-614, score-0.348]

78 (47) For n ∈ Ic , yn+1 = yn ; from (15), since xn+1 = arg minx F(x, y) with y = yn = yn+1 , ¯ ¯ 2ρ(F(xn , yn ) − F(xn+1 , yn+1 )) ≥ 0. [sent-615, score-0.418]

79 ¯ (48) Hence, from (46) and (47) to (48), F(xn , yn ) ≥ F(xn , yn ) ≥ F(xn+1 , yn+1 ) ≥ F(xn+1 , yn+1 ). [sent-616, score-0.232]

80 Then, we ¯ ¯ have k−1 ¯ ¯ ∑ F(xn+1 , yn+1 ) ≥ kF(xk , yk ), and ∑ F(xn+1 , yn+1 ) ≥ kn F(xk , yk ). [sent-617, score-0.624]

81 (49) n∈I n=0 Combining (45) and (49) yields 2ρ(k + kn )(F(x∗ , y∗ ) − F(xk , yk )) ≥ − y0 − y∗ 2 . [sent-618, score-0.322]

82 Derivation of the Stopping Criteria In this section, we show that the quantities that we use in our stopping criteria correspond to the primal and dual residuals (Boyd et al. [sent-620, score-0.095]

83 , 2010) for the outer iterations and the gradient residuals for the inner iterations. [sent-621, score-0.16]

84 ¯ ¯ ∗ ∗ ∗ (52) Since yk+1 = zk , the primal residual is thus yk+1 − yk+1 = yk+1 − zk . [sent-624, score-0.539]

85 It follows from the ¯ ¯ k+1 in Line 3 of Algorithm 6 that optimality of x 1 1 AT (Axk+1 − b) −CT vl + CT (Cxk+1 − yk+1 ) + CT (yk+1 − zk ) = 0 ¯ ¯ µ µ 1 ¯ ⇒ ∇x f (xk+1 , yk+1 ) = CT (zk − yk+1 ). [sent-625, score-0.336]

86 ¯ µ Similarly, from the optimality of yk+1 in Line 4, we have that ¯ 1 0 ∈ ∂g(yk+1 ) + ∇y f (xk+1 , zk ) + (yk+1 − zk ) ¯ ¯ ρ 1 1 = ∂g(yk+1 ) + ∇y f (xk+1 , yk+1 ) − (yk+1 − zk ) + (yk+1 − zk ) ¯ ¯ ¯ ¯ µ ρ = ∂g(yk+1 ) + ∇y f (xk+1 , yk+1 ), ¯ ¯ 1 ¯ where the last step follows from µ = ρ. [sent-626, score-0.872]

87 Hence, we see that µ CT (zk − yk+1 ) is the gradient residual corresponding to (51), while (52) is satisﬁed in every inner iteration. [sent-627, score-0.114]

88 1460 S TRUCTURED S PARSITY VIA A LTERNATING D IRECTION M ETHODS APLM-S The primal residual is yk+1 − yk+1 from (50). [sent-628, score-0.103]

89 ¯ ¯ ¯ Clearly, the primal residual is (xk+1 − zk , yk+1 − zk ) since (xk+1 , yk+1 ) ≡ (zk , zk ). [sent-631, score-0.757]

90 From the ¯ x y y x ¯ k+1 , yk+1 ), it follows that optimality of (x ¯ ¯ 1 0 = ∇x f (zk , zk ) + (xk+1 − zk ), ¯ x y x ρ 1 ¯ 0 ∈ ∂g(yk+1 ) + ∇y f (zk , zk ) + (yk+1 − zk ). [sent-632, score-0.872]

91 ¯ y x y ρ 1 1 Here, we simply use ρ (xk+1 − zk ) and ρ (yk+1 − zk ) to approximate the gradient residuals. [sent-633, score-0.459]

92 The dual residual is ¯ ∇L(xl+1 ) −CT (vl − 1 (Cxl+1 − yl+1 )) µ 1 , recalling that vl+1 = vl − µ (Cxl+1 − yl+1 ). [sent-637, score-0.204]

93 The above ¯ ˜ l+1 ) + vl − 1 (Cxl+1 − yl+1 ) ∂Ω(y ¯ µ is simply the gradient of the augmented Lagrangian (5) evaluated at (xl , yl , vl ). [sent-638, score-0.495]

94 Now, since the objective function of an inner iteration is the augmented Lagrangian with v = vl , the dual residual for an outer iteration is readily available from the gradient residual computed for the last inner iteration of the outer iteration. [sent-639, score-0.703]

95 Consistency of the group lasso and multiple kernel learning. [sent-794, score-0.108]

96 Fast alternating linearization methods for minimizing the sum of two convex functions. [sent-1047, score-0.093]

97 Tree-guided group lasso for multi-task regression with structured sparsity. [sent-1186, score-0.137]

98 The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. [sent-1193, score-0.15]

99 A primal-dual algorithm for group sparse regularization with overlapping groups. [sent-1230, score-0.11]

100 Group lasso with overlaps: the latent group lasso approach. [sent-1248, score-0.167]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('adal', 0.538), ('fista', 0.495), ('yk', 0.302), ('zk', 0.218), ('yl', 0.147), ('vl', 0.118), ('oldfarb', 0.116), ('proxflow', 0.116), ('proxgrad', 0.116), ('yn', 0.116), ('irection', 0.109), ('lternating', 0.109), ('xk', 0.101), ('cx', 0.1), ('parsity', 0.093), ('tructured', 0.093), ('augmented', 0.089), ('outer', 0.083), ('lagrangian', 0.08), ('skipping', 0.073), ('mairal', 0.068), ('ct', 0.068), ('cpu', 0.067), ('ethods', 0.065), ('sl', 0.063), ('cxl', 0.061), ('dct', 0.061), ('residual', 0.06), ('lasso', 0.059), ('xl', 0.056), ('approxauglagmin', 0.055), ('ax', 0.054), ('ista', 0.052), ('iters', 0.052), ('linearization', 0.05), ('group', 0.049), ('algs', 0.048), ('ys', 0.047), ('primal', 0.043), ('breast', 0.043), ('arg', 0.043), ('alternating', 0.043), ('goldfarb', 0.042), ('overlapping', 0.041), ('combettes', 0.041), ('jenatton', 0.041), ('avg', 0.041), ('proximal', 0.041), ('ly', 0.037), ('dy', 0.037), ('xn', 0.035), ('multiplier', 0.035), ('pcg', 0.035), ('eckstein', 0.034), ('miny', 0.034), ('pesquet', 0.034), ('iteration', 0.033), ('inner', 0.031), ('ws', 0.03), ('obozinski', 0.029), ('chen', 0.029), ('structured', 0.029), ('rl', 0.029), ('groups', 0.027), ('aplm', 0.027), ('breastcancerdata', 0.027), ('dyn', 0.027), ('ogl', 0.027), ('peyre', 0.027), ('subroutine', 0.027), ('rx', 0.027), ('minx', 0.027), ('jacob', 0.027), ('cancer', 0.027), ('ry', 0.026), ('dual', 0.026), ('lagrange', 0.026), ('stopping', 0.026), ('scheme', 0.026), ('penalty', 0.025), ('bertsekas', 0.025), ('ic', 0.024), ('subproblem', 0.024), ('rmse', 0.024), ('qin', 0.023), ('updating', 0.023), ('iterations', 0.023), ('gradient', 0.023), ('dynamic', 0.022), ('scalability', 0.022), ('sec', 0.022), ('boyd', 0.021), ('subproblems', 0.021), ('lipschitz', 0.021), ('fob', 0.02), ('ttkk', 0.02), ('cholesky', 0.02), ('kn', 0.02), ('accelerated', 0.02), ('regularization', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

Author: Zhiwei Qin, Donald Goldfarb

2 0.061065122 111 jmlr-2012-Structured Sparsity and Generalization

Author: Andreas Maurer, Massimiliano Pontil

Abstract: We present a data dependent generalization bound for a large class of regularized algorithms which implement structured sparsity constraints. The bound can be applied to standard squared-norm regularization, the Lasso, the group Lasso, some versions of the group Lasso with overlapping groups, multiple kernel learning and other regularization schemes. In all these cases competitive results are obtained. A novel feature of our bound is that it can be applied in an inﬁnite dimensional setting such as the Lasso in a separable Hilbert space or multiple kernel learning with a countable number of kernels. Keywords: empirical processes, Rademacher average, sparse estimation.

3 0.048823543 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

Author: Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján

Abstract: We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic ﬁlter criteria under a single theoretical interpretation. This is in response to the question: “what are the implicit statistical assumptions of feature selection criteria based on mutual information?”. To answer this, we adopt a different strategy than is usual in the feature selection literature—instead of trying to deﬁne a criterion, we derive one, directly from a clearly speciﬁed objective function: the conditional likelihood of the training labels. While many hand-designed heuristic criteria try to optimize a deﬁnition of feature ‘relevancy’ and ‘redundancy’, our approach leads to a probabilistic framework which naturally incorporates these concepts. As a result we can unify the numerous criteria published over the last two decades, and show them to be low-order approximations to the exact (but intractable) optimisation problem. The primary contribution is to show that common heuristics for information based feature selection (including Markov Blanket algorithms as a special case) are approximate iterative maximisers of the conditional likelihood. A large empirical study provides strong evidence to favour certain classes of criteria, in particular those that balance the relative size of the relevancy/redundancy terms. Overall we conclude that the JMI criterion (Yang and Moody, 1999; Meyer et al., 2008) provides the best tradeoff in terms of accuracy, stability, and ﬂexibility with small data samples. Keywords: feature selection, mutual information, conditional likelihood

4 0.048413597 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

Abstract: The ℓ1 -penalized method, or the Lasso, has emerged as an important tool for the analysis of large data sets. Many important results have been obtained for the Lasso in linear regression which have led to a deeper understanding of high-dimensional statistical problems. In this article, we consider a class of weighted ℓ1 -penalized estimators for convex loss functions of a general form, including the generalized linear models. We study the estimation, prediction, selection and sparsity properties of the weighted ℓ1 -penalized estimator in sparse, high-dimensional settings where the number of predictors p can be much larger than the sample size n. Adaptive Lasso is considered as a special case. A multistage method is developed to approximate concave regularized estimation by applying an adaptive Lasso recursively. We provide prediction and estimation oracle inequalities for single- and multi-stage estimators, a general selection consistency theorem, and an upper bound for the dimension of the Lasso estimator. Important models including the linear regression, logistic regression and log-linear models are used throughout to illustrate the applications of the general results. Keywords: variable selection, penalized estimation, oracle inequality, generalized linear models, selection consistency, sparsity

5 0.036154177 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

Abstract: We consider the task of estimating, from observed data, a probabilistic model that is parameterized by a ﬁnite number of parameters. In particular, we are considering the situation where the model probability density function is unnormalized. That is, the model is only speciﬁed up to the partition function. The partition function normalizes a model so that it integrates to one for any choice of the parameters. However, it is often impossible to obtain it in closed form. Gibbs distributions, Markov and multi-layer networks are examples of models where analytical normalization is often impossible. Maximum likelihood estimation can then not be used without resorting to numerical approximations which are often computationally expensive. We propose here a new objective function for the estimation of both normalized and unnormalized models. The basic idea is to perform nonlinear logistic regression to discriminate between the observed data and some artiﬁcially generated noise. With this approach, the normalizing partition function can be estimated like any other parameter. We prove that the new estimation method leads to a consistent (convergent) estimator of the parameters. For large noise sample sizes, the new estimator is furthermore shown to behave like the maximum likelihood estimator. In the estimation of unnormalized models, there is a trade-off between statistical and computational performance. We show that the new method strikes a competitive trade-off in comparison to other estimation methods for unnormalized models. As an application to real data, we estimate novel two-layer models of natural image statistics with spline nonlinearities. Keywords: statistics unnormalized models, partition function, computation, estimation, natural image

6 0.036054522 34 jmlr-2012-Dynamic Policy Programming

7 0.035404529 97 jmlr-2012-Regularization Techniques for Learning with Matrices

8 0.034005586 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

9 0.03335809 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

10 0.032844275 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

11 0.031912763 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

12 0.031341642 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

13 0.02846716 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

14 0.028112976 87 jmlr-2012-PAC-Bayes Bounds with Data Dependent Priors

15 0.026249411 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

16 0.02562706 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

17 0.025085483 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

18 0.024608731 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

19 0.023703603 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

20 0.02334537 91 jmlr-2012-Plug-in Approach to Active Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.121), (1, 0.029), (2, -0.016), (3, -0.015), (4, 0.013), (5, 0.046), (6, -0.002), (7, -0.086), (8, -0.003), (9, 0.007), (10, -0.003), (11, 0.066), (12, 0.044), (13, -0.011), (14, 0.051), (15, -0.006), (16, 0.017), (17, 0.027), (18, 0.041), (19, 0.082), (20, -0.039), (21, 0.002), (22, 0.147), (23, -0.049), (24, -0.033), (25, -0.074), (26, 0.124), (27, -0.077), (28, 0.15), (29, -0.05), (30, 0.002), (31, 0.231), (32, 0.278), (33, -0.036), (34, 0.374), (35, 0.142), (36, 0.124), (37, 0.026), (38, 0.048), (39, -0.178), (40, -0.06), (41, 0.018), (42, -0.081), (43, 0.031), (44, -0.054), (45, -0.017), (46, -0.019), (47, -0.105), (48, -0.004), (49, -0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95806849 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

Author: Zhiwei Qin, Donald Goldfarb

2 0.56733739 27 jmlr-2012-Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection

Author: Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján

3 0.3079434 111 jmlr-2012-Structured Sparsity and Generalization

Author: Andreas Maurer, Massimiliano Pontil

4 0.26467291 34 jmlr-2012-Dynamic Policy Programming

Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen

Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the inﬁnite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove ﬁnite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be signiﬁcantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation

5 0.25073323 39 jmlr-2012-Estimation and Selection via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

Author: Jian Huang, Cun-Hui Zhang

6 0.24737088 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

7 0.22298963 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

8 0.20125431 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

9 0.19873114 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

10 0.19790855 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression

11 0.19045655 12 jmlr-2012-Active Clustering of Biological Sequences

12 0.17439666 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

13 0.16580917 55 jmlr-2012-Learning Algorithms for the Classification Restricted Boltzmann Machine

14 0.16506726 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

15 0.15633829 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions

16 0.15514162 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

17 0.15373315 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

18 0.15194167 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

19 0.14647238 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

20 0.14645377 97 jmlr-2012-Regularization Techniques for Learning with Matrices

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.019), (21, 0.043), (26, 0.542), (29, 0.02), (35, 0.015), (49, 0.015), (56, 0.011), (64, 0.019), (69, 0.015), (75, 0.039), (77, 0.015), (92, 0.049), (96, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97745889 79 jmlr-2012-Oger: Modular Learning Architectures For Large-Scale Sequential Processing

Author: David Verstraeten, Benjamin Schrauwen, Sander Dieleman, Philemon Brakel, Pieter Buteneers, Dejan Pecevski

Abstract: Oger (OrGanic Environment for Reservoir computing) is a Python toolbox for building, training and evaluating modular learning architectures on large data sets. It builds on MDP for its modularity, and adds processing of sequential data sets, gradient descent training, several crossvalidation schemes and parallel parameter optimization methods. Additionally, several learning algorithms are implemented, such as different reservoir implementations (both sigmoid and spiking), ridge regression, conditional restricted Boltzmann machine (CRBM) and others, including GPU accelerated versions. Oger is released under the GNU LGPL, and is available from http: //organic.elis.ugent.be/oger. Keywords: Python, modular architectures, sequential processing

same-paper 2 0.88615113 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

Author: Zhiwei Qin, Donald Goldfarb

3 0.87530291 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics

Author: Michael U. Gutmann, Aapo Hyvärinen

4 0.86528778 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

Author: Prateek Jain, Brian Kulis, Jason V. Davis, Inderjit S. Dhillon

Abstract: Metric and kernel learning arise in several machine learning applications. However, most existing metric learning algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are often limited to the transductive setting and do not generalize to new data points. In this paper, we study the connections between metric learning and kernel learning that arise when studying metric learning as a linear transformation learning problem. In particular, we propose a general optimization framework for learning metrics via linear transformations, and analyze in detail a special case of our framework—that of minimizing the LogDet divergence subject to linear constraints. We then propose a general regularized framework for learning a kernel matrix, and show it to be equivalent to our metric learning framework. Our theoretical connections between metric and kernel learning have two main consequences: 1) the learned kernel matrix parameterizes a linear transformation kernel function and can be applied inductively to new data points, 2) our result yields a constructive method for kernelizing most existing Mahalanobis metric learning formulations. We demonstrate our learning approach by applying it to large-scale real world problems in computer vision, text mining and semi-supervised kernel dimensionality reduction. Keywords: divergence metric learning, kernel learning, linear transformation, matrix divergences, logdet

5 0.56211352 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton van den Hengel

Abstract: The success of many machine learning and pattern recognition methods relies heavily upon the identiﬁcation of an appropriate distance metric on the input data. It is often beneﬁcial to learn such a metric from the input training data, instead of using a default one such as the Euclidean distance. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing the constraint that the matrix parameter to the metric remains positive semideﬁnite. Semideﬁnite programming is often used to enforce this constraint, but does not scale well and is not easy to implement. B OOST M ETRIC is instead based on the observation that any positive semideﬁnite matrix can be decomposed into a linear combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semideﬁnite matrices as weak learners within an efﬁcient and scalable boosting-based learning process. The resulting methods are easy to implement, efﬁcient, and can accommodate various types of constraints. We extend traditional boosting algorithms in that its weak learner is a positive semideﬁnite matrix with trace and rank being one rather than a classiﬁer or regressor. Experiments on various data sets demonstrate that the proposed algorithms compare favorably to those state-of-the-art methods in terms of classiﬁcation accuracy and running time. Keywords: Mahalanobis distance, semideﬁnite programming, column generation, boosting, Lagrange duality, large margin nearest neighbor

6 0.53602451 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

7 0.53391844 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

8 0.52483135 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

9 0.49972194 18 jmlr-2012-An Improved GLMNET for L1-regularized Logistic Regression

10 0.49002314 98 jmlr-2012-Regularized Bundle Methods for Convex and Non-Convex Risks

11 0.48739463 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

12 0.47782198 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

13 0.47731516 65 jmlr-2012-MedLDA: Maximum Margin Supervised Topic Models

14 0.47665408 54 jmlr-2012-Large-scale Linear Support Vector Regression

15 0.47465461 21 jmlr-2012-Bayesian Mixed-Effects Inference on Classification Performance in Hierarchical Data Sets

16 0.46966758 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

17 0.46681309 103 jmlr-2012-Sampling Methods for the Nyström Method

18 0.46159053 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

19 0.44085309 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization

20 0.43901163 119 jmlr-2012-glm-ie: Generalised Linear Models Inference & Estimation Toolbox