jmlr jmlr2011 jmlr2011-20 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
Reference: text
sentIndex sentText sentNum sentScore
1 To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. [sent-11, score-0.536]
2 On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. [sent-12, score-0.272]
3 We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. [sent-14, score-0.264]
4 Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers 1. [sent-15, score-0.349]
5 Our first strategy uses proximal gradient methods, which have proven to be effective in this context, essentially because of their fast convergence rates and their ability to deal with large problems (Nesterov, 2007; Beck and Teboulle, 2009). [sent-45, score-0.233]
6 The second strategy we consider exploits proximal splitting methods (see Combettes and Pesquet, 2008, 2010; Goldfarg and Ma, 2009; Tomioka et al. [sent-47, score-0.272]
7 2 This is the main contribution of the paper, which allows us to use proximal gradient methods in the context of structured sparsity. [sent-51, score-0.286]
8 • We present proximal splitting methods for solving structured sparse regularized problems. [sent-53, score-0.36]
9 Then, we illustrate our algorithms with different tasks: video background subtraction, estimation of hierarchical structures for dictionary learning of natural image patches (Jenatton et al. [sent-57, score-0.174]
10 Interestingly, this is not the first time that network flow optimization tools have been used to solve sparse regularized problems with proximal methods. [sent-67, score-0.293]
11 2682 C ONVEX AND N ETWORK F LOW O PTIMIZATION FOR S TRUCTURED S PARSITY with a structured sparse prior, and topographic dictionary learning of natural image patches (Hyv¨ rinen et al. [sent-71, score-0.287]
12 , 2010b), by adding new experiments (CUR matrix factorization, wavelet image denoising and topographic dictionary learning), presenting the proximal splitting methods, providing the full proofs of the optimization results, and adding numerous discussions. [sent-75, score-0.56]
13 Section 3 is devoted to proximal gradient algorithms, and Section 4 to proximal splitting methods. [sent-95, score-0.505]
14 ,p} is a set of groups of variables, the vector wg in R|g| represents the coefficients of w indexed by g in G , the scalars ηg are positive weights, and . [sent-153, score-0.178]
15 • When the groups overlap, Ω is still a norm and sets groups of variables to zero together (Jenatton et al. [sent-170, score-0.168]
16 It is possible to show for instance that Ω(w) = 1 |G | (zg )g∈G ∈R+ 2 min η2 wg g ∑ zg g∈G 2 2 + zg . [sent-202, score-0.331]
17 (2010a, 2011) who tackled the particular case of hierarchical norms, we propose to use proximal gradient methods, which we now introduce. [sent-227, score-0.256]
18 The simplest version of this ˜ class of methods linearizes at each iteration the function f around the current estimate w, and this estimate is updated as the (unique by strong convexity) solution of the proximal problem, defined as: L ˜ 2 w − w 2. [sent-243, score-0.233]
19 5 In addition, when the nonsmooth term Ω is not present, the previous proximal problem exactly leads to the standard gradient update rule. [sent-246, score-0.233]
20 More generally, we define the proximal operator: min w∈R p Definition 1 (Proximal Operator) The proximal operator associated with our regularization term λΩ, which we denote by ProxλΩ , is the function that maps a vector u ∈ R p to the unique solution of 1 u − w 2 + λΩ(w). [sent-247, score-0.534]
21 What makes proximal methods appealing to solve sparse decomposition problems is that this operator can often be computed in closed form. [sent-249, score-0.336]
22 For instance, minp • When Ω is the ℓ1 -norm—that is Ω(w) = w 1 —the proximal operator is the well-known elementwise soft-thresholding operator, ∀ j ∈ {1, . [sent-250, score-0.277]
23 , p}, the proximal problem is separable in every group, and the solution is a generalization of the soft-thresholding operator to groups of variables: ∀g ∈ G , ug → ug − Π . [sent-257, score-0.508]
24 2 ≤λ [ug ] = 0 ug 2 −λ ug 2 ug if ug 2 ≤λ otherwise, 5. [sent-258, score-0.324]
25 Note, however, that fast convergence rates can also be achieved while solving approximately the proximal problem (see Schmidt et al. [sent-259, score-0.233]
26 g|G | , and if we define Proxg as (a) the proximal operator ug → Proxληg · (ug ) on the subspace corresponding to group g and (b) the identity on the orthogonal, Jenatton et al. [sent-277, score-0.392]
27 2 Dual of the Proximal Operator We now show that, for a set G of general overlapping groups, a convex dual of the proximal problem (3) can be reformulated as a quadratic min-cost flow problem. [sent-287, score-0.365]
28 (2010a, 2011): Lemma 2 (Dual of the proximal problem, Jenatton et al. [sent-290, score-0.233]
29 For all arcs in E, we define a non-negative capacity constant, and as done classically in the network flow literature (Ahuja et al. [sent-310, score-0.234]
30 (5) Indeed, • the only arcs with a cost are those leading to the sink, which have the form ( j,t), where j is the index of a variable in {1, . [sent-343, score-0.19]
31 By flow conservation, the flow on an arc (s, g) is ∑ j∈g ξg j which should be less than ληg by capacity constraints; • all other arcs have the form (g, j), where g is in G and j is in g. [sent-348, score-0.338]
32 When some groups are included in others, the canonical graph can be simplified to yield a graph with a smaller number of edges. [sent-351, score-0.185]
33 By definition, a parametric max-flow problem consists in solving, for every value of a parameter, a max-flow problem on a graph whose arc capacities depend on this parameter. [sent-383, score-0.183]
34 Figure 1: Graph representation of simple proximal problems with different group structures G . [sent-388, score-0.267]
35 2 All other arcs in the graph have zero cost and infinite capacity. [sent-392, score-0.227]
36 2691 M AIRAL , J ENATTON , O BOZINSKI AND BACH Algorithm 1 Computation of the proximal operator for overlapping groups. [sent-397, score-0.335]
37 3: Return: w = u − ξ (optimal solution of the proximal problem). [sent-402, score-0.233]
38 ξ j = γ j then 5: Denote by (s,V + ) and (V − ,t) the two disjoint subsets of (V, s,t) separated by the minimum (s,t)-cut of the graph, and remove the arcs between V + and V − . [sent-410, score-0.19]
39 At this point, it is possible to show that the value of the optimal min-cost flow on all arcs between V + and V − is necessary zero. [sent-419, score-0.19]
40 In our context, we use it to monitor the convergence of the proximal method through a duality gap, hence defining a proper optimality criterion for problem (1). [sent-455, score-0.299]
41 / j (7) In the network problem associated with (7), the capacities on the arcs (s, g), g ∈ G , are set to τηg , and the capacities on the arcs ( j,t), j in {1, . [sent-469, score-0.464]
42 Solving problem (7) amounts to finding the smallest value of τ, such that there exists a flow saturating all the capacities κ j on the arcs leading to the sink t. [sent-473, score-0.256]
43 Function dualNorm(V = Vu ∪Vgr , E) 1: τ ← (∑ j∈Vu κ j )/(∑g∈Vgr ηg ) and set the capacities of arcs (s, g) to τηg for all g in Vgr . [sent-481, score-0.232]
44 Optimization with Proximal Splitting Methods We now present proximal splitting algorithms (see Combettes and Pesquet, 2008, 2010; Tomioka et al. [sent-488, score-0.272]
45 The key is to introduce additional variables zg in R|g| , one for every group g in G , and equivalently reformulate Equation (1) as f (w) + λ ∑ ηg zg s. [sent-501, score-0.256]
46 ∀g ∈ G , zg = wg , (8) minp w∈R zg ∈R|g| for g∈G g∈G The issue of overlapping groups is removed, but new constraints are added, and as in Section 3, the method introduces additional variables which induce a memory cost of O(∑g∈G |g|). [sent-503, score-0.458]
47 12 It introduces dual variables νg in R|g| for all g in G , and defines the augmented Lagrangian: L w, (zg )g∈G , (νg )g∈G f (w) + ∑ g∈G ληg zg + νg⊤ (zg − wg ) + γ g z − wg 2 2 2 , where γ > 0 is a parameter. [sent-507, score-0.401]
48 We are not aware of any efficient algorithm providing the exact solution of the proximal operator associated to a sum of ℓ2 -norms, which would be necessary for using (accelerated) proximal gradient methods. [sent-513, score-0.51]
49 , 2010a, 2011), but such a procedure would be computationally expensive and would require to be able to deal with approximate computations of the proximal operators (e. [sent-517, score-0.233]
50 (2010) for computing the proximal operator associated to hierarchical norms, and independently in the same context as ours by Boyd et al. [sent-524, score-0.3]
51 The augmented Lagrangian is in fact the classical Lagrangian (see Boyd and Vandenberghe, 2004) of the following optimization problem which is equivalent to Equation (8): f (w) + λ min w∈R p ,(zg ∈R|g| ) g∈G ∑ ηg zg + g∈G 2695 γ g z − wg 2 2 2 s. [sent-527, score-0.271]
52 The solution can be obtained in closed form: for all g in G , zg ← prox ληg . [sent-532, score-0.168]
53 It is easy to show that every step can be obtained efficiently, as long as one knows how to compute the proximal operator associated to the functions f˜i in closed form. [sent-552, score-0.277]
54 2 Wavelet Denoising with Structured Sparsity We now illustrate the results of Section 3, where a single large-scale proximal operator (p ≈ 250 000) associated to a sum of ℓ∞ -norms has to be computed. [sent-608, score-0.277]
55 Considering a natural quad-tree for wavelet coefficients (see Mallat, 1999), this norm takes the form of Equation (2) with one group per wavelet coefficient that contains the coefficient and all its descendants in the tree. [sent-631, score-0.194]
56 • a sum of ℓ∞ -norms with overlapping groups representing 2 × 2 spatial neighborhoods in the wavelet domain. [sent-633, score-0.192]
57 This regularization encourages neighboring wavelet coefficients to be set to zero together, which was also exploited in the past in block-thresholding approaches for wavelet denoising (Cai, 1999). [sent-634, score-0.189]
58 29 Table 1: PSNR measured for the denoising of 12 standard images when the regularization function is the ℓ1 -norm, the tree-structured norm Ωtree , and the structured norm Ωgrid , and improvement in PSNR compared to the ℓ1 -norm (IPSNR). [sent-708, score-0.172]
59 Solving exactly the proximal problem with Ωgrid for an image with p = 512 × 512 = 262 144 pixels (and therefore approximately the same number of groups) takes approximately ≈ 4 − 6 seconds on a single core of a 3. [sent-713, score-0.266]
60 2699 M AIRAL , J ENATTON , O BOZINSKI AND BACH Problem (9) has a smooth convex data-fitting term and brings into play a sparsity-inducing norm with overlapping groups of variables (the rows and the columns of W). [sent-735, score-0.185]
61 We therefore put an additional structured regularization term Ω on e, where the groups in G are all the overlapping 3×3-squares on the ˜ image. [sent-761, score-0.204]
62 The problem of dictionary learning, originally introduced by Olshausen and Field (1996), is a matrix factorization problem which aims at representing these signals as linear combinations of dictionary elements that are the columns of a matrix X = [x1 , . [sent-817, score-0.164]
63 (2009), such a result can be reproduced with a dictionary learning formulation, using a structured √ √ norm for Ω. [sent-841, score-0.165]
64 Following their formulation, we organize the p dictionary elements on a p × p grid, and consider p overlapping groups that are 3 × 3 or 4 × 4 spatial neighborhoods on the grid (to avoid boundary effects, we assume the grid to be cyclic). [sent-842, score-0.209]
65 (2009) uses a subgradient descent algorithm to solve them, we use the proximal splitting methods presented in Section 4. [sent-881, score-0.272]
66 (2010a), the dictionary elements are embedded in a predefined tree T , via a particular instance of the structured norm Ω, which we refer to it as Ωtree , and call G the underlying set of groups. [sent-896, score-0.198]
67 Our experiments use the algorithm of Beck and Teboulle (2009) based on our proximal operator, with weights ηg set to 1. [sent-937, score-0.233]
68 Conclusion We have presented new optimization methods for solving sparse structured problems involving sums of ℓ2 - or ℓ∞ -norms of any (overlapping) groups of variables. [sent-940, score-0.182]
69 In particular, the proximal operator for the sum of ℓ∞ -norms can be cast as a specific form of quadratic min-cost flow problem, for which we proposed an efficient and simple algorithm. [sent-942, score-0.277]
70 In addition to making it possible to resort to accelerated gradient methods, an efficient computation of the proximal operator offers more generally a certain modularity, in that it can be used as a building-block for other optimization problems. [sent-952, score-0.302]
71 A case in point is dictionary learning where proximal problems come up and have to be solved repeatedly in an inner-loop. [sent-953, score-0.315]
72 (2009), or total-variation based penalties, whose proximal operators are also based on minimum cost flow problems (Chambolle and Darbon, 2009). [sent-955, score-0.233]
73 Let G′ = (V, E ′ , s,t) be a graph sharing the same set of vertices, source and sink as G, but with a different arc set E ′ . [sent-963, score-0.165]
74 • For every arc (g, j) in E, with (g, j) in Vgr × Vu , there exists a unique path in E ′ from g to j with zero costs and infinite capacities on every arc of the path. [sent-966, score-0.273]
75 • Conversely, if there exists a path in E ′ between a vertex g in Vgr and a vertex j in Vu , then there exists an arc (g, j) in E. [sent-967, score-0.181]
76 Moreover, the values of the optimal flow on the arcs ( j,t), j in Vu , are the same on G and G′ . [sent-969, score-0.19]
77 Proof We first notice that on both G and G′ , the cost of a flow on the graph only depends on the flow on the arcs ( j,t), j in Vu , which we have denoted by ξ in E. [sent-970, score-0.227]
78 We now use the concept of path flow, which is a flow vector in G carrying the same positive value on every arc of a directed path between two nodes of G. [sent-972, score-0.174]
79 According to the definition of graph equivalence introduced in the Lemma, it is easy to show that there is a bijection between the arcs in E, and the paths in E ′ with positive capacities on every arc. [sent-974, score-0.269]
80 More precisely, for every arc a in E, we consider its equivalent path in E ′ , with a path flow carrying the same amount of flow as a. [sent-976, score-0.174]
81 We now build a flow π in G, by associating to each path flow in the decomposition of π′ , an arc in E carrying the same amount of flow. [sent-982, score-0.175]
82 Convergence Analysis We show in this section the correctness of Algorithm 1 for computing the proximal operator, and of Algorithm 2 for computing the dual norm Ω⋆ . [sent-986, score-0.309]
83 1 Computation of the Proximal Operator We first prove that our algorithm converges and that it finds the optimal solution of the proximal problem. [sent-988, score-0.233]
84 2010a, 2011) The primal-dual variables (w, ξ) are respectively solutions of the primal (3) and dual problems (4) 2707 M AIRAL , J ENATTON , O BOZINSKI AND BACH if and only if the dual variable ξ is feasible for the problem (4) and w = u − ∑g∈G ξg , w⊤ ξg = wg g g or wg = 0. [sent-992, score-0.34]
85 Conversely, arcs from V + to t are all saturated, whereas there can be non-saturated arcs from V − to t. [sent-999, score-0.38]
86 • The cut goes through all arcs going from V + to t, and all arcs going from s to V − . [sent-1003, score-0.432]
87 Arcs going from s to V − are saturated, as well as arcs going + to t. [sent-1006, score-0.242]
88 Proposition 7 (Correctness of Algorithm 1) Algorithm 1 solves the proximal problem of Equation (3). [sent-1028, score-0.233]
89 We first notice that g ∩Vu+ = g since there are no arcs between V + and V − in E (see the properties of the cuts discussed before this proposition). [sent-1056, score-0.19]
90 This result is relatively intuitive: (s,V + ) and (V − ,t) being an (s,t)-cut, all arcs between s and V − are saturated, while there are unsaturated arcs between s and V + ; one therefore expects the residuals u j − ξ j to decrease on the V + side, while increasing on the V − side. [sent-1060, score-0.38]
91 wg ∞ j∈Vu ∈ g}, wg ∞ > max |u j − γ j |}, j∈Vu ++ +− As previously, we denote V +− Vu+− ∪ Vgr and V ++ Vu++ ∪ Vgr . [sent-1070, score-0.218]
92 According to the definition of the different sets above, we observe that no arcs are going from ++ V ++ to V +− , that is, for all g in Vgr , g ∩ Vu+− = ∅. [sent-1073, score-0.216]
93 j∈Vu+− +− Let j ∈ Vu+− , if ξ j = 0 then for some g ∈ Vgr such that j receives some flow from g, which +− from the optimality conditions (15) implies w j = wg ∞ ; by definition of Vgr , wg ∞ > u j − γ j . [sent-1075, score-0.247]
94 2 Computation of the Dual Norm Ω⋆ As for the proximal operator, the computation of dual norm Ω∗ can itself be shown to solve another network flow problem, based on the following variational formulation, which extends a previous result from Jenatton et al. [sent-1087, score-0.309]
95 ∀ g ∈ G , zg ∞ ≤ αg , ∑g∈G ηg αg ≤1 with the additional |G | conic constraints zg ∞ ≤ αg . [sent-1096, score-0.222]
96 The algorithm splits the vertex set V into two nonempty parts V + and V − and we remark that there are no arcs going from V + to V − , and no flow going from V − to V + . [sent-1132, score-0.269]
97 Since the arcs going from s to V − are saturated, we have − that ∑g∈Vgr τηg ≤ ∑ j∈Vu− κ j . [sent-1133, score-0.216]
98 Since there are no arcs going from V + to V − , τ⋆ is feasible for Equation (17) when replacing V by V − and we have that τ⋆ ≥ τ′ and then τ′ = τ⋆ . [sent-1137, score-0.216]
99 Moreover, we refer to the proximal operator associated with λΩ as proxλΩ . [sent-1149, score-0.277]
100 A proximal decomposition method for solving convex variational inverse problems. [sent-1372, score-0.285]
wordName wordTfidf (topN-words)
[('vu', 0.484), ('vgr', 0.46), ('ow', 0.312), ('proximal', 0.233), ('jenatton', 0.201), ('arcs', 0.19), ('etwork', 0.118), ('cur', 0.116), ('admm', 0.112), ('zg', 0.111), ('wg', 0.109), ('airal', 0.106), ('bozinski', 0.106), ('arc', 0.104), ('onvex', 0.1), ('enatton', 0.095), ('xw', 0.092), ('computeflow', 0.087), ('tructured', 0.083), ('dictionary', 0.082), ('ug', 0.081), ('ptimization', 0.077), ('parsity', 0.077), ('groups', 0.069), ('bach', 0.067), ('wavelet', 0.065), ('proxflow', 0.062), ('mairal', 0.061), ('overlapping', 0.058), ('prox', 0.057), ('structured', 0.053), ('hochbaum', 0.05), ('mahoney', 0.05), ('topographic', 0.048), ('dual', 0.046), ('capacity', 0.044), ('operator', 0.044), ('drineas', 0.043), ('pesquet', 0.043), ('capacities', 0.042), ('tomioka', 0.042), ('canonical', 0.042), ('combettes', 0.041), ('gj', 0.039), ('splitting', 0.039), ('kavukcuoglu', 0.038), ('obozinski', 0.038), ('cehver', 0.037), ('gallo', 0.037), ('graph', 0.037), ('duality', 0.037), ('patches', 0.036), ('sparse', 0.035), ('denoising', 0.035), ('beck', 0.034), ('goldberg', 0.034), ('group', 0.034), ('subtraction', 0.033), ('tree', 0.033), ('sg', 0.033), ('teboulle', 0.033), ('image', 0.033), ('ows', 0.032), ('psnr', 0.032), ('gap', 0.032), ('gv', 0.031), ('olshausen', 0.031), ('primal', 0.03), ('norm', 0.03), ('optimality', 0.029), ('foreground', 0.028), ('hong', 0.028), ('boyd', 0.028), ('convex', 0.028), ('vertex', 0.027), ('jacob', 0.027), ('dct', 0.026), ('saturated', 0.026), ('sprechmann', 0.026), ('norms', 0.026), ('augmented', 0.026), ('borwein', 0.026), ('going', 0.026), ('optimization', 0.025), ('bien', 0.025), ('ford', 0.025), ('garrigues', 0.025), ('simp', 0.025), ('xwi', 0.025), ('decomposition', 0.024), ('regularization', 0.024), ('sparsity', 0.024), ('carrying', 0.024), ('sink', 0.024), ('path', 0.023), ('hierarchical', 0.023), ('nesterov', 0.023), ('dictionaries', 0.023), ('equation', 0.022), ('bertsekas', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
2 0.34657097 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
3 0.12821434 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
4 0.075541474 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama
Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization
5 0.073348656 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
6 0.071178056 91 jmlr-2011-The Sample Complexity of Dictionary Learning
7 0.057584289 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
8 0.05677478 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
9 0.049461201 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
10 0.045392018 97 jmlr-2011-Union Support Recovery in Multi-task Learning
11 0.041656815 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
12 0.036488529 101 jmlr-2011-Variable Sparsity Kernel Learning
13 0.033961389 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
14 0.033456415 50 jmlr-2011-LPmade: Link Prediction Made Easy
15 0.032875609 105 jmlr-2011-lp-Norm Multiple Kernel Learning
16 0.030344991 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
17 0.02873629 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
18 0.028720669 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
19 0.028419379 54 jmlr-2011-Learning Latent Tree Graphical Models
20 0.027798438 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
topicId topicWeight
[(0, 0.206), (1, 0.037), (2, 0.339), (3, 0.232), (4, -0.138), (5, 0.02), (6, 0.033), (7, -0.018), (8, 0.251), (9, -0.087), (10, -0.073), (11, 0.247), (12, 0.026), (13, 0.152), (14, 0.033), (15, -0.012), (16, 0.042), (17, -0.223), (18, 0.189), (19, 0.037), (20, 0.041), (21, 0.033), (22, 0.045), (23, 0.005), (24, -0.056), (25, -0.015), (26, 0.118), (27, 0.047), (28, -0.114), (29, -0.012), (30, -0.032), (31, -0.023), (32, 0.035), (33, 0.117), (34, 0.01), (35, -0.038), (36, -0.11), (37, 0.034), (38, -0.055), (39, 0.061), (40, -0.049), (41, 0.052), (42, -0.037), (43, -0.098), (44, -0.011), (45, 0.01), (46, -0.036), (47, -0.026), (48, -0.026), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.95354265 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
2 0.82711953 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
3 0.55008096 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
4 0.34597391 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
5 0.237147 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
6 0.23407185 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
7 0.19795445 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
8 0.190368 101 jmlr-2011-Variable Sparsity Kernel Learning
9 0.18137668 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
10 0.17251475 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
11 0.17248377 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
12 0.16185847 50 jmlr-2011-LPmade: Link Prediction Made Easy
13 0.15564038 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.15369967 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
15 0.14994562 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
16 0.14742279 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
17 0.13948624 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
18 0.13217928 105 jmlr-2011-lp-Norm Multiple Kernel Learning
19 0.13121922 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
20 0.12763904 61 jmlr-2011-Logistic Stick-Breaking Process
topicId topicWeight
[(4, 0.06), (9, 0.256), (10, 0.036), (11, 0.015), (24, 0.025), (27, 0.011), (31, 0.066), (32, 0.016), (41, 0.024), (60, 0.016), (73, 0.028), (78, 0.051), (90, 0.024), (92, 0.267)]
simIndex simValue paperId paperTitle
same-paper 1 0.81902552 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
2 0.70192158 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
3 0.66143739 105 jmlr-2011-lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN
4 0.54338664 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
5 0.47383302 66 jmlr-2011-Multiple Kernel Learning Algorithms
Author: Mehmet Gönen, Ethem Alpaydın
Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning
6 0.45646918 59 jmlr-2011-Learning with Structured Sparsity
7 0.44356263 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
8 0.44231573 91 jmlr-2011-The Sample Complexity of Dictionary Learning
9 0.43938541 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
10 0.42584184 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
11 0.41711107 48 jmlr-2011-Kernel Analysis of Deep Networks
12 0.41163048 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
13 0.40762943 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
14 0.40233365 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
15 0.40124959 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
16 0.39112738 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
17 0.38817948 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
18 0.3824378 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
19 0.37980843 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
20 0.37915879 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals