jmlr jmlr2011 jmlr2011-88 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
Reference: text
sentIndex sentText sentNum sentScore
1 This leads to a specific set of allowed nonzero patterns for the solutions of such problems. [sent-9, score-0.32]
2 We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. [sent-10, score-0.563]
3 This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. [sent-11, score-0.224]
4 As mentioned above, the ℓ1 -norm focuses only on cardinality and cannot easily specify side information about the patterns of nonzero coefficients (“nonzero patterns”) induced in the solution, since they are all theoretically possible. [sent-69, score-0.32]
5 Group ℓ1 -norms (Yuan and Lin, 2006; Roth and Fischer, 2008; Huang and Zhang, 2010) consider a partition of all variables into a certain number of subsets and penalize the sum of the Euclidean norms of each one, leading to selection of groups rather than individual variables. [sent-70, score-0.224]
6 Moreover, recent works have considered overlapping but nested groups in constrained situations such as trees and directed acyclic graphs (Zhao et al. [sent-71, score-0.197]
7 In this paper, we consider all possible sets of groups and characterize exactly what type of prior knowledge can be encoded by considering sums of norms of overlapping groups of variables. [sent-74, score-0.399]
8 Before describing how to go from groups to nonzero patterns (or equivalently zero patterns), we show that it is possible to “reverse-engineer” a given set of nonzero patterns, that is, to build the unique minimal set of groups that will generate these patterns. [sent-75, score-0.84]
9 As will be shown in Section 3, for each set of groups, a notion of hull of a nonzero pattern may be naturally defined. [sent-78, score-0.406]
10 For example, in the particular case of the two-dimensional planar grid considered in this paper, this hull is exactly the axis-aligned bounding box or the regular convex hull. [sent-79, score-0.239]
11 We show that, in our framework, the allowed nonzero patterns are exactly those equal to their hull, and that the hull of the relevant variables is consistently estimated under certain conditions, both in low and high-dimensional settings. [sent-80, score-0.532]
12 Groups and Sparsity Patterns We now study the relationship between the norm Ω defined in Equation (1) and the nonzero patterns the estimated vector w is allowed to have. [sent-157, score-0.345]
13 We first characterize the set of nonzero patterns, then we ˆ provide forward and backward procedures to go back and forth from groups to patterns. [sent-158, score-0.372]
14 Intuitively, for a certain subset of groups G ′ ⊆ G , the vectors wG associated with the groups G ∈ G ′ will be exactly equal to zero, leading to a set of zeros which is the union of these groups, G∈G ′ G. [sent-165, score-0.332]
15 Note that instead of considering the set of zero patterns Z , it is also convenient to manipulate nonzero patterns, and we define P= G∈G ′ Gc ; G ′ ⊆ G = Zc; Z ∈ Z . [sent-169, score-0.32]
16 For µ > 0, any solution of the problem in Equation (2) with at most k − 1 ′ nonzero coefficients has a zero pattern in Z = G∈G ′ G; G ⊆ G almost surely. [sent-208, score-0.216]
17 5): • ℓ2 -norm: the set of allowed nonzero patterns is composed of the empty set and the full set {1, . [sent-216, score-0.32]
18 Two natural questions now arise: (1) starting from the groups G , is there an efficient way to generate the set of nonzero patterns P ; (2) conversely, and more importantly, given P , how can the groups G —and hence the norm Ω(w)—be designed? [sent-224, score-0.677]
19 2 General Properties of G , Z and P We now study the different properties of the set of groups G and its corresponding sets of patterns Z and P . [sent-227, score-0.298]
20 1 C LOSEDNESS The set of zero patterns Z (respectively, the set of nonzero patterns P ) is closed under union (respectively, intersection), that is, for all K ∈ N and all z1 , . [sent-230, score-0.452]
21 For instance, if we consider a sequence (see Figure 4), we cannot take P to be the set of contiguous patterns with length two, since the intersection of such two patterns may result in a singleton (that does not belong to P ). [sent-239, score-0.292]
22 For instance, if the set G is formed by all vertical and horizontal half-spaces when the variables are organized in a 2 dimensional-grid (see Figure 5), the hull of a subset I ⊂ {1, . [sent-261, score-0.212]
23 , orientations ±π/4 are shown in Figure 6), the hull becomes the regular convex hull. [sent-267, score-0.27]
24 We can also build the corresponding DAG for the set of zero patterns Z ⊃ G , which is a super-DAG of the DAG of groups (see Figure 3 for examples). [sent-273, score-0.298]
25 Note that we obtain also the isomorphic DAG for the nonzero patterns P , although it corresponds to the poset (P , ⊂): this DAG will be used in the active set algorithm presented in Section 4. [sent-274, score-0.442]
26 Namely, from an intersection-closed set of zero patterns Z , we can build back a minimal set of groups G by iteratively pruning away in the DAG corresponding to Z , all sets which are unions of their parents. [sent-288, score-0.322]
27 Algorithm 1 Backward procedure Input: Intersection-closed family of nonzero patterns P . [sent-297, score-0.32]
28 Output: Collection of zero patterns Z and nonzero patterns P . [sent-316, score-0.452]
29 1 S EQUENCES Given p variables organized in a sequence, if we want only contiguous nonzero patterns, the backward algorithm will lead to the set of groups which are intervals [1, k]k∈{1,. [sent-328, score-0.422]
30 Imposing the contiguity of the nonzero patterns is for instance relevant for the diagnosis of tumors, based on the profiles of arrayCGH (Rapaport et al. [sent-335, score-0.32]
31 Figure 4: (Left and middle) The set of groups (blue areas) to penalize in order to select contiguous patterns in a sequence. [sent-337, score-0.326]
32 (Right) An example of such a nonzero pattern (red dotted area) with its corresponding zero pattern (hatched area). [sent-338, score-0.244]
33 Larger set of convex patterns can be obtained by adding in G half-planes with other orientations than vertical and horizontal. [sent-344, score-0.212]
34 For instance, if we use planes with angles that are multiples of π/4, the nonzero patterns of P can have polygonal shapes with up to 8 faces. [sent-345, score-0.32]
35 In this sense, if we keep on adding half-planes with finer orientations, the nonzero patterns of P can be described by polygonal shapes with an increasingly larger number of faces. [sent-346, score-0.32]
36 (Right) An example of nonzero pattern (red dotted area) recovered in this setting, with its corresponding zero pattern (hatched area). [sent-360, score-0.244]
37 Moreover, while the two-dimensional rectangular patterns described previously are adapted to find bounding boxes in static images (Harzallah et al. [sent-369, score-0.201]
38 (Right) An example of nonzero pattern (red dotted area) recovered in this setting, with its corresponding zero pattern (hatched area). [sent-373, score-0.244]
39 In this case, the nested groups we consider are defined based on the following groups of variables gk,θ = {(i, j) ∈ {1, . [sent-391, score-0.354]
40 In the example of the sin(θ) rectangular groups (see Figure 5), we have four orientations, with θ ∈ {0, π/2, −π/2, π}. [sent-400, score-0.235]
41 We present in this section an active set algorithm (Algorithm 3) that finds a solution for Problem (2) by considering increasingly larger active sets and checking global optimality at each step. [sent-414, score-0.231]
42 When the rectangular groups are used, the total complexity of this method is in O(s max{p1. [sent-415, score-0.235]
43 Our active set algorithm uu u extends to general overlapping groups the work of Bach (2008a), by further assuming that it is computationally possible to have a time complexity polynomial in the number of variables p. [sent-422, score-0.323]
44 n i=1 (3) In active set methods, the set of nonzero variables, denoted by J, is built incrementally, and the problem is solved only for this reduced set of variables, adding the constraint wJ c = 0 to Equation (3). [sent-429, score-0.292]
45 , p} which belongs to the set of allowed nonzero patterns P , we denote by GJ the set of active groups, that is, the set of groups G ∈ G such that G ∩ J = ∅. [sent-442, score-0.59]
46 The previous proposition enables us to derive the duality gap for the optimization Problem (4), that is reduced to the active set of variables J. [sent-454, score-0.206]
47 Figure 7: The active set (black) and the candidate patterns of variables, that is, the variables in K\J (hatched in black) that can become active. [sent-467, score-0.258]
48 The fringe groups are exactly the groups that have the hatched areas (i. [sent-468, score-0.402]
49 2791 J ENATTON , AUDIBERT AND BACH Figure 8: The active set (black) and the candidate patterns of variables, that is, the variables in K\J (hatched in black) that can become active. [sent-471, score-0.258]
50 2 Active Set Algorithm We can interpret the active set algorithm as a walk through the DAG of nonzero patterns allowed by the norm Ω. [sent-475, score-0.449]
51 The parents ΠP (J) of J in this DAG are exactly the patterns containing the variables that may enter the active set at the next iteration of Algorithm 3. [sent-476, score-0.258]
52 The groups that are exactly at the boundaries of the active set (referred to as the fringe groups) are FJ = {G ∈ (GJ )c ; ∄G′ ∈ (GJ )c , G ⊆ G′ }, that is, the groups that are not contained by any other inactive groups. [sent-477, score-0.457]
53 In simple settings, for example, when G is the set of rectangular groups, the correspondence between groups and variables is straightforward since we have FJ = K∈ΠP (J) GK \GJ (see Figure 7). [sent-478, score-0.257]
54 The heuristics for the sufficient condition (Sε ) implies that, to go from groups to variables, we simply consider the group G ∈ FJ violating the sufficient condition the most and then take all the patterns of variables K ∈ ΠP (J) such that K ∩ G = ∅ to enter the active set. [sent-487, score-0.45]
55 A direct consequence of this heuristics is that it is possible for the algorithm to jump over the right active set and to consider instead a (slightly) larger active set as optimal. [sent-489, score-0.208]
56 However if the active set is larger than the optimal set, then (it can be proved that) the sufficient condition (S0 ) is satisfied, and the reduced problem, which we solve exactly, will still output the correct nonzero pattern. [sent-490, score-0.292]
57 2 A LGORITHMIC C OMPLEXITY We analyze in detail the time complexity of the active set algorithm when we consider sets of groups G such as those presented in the examples of Section 3. [sent-513, score-0.27]
58 For such choices of G , the fringe groups FJ reduces to the largest groups of each orientation and therefore |FJ | ≤ |Θ|. [sent-517, score-0.353]
59 Given an active set J, both the necessary and sufficient conditions require to have access to the direct parents ΠP (J) of J in the DAG of nonzero patterns. [sent-519, score-0.292]
60 In simple settings, for example, when G is the set of rectangular groups, this operation can be performed in O(1) (it just corresponds to scan the (up to) four patterns at the edges of the current rectangular hull). [sent-520, score-0.27]
61 However, for more general orientations, computing ΠP (J) requires to find the smallest nonzero patterns that we can generate from the groups in FJ , reduced to the stripe of variables around the c current hull. [sent-521, score-0.508]
62 In particular, the weights have to take into account that some variables belonging to several overlapping groups are penalized multiple times. [sent-551, score-0.219]
63 5, it is natural to rewrite G as θ∈Θ Gθ , where Θ is the set of the orientations of the groups in G (for more details, see Section 3. [sent-555, score-0.222]
64 The main point is that, since P is closed under intersection, the two procedures described below actually lead to the same set of allowed nonzero patterns: a) Simply considering the nonzero pattern of w. [sent-558, score-0.404]
65 ˆ b) Taking the intersection of the nonzero patterns obtained for each wθ , θ in Θ. [sent-559, score-0.32]
66 In addition, this procedure is a variable selection technique that therefore needs a second step for estimating the loadings (restricted to the selected nonzero pattern). [sent-561, score-0.206]
67 Considering the set of nonzero patterns P derived in Section 3, we can only hope to estimate the correct hull of the generating sparsity pattern, since Theorem 2 states that other patterns occur with zero probability. [sent-571, score-0.689]
68 vector with Gaussian distributions with mean zero and variance σ2 > 0, and w ∈ R p is the population sparse vector; we denote by J the G -adapted hull of its nonzero pattern. [sent-586, score-0.378]
69 Note that estimating the G -adapted hull of w is equivalent to estimating the nonzero pattern of w if and only if this nonzero pattern belongs to P . [sent-587, score-0.622]
70 Conversely, if G is misspecified, recovering the hull of the nonzero pattern of w may be irrelevant, which is for instance the case if w = w1 ∈ R2 and 0 G = {{1}, {1, 2}}. [sent-589, score-0.406]
71 The following Theorem gives the sufficient and necessary conditions under which the hull of the generating pattern is consistently estimated. [sent-600, score-0.238]
72 3) that consists in intersecting the nonzero patterns obtained for different models of Slasso will be referred to as Intersected Structured-lasso, or simply ISlasso. [sent-636, score-0.32]
73 We further assume that these nonzero components are either organized on a sequence or on a two-dimensional grid (see Figure 9). [sent-640, score-0.213]
74 Specifically, given a group G ∈ G and a variable j ∈ G, the corresponding weight d G is all the more small as the variable j already belongs to other groups with the same j orientation. [sent-645, score-0.228]
75 To this end, we compute the probability of correct hull estimation when we consider diamond-shaped generating patterns of |J| = 24 variables on a 20×20-dimensional grid (see Figure 9h). [sent-652, score-0.389]
76 (Right column, in white) generating patterns that we use for the 20×20-dimensional grid experiments; again, those patterns all form the same hull of 24 variables, that is, the diamond-shaped convex in (h). [sent-661, score-0.523]
77 For the grid setting, the hull is defined based on the set of groups that are half-planes, with orientations that are multiple of π/4 (see Section 3. [sent-663, score-0.437]
78 5 Figure 10: Consistent hull estimation: probability of correct hull estimation versus the consistency condition (Ωc )∗ [QJc J Q−1 rJ ]. [sent-671, score-0.404]
79 After repeating 20 times this computation for each Q, we eventually report in Figure 10 the probabilities of correct hull estimation versus the consistency condition (Ωc )∗ [QJc J Q−1 rJ ]. [sent-675, score-0.214]
80 2 Structured Variable Selection We show in this experiment that the prior information we put through the norm Ω improves upon the ability of the model to recover spatially structured nonzero patterns. [sent-678, score-0.243]
81 For different sample sizes n ∈ {100, 200, 300, 400, 500, 700, 1000}, we consider the probabilities of correct recovery and the (normalized) Hamming distance to the true nonzero patterns. [sent-681, score-0.219]
82 For the realization of a random setting {J, w, X, ε}, we compute an entire regularization path over which we collect the closest Hamming distance to the true nonzero pattern and whether it has been exactly recovered for some µ. [sent-682, score-0.216]
83 The models learned with such weights do not manage to recover the correct nonzero patterns (in that case, the best model found on the path corresponds to the empty solution, with a normalized Hamming distance of |J|/p = 0. [sent-686, score-0.32]
84 In the grid case, two sets of groups G are considered, the rectangular groups with or without the ±π/4-groups (denoted by (π/4) in the legend). [sent-720, score-0.426]
85 Moreover, adding the ±π/4-groups to the rectangular groups enables to recover a nonzero pattern closer to the generating pattern. [sent-727, score-0.471]
86 This is illustrated on Figure 11d where the error of ISlasso with only rectangular groups (in black) corresponds to the selection of the smallest rectangular box around the generating pattern. [sent-728, score-0.324]
87 2 where we additionally evaluate the relevance of the contiguous (or convex) prior by varying the number of zero variables that are contained in the hull (see Figure 9). [sent-731, score-0.24]
88 The prediction error is understood here as X test (w − w) 2 / X test w 2 , where ˆ 2 2 w denotes the OLS estimate, performed on the nonzero pattern found by the model considered (i. [sent-733, score-0.216]
89 Indeed, as soon as the hull of the generating pattern does not contain too many zero variables, Slasso/ISlasso outperform Lasso. [sent-741, score-0.238]
90 In fact, the sample complexity of the Lasso depends on the number of nonzero variables in w (Wainwright, 2009) as opposed to the size of the hull for Slasso/ISlasso. [sent-742, score-0.4]
91 This also explains why the error for Slasso/ISlasso is almost constant with respect to the number of nonzero variables (since the hull has a constant size). [sent-743, score-0.4]
92 9 As predicted by the complexity analysis of the active set algorithm (see the end of Section 4), considering the set of rectangular groups with or without the ±π/4-groups results in the same complexity (up to constant terms). [sent-750, score-0.339]
93 01 0 25 50 75 100 Proportion of nonzero variables in the hull (%) 33 50 83 100 Proportion of nonzero variables in the hull (%) (a) Sequence setting (b) Grid setting Sample size n=500 Sample size n=500 0. [sent-770, score-0.8]
94 In the grid case, two sets of groups G are considered, the rectangular groups with or without the ±π/4-groups (denoted by (π/4) in the legend). [sent-779, score-0.426]
95 Two sets of groups G are considered, the rectangular groups with or without the ±π/4-groups (denoted by (π/4) in the legend). [sent-785, score-0.401]
96 Conclusion We have shown how to incorporate prior knowledge on the form of nonzero patterns for linear supervised learning. [sent-788, score-0.32]
97 The restrictions LJ : RJ → R and ΩJ : RJ → R of L and Ω are continuously differentiable functions on w ∈ RJ : wI = 0 with respective gradients ∇LJ (w) = ∂LJ (w) ∂w j ⊤ and ∇ΩJ (w) = wj ∑ G (d j ) G∈GI , G∋ j j∈J 2 G d ◦w ⊤ −1 2 . [sent-873, score-0.486]
98 Among all groups G ∈ (GJ )c , the ones with the maximum values are the largest ones, that is, those in the fringe groups FJ = {G ∈ (GJ )c ; ∄G′ ∈ (GJ )c , G ⊆ G′ }. [sent-975, score-0.353]
99 We first prove that we must have with high probability wG ∞ > ˆ 0 for all G ∈ GJ , proving that the hull of the active set of wJ is exactly J (i. [sent-1025, score-0.294]
100 , p} and GI = {G ∈ G ; G ∩ I = ∅} ⊆ G , that is, the set of active groups when the variables I are selected. [sent-1077, score-0.292]
wordName wordTfidf (topN-words)
[('wj', 0.486), ('qjc', 0.331), ('gj', 0.266), ('jc', 0.25), ('lj', 0.208), ('rj', 0.206), ('hull', 0.19), ('nonzero', 0.188), ('dj', 0.178), ('groups', 0.166), ('patterns', 0.132), ('election', 0.12), ('uj', 0.116), ('enatton', 0.112), ('jj', 0.108), ('orms', 0.108), ('nducing', 0.108), ('active', 0.104), ('jenatton', 0.099), ('tructured', 0.099), ('qjj', 0.098), ('audibert', 0.091), ('gi', 0.087), ('slasso', 0.086), ('bach', 0.083), ('socp', 0.08), ('dg', 0.078), ('qj', 0.075), ('gk', 0.074), ('islasso', 0.073), ('lasso', 0.07), ('rectangular', 0.069), ('fj', 0.064), ('mairal', 0.056), ('orientations', 0.056), ('dag', 0.049), ('duality', 0.049), ('hatched', 0.049), ('zj', 0.043), ('vj', 0.043), ('wg', 0.037), ('norms', 0.036), ('zhao', 0.036), ('borwein', 0.034), ('obozinski', 0.034), ('overlapping', 0.031), ('gc', 0.031), ('recovery', 0.031), ('gap', 0.031), ('gid', 0.031), ('hjj', 0.031), ('qjjc', 0.031), ('structured', 0.03), ('pattern', 0.028), ('contiguous', 0.028), ('sparsity', 0.027), ('group', 0.026), ('dual', 0.025), ('norm', 0.025), ('grid', 0.025), ('doignon', 0.024), ('flim', 0.024), ('harzallah', 0.024), ('maxg', 0.024), ('rapaport', 0.024), ('unions', 0.024), ('voxels', 0.024), ('wjc', 0.024), ('convex', 0.024), ('consistency', 0.024), ('uniqueness', 0.024), ('optimality', 0.023), ('boyd', 0.023), ('kim', 0.023), ('overlaps', 0.023), ('variables', 0.022), ('vandenberghe', 0.022), ('wy', 0.021), ('squaring', 0.021), ('ei', 0.021), ('fringe', 0.021), ('gramfort', 0.021), ('equation', 0.021), ('spatial', 0.021), ('propositions', 0.02), ('generating', 0.02), ('minp', 0.02), ('dk', 0.02), ('lewis', 0.019), ('ug', 0.019), ('ui', 0.019), ('arraycgh', 0.018), ('falmagne', 0.018), ('poset', 0.018), ('toh', 0.018), ('ujc', 0.018), ('variable', 0.018), ('gm', 0.018), ('backward', 0.018), ('solver', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
2 0.19339783 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
3 0.14335181 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
4 0.12821434 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
5 0.078603074 97 jmlr-2011-Union Support Recovery in Multi-task Learning
Author: Mladen Kolar, John Lafferty, Larry Wasserman
Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation
6 0.071478806 59 jmlr-2011-Learning with Structured Sparsity
7 0.067074239 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
8 0.044743177 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
9 0.038465187 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
10 0.038079415 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
11 0.032129534 101 jmlr-2011-Variable Sparsity Kernel Learning
12 0.030406119 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
13 0.029638948 105 jmlr-2011-lp-Norm Multiple Kernel Learning
14 0.028509043 12 jmlr-2011-Bayesian Co-Training
15 0.028179182 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
16 0.027434533 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
17 0.027420633 61 jmlr-2011-Logistic Stick-Breaking Process
18 0.027011128 91 jmlr-2011-The Sample Complexity of Dictionary Learning
19 0.026731256 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
20 0.02671078 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
topicId topicWeight
[(0, 0.185), (1, 0.004), (2, 0.273), (3, 0.213), (4, -0.06), (5, -0.004), (6, -0.014), (7, -0.018), (8, 0.093), (9, -0.033), (10, 0.135), (11, 0.004), (12, 0.137), (13, -0.098), (14, 0.008), (15, -0.011), (16, 0.042), (17, -0.237), (18, 0.022), (19, 0.012), (20, 0.059), (21, 0.04), (22, 0.117), (23, 0.109), (24, -0.218), (25, -0.118), (26, 0.164), (27, 0.048), (28, -0.092), (29, -0.072), (30, -0.085), (31, 0.163), (32, 0.08), (33, -0.083), (34, -0.117), (35, 0.099), (36, 0.164), (37, 0.073), (38, 0.099), (39, -0.004), (40, 0.04), (41, -0.017), (42, 0.073), (43, 0.02), (44, 0.065), (45, 0.017), (46, -0.018), (47, 0.003), (48, -0.052), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.94425505 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
2 0.61696309 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
3 0.51662624 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
4 0.42377204 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
5 0.32599691 25 jmlr-2011-Discriminative Learning of Bayesian Networks via Factorized Conditional Log-Likelihood
Author: Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira, Petri Myllymäki
Abstract: We propose an efficient and parameter-free scoring criterion, the factorized conditional log-likelihood (ˆ fCLL), for learning Bayesian network classifiers. The proposed score is an approximation of the conditional log-likelihood criterion. The approximation is devised in order to guarantee decomposability over the network structure, as well as efficient estimation of the optimal parameters, achieving the same time and space complexity as the traditional log-likelihood scoring criterion. The resulting criterion has an information-theoretic interpretation based on interaction information, which exhibits its discriminative nature. To evaluate the performance of the proposed criterion, we present an empirical comparison with state-of-the-art classifiers. Results on a large suite of benchmark data sets from the UCI repository show that ˆ fCLL-trained classifiers achieve at least as good accuracy as the best compared classifiers, using significantly less computational resources. Keywords: Bayesian networks, discriminative learning, conditional log-likelihood, scoring criterion, classification, approximation c 2011 Alexandra M. Carvalho, Teemu Roos, Arlindo L. Oliveira and Petri Myllym¨ ki. a ¨ C ARVALHO , ROOS , O LIVEIRA AND M YLLYM AKI
6 0.25514916 97 jmlr-2011-Union Support Recovery in Multi-task Learning
7 0.23895335 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
8 0.21732102 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
9 0.20284855 59 jmlr-2011-Learning with Structured Sparsity
10 0.20025833 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
11 0.18450296 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
12 0.18014137 12 jmlr-2011-Bayesian Co-Training
13 0.17748405 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
14 0.1747613 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
15 0.17391567 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
16 0.16126136 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
17 0.15889461 105 jmlr-2011-lp-Norm Multiple Kernel Learning
18 0.15680014 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
19 0.15420251 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
20 0.146091 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
topicId topicWeight
[(4, 0.065), (9, 0.139), (10, 0.028), (24, 0.032), (31, 0.063), (32, 0.015), (35, 0.359), (41, 0.022), (60, 0.019), (70, 0.011), (71, 0.016), (73, 0.026), (78, 0.059), (90, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.73491657 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
2 0.45587593 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
3 0.44433442 105 jmlr-2011-lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN
4 0.43936631 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
5 0.37247461 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
6 0.36597991 91 jmlr-2011-The Sample Complexity of Dictionary Learning
7 0.35927844 66 jmlr-2011-Multiple Kernel Learning Algorithms
8 0.34704682 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
9 0.34651947 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
10 0.34544802 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
11 0.33999246 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
12 0.33795533 48 jmlr-2011-Kernel Analysis of Deep Networks
13 0.33777201 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
14 0.33725849 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
15 0.33657011 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
16 0.33206201 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
17 0.32712591 104 jmlr-2011-X-Armed Bandits
18 0.32451472 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
19 0.32216108 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
20 0.32205668 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing