jmlr jmlr2011 jmlr2011-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
Reference: text
sentIndex sentText sentNum sentScore
1 More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. [sent-13, score-0.862]
2 Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. [sent-16, score-0.879]
3 Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization 1. [sent-18, score-0.455]
4 Introduction Modeling signals as sparse linear combinations of atoms selected from a dictionary has become a popular paradigm in many fields, including signal processing, statistics, and machine learning. [sent-19, score-0.605]
5 In many applied settings, the structure of the problem at hand, such as, for example, the spatial arrangement of the pixels in an image, or the presence of variables corresponding to several levels of a given factor, induces relationships between dictionary elements. [sent-33, score-0.452]
6 We tackle the resulting nonsmooth convex optimization problem with proximal methods (e. [sent-63, score-0.42]
7 Concretely, given an m-dimensional signal x along with a dictionary D = [d1 , . [sent-68, score-0.459]
8 A particular instance of this problem—known as the proximal problem—is central to our analysis and concentrates on the case where the dictionary D is orthogonal. [sent-77, score-0.74]
9 First, we consider settings where the dictionary is fixed and given a priori, corresponding for instance to a basis of wavelets for the denoising of natural images. [sent-79, score-0.504]
10 Second, we show how one can take advantage of this hierarchical sparse coding in the context of dictionary learning (Olshausen and Field, 1997; Aharon et al. [sent-80, score-0.631]
11 , 2010a), where the dictionary is learned to adapt to the predefined tree structure. [sent-82, score-0.514]
12 This extension of dictionary learning is notably shown to share interesting connections with hierarchical probabilistic topic models. [sent-83, score-0.559]
13 To summarize, the contributions of this paper are threefold: • We show that the proximal operator for a tree-structured sparse regularization can be computed exactly in a finite number of operations using a dual approach. [sent-84, score-0.568]
14 Our approach is equivalent to computing a particular sequence of elementary proximal operators, and has a complexity linear, or close to linear, in the number of variables. [sent-85, score-0.326]
15 • Our method establishes a bridge between hierarchical dictionary learning and hierarchical topic models (Blei et al. [sent-90, score-0.639]
16 2299 J ENATTON , M AIRAL , O BOZINSKI AND BACH dictionary learning framework and shows how it can be used with tree-structured norms. [sent-111, score-0.414]
17 In traditional sparse coding, we seek to approximate this signal by a sparse linear combination of atoms, or dictionary elements, represented here by the columns of △ a matrix D = [d1 , . [sent-115, score-0.585]
18 (1) Informally, we want to exploit the structure of T in the following sense: the decomposition of any signal x can involve a dictionary element d j only if the ancestors of d j in the tree T are themselves part of the decomposition. [sent-132, score-0.631]
19 , p}, condition (1) amounts to saying that if a dictionary element is not used in the decomposition, its descendants in the tree should not be used either. [sent-161, score-0.514]
20 Given such a tree-structured set of groups G and its associated norm Ω, we are interested throughout the paper in the following hierarchical sparse coding problem, min f (α) + λΩ(α), α∈R p (5) 5. [sent-193, score-0.375]
21 In the rest of the 1 paper, we will mostly use the square loss f (α) = 2 x − Dα 2 , with a dictionary D in Rm×p , but the 2 formulation of Equation (5) extends beyond this context. [sent-202, score-0.414]
22 (2010) have considered proximal methods for general group structure G when . [sent-234, score-0.373]
23 Optimization We begin with a brief introduction to proximal methods, necessary to present our contributions. [sent-237, score-0.326]
24 It is worth mentioning that there exist various proximal schemes in the literature that differ in their settings (e. [sent-239, score-0.326]
25 L 2 2 Solving efficiently and exactly this problem is crucial to enjoy the fast convergence rates of proximal methods. [sent-263, score-0.326]
26 In addition, when the nonsmooth term Ω is not present, the previous proximal problem exactly leads to the standard gradient update rule. [sent-264, score-0.375]
27 More generally, we define the proximal operator: Definition 2 (Proximal Operator) The proximal operator associated with our regularization term λΩ, which we denote by ProxλΩ , is the function that maps a vector u ∈ R p to the unique solution of min p v∈R 1 u−v 2 2 2 + λΩ(v). [sent-265, score-0.764]
28 What makes proximal methods appealing for solving sparse decomposition problems is that this operator can be often computed in closed-form. [sent-267, score-0.496]
29 For instance, 2304 P ROXIMAL M ETHODS FOR H IERARCHICAL S PARSE C ODING • When Ω is the ℓ1 -norm—that is, Ω(u) = u 1 , the proximal operator is the well-known elementwise soft-thresholding operator, ∀ j ∈ {1, . [sent-268, score-0.4]
30 , p}, the proximal problem is separable in every group, and the solution is a generalization of the soft-thresholding operator to groups of variables: ∀g ∈ G , u|g → u|g − Π where Π . [sent-275, score-0.489]
31 , 2009) says that the proximal operator for a norm . [sent-286, score-0.469]
32 9 This is a classical duality result for proximal operators leading to the different closed forms we have just presented. [sent-289, score-0.387]
33 : Lemma 3 (Dual of the proximal problem) Let u ∈ R p and let us consider the problem max − ξ∈R p×|G | 1 2 u− s. [sent-299, score-0.326]
34 ∗ ≤th (u|h − ξ ), P ROXIMAL M ETHODS FOR H IERARCHICAL S PARSE C ODING with tg ,th > 0. [sent-339, score-0.282]
35 In only one pass over the groups {g, h}, we have in fact reached a solution of the dual formulation presented in Equation (8), and in particular, the solution of the proximal problem (7). [sent-350, score-0.533]
36 In the following proposition, this lemma is extended to general tree-structured sets of groups G : Proposition 5 (Convergence in one pass) Suppose that the groups in G are ordered according to the total order relation of Definition 1, and that the norm . [sent-351, score-0.298]
37 Then, after initializing ξ to 0, a single pass of Algorithm 1 over G with the order yields the solution of the proximal problem (7). [sent-353, score-0.377]
38 2307 J ENATTON , M AIRAL , O BOZINSKI AND BACH Using conic duality, we have derived a dual formulation of the proximal operator, leading to Algorithm 1 which is generic and works for any norm . [sent-374, score-0.462]
39 end for Actually, in light of the classical relationship between proximal operator and projection (as discussed in Section 3. [sent-392, score-0.449]
40 To simplify the notations, we define the proximal operator for a group g in △ G as Proxg (u) = Proxλωg . [sent-396, score-0.447]
41 Thus, Algorithm 2 in fact performs a sequence of |G | proximal operators, and we have shown the following corollary of Proposition 5: Corollary 6 (Composition of Proximal Operators) Let g1 . [sent-398, score-0.326]
42 The proximal operator ProxλΩ associated with the norm Ω can be written as the composition of elementary operators: ProxλΩ = Proxgm ◦ . [sent-405, score-0.469]
43 Indeed, in that case each of the proximal operators Proxg is a scaling operation: v|g ← 1 − λωg / v|g 2 + v|g . [sent-429, score-0.356]
44 As a reminder, root(g) is not a singleton when several dictionary elements are considered per node. [sent-439, score-0.414]
45 2309 J ENATTON , M AIRAL , O BOZINSKI AND BACH So far the dictionary D was fixed to be for example a wavelet basis. [sent-440, score-0.501]
46 In the next section, we apply the tools we developed for solving efficiently problem (5) to learn a dictionary D adapted to our hierarchical sparse coding formulation. [sent-441, score-0.631]
47 Application to Dictionary Learning We start by briefly describing dictionary learning. [sent-443, score-0.414]
48 Dictionary learning is a matrix factorization problem which aims at representing these signals as linear combinations of the dictionary elements, that are the columns of a matrix D = [d1 , . [sent-449, score-0.446]
49 More precisely, the dictionary D is learned along with a matrix of decomposition coefficients A = [α1 , . [sent-453, score-0.447]
50 However, this classical setting treats each dictionary element independently from the others, and does not exploit possible relationships between them. [sent-466, score-0.414]
51 To embed the dictionary in a tree structure, we therefore replace the ℓ1 -norm by our hierarchical norm and set Ψ = Ω in Equation (10). [sent-467, score-0.663]
52 On the contrary, in the case of dictionary learning, since the atoms are learned, one can argue that the dictionary elements learned will have to match well the hierarchical prior that is imposed by the regularization. [sent-470, score-0.959]
53 In other words, combining structured regularization with dictionary learning has precisely the advantage that the dictionary elements will self-organize to match the prior. [sent-471, score-0.907]
54 2 Learning the Dictionary Optimization for dictionary learning has already been intensively studied. [sent-473, score-0.414]
55 2310 P ROXIMAL M ETHODS FOR H IERARCHICAL S PARSE C ODING dictionary learning problem. [sent-484, score-0.414]
56 The optimization of the dictionary D (for A fixed), which we discuss first, is in general easier. [sent-492, score-0.414]
57 For natural image patches, the dictionary elements are usually constrained to be in the unit ℓ2 -norm ball (i. [sent-503, score-0.508]
58 , D = D0 ), while for topic modeling, the dictionary + elements are distributions of words and therefore belong to the simplex (i. [sent-505, score-0.479]
59 The update of each dictionary element amounts to performing a Euclidean projection, which can be computed efficiently (Mairal et al. [sent-508, score-0.414]
60 Although we have not explored locality constraints on the dictionary elements, these have been shown to be particularly relevant to some applications such as patch-based image classification (Yu et al. [sent-511, score-0.465]
61 Furthermore, positivity constraints can be added on the domain of A, by noticing that for our norm Ω and any vector u in R p , adding these constraints when computing the proximal operator is equivalent 1 to solving minv∈R p 2 [u]+ − v 2 + λΩ(v). [sent-518, score-0.469]
62 3, we have shown that the proximal operator associated to Ω can be computed exactly and efficiently. [sent-528, score-0.4]
63 The problem is therefore amenable to fast proximal algorithms that are well suited to nonsmooth convex optimization. [sent-529, score-0.42]
64 Specifically, we tried the accelerated scheme from both Nesterov 2311 J ENATTON , M AIRAL , O BOZINSKI AND BACH (2007) and Beck and Teboulle (2009), and finally opted for the latter since, for a comparable level of precision, fewer calls of the proximal operator are required. [sent-530, score-0.444]
65 This latter algorithm has an optimal convergence rate in the class of first-order techniques, and also allows for warm restarts, which is crucial in the alternating scheme of dictionary learning. [sent-533, score-0.414]
66 Moreover, the experiments we carry out cover various settings, with notably different sparsity regimes, that is, low, medium and high, respectively corresponding to about 50%, 10% and 1% of the total number of dictionary elements. [sent-556, score-0.46]
67 The dictionary we use is represented on Figure 7. [sent-564, score-0.414]
68 Although the problem involves a small number of variables, that is, p = 151 dictionary elements, it has to be solved repeatedly for tens of thousands of patches, at moderate precision. [sent-565, score-0.414]
69 First, we observe that in most cases, the accelerated proximal scheme performs better than the other approaches. [sent-585, score-0.37]
70 The results in Figure 4 highlight that the accelerated proximal scheme performs overall better that the two other methods. [sent-618, score-0.37]
71 Again, it is important to note that both proximal algorithms yield sparse solutions, which is not the case for SG. [sent-619, score-0.389]
72 Since the basis is here orthonormal, solving the corresponding decomposition problems amounts to computing a single instance of the proximal operator. [sent-636, score-0.359]
73 Solving the proximal problem for an image with m = 512 × 512 = 262 144 pixels takes approximately 0. [sent-805, score-0.415]
74 80 70 60 50 16 21 31 41 61 81 121 161 181 241 301 321 401 Figure 6: Mean square error multiplied by 100 obtained with 13 structures with error bars, sorted by number of dictionary elements from 16 to 401. [sent-860, score-0.414]
75 White bars correspond to the flat dictionary model containing the same number of dictionary as the tree-structured one. [sent-862, score-0.828]
76 For the first experiment, the dictionary D is learned on Xtr using the formulation of Equation (10), with µ = 0 for Dµ as defined in Equation (11). [sent-868, score-0.414]
77 2317 J ENATTON , M AIRAL , O BOZINSKI AND BACH Figure 7: Learned dictionary with a tree structure of depth 5. [sent-877, score-0.558]
78 The dictionary is learned on 50, 000 patches of size 16 × 16 pixels. [sent-880, score-0.511]
79 For each tree structure, we evaluated the performance obtained with the tree-structured dictionary along with a non-structured dictionary containing the same number of elements. [sent-896, score-0.928]
80 For all fractions of missing pixels considered, the tree-structured dictionary outperforms the “unstructured one”, and the most significant improvement is obtained in the noisiest setting. [sent-899, score-0.452]
81 Note that having more dictionary elements is worthwhile when using the tree structure. [sent-900, score-0.514]
82 For each dictionary size, the tree-structured dictionary significantly outperforms the unstructured one. [sent-902, score-0.86]
83 An example of a learned tree-structured dictionary is presented on Figure 7. [sent-903, score-0.414]
84 We similarly use our dictionary learning approach to find low-dimensional representations of text corpora. [sent-915, score-0.414]
85 If we further assume that the entries of D and A are nonnegative, and that the dictionary elements d j have unit ℓ1 -norm, the decomposition (D, A) can be interpreted as the parameters of a topic-mixture model. [sent-920, score-0.447]
86 The regularization Ω induces the organization of these topics on a tree, so that, if a document involves a certain topic, then all ancestral topics in the tree are also present in the topic decomposition. [sent-921, score-0.335]
87 Figure 8 displays an example of a learned dictionary with 13 topics, obtained by using the ℓ∞ -norm in Ω and selecting manually λ = 2−15 . [sent-932, score-0.414]
88 (2003), (ii) principal component analysis (PCA), (iii) nonnegative matrix factorization (NMF), (iv) standard sparse dictionary learning (denoted by SpDL) and (v) our sparse hierarchical p×n + approach (denoted by SpHDL). [sent-957, score-0.62]
89 It should be noted, however, that, unlike some Bayesian methods, dictionary learning by itself does not provide mechanisms for the automatic selection of model hyper-parameters (such as the dictionary size or the topology of the tree). [sent-989, score-0.828]
90 This problem can be viewed as a proximal operator for the nonconvex regularization ∑g∈G δg (v). [sent-999, score-0.554]
91 We can rewrite problem (7) as min v∈R p ,z∈R|G | 1 u−v 2 2 2 +λ ∑ ωg zg , such that (v g , zg ) ∈ C , ∀g ∈ G , | g∈G by introducing the primal variables z = (zg )g∈G ∈ R|G | , with the additional |G | conic constraints (v|g , zg ) ∈ C , for g ∈ G . [sent-1037, score-0.52]
92 We now consider the Lagrangian L defined as L (v, z, τ, ξ) = 1 u−v 2 ∑ 2 2 +λ g∈G ωg zg − zg ∑ v|g g∈G ⊤ τg , ξg with the dual variables τ = (τg )g∈G in R|G | , and ξ = (ξg )g∈G in R p×|G | , such that for all g ∈ G , ξg = 0 if j ∈ g and (ξg , τg ) ∈ C ∗ . [sent-1041, score-0.387]
93 We have that {v, z, τ, ξ} are optimal if and only if ∀g ∈ G , (v|g , zg ) ∈ C , ∀g ∈ G , (ξ , τg ) ∈ C , g ∗ ∀g ∈ G , zg τg − v|⊤ ξg = 0, g ∀g ∈ G , λωg − τg v − u + ∑g∈G ξ g (Complementary slackness) = 0, = 0. [sent-1046, score-0.32]
94 J ENATTON , M AIRAL , O BOZINSKI AND BACH Furthermore, using the fact that ∀g ∈ G , (v|g , zg ) ∈ C and (ξg , τg ) = (ξg , λωg ) ∈ C ∗ , we obtain the following chain of inequalities ∀g ∈ G , λzg ωg = v|⊤ ξg ≤ v|g g ξg ∗ ≤ zg ξg ∗ ≤ λzg ωg , for which equality must hold. [sent-1048, score-0.32]
95 Precisely, by Lemma 9, we need to show that either ξg = u|g − ξ|h , if u|g − ξ|h g g ∗ ≤ tg , or ξg ∗ = tg and ξg⊤ (u|g − ξ|h − ξg ) = ξg g 2324 ∗ u|g − ξ|h − ξg . [sent-1072, score-0.564]
96 g P ROXIMAL M ETHODS FOR H IERARCHICAL S PARSE C ODING Note that the feasibility of ξg , that is, ξg ∗ ≤ tg , holds by definition of κg . [sent-1073, score-0.282]
97 6 Non-negativity Constraint for the Proximal Operator The next lemma shows how we can easily add a non-negativity constraint on the proximal operator when the norm Ω is absolute (Stewart and Sun, 1990, Definition 1. [sent-1133, score-0.52]
98 Lemma 11 (Non-negativity constraint for the proximal operator) Let κ ∈ R p and λ > 0. [sent-1135, score-0.326]
99 We keep the same notation as in Lemma 4 and assume from now on that u|g q′ > tg and u|h q′ > tg +th . [sent-1158, score-0.564]
100 We therefore have u|g − ξ|h q′ > tg and using again the second optimality conditions of Lemma 9, g ′ there exists ρ > 0 such that for all j in g, |ξg |q = ρ|u j − ξg − ξh |q . [sent-1171, score-0.322]
wordName wordTfidf (topN-words)
[('dictionary', 0.414), ('proximal', 0.326), ('tg', 0.282), ('roximal', 0.188), ('airal', 0.169), ('bozinski', 0.169), ('mairal', 0.169), ('ierarchical', 0.168), ('zg', 0.16), ('oding', 0.16), ('enatton', 0.152), ('nonconvex', 0.116), ('parse', 0.11), ('bach', 0.104), ('baraniuk', 0.103), ('tree', 0.1), ('ethods', 0.099), ('patches', 0.097), ('jenatton', 0.095), ('ista', 0.094), ('groups', 0.089), ('wavelet', 0.087), ('sphdl', 0.084), ('hierarchical', 0.08), ('donoho', 0.075), ('operator', 0.074), ('coding', 0.074), ('computesqnorm', 0.073), ('sgsqrt', 0.073), ('spdl', 0.073), ('fista', 0.072), ('norm', 0.069), ('dual', 0.067), ('blei', 0.066), ('topics', 0.066), ('children', 0.066), ('topic', 0.065), ('dictionaries', 0.064), ('sparse', 0.063), ('postings', 0.063), ('psnr', 0.062), ('prox', 0.056), ('sg', 0.055), ('recursivescaling', 0.052), ('zhao', 0.052), ('atoms', 0.051), ('lemma', 0.051), ('image', 0.051), ('pass', 0.051), ('projection', 0.049), ('denoising', 0.049), ('nonsmooth', 0.049), ('nonzero', 0.048), ('group', 0.047), ('beck', 0.046), ('sparsity', 0.046), ('convex', 0.045), ('inria', 0.045), ('signal', 0.045), ('accelerated', 0.044), ('sign', 0.044), ('kim', 0.044), ('depth', 0.044), ('ball', 0.043), ('teboulle', 0.043), ('documents', 0.042), ('uroot', 0.042), ('root', 0.041), ('combettes', 0.041), ('wavelets', 0.041), ('structured', 0.041), ('optimality', 0.04), ('pesquet', 0.04), ('socp', 0.04), ('primal', 0.04), ('ancestors', 0.039), ('regularization', 0.038), ('pixels', 0.038), ('obozinski', 0.037), ('mallat', 0.037), ('lda', 0.037), ('sprechmann', 0.036), ('cpu', 0.035), ('projections', 0.035), ('images', 0.034), ('decomposition', 0.033), ('rm', 0.033), ('unstructured', 0.032), ('aharon', 0.032), ('hierarchies', 0.032), ('nr', 0.032), ('counterexample', 0.032), ('signals', 0.032), ('lena', 0.031), ('recursivethresholding', 0.031), ('duality', 0.031), ('wright', 0.03), ('operators', 0.03), ('haar', 0.029), ('borwein', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999839 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
2 0.34657097 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
3 0.27722639 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.14335181 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
5 0.11665761 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
6 0.098160103 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
7 0.097859591 59 jmlr-2011-Learning with Structured Sparsity
8 0.086360477 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
9 0.083491087 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
10 0.083125129 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
11 0.068817757 54 jmlr-2011-Learning Latent Tree Graphical Models
12 0.067904018 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
13 0.06180235 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
14 0.060642213 97 jmlr-2011-Union Support Recovery in Multi-task Learning
15 0.056162819 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
16 0.054441825 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
17 0.046985783 105 jmlr-2011-lp-Norm Multiple Kernel Learning
18 0.045216743 101 jmlr-2011-Variable Sparsity Kernel Learning
19 0.041719574 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough
20 0.04127159 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
topicId topicWeight
[(0, 0.302), (1, 0.02), (2, 0.395), (3, 0.322), (4, -0.184), (5, 0.039), (6, 0.064), (7, -0.014), (8, 0.221), (9, -0.142), (10, -0.128), (11, 0.296), (12, -0.082), (13, 0.189), (14, 0.013), (15, 0.027), (16, 0.025), (17, -0.138), (18, 0.153), (19, -0.01), (20, -0.008), (21, 0.044), (22, 0.012), (23, -0.039), (24, -0.002), (25, 0.034), (26, 0.009), (27, -0.011), (28, -0.009), (29, 0.014), (30, -0.001), (31, -0.019), (32, 0.01), (33, 0.022), (34, 0.031), (35, -0.026), (36, -0.053), (37, -0.031), (38, -0.05), (39, -0.01), (40, -0.024), (41, 0.005), (42, -0.003), (43, 0.01), (44, -0.005), (45, 0.019), (46, 0.022), (47, 0.008), (48, -0.027), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.94461036 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
2 0.88871276 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
3 0.62266594 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.48879147 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
5 0.32129246 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros
Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy
6 0.31241193 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
7 0.29992181 59 jmlr-2011-Learning with Structured Sparsity
8 0.23100205 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
9 0.22384429 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
10 0.21898977 54 jmlr-2011-Learning Latent Tree Graphical Models
11 0.21631777 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
12 0.2131222 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
13 0.20278868 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
14 0.19451387 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
15 0.19323941 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
16 0.18631341 104 jmlr-2011-X-Armed Bandits
17 0.1850467 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
18 0.18502934 105 jmlr-2011-lp-Norm Multiple Kernel Learning
19 0.18225473 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
20 0.17584215 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
topicId topicWeight
[(4, 0.058), (9, 0.529), (10, 0.03), (24, 0.031), (31, 0.07), (32, 0.012), (41, 0.016), (60, 0.018), (70, 0.011), (73, 0.028), (78, 0.061), (90, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.90720695 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
2 0.82689118 105 jmlr-2011-lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN
3 0.79332507 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach
Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers
4 0.63111389 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
5 0.52938432 66 jmlr-2011-Multiple Kernel Learning Algorithms
Author: Mehmet Gönen, Ethem Alpaydın
Abstract: In recent years, several methods have been proposed to combine multiple kernels instead of using a single one. These different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets). In trying to organize and highlight the similarities and differences between them, we give a taxonomy of and review several multiple kernel learning algorithms. We perform experiments on real data sets for better illustration and comparison of existing algorithms. We see that though there may not be large differences in terms of accuracy, there is difference between them in complexity as given by the number of stored support vectors, the sparsity of the solution as given by the number of used kernels, and training time complexity. We see that overall, using multiple kernels instead of a single one is useful and believe that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels. Keywords: support vector machines, kernel machines, multiple kernel learning
6 0.48680156 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
7 0.47437009 59 jmlr-2011-Learning with Structured Sparsity
8 0.46614611 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
9 0.45810413 91 jmlr-2011-The Sample Complexity of Dictionary Learning
10 0.43665412 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
11 0.43209943 48 jmlr-2011-Kernel Analysis of Deep Networks
12 0.42144939 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal
13 0.4201791 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.41615787 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
15 0.40849724 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
16 0.39634985 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
17 0.39102513 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
18 0.3891556 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
19 0.38777745 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals
20 0.38597074 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets