jmlr jmlr2011 jmlr2011-91 knowledge-graph by maker-knowledge-mining

91 jmlr-2011-The Sample Complexity of Dictionary Learning


Source: pdf

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. [sent-12, score-0.656]

2 Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? [sent-13, score-0.552]

3 We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. [sent-15, score-1.219]

4 For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). [sent-17, score-1.251]

5 Keywords: dictionary learning, generalization bound, sparse representation 1. [sent-21, score-0.694]

6 Introduction A common technique in processing signals from X = Rn is to use sparse representations; that is, to approximate each signal x by a “small” linear combination a of elements di from a dictionary p D ∈ X p , so that x ≈ Da = ∑i=1 ai di . [sent-22, score-0.853]

7 The dictionary learning problem is to find a dictionary D minimizing E(D) = Ex∼ν hA,D (x), (2) where ν is a distribution over signals that is known to us only through samples from it. [sent-30, score-1.082]

8 The problem addressed in this paper is the “generalization” (in the statistical learning sense) of dictionary learning: to what extent does the performance of a dictionary chosen based on a finite set of samples indicate its expected error in (2)? [sent-31, score-1.02]

9 This clearly depends on the number of samples and other parameters of the problem such as the dictionary size. [sent-32, score-0.51]

10 In particular, an obvious algorithm is to represent each sample using itself, if the dictionary is allowed to be as large as the sample, but the performance on unseen signals is likely to disappoint. [sent-33, score-0.572]

11 To state our goal more quantitatively, assume that an algorithm finds a dictionary D suited to k sparse representation, in the sense that the average representation error Em (D) on the m examples given to the algorithm is low. [sent-34, score-0.639]

12 In particular, such a result means that every procedure for approximate minimization of empirical error (empirical dictionary learning) is also a procedure for approximate dictionary learning (as defined above) in a similar sense. [sent-37, score-1.02]

13 Is it possible to extend the usefulness of dictionary learning techniques to this setting? [sent-41, score-0.51]

14 • Compression: If a signal x has an approximate sparse representation in some commonly known dictionary D, it can be stored or transmitted more economically with reasonable precision. [sent-47, score-0.681]

15 • Denoising: If a signal x has a sparse representation in some known dictionary D, and x = x+ν, ˜ where the random noise ν is Gaussian, then the sparse representation found for x will likely ˜ be very close to x (for example Chen et al. [sent-52, score-0.81]

16 • Compressed sensing: Assuming that a signal x has a sparse representation in some known dictionary D that fulfills certain geometric conditions, this representation can be approximately retrieved with high probability from a small number of random linear measurements of x. [sent-54, score-0.75]

17 The implications of these results are significant when a dictionary D is known that sparsely represents simultaneously many signals. [sent-56, score-0.534]

18 In some applications the dictionary is chosen based on prior knowledge, but in many applications the dictionary is learned based on a finite set of examples. [sent-57, score-1.02]

19 To motivate dictionary learning, consider an image representation used for compression or denoising. [sent-58, score-0.579]

20 Different types of images may have different properties (MRI images are not similar to scenery images), so that learning a dictionary specific to each type of images may lead to improved performance. [sent-59, score-0.573]

21 The benefits of dictionary learning have been demonstrated in many applications (Protter and Elad, 2007; Peyr´ , 2009). [sent-60, score-0.51]

22 e Two extensively used techniques related to dictionary learning are Principal Component Analysis (PCA) and K-means clustering. [sent-61, score-0.51]

23 The former finds a single subspace minimizing the sum of squared representation errors which is very similar to dictionary learning with A = Hk and p = k. [sent-62, score-0.579]

24 The latter finds a set of locations minimizing the sum of squared distances between each signal and the location closest to it which is very similar to dictionary learning with A = H1 where p is the number of locations. [sent-63, score-0.552]

25 Thus we could see dictionary learning as PCA with multiple subspaces, or as clustering where multiple locations are used to represent each signal. [sent-64, score-0.51]

26 3261 VAINSENCHER , M ANNOR AND B RUCKSTEIN The only prior work we are aware of that addresses generalization in dictionary learning, by Maurer and Pontil (2010), addresses the convex representation constraint A = Rλ ; we discuss the relation of our work to theirs in Section 2. [sent-79, score-0.634]

27 Our results are: A new approach to dictionary learning generalization. [sent-82, score-0.51]

28 Our first main contribution is an approach to generalization bounds in dictionary learning that is complementary to the approach used by Maurer and Pontil (2010). [sent-83, score-0.64]

29 In Theorem 1 we quantify the complexity of the class of functions hA,D over all dictionaries whose columns have unit length, where A ⊂ Rλ . [sent-86, score-0.301]

30 The main significance of this is in that the general statistical behavior they imply occurs in dictionary learning. [sent-94, score-0.51]

31 (1998) of order √ m−1 on the k-means clustering problem, which corresponds to dictionary learning for 1-sparse representation, fast rates may be expected only with η > 0, as presented here. [sent-98, score-0.51]

32 Distance from orthogonality is measured by the Babel function (which, for example, upper bounds the magnitude of the maximal inner product between distinct dictionary elements) defined below and discussed in more detail in Section 4. [sent-105, score-0.659]

33 ,p}\{i};|Λ|=k j∈Λ The following proposition, which is proved in Section 3, bounds the 1-norm of the dictionary coefficients for a k sparse representation and also follows from analysis previously done by Donoho and Elad (2003) and Tropp (2004). [sent-113, score-0.714]

34 2 2 Since low values of the Babel function have implications to representation finding algorithms, this result is of interest also outside the context of dictionary learning. [sent-125, score-0.579]

35 Essentially it means that random dictionaries whose cardinality is sub-exponential in (n − 2)/k2 have low Babel function. [sent-126, score-0.269]

36 The covering number bound of Theorem 1 implies several generalization bounds for the problem of dictionary learning for l1 regularized representations which we give here. [sent-128, score-0.804]

37 Then with probability at least 1 − e−x over the m samples in Em drawn according to ν, for all dictionaries D ⊂ B with cardinality p: Eh2 ≤ Em h2 + A,D A,D p2 14λ + 1/2 ln (16mλ2 ) 2 + m x . [sent-133, score-0.522]

38 Then with probability at least 1 − e−x over the m samples in Em drawn according to ν, for all D with unit length columns: EhRλ ,D ≤ Em hRλ ,D + √ np ln (4 mλ) + 2m x + 2m 4 . [sent-137, score-0.402]

39 Proposition 3 and Corollary 4 imply certain generalization bounds for the problem of dictionary learning for k sparse representations, which we give here. [sent-146, score-0.7]

40 µk−1 (D) ≤ δ and with unit length columns:  1 p  14k + Eh2 k ,D ≤ Em h2 k ,D + √ H H m 1−δ 2 3264 k ln 16m 1−δ 2  + x . [sent-151, score-0.286]

41 µk−1 (D) ≤ δ and with unit length columns: √ EhHk ,D ≤ Em hHk ,D + mk np ln 41−δ + 2m x + 2m 4 . [sent-158, score-0.402]

42 µk−1 (D) ≤ δ and with unit length columns: np ln EhHk ,D ≤ 1. [sent-162, score-0.402]

43 Generalization bounds for dictionary learning in feature spaces. [sent-164, score-0.614]

44 We further consider applications of dictionary learning to signals that are not represented as elements in a vector space, or that have a very high (possibly infinite) dimension. [sent-165, score-0.611]

45 The dictionary of elements used for representation could be decided via dictionary learning, and it is natural to choose the dictionary so that the bags of words of documents are approximated well by small linear combinations of those in the dictionary. [sent-175, score-1.638]

46 The dimension free results of Maurer and Pontil (2010) apply most naturally in this setting, and may be combined with our results to cover also dictionaries for k sparse representation, under reasonable assumptions on the kernel. [sent-178, score-0.374]

47 Then with probability at least 1 − e−x over the m samples in Em drawn according to ν, for all D ⊂ R with cardinality p such that ΦD ⊂ BH and µk−1 (ΦD) ≤ δ < 1: 2 Eh2 k ,D ≤ Em h2 k ,D + φ,H φ,H p2 14k/(1 − δ) + 1/2 ln 16m m k 1−δ 2 + x . [sent-180, score-0.281]

48 Let δ < 1, and ν any distribution on R , then with probability at least 1 − e−x over the m samples in Em drawn according to ν, for all dictionaries D ⊂ R of cardinality p s. [sent-190, score-0.269]

49 µk−1 (ΦD) ≤ δ < 1 (where Φ acts like φ on columns):   √ α kγ2 L mC 1−δ  np ln x  + 4 . [sent-192, score-0.369]

50 + EhHk ,D ≤ Em hHk ,D + γ   2αm 2m  m The covering number bounds needed to prove this theorem and analogs for the other generalization bounds are proved in Section 5. [sent-193, score-0.336]

51 We also show that in the k sparse representation setting a finite bound on λ does not occur generally thus an additional restriction, such as the near-orthogonality on the set of dictionaries on which we rely in this setting, 3266 T HE S AMPLE C OMPLEXITY OF D ICTIONARY L EARNING is necessary. [sent-196, score-0.395]

52 To prove Theorem 1 and Corollary 4 we first note that the space of all possible dictionaries is a subset of a unit ball in a Banach space of dimension np (with a norm specified below). [sent-202, score-0.445]

53 Thus (see formalization in Proposition 5 of Cucker and Smale, 2002) the space of dictionaries has an ε cover of size (4/ε)np . [sent-203, score-0.314]

54 Then it is enough to show that Ψλ defined as D → hRλ ,D and Φk defined as D → hHk ,D are uniformly Lipschitz (when Φk is restricted to the dictionaries with µk−1 (D) ≤ c < 1). [sent-205, score-0.268]

55 The images of Ψλ and Φk are sets of representation error functions—each dictionary induces a set of precisely representable signals, and a representation error function is simply a map of distances from this set. [sent-213, score-0.669]

56 Proof Let D and D′ be two dictionaries whose corresponding elements are at most ε > 0 far from one another. [sent-217, score-0.28]

57 2 ≤ ∞ Lemma 19 The function Φk is a k/(1 − δ)-Lipschitz mapping from the set of normalized dictionaries with µk−1 (D) < δ with the metric induced by · 1,2 to C Sn−1 . [sent-237, score-0.302]

58 Proof First we show that for any dictionary D there exist c > 0 and x ∈ Sn−1 such that hHk ,D (x) > c. [sent-242, score-0.51]

59 Then for that c and any dictionary D there exists a set of positive measure on which 3268 T HE S AMPLE C OMPLEXITY OF D ICTIONARY L EARNING hHk ,D > c, let q be a point in this set. [sent-247, score-0.51]

60 We now fix the dictionary D; its first k − 1 elements are the standard basis {e1 , . [sent-249, score-0.549]

61 To conclude the generalization bounds of Theorems 7, 8, 10, 11 and 14 from the covering number bounds we have provided, we use the following results. [sent-256, score-0.306]

62 Then with probability at least 1 − exp (−x), we have for all f ∈ F : E f ≤ (1 + β) Em f + K (d, m, β) where K (d, m, β) = 2 9 √ m +2 d+3 3d +1+ 9 √ m +2 d ln (Cm) + x , m d+3 3d 1 + 1 + 2β . [sent-264, score-0.281]

63 The probability that any of the |Fε | differences under the supremum is larger than t may be bounded as P sup f ∈Fε E f − Em f ≥ t ≤ |Fε | · exp −2mB−2t 2 ≤ exp d ln (C/ε) − 2mB−2t 2 . [sent-277, score-0.34]

64 In order to control the probability with x as in the statement of the lemma, we take −x = d ln (C/ε) − 2mB−2t 2 or equivalently we choose t = B2 /2m d ln (C/ε) + x. [sent-278, score-0.506]

65 Using the covering number bound assumption and √ d ln (C/ε) /2m + x/2m . [sent-280, score-0.379]

66 On the Babel Function The Babel function is one of several metrics defined in the sparse representations literature to quantify an ”almost orthogonality” property that dictionaries may enjoy. [sent-283, score-0.339]

67 Definition 24 The coherence of a dictionary D is µ1 (D) = maxi= j di , d j . [sent-290, score-0.603]

68 The tightness of the right inequality is witnessed by a dictionary including k + 1 copies of the same element. [sent-295, score-0.532]

69 3270 T HE S AMPLE C OMPLEXITY OF D ICTIONARY L EARNING A well known generic class of dictionaries with more elements than a basis is that of frames (see Duffin and Schaeer, 1952), which includes many wavelet systems and filter banks. [sent-303, score-0.28]

70 i=1 This may be easily verified by considering the inner products of any dictionary element with any other k elements as a vector in Rk ; the frame condition bounds its squared Euclidean norm by B − 1 (we remove the inner product of the element with itself in the frame expression). [sent-307, score-0.779]

71 If the answer is affirmative, it implies that Theorem 11 is quite strong, and representation finding algorithms such as basis pursuit are almost always exact, which might help prove properties of dictionary learning algorithms. [sent-312, score-0.579]

72 While there might be better probability measures on the space of dictionaries, we consider one that seems natural: suppose that a dictionary D is constructed by choosing p unit vectors uniformly from Sn−1 ; what is the probability that µk−1 (D) < δ? [sent-314, score-0.57]

73 Dictionary Learning in Feature Spaces We propose in Section 2 a scenario in which dictionary learning is performed in a feature space corresponding to a kernel function. [sent-319, score-0.574]

74 A general feature space, denoted H , is a Hilbert space to which Theorem 6 applies as is, under the simple assumption that the dictionary elements and signals are in its unit ball; this assumption is guaranteed by some kernels such as the Gaussian kernel. [sent-322, score-0.694]

75 The corresponding generalization bound for k sparse representations when the dictionary elements are nearly orthogonal in the feature space is given in Proposition 13. [sent-325, score-0.756]

76 The results so far show that generalization in dictionary learning can occur despite the potentially infinite dimension of the feature space, without considering practical issues of representation 3271 VAINSENCHER , M ANNOR AND B RUCKSTEIN and computation. [sent-328, score-0.663]

77 The elements of a dictionary D are now from R , and we denote ΦD their mapping by φ to H . [sent-331, score-0.549]

78 The cover number bounds depend strongly on the dimension of the space of dictionary elements. [sent-339, score-0.658]

79 Taking H as the space of dictionary elements is the simplest approach, but may lead to vacuous or weak bounds, for example in the case of the Gaussian kernel whose feature space is infinite dimensional. [sent-340, score-0.613]

80 A H¨ lder feature map φ allows us to bound the cover numbers of the dictionary elements in H o using their cover number bounds in R . [sent-348, score-0.893]

81 Note that not every kernel corresponds to a H¨ lder feature o map (the Dirac δ kernel is a counter example: any two distinct elements are mapped to elements at a mutual distance of 1), and sometimes analyzing the feature map is harder than analyzing the kernel. [sent-349, score-0.275]

82 The next proposition completes this section by giving the cover number bounds for the representation error function classes induced by appropriate kernels, from which various generalization bounds easily follow, such as Theorem 14. [sent-358, score-0.417]

83 Proposition 30 Let R be a set of representations with a cover number bound of (C/ε)n , and let either φ be uniformly L H¨ lder condition of order α on R , or κ be uniformly L H¨ lder of order 2α on o o R in each parameter, and let γ = supd∈R φ(d) H . [sent-359, score-0.328]

84 Then the function classes Wφ,λ and Qφ,k,δ taken as metric spaces with the supremum norm, have ε covers of cardinalities at most C (λγL/ε)1/α and C kγ2 L/ (ε (1 − δ)) 1/α np np , respectively. [sent-360, score-0.299]

85 If a 1 ≤ λ and maxd∈D φ(d) H ≤ γ then by considerations applied in Section 3, to obtain an ε cover of the image of dictionaries {mina (ΦD) a − φ (x) H : D ∈ D }, it is enough to obtain an ε/ (λγ) cover of {ΦD : D ∈ D }. [sent-362, score-0.387]

86 If also the feature mapping φ is uniformly L H¨ lder of order α over R then an (λγL/ε)−1/α cover o np of the set of dictionaries is sufficient, which as we have seen requires at most C (λγL/ε)1/α elements. [sent-363, score-0.555]

87 Conclusions Our work has several implications on the design of dictionary learning algorithms as used in signal, image, and natural language processing. [sent-366, score-0.51]

88 It might thus be useful to modify dictionary learning algorithms so that they obtain dictionaries with low Babel functions, possibly through regularization or through certain convex relaxations. [sent-369, score-0.751]

89 We view the dependence on µk−1 from an “algorithmic luckiness” perspective (Herbrich and Williamson, 2003): if the data are described by a dictionary with low Babel function the generalization bounds are encouraging. [sent-378, score-0.64]

90 m m Inequality (5) follows from Lemma 31: Pr ∃g ∈ G : Eg > Em g + 2 Var g (d ln (Cm) + x) 2 (d ln (Cm) + x) + m 3m 3274 ≤ e−x . [sent-395, score-0.506]

91 (5) (6) T HE S AMPLE C OMPLEXITY OF D ICTIONARY L EARNING Inequality (6) follows from Lemma 33 because 2 Var g (d ln (Cm) + x) = m 2 (d ln (Cm) + x) ≤ m Var g Var f + 2 m 2 (d ln (Cm) + x) . [sent-396, score-0.759]

92 We also denote A = Em f + K d ln(Cm)+x and 3d m B = (d ln (Cm) + x) /m, and note we have shown that with probability at least 1 − exp (−x) we have √ √ E f − A ≤ 2BE f , which by Lemma 34 implies E f ≤ A + B + 2AB + B2 . [sent-399, score-0.281]

93 From Lemma 35 with a = Em f and b = 2 (d ln (Cm) + x) /m we find that for every λ > 0 2Em f (d ln (Cm) + x) /m ≤ λEm f + 1 (d ln (Cm) + x) /m 2λ and the proposition follows. [sent-401, score-0.804]

94 Then Pr ∃g ∈ G : Eg > Em g + 2 Var g (d ln (Cm) + x) 2 (d ln (Cm) + x) + m 3m 3275 ≤ e−x . [sent-404, score-0.506]

95 T HE S AMPLE C OMPLEXITY OF D ICTIONARY L EARNING Taking a union bound over all |G |, we have: Pr ∃g ∈ G : Eg − Em g > y/ Pr ∃g ∈ G : Eg − Em g > Pr ∃g ∈ G : Eg − Em g > 2y + 3m 2y Var g m 2y Var g m 2y + 3m 2y Var g m 2y + 3m ≤ |G | exp (−y) ⇐⇒ ≤ exp (ln |G | − y) ⇒ ≤ exp ln (Cm)d − y . [sent-414, score-0.362]

96 Then we take −x = ln (Cm)d − y ⇐⇒ y = ln (Cm)d + x and have:  d 2 Var g ln (Cm) + x   ≤ exp (−x) . [sent-415, score-0.787]

97 Lemma 36 For any a, b ≥ 0, √ √ √ a+b ≤ a+ b Lemma 37 For d, m ≥ 1, x ≥ 0 and C ≥ e we have d ln (Cm) + x 2 9 √ +2 + ≤ m 3m m Proof By the assumptions, 9 √ m 9 √ +2 m d +3 3d d ln (Cm) + x . [sent-424, score-0.506]

98 Then 9 d ln (Cm) + x 2 √ +2 + = m 3m m 9 d ln (Cm) + x 2 √ +2 + m 3m m 9 d ln (Cm) + x + 3 ≤ √ +2 m 3m = 3 d ln (Cm) + x + d d 9 √ +2 m 3m 3 d ln (Cm) + x + d (d ln (Cm) + x) 9 ≤ √ +2 m 3m 9 d + 3 d ln (Cm) + x = √ +2 . [sent-426, score-1.771]

99 K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. [sent-445, score-0.322]

100 Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization. [sent-497, score-0.37]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dictionary', 0.51), ('cm', 0.349), ('babel', 0.269), ('ln', 0.253), ('dictionaries', 0.241), ('em', 0.222), ('hhk', 0.171), ('vainsencher', 0.159), ('var', 0.158), ('eg', 0.149), ('annor', 0.135), ('ruckstein', 0.135), ('maurer', 0.125), ('np', 0.116), ('ictionary', 0.115), ('dk', 0.104), ('covering', 0.101), ('ample', 0.094), ('omplexity', 0.083), ('sn', 0.076), ('bounds', 0.075), ('cover', 0.073), ('di', 0.07), ('representation', 0.069), ('lder', 0.069), ('hr', 0.065), ('ful', 0.063), ('signals', 0.062), ('sparse', 0.06), ('generalization', 0.055), ('orthogonality', 0.048), ('pontil', 0.048), ('lls', 0.047), ('proposition', 0.045), ('lemma', 0.043), ('signal', 0.042), ('technion', 0.04), ('donoho', 0.04), ('elements', 0.039), ('representations', 0.038), ('pr', 0.037), ('bruckstein', 0.037), ('ehhk', 0.037), ('isoperimetric', 0.037), ('metric', 0.036), ('da', 0.036), ('kernel', 0.035), ('norm', 0.034), ('earning', 0.033), ('unit', 0.033), ('sphere', 0.032), ('elad', 0.032), ('alfred', 0.032), ('lewicki', 0.031), ('lq', 0.031), ('hk', 0.031), ('supremum', 0.031), ('theorem', 0.03), ('feature', 0.029), ('exp', 0.028), ('shahar', 0.028), ('bor', 0.028), ('bartlett', 0.028), ('cardinality', 0.028), ('coef', 0.027), ('uniformly', 0.027), ('columns', 0.027), ('rn', 0.026), ('inner', 0.026), ('biau', 0.026), ('mairal', 0.026), ('olshausen', 0.026), ('shie', 0.026), ('induced', 0.025), ('bound', 0.025), ('proof', 0.025), ('israel', 0.025), ('cucker', 0.024), ('duf', 0.024), ('ehr', 0.024), ('peyr', 0.024), ('protter', 0.024), ('sparsely', 0.024), ('mendelson', 0.024), ('corollary', 0.023), ('ek', 0.023), ('products', 0.023), ('coherence', 0.023), ('tropp', 0.023), ('frame', 0.023), ('inequality', 0.022), ('michael', 0.022), ('overcomplete', 0.021), ('images', 0.021), ('ball', 0.021), ('ac', 0.021), ('applies', 0.021), ('soc', 0.021), ('mina', 0.021), ('andreas', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

2 0.27722639 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

3 0.14458111 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

Author: Stéphane Gaïffas, Guillaume Lecué

Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars

4 0.089308612 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates

5 0.083981164 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation

6 0.07626155 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

7 0.074360654 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

8 0.071400575 59 jmlr-2011-Learning with Structured Sparsity

9 0.071178056 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

10 0.064646348 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

11 0.060480427 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

12 0.055655804 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

13 0.052957926 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

14 0.052945536 41 jmlr-2011-Improved Moves for Truncated Convex Models

15 0.044459045 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

16 0.044013713 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

17 0.043721411 105 jmlr-2011-lp-Norm Multiple Kernel Learning

18 0.042569477 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

19 0.039891943 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

20 0.039084032 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.244), (1, 0.006), (2, 0.255), (3, 0.285), (4, -0.038), (5, 0.075), (6, 0.039), (7, -0.028), (8, 0.024), (9, -0.131), (10, -0.135), (11, 0.074), (12, -0.105), (13, 0.236), (14, 0.058), (15, 0.035), (16, 0.019), (17, 0.088), (18, -0.025), (19, -0.126), (20, 0.003), (21, -0.119), (22, -0.055), (23, -0.186), (24, 0.157), (25, 0.043), (26, -0.051), (27, -0.118), (28, 0.138), (29, 0.052), (30, 0.003), (31, 0.01), (32, -0.124), (33, -0.02), (34, 0.06), (35, 0.012), (36, -0.042), (37, -0.081), (38, -0.056), (39, -0.06), (40, -0.015), (41, -0.045), (42, 0.045), (43, 0.0), (44, 0.043), (45, -0.068), (46, 0.13), (47, 0.044), (48, 0.019), (49, 0.12)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95439762 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

2 0.62484431 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

3 0.50365859 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

Author: Stéphane Gaïffas, Guillaume Lecué

Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars

4 0.41185153 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach

Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers

5 0.39453405 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with infinitely many arms, optimistic online optimization, regret bounds, minimax rates

6 0.36972708 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

7 0.33643734 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

8 0.29843494 97 jmlr-2011-Union Support Recovery in Multi-task Learning

9 0.29826325 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

10 0.28269821 59 jmlr-2011-Learning with Structured Sparsity

11 0.25907722 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

12 0.25606787 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

13 0.23618656 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

14 0.22402695 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

15 0.22329034 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

16 0.22275499 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

17 0.22060615 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

18 0.21586333 41 jmlr-2011-Improved Moves for Truncated Convex Models

19 0.21551414 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

20 0.21415694 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.096), (6, 0.015), (8, 0.315), (9, 0.094), (10, 0.024), (11, 0.01), (24, 0.036), (31, 0.08), (32, 0.027), (41, 0.03), (60, 0.013), (65, 0.021), (73, 0.049), (78, 0.079), (90, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.7154994 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

Author: Mauricio A. Álvarez, Neil D. Lawrence

Abstract: Recently there has been an increasing interest in regression methods that deal with multiple outputs. This has been motivated partly by frameworks like multitask learning, multisensor networks or structured output data. From a Gaussian processes perspective, the problem reduces to specifying an appropriate covariance function that, whilst being positive semi-definite, captures the dependencies between all the data points and across all the outputs. One approach to account for non-trivial correlations between outputs employs convolution processes. Under a latent function interpretation of the convolution transform we establish dependencies between output variables. The main drawbacks of this approach are the associated computational and storage demands. In this paper we address these issues. We present different efficient approximations for dependent output Gaussian processes constructed through the convolution formalism. We exploit the conditional independencies present naturally in the model. This leads to a form of the covariance similar in spirit to the so called PITC and FITC approximations for a single output. We show experimental results with synthetic and real data, in particular, we show results in school exams score prediction, pollution prediction and gene expression data. Keywords: Gaussian processes, convolution processes, efficient approximations, multitask learning, structured outputs, multivariate processes

same-paper 2 0.70423359 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

3 0.4750855 105 jmlr-2011-lp-Norm Multiple Kernel Learning

Author: Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Alexander Zien

Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this ℓ1 -norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is ℓ p -norms with p ≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and ℓ∞ -norm MKL in various scenarios. Importantly, empirical applications of ℓ p -norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/. Keywords: multiple kernel learning, learning kernels, non-sparse, support vector machine, convex conjugate, block coordinate descent, large scale optimization, bioinformatics, generalization bounds, Rademacher complexity ∗. Also at Machine Learning Group, Technische Universit¨ t Berlin, 10587 Berlin, Germany. a †. Parts of this work were done while SS was at the Friedrich Miescher Laboratory, Max Planck Society, 72076 T¨ bingen, Germany. u ‡. Most contributions by AZ were done at the Fraunhofer Institute FIRST, 12489 Berlin, Germany. c 2011 Marius Kloft, Ulf Brefeld, S¨ ren Sonnenburg and Alexander Zien. o K LOFT, B REFELD , S ONNENBURG AND Z IEN

4 0.47143161 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach

Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm defined as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network flow optimization, alternating direction method of multipliers

5 0.47140205 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

6 0.46430939 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

7 0.46147859 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

8 0.44830883 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

9 0.44751841 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

10 0.44457212 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

11 0.44363478 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

12 0.44220102 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

13 0.44155627 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

14 0.43506917 104 jmlr-2011-X-Armed Bandits

15 0.42929602 66 jmlr-2011-Multiple Kernel Learning Algorithms

16 0.42874506 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

17 0.42547333 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

18 0.42387798 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

19 0.42224535 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

20 0.42207256 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models