jmlr jmlr2011 jmlr2011-59 knowledge-graph by maker-knowledge-mining

59 jmlr-2011-Learning with Structured Sparsity

Source: pdf

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efﬁciently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Computer Science Rutgers University Piscataway, NJ, 08854 USA Editor: Francis Bach Abstract This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. [sent-6, score-0.414]

2 A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. [sent-8, score-0.424]

3 It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. [sent-9, score-0.724]

4 Moreover, a structured greedy algorithm is proposed to efﬁciently solve the structured sparsity problem. [sent-10, score-0.478]

5 It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. [sent-11, score-0.357]

6 Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. [sent-12, score-0.51]

7 Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing 1. [sent-13, score-0.459]

8 Since (2) and (3) penalize the coding complexity c(β), we shall call this approach coding complexity regularization. [sent-51, score-0.634]

9 In particular, the theoretical analysis in our companion paper (Huang and Zhang, 2010) for group Lasso fails to yield meaningful bounds for more complex convex relaxation methods that are proposed for general structured sparsity formulations considered in this paper. [sent-56, score-0.446]

10 For this reason, we present an extension of the standard greedy OMP algorithm that can be applied to general structured sparsity problems, and more importantly, meaningful sparse recovery bounds can be obtained for this algorithm. [sent-57, score-0.566]

11 For example, group sparsity has been considered for simultaneous sparse approximation (Wipf and Rao, 2007) and multi-task compressive sensing and learning (Argyriou et al. [sent-63, score-0.495]

12 Huang and Zhang (2010) developed a theory for group Lasso using a concept called strong group sparsity, which is a special case of the general structured sparsity idea considered here. [sent-77, score-0.551]

13 It was shown in Huang and Zhang (2010) that group Lasso is superior to standard Lasso for strongly group-sparse signals, which provides a convincing theoretical justiﬁcation for using group structured sparsity. [sent-78, score-0.354]

14 While group Lasso works under the strong group sparsity assumption, it doesn’t handle the more general structures considered in this paper. [sent-81, score-0.439]

15 , 2009) considered structured sparsity in the convex relaxation setting, and extended group Lasso to more complicated sparse regularization conditions. [sent-92, score-0.487]

16 Again, since convex relaxation methods are more difﬁcult to analyze in the structured sparsity setting with overlapping groups, a satisfactory theoretical justiﬁcation remains an open challenge. [sent-94, score-0.354]

17 In other words, it does not provide a common scheme to represent their "models" for different structured sparsity data. [sent-118, score-0.351]

18 In structured sparsity, the cost of F is an upper bound of the coding length of F (number of bits needed to represent F by a computer program) in a pre-chosen preﬁx coding scheme. [sent-142, score-0.756]

19 The probability model of structured sparse learning is thus: ﬁrst generate the sparsity pattern F according to probability 2−cl(F) ; then generate the coefﬁcients in F. [sent-147, score-0.354]

20 The corresponding structured sparse coding complexity of F is deﬁned as c(F) = |F| + cl(F). [sent-150, score-0.465]

21 A coding length cl(F) is sub-additive if cl(F ∪ F ′ ) ≤ cl(F) + cl(F ′ ), and a coding complexity c(F) is sub-additive if c(F ∪ F ′ ) ≤ c(F) + c(F ′ ). [sent-151, score-0.626]

22 Based on the structured coding complexity of subsets of I , we can now deﬁne the structured ¯ coding complexity of a sparse coefﬁcient vector β ∈ R p . [sent-155, score-0.889]

23 Deﬁnition 2 Giving a coding complexity c(F), the structured sparse coding complexity of a coefﬁ¯ cient vector β ∈ R p is ¯ ¯ c(β) = min{c(F) : supp(β) ⊂ F}. [sent-156, score-0.773]

24 ¯ ¯ ¯ We will later show that if a coefﬁcient vector β has a small coding complexity c(β), then β can be effectively learned, with good in-sample prediction performance (in statistical learning) and reconstruction performance (in compressive sensing). [sent-157, score-0.393]

25 While the idea of using coding based penalization is clearly motivated by the minimum description length (MDL) principle, the actual penalty we obtain for structured sparsity problems is different from the standard MDL penalty for model selection. [sent-164, score-0.631]

26 This deﬁnition takes advantage of coding complexity, and can be also considered as (a weaker version of) structured RIP. [sent-187, score-0.402]

27 For example, for ran¯ dom projections used in compressive sensing applications, the coding length c(supp(β)) is O(k ln p) ¯ = O(k) in structured sparsity (if we can guess in standard sparsity, but can be as low as c(supp(β)) ¯ supp(β) approximately correctly. [sent-199, score-0.932]

28 The difference can be signiﬁcant when p is large and the coding length ¯ cl(supp(β)) ≪ k ln p. [sent-201, score-0.473]

29 The coding length of the groups are (k/k0 ) ln(p/k0 ), which is signiﬁcantly smaller than k ln p when p is large (see Section 4 for details). [sent-203, score-0.503]

30 The theorem implies that the structured RIP condition is satisﬁed with sample size n = O(k + (k/k0 ) ln(p/k0 )) in group sparsity (where s = O(k + (k/k0 ) ln(p/k0 ))) rather than n = O(k ln(p)) in standard sparsity (where s = O(k ln p)). [sent-205, score-0.814]

31 As we have pointed out earlier, this num¯ ber can be signiﬁcantly smaller than the standard sparsity requirement of n = O( β 0 ln p), if the structure imposed in the formulation is meaningful. [sent-237, score-0.368]

32 Although the result for the coding complexity estimator (2) is better due to weaker RIP dependency, we shall point out that it doesn’t mean that for group sparsity, we should use (2) instead of group-Lasso in practice. [sent-246, score-0.437]

33 Our generalization, which we refer to as structured greedy algorithm or simply StructOMP, takes advantage of block structures to approximately solve the structured sparsity formulation (2). [sent-267, score-0.575]

34 It would be worthwhile to mention that the notion of block structures here is different from block sparsity in model-based compressive sensing (Baraniuk et al. [sent-268, score-0.517]

35 Moreover, if there exists a base block B ⊂ F (k−1) but c(B ∪ F (k−1) ) ≤ c(F (k−1) ), we can always select B and let F (k) = B ∪ F (k−1) (this is because it is always beneﬁcial to add more features into F (k) without additional coding complexity). [sent-305, score-0.38]

36 However, our theoretical analysis shows that if in addition, the underlying coding scheme can be approximated by block coding using base blocks employed in the greedy algorithm, then the algorithm is effective in minimizing (2). [sent-310, score-0.827]

37 It is also useful to understand that our result does not imply that the algorithm won’t be effective if the actual coding scheme cannot be approximated by block coding. [sent-312, score-0.417]

38 H UANG , Z HANG AND M ETAXAS ¯ The following theorem shows that if c(β, B ) is small, then one can use the structured greedy algo(k) that is competitive to β, and the coding complexity c(β(k) ) is ¯ rithm to ﬁnd a coefﬁcient vector β ¯ B ). [sent-315, score-0.473]

39 approximated by block complexity c(β, ¯ Theorem 9 Suppose the coding scheme is sub-additive. [sent-317, score-0.439]

40 ¯ The result shows that in order to approximate a signal β up to accuracy ε, one needs to use ¯ coding complexity O(ln(1/ε))c(β, B ). [sent-324, score-0.355]

41 Now, consider the case that B contains small blocks and their sub-blocks with equal coding length, and the actual coding scheme can be approximated (up ¯ ¯ to a constant) by block coding generated by B ; that is, c(β, B ) = O(c(β)). [sent-325, score-1.047]

42 In this case we need O(s ln(1/ε)) to approximate a signal with coding complexity s. [sent-326, score-0.355]

43 Structured Sparsity Examples Before giving detailed examples, we describe a general coding scheme called block coding, which is an expansion of Deﬁnition 8. [sent-340, score-0.401]

44 The basic idea of block coding is to deﬁne a coding scheme on 3382 L EARNING WITH S TRUCTURED S PARSITY a small number of base blocks (a block is a subset of I ), and then deﬁne a coding scheme on all subsets of I using these base blocks. [sent-341, score-1.18]

45 b≥1 We call the coding scheme clB block coding. [sent-347, score-0.401]

46 It is clear from the deﬁnition that block coding is sub-additive. [sent-348, score-0.363]

47 From Theorem 9 and the discussions thereafter, we know that under appropriate conditions, a target coefﬁcient vector with a small block coding complexity can be approximately learned using the structured greedy algorithm. [sent-349, score-0.57]

48 This means that the block coding scheme has important algorithmic implications. [sent-350, score-0.401]

49 That is, if a coding scheme can be approximated by block coding with a small number of base blocks, then the corresponding estimation problem can be approximately solved using the structured greedy algorithm. [sent-351, score-0.885]

50 For this reason, we shall pay special attention to block coding approximation schemes for examples discussed below. [sent-352, score-0.381]

51 In particular, a coding scheme cl(·) can be polynomially approximated by block coding if there exists a block coding scheme clB with polynomial (in p) number of base blocks in B , such that there exists a positive constant CB independent of p: clB (F) ≤ CB cl(F). [sent-353, score-1.179]

52 That is, up to a constant, the block coding scheme clB () is dominated by the coding scheme cl(). [sent-354, score-0.725]

53 While it is possible to work with blocks with non-uniform coding schemes, for simplicity examples provided in this paper only consider blocks with uniform coding, which is similar to the representation used in the Union-of-Subspaces model of Lu and Do (2008). [sent-355, score-0.402]

54 1 Standard Sparsity A simple coding scheme is to code each subset F ⊂ I of cardinality k using k log2 (2p) bits, which corresponds to block coding with B consisted only of single element sets, and each base block has a coding length cl0 = log2 p. [sent-357, score-1.099]

55 A more general version is to consider single element blocks B = {{ j} : j ∈ I }, with a nonuniform coding scheme cl0 ({ j}) = c j , such that ∑ j 2−c j ≤ 1. [sent-359, score-0.382]

56 j∈B 3383 H UANG , Z HANG AND M ETAXAS In particular, if a feature j is likely to be nonzero, we should give it a smaller coding length c j , and if a feature j is likely to be zero, we should give it a larger coding length. [sent-361, score-0.604]

57 2 Group Sparsity The concept of group sparsity has appeared in various recent work, such as the group Lasso in Yuan and Lin (2006) or multi-task learning in Argyriou et al. [sent-364, score-0.435]

58 The strong group sparsity coding scheme is to give each element in B1 a code-length cl0 of ∞, and each element in BG a code-length cl0 of log2 m. [sent-368, score-0.632]

59 Then the block coding scheme with blocks B = BG ∪ B1 leads to group sparsity, which only looks for signals consisted of the groups. [sent-369, score-0.609]

60 The resulting coding length is: cl(B) = g log2 (2m) if B can be represented as the union of g disjoint groups G j ; and cl(B) = ∞ otherwise. [sent-370, score-0.374]

61 Note that if the support of the target signal F can be expressed as the union of g groups, and each group size is k0 , then the group coding length g log2 (2m) can be signiﬁcantly smaller than the standard sparsity coding length of |F| log2 (2p) = gk0 log2 (2p). [sent-371, score-1.163]

62 As we shall see later, the smaller coding complexity implies better learning behavior, which is essentially the advantage of using group sparse structure. [sent-372, score-0.478]

63 It was shown by Huang and Zhang (2010) that strong group sparsity deﬁned above also characterizes the performance of group Lasso. [sent-373, score-0.419]

64 An extension of this idea is to allow more general block coding length for cl0 (G j ) and cl0 ({ j}) so that m p j=1 j=1 ∑ 2−cl0 (G j ) + ∑ 2−cl0 ({ j}) ≤ 1. [sent-375, score-0.395]

65 This leads to non-uniform coding of the groups, so that a group that is more likely to be nonzero is given a smaller coding length. [sent-376, score-0.701]

66 j=1 Figure 2: Group sparsity: nodes are variables, and black nodes are selected variables Group sparsity is a special case of graph sparsity discussed below. [sent-381, score-0.486]

67 If we encode each group uniformly, then the coding length is cl(F) = 2 log2 (12). [sent-387, score-0.458]

68 Therefore the coding length of F is log2 3 times the total number of internal nodes leading to elements of F. [sent-394, score-0.351]

69 The idea is similar to that of the image coding example in the more general graph sparsity scheme which we discuss next. [sent-402, score-0.58]

70 Proposition 10 implies that if F is composed of g connected regions, then the coding length is g log2 (2p) + 2 log2 (5)|F|, which can be signiﬁcantly better than standard sparse coding length of |F| log2 (2p). [sent-415, score-0.747]

71 Figure 4: Graph sparsity: nodes are variables, and black nodes are selected variables Note that group sparsity is a special case of graph sparsity, where each group is one connected region, as shown in Figure 2. [sent-419, score-0.581]

72 From Proposition 10, similar coding complexity can be obtained as long as F can be covered by a small number of connected regions. [sent-422, score-0.378]

73 Tree-structured hierarchical sparsity is also a special case of graph sparsity with a single connected region containing the root (we may take q(root) = 1). [sent-423, score-0.49]

74 Figure 5: Line-structured sparsity: nodes are variables, and black nodes are selected variables The following result shows that under uniform encoding of the nodes q(v) = 1/p for v ∈ G, general graph coding schemes can be polynomially approximated with block coding. [sent-428, score-0.504]

75 The idea is to consider relatively small sized base blocks consisted of nodes that are close together with respect to the graph structure, and then use the induced block coding scheme to approximate the graph coding. [sent-429, score-0.561]

76 The result means that graph sparsity can be polynomially approximated with a block coding scheme if we let q(v) = 1/p for all v ∈ G. [sent-436, score-0.64]

77 For many such models, it is possible to approximate a general random ﬁeld coding scheme with block coding by using approximation methods in the graphical model literature. [sent-463, score-0.687]

78 Note that graph sparsity is more general than group sparsity; in fact connected regions may be regarded as dynamic groups that are not pre-deﬁned. [sent-486, score-0.456]

79 1 Simulated 1D Signals with Line-Structured Sparsity In the ﬁrst experiment, we randomly generate a 1D structured sparse signal with values ±1, where data dimension p = 512, sparsity number k = 64 and group number g = 4. [sent-497, score-0.512]

80 The graph sparsity concept introduced earlier is used to compute the coding length of sparsity patterns in StructOMP. [sent-500, score-0.754]

81 Our task is to compare the recovery performance of StructOMP to those of OMP, Lasso and group Lasso for these structured sparsity signals under the framework of compressive sensing. [sent-508, score-0.711]

82 Since the sample size n is not big enough, OMP and Lasso do not achieve good recovery results, whereas the StructOMP algorithm achieves near perfect recovery of the original signal. [sent-510, score-0.364]

83 To study how the sample size n effects the recovery performance, we vary the sample size and record the recovery results by different algorithms. [sent-521, score-0.402]

84 The results show that the proposed StructOMP can achieve better recovery performance for structured sparsity signals with less samples. [sent-525, score-0.515]

85 The effect is much less noticeable with weakly sparse signal in Figure 11(a) because the necessary structured 3389 H UANG , Z HANG AND M ETAXAS RIP condition is easier to satisﬁed for weakly sparse signals (based on our theory). [sent-532, score-0.362]

86 To study how the additive noise affects the recovery performance, we adjust the noise power σ and then record the recovery results by different algorithms. [sent-543, score-0.366]

87 At least for this problem, StructOMP achieves better performance than OverlapLasso and ModelCS, which shows that the proposed StructOMP algorithm can achieve better recovery performance than other structured sparsity algorithms for some problems. [sent-607, score-0.476]

88 The graph sparsity concept introduced earlier is used to compute the coding length of sparsity patterns in StructOMP. [sent-642, score-0.754]

89 As we do not know the predeﬁned groups for group Lasso, we just try group Lasso with several different 3392 L EARNING WITH S TRUCTURED S PARSITY group sizes (gs=2, 4, 8, 16). [sent-652, score-0.363]

90 In order to study how the sample size n effects the recovery performance, we vary the sample size and record the recovery results by different algorithms. [sent-654, score-0.402]

91 On the other hand, since in this application, three channels of the color background subtracted image share the same support set, we can enforce group sparsity across the color channels for each pixel. [sent-775, score-0.431]

92 Discussion This paper develops a theory for structured sparsity where prior knowledge allows us to prefer certain sparsity patterns to others. [sent-814, score-0.51]

93 ¯ In structured sparsity, the complexity of learning is measured by the coding complexity c(β) ≤ ¯ 0 + cl(supp(β)) instead of β 0 ln p which determines the complexity in standard sparsity. [sent-832, score-0.623]

94 The theory shows ¯ ¯ that if the coding length cl(supp(β)) is small for a target coefﬁcient vector β, then the complexity ¯ can be signiﬁcantly smaller than the corresponding complexity in standard sparsity. [sent-834, score-0.382]

95 The structured greedy algorithm presented in this paper is the ﬁrst efﬁcient algorithm proposed to handle the general structured sparsity learning. [sent-836, score-0.478]

96 Since Lemma 13 implies that each connected component Fj of F can be covered by 1 + 2(|Fj | − 1)/L connected regions from B , we have clB (Fj ) ≤ (1 + 2(|Fj | − 1)/L)(1 + CG δ) log2 p under the uniform coding on B . [sent-907, score-0.448]

97 Let ε1 = 2/15, and η = 2(1 + 2/ε1 )k e−ε2 /2σ , we have 2 2 ε2 = 2σ2 [(4k + 1) ln 2 − ln η], 2 and thus P(y − Ey) 2 ≤ 15 σ 2(4k + 1) ln 2 − 2 ln η. [sent-925, score-0.62]

98 As an application, we introduce the following concept of weakly sparse compressible target that generalizes the corresponding concept of compressible signal in standard sparsity from the compressive sensing literature (Donoho, 2006). [sent-1069, score-0.568]

99 If we assume the underlying coding scheme is block coding generated by B , then we have ¯ minu≤s′ ρ− (s + c(β(u))) ≤ ρ− (s + s′ ). [sent-1090, score-0.687]

100 In particular, in compressive sensing applications where σ = 0, we obtain when sample size reaches n = O(qs′ ), the reconstruction performance is ¯ ¯ β(k) − β 2 = O(a/s′q ), 2 which matches that of the constrained coding complexity regularization method in (2) up to a constant O(q). [sent-1099, score-0.492]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ey', 0.485), ('structomp', 0.392), ('omp', 0.345), ('coding', 0.286), ('lasso', 0.198), ('sparsity', 0.197), ('gs', 0.189), ('grouplasso', 0.164), ('recovery', 0.163), ('ln', 0.155), ('cl', 0.145), ('etaxas', 0.123), ('structured', 0.116), ('group', 0.111), ('uang', 0.094), ('hang', 0.086), ('compressive', 0.085), ('tructured', 0.082), ('xf', 0.08), ('block', 0.077), ('parsity', 0.077), ('pf', 0.072), ('rip', 0.07), ('connected', 0.07), ('supp', 0.068), ('cg', 0.065), ('sensing', 0.061), ('recovered', 0.06), ('blocks', 0.058), ('greedy', 0.049), ('signal', 0.047), ('cpu', 0.047), ('clb', 0.047), ('modelcs', 0.047), ('sparse', 0.041), ('signals', 0.039), ('weakly', 0.039), ('scheme', 0.038), ('wavelet', 0.036), ('bits', 0.036), ('overlaplasso', 0.035), ('baraniuk', 0.034), ('fj', 0.034), ('subtracted', 0.034), ('image', 0.033), ('nodes', 0.033), ('coef', 0.033), ('length', 0.032), ('groups', 0.03), ('earning', 0.029), ('encode', 0.029), ('exceeding', 0.029), ('huang', 0.028), ('dg', 0.027), ('graph', 0.026), ('union', 0.026), ('pqi', 0.025), ('ratio', 0.025), ('compressible', 0.023), ('constr', 0.023), ('zhang', 0.023), ('sample', 0.023), ('complexity', 0.022), ('stack', 0.022), ('regions', 0.022), ('relaxation', 0.022), ('jacob', 0.022), ('proposition', 0.021), ('cevher', 0.021), ('pz', 0.021), ('mdl', 0.021), ('lemma', 0.021), ('node', 0.02), ('rutgers', 0.02), ('color', 0.02), ('structures', 0.02), ('noise', 0.02), ('target', 0.02), ('overlapping', 0.019), ('cients', 0.019), ('tree', 0.019), ('jk', 0.018), ('averaged', 0.018), ('nonzero', 0.018), ('shall', 0.018), ('bunea', 0.018), ('haupt', 0.018), ('minu', 0.018), ('stopping', 0.018), ('base', 0.017), ('jenatton', 0.016), ('lounici', 0.016), ('background', 0.016), ('superior', 0.016), ('concept', 0.016), ('structure', 0.016), ('approximated', 0.016), ('projection', 0.015), ('pb', 0.015), ('size', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

2 0.22652583 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the ﬁeld of statistics, is a simpliﬁed model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation

3 0.10398453 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning

4 0.097859591 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difﬁcult to optimize, and in this paper, we propose efﬁcient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efﬁcient and scales gracefully to millions of variables, which we illustrate in two types of applications: ﬁrst, we consider ﬁxed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespeciﬁed arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

5 0.073348656 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

Author: Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, Francis Bach

Abstract: We consider a class of learning problems regularized by a structured sparsity-inducing norm deﬁned as the sum of ℓ2 - or ℓ∞ -norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞ norms can be computed exactly in polynomial time by solving a quadratic min-cost ﬂow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efﬁcient and scalable algorithms exploiting these two strategies, which are signiﬁcantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches. Keywords: convex optimization, proximal methods, sparse coding, structured sparsity, matrix factorization, network ﬂow optimization, alternating direction method of multipliers

6 0.071738638 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

7 0.071478806 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

8 0.071400575 91 jmlr-2011-The Sample Complexity of Dictionary Learning

9 0.057183586 104 jmlr-2011-X-Armed Bandits

10 0.048664059 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

11 0.048063569 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

12 0.047149595 101 jmlr-2011-Variable Sparsity Kernel Learning

13 0.045826349 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

14 0.045479376 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.038374372 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

16 0.037345693 12 jmlr-2011-Bayesian Co-Training

17 0.036795411 54 jmlr-2011-Learning Latent Tree Graphical Models

18 0.035551883 35 jmlr-2011-Forest Density Estimation

19 0.035224609 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

20 0.033822261 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.196), (1, 0.011), (2, 0.254), (3, 0.224), (4, -0.032), (5, 0.026), (6, -0.004), (7, -0.007), (8, 0.061), (9, 0.015), (10, 0.159), (11, -0.206), (12, 0.081), (13, -0.165), (14, 0.078), (15, -0.167), (16, -0.141), (17, 0.131), (18, 0.043), (19, -0.059), (20, -0.146), (21, -0.034), (22, -0.107), (23, 0.04), (24, 0.132), (25, 0.025), (26, -0.171), (27, -0.097), (28, -0.026), (29, -0.12), (30, -0.044), (31, -0.06), (32, -0.002), (33, 0.107), (34, -0.057), (35, -0.088), (36, -0.04), (37, 0.061), (38, -0.059), (39, 0.001), (40, -0.059), (41, 0.009), (42, -0.044), (43, -0.046), (44, -0.166), (45, 0.003), (46, -0.071), (47, 0.064), (48, -0.049), (49, -0.176)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96941686 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

2 0.82620722 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

3 0.47503239 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

4 0.3591677 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufﬁcient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

5 0.30165908 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

6 0.28016582 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.27371407 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

8 0.26830012 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

9 0.25521338 104 jmlr-2011-X-Armed Bandits

10 0.25342685 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

11 0.2443714 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

12 0.24320963 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

13 0.22455159 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

14 0.18766184 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.17161988 101 jmlr-2011-Variable Sparsity Kernel Learning

16 0.17089634 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

17 0.17017317 54 jmlr-2011-Learning Latent Tree Graphical Models

18 0.1539074 12 jmlr-2011-Bayesian Co-Training

19 0.15255968 23 jmlr-2011-DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model

20 0.14405908 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.158), (9, 0.079), (10, 0.041), (12, 0.33), (24, 0.028), (31, 0.078), (32, 0.014), (41, 0.016), (60, 0.018), (70, 0.015), (71, 0.012), (73, 0.022), (78, 0.059), (90, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72490191 59 jmlr-2011-Learning with Structured Sparsity

Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas

2 0.71987891 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Author: Stefano Melacci, Mikhail Belkin

Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classiﬁers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classiﬁcation. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efﬁciently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classiﬁcation accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be signiﬁcantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the beneﬁts of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classiﬁcation, optimization

3 0.50694585 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efﬁcient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our ﬁrst algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efﬁcient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random ﬁelds, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

4 0.50055772 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisﬁes a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing

5 0.49452832 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

6 0.44602546 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.43644509 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

8 0.43441987 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

9 0.43067425 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity

10 0.42534381 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

11 0.41357839 104 jmlr-2011-X-Armed Bandits

12 0.4135153 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

13 0.41203544 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

14 0.40935385 105 jmlr-2011-lp-Norm Multiple Kernel Learning

15 0.40891278 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

16 0.40738136 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

17 0.40035436 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

18 0.39631253 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

19 0.39574617 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

20 0.39408332 7 jmlr-2011-Adaptive Exact Inference in Graphical Models