nips nips2013 nips2013-249 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. [sent-8, score-0.803]
2 With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. [sent-9, score-0.062]
3 We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. [sent-11, score-0.778]
4 However, sophisticated notions of structured sparsity have been recently developed that can encode combinatorial patterns over variable subsets [8]. [sent-14, score-0.113]
5 For example, current structured sparse estimators often adopt an accelerated proximal gradient (APG) strategy [9, 10], which has a low per-step complexity and enjoys an optimal convergence rate among black-box first-order procedures [10]. [sent-16, score-0.179]
6 Unfortunately, APG must also compute a proximal update (PU) of the nonsmooth regularizer during each iteration. [sent-17, score-0.122]
7 Not only does the PU require a highly nontrivial computation for structured regularizers [4]—e. [sent-18, score-0.2]
8 Recently, [6] has demonstrated a class of regularizers where the corresponding PUs can be computed by a sequence of submodular function minimizations, but such an approach remains expensive. [sent-21, score-0.176]
9 Instead, in this paper, we demonstrate that an alternative approach can be more effective for many structured regularizers. [sent-22, score-0.063]
10 Importantly, unlike APG, GCG requires computing the polar of the regularizer, instead of the PU, in each step. [sent-26, score-0.521]
11 This difference allows important new approaches for characterizing and evaluating structured sparse regularizers. [sent-27, score-0.09]
12 Our first main contribution is to characterize a rich class of structured sparse regularizers that allow efficient computation of their polar operator. [sent-28, score-0.748]
13 In particular, motivated by [6], we consider a family of structured sparse regularizers induced by a cost function on variable subsets. [sent-29, score-0.254]
14 By introducing a “lifting” construction, we show how these regularizers can be expressed as linear functions, which after some reformulation, allows efficient evaluation by a simple linear program (LP). [sent-30, score-0.137]
15 Important examples covered include overlapping group lasso [5] and path regularization in directed acyclic graphs [12]. [sent-31, score-0.207]
16 By exploiting additional structure in these cases, the LP can be reduced to a piecewise 1 linear objective over a simple domain, allowing further reduction in computation time via smoothing [17]. [sent-32, score-0.125]
17 For example, for the overlapping group lasso with n groups where each variable belongs to at √ most r groups, the cost of evaluating the polar operator can be reduced from O(rn3 ) to O(rn n/ ) for a desired accuracy of . [sent-33, score-0.779]
18 Encouraged by the superior performance of GCG in these cases, we then provide a simple reduction of the polar operator to the PU. [sent-34, score-0.606]
19 To illustrate the usefulness of this reduction we provide an efficient new algorithm for solving the fused latent lasso [18]. [sent-36, score-0.285]
20 Recently, [6] has established a principled method for deriving regularizers from a subset cost function F : 2[n] → R+ based on defining the gauge: ΩF (w) = inf{γ ≥ 0 : w ∈ γ conv(SF )}, where SF = wA : wA p p ˜ = 1/F (A), ∅ = A ⊆ [n] . [sent-44, score-0.164]
21 The gauge ΩF defined in (2) is also known as the atomic norm with the set of atoms SF [20]. [sent-47, score-0.149]
22 It will be useful to recall that the polar of a gauge Ω is defined by [19, §15]: Ω◦ (g) := supw { g, w : Ω(w) ≤ 1}. [sent-48, score-0.593]
23 (3) In particular, the polar of a norm is its dual norm. [sent-49, score-0.551]
24 ) For the specific gauge ΩF defined in (2), its polar is simply the support function of SF [19, Theorem 13. [sent-51, score-0.593]
25 ) By varying p and F , one can generate a class of sparsity inducing regularizers that includes most current ˜ proposals [6]. [sent-54, score-0.164]
26 As pointed out in [6], when F is submodular, (4) can be evaluated by a secant method with submodular minimizations ([21, §8. [sent-57, score-0.162]
27 1 Optimization Algorithms A standard approach for minimizing (1) is the accelerated proximal gradient (APG) algorithm [9, 10], where each iteration involves solving the proximal update (PU): wk+1 = arg minw dk , w + 1 2 and 2sk w − wk 2 + λΩF (w), for some step size sk √ descent direction dk . [sent-62, score-0.193]
28 Unlike APG, GCG only requires the polar operator of the regularizer ΩF to be computed in each iteration, given by the argument of (4): P◦ (g) = arg max g, w = F (C) F w∈SF −1 p arg max gC , w w: w p =1 ˜ for C = arg max gA ∅=A⊆[n] p p /F (A). [sent-65, score-0.752]
29 1 5: Local re-optimization: {ui }k := arg min{ui =ui } f ( i ui ) + λ i F (Ai ) p ui p ˜ 1 A i where the {ui } are initialized by ui = α 6: wk+1 ← 7: end for i ui , i ← ui for i ≤ k, sk+1 ← i for i < k and ui = βvi for i = k. [sent-74, score-0.357]
30 ˜ evaluates the polar operator, which provides a descent direction vk ; Line 4 finds the optimal step sizes for combining the current iterate wk with the direction vk ; and Line 5 locally improves the objective (1) by maintaining the same support patterns but re-optimizing the parameters. [sent-76, score-0.635]
31 It has been shown that GCG can find an accurate solution to (1) in O(1/ ) steps, provided only that the polar (5) is computed to accuracy [14]. [sent-77, score-0.521]
32 Our main goal in this paper therefore is to extend this GCG approach to structured sparse models by developing efficient algorithms for computing the polar operator for the structured regularizers defined in (2). [sent-80, score-0.866]
33 Our first main contribution is to develop a general class of atomic norm regularizers whose polar operator (5) can be computed efficiently. [sent-82, score-0.79]
34 To begin, consider the case of a (partially) linear function F where there exists a c ∈ Rn such that F (A) = c, 1A for all A ∈ dom F (note that the domain need not be a lattice). [sent-83, score-0.072]
35 A few useful regularizers can be generated by linear functions: for example, the 1 norm can be derived from F (A) = 1, 1A for |A| = 1, which is linear. [sent-84, score-0.167]
36 Unfortunately, linearity is too restrictive to capture most structured regularizers of interest, therefore we will need to expand the space of functions F we consider. [sent-85, score-0.2]
37 To do so, we introduce the more general class of marginalized linear functions: we say that F is marginalized linear if there exists a nonnegative linear function M on an extended domain 2[n+l] such that its marginalization to 2[n] is exactly F : F (A) = min M (B), ∀ A ⊆ [n]. [sent-86, score-0.113]
38 The key question is whether the polar Ω◦ can be efficiently evaluated for such functions. [sent-88, score-0.521]
39 F To develop an efficient procedure for computing the polar Ω◦ , first consider the simpler case of F computing the polar Ω◦ for a nonnegative linear function M . [sent-89, score-1.068]
40 Since the effective domain of M need not be the whole space in general, we make use of the specialized polytope: P := conv{1B : B ∈ dom M } ⊆ [0, 1]n+l . [sent-91, score-0.072]
41 Finally, we note that the same formulation allows the polar to be efficiently computed for a marginalized linear function F via a simple reduction: Consider any g ∈ Rn and let [g; 0] ∈ Rn+l denote g padded by l zeros. [sent-100, score-0.554]
42 Then Ω◦ (g) = Ω◦ ([g; 0]) for all g ∈ Rn because F M max gA ∅=A⊆[n] p p F (A) = max ∅=A⊆[n] gA p p minB:A⊆B⊆[n+l] M (B) = max ∅=A⊆B gA p p M (B) = max B:∅=B⊆[n+l] [g; 0]B M (B) p p . [sent-101, score-0.084]
43 M Although we have kept our development general so far, the idea is clear: once an appropriate “lifting” has been found so that the polytope Q in (9) can be compactly represented, the polar (5) can be reformulated as the LP (10), for which efficient implementations can be sought. [sent-104, score-0.558]
44 We now demonstrate this new methodology for the two important structured regularizers: group sparsity and path coding. [sent-105, score-0.201]
45 1 Group Sparsity For a general formulation of group sparsity, let G ⊆ 2[n] be a set of variable groups (subsets) that possibly overlap [3, 6, 7]. [sent-107, score-0.1]
46 Consider the cost function over variable groups Fg : 2[n] → R+ defined by: cG I(A ∩ G = ∅), Fg (A) = (12) G∈G where cG is a nonnegative cost and I is an indicator such that I(·) = 1 if its argument is true, and 0 otherwise. [sent-109, score-0.128]
47 Unfortunately, Fg is not linear, so we need to re-express it to recover an efficient polar operator. [sent-111, score-0.521]
48 To do so, augment the domain by adding l = |G| variables such that each new variable G corresponds to a group G. [sent-112, score-0.073]
49 For example, observe that the polytope (7) can now be compactly represented as: Pg = {w ∈ Rn+l : 0 ≤ w ≤ 1, wi ≤ wG , ∀ i ∈ G ∈ G}. [sent-120, score-0.062]
50 Moreover, the linear constraint in (14) is totally unimodular (TUM) since it is the incidence matrix of a bipartite graph (variables and groups), hence Pg is the convex hull of its integral vectors [23]. [sent-122, score-0.093]
51 ˜ Using the fact that the scalar σ in (10) admits a closed form solution σ = 1, w in this case, the LP (10) can be reduced to: max ˜ w gi ˜ i∈[n] min G:i∈G∈G ˜ wG , subject to w ≥ 0, ˜ bG wG = 1. [sent-123, score-0.056]
52 (Appendix D considers when the groups form a directed acyclic graph. [sent-128, score-0.068]
53 2 Path Coding Another interesting regularizer, recently investigated by [12], is determined by path costs in a directed acyclic graph (DAG) defined over the set of variables i ∈ [n]. [sent-138, score-0.079]
54 A nonnegative cost is associated with each edge including (s, i) and (i, t), so the cost of a path is the sum of its edge costs. [sent-144, score-0.139]
55 A regularizer can then be defined by (2) applied to the cost function Fp : 2[n] → R+ cost of the path if the nodes in A form an (s, t)-path (unique for DAG) . [sent-145, score-0.163]
56 (16) Fp (A) = ∞ if such a path does not exist Note Fp is not submodular. [sent-146, score-0.059]
57 (17) p To efficiently compute the resulting polar, we consider the form (8) using gi = |gi | ∀i as before: ˜ ˜ g, w Ω◦ p (g) = max , s. [sent-151, score-0.056]
58 (18) M j:(i,j)∈E k:(k,i)∈E 0=w∈[0,1]|T | b, w Here the constraints form the well-known flow polytope whose vertices are exactly all the paths in a DAG. [sent-154, score-0.058]
59 Similar to (15), the normalized LP (10) can be simplified by solving for the scalar σ to obtain: max ˜ w≥0 gi ˜ i∈[n] wij + ˜ j:(i,j)∈E ˜ wki , s. [sent-155, score-0.118]
60 b, w = 1, ˜ k:(k,i)∈E wij = ˜ j:(i,j)∈E wki , ∀i ∈ [n]. [sent-157, score-0.062]
61 To find a 2 accurate solution, the cutting plane method takes O( mn ) computations to optimize the √ 2 nonsmooth piecewise linear objective, while APG needs O( 1 n) steps to optimize the smoothed √ objective, using a total time cost of O( m n). [sent-160, score-0.077]
62 4 Generalizing Beyond Atomic Norms Although we find the above approach to be effective, many useful regularizers are not expressed in form of an atomic norm (2), which makes evaluation of the polar a challenge and thus creates difficulty in applying Algorithm 1. [sent-163, score-0.735]
63 For example, another important class of structured sparse regularizers is given by an alternative, composite gauge construction: Ωs (w) = i κi (w), where κi is a closed gauge that can be different for different i. [sent-164, score-0.39]
64 Ω◦ (g) s inf{maxi κ◦ (wi ) i i (20) The polar for such a regularizer is given by = : i w = g}, where each wi is an independent vector and κ◦ corresponds to the polar of κi (proof given in Appendix H). [sent-165, score-1.117]
65 i Unfortunately, a polar in this form does not appear to be easy to compute. [sent-166, score-0.521]
66 However, for some regularizers in the form (20) the following proximal objective can indeed be computed efficiently: ArgProxΩ (g) = arg minθ 1 g − θ 2 + Ω(θ). [sent-167, score-0.235]
67 Proposition 2 For any closed gauge Ω, its polar Ω◦ can be equivalently expressed by: Ω◦ (g) = inf{ ζ ≥ 0 : ProxζΩ (g) = 1 g 2 }. [sent-169, score-0.593]
68 ) Since the left hand side of the inner constraint is decreasing in ζ, one can efficiently compute the polar Ω◦ by a simple root finding search in ζ. [sent-171, score-0.521]
69 Thus, regularizers in the form of (20) can still be accommodated in an efficient GCG method in the form of Algorithm 1. [sent-172, score-0.137]
70 The main motivation for this regularizer arises from biostatistics, where one wishes to identify DNA copy number variations simultaneously for a group of related samples [18]. [sent-183, score-0.102]
71 In this case the total variation norm · TV encourages the dictionary to vary smoothly from entry to entry while the p norm shrinks the dictionary so that few latent features are selected. [sent-184, score-0.16]
72 Conveniently, Ωp decomposes along the columns of W , so one can apply the reduction in Proposition 2 to compute its polar assuming ProxΩp can be efficiently computed. [sent-185, score-0.551]
73 Solving ProxΩp appears non-trivial due to the composition of two overlapping norms, however [25] showed that for p = 1 the polar can be solved efficiently by computing Prox for each of the two norms successively. [sent-186, score-0.543]
74 Therefore, by combining this result with Propositions 2 and 3 we are able to efficiently compute the polar Ω◦ p and hence apply Algorithm 1 to solving (23) with respect to W . [sent-191, score-0.521]
75 5 Experiments To investigate the effectiveness of these computational schemes we considered three applications: group lasso, path coding, and latent fused lasso. [sent-192, score-0.29]
76 1 Group Lasso: CUR-like Matrix Factorization Our first experiment considered an example of group lasso that is inspired by CUR matrix factorization [27]. [sent-195, score-0.128]
77 We compared three algorithms: GCG (Algorithm 1) with our polar operator which we call GCG TUM, GCG with the polar operator of [11, Algorithm 2] (GCG Secant), and APG (see Section 2. [sent-205, score-1.152]
78 The polar operator of GCG Secant was implemented with a mex wrapper of a max-flow package ∗ [30], while GCG TUM used L-BFGS to find an optimal solution {wG } for the smoothed version of 6 0. [sent-208, score-0.601]
79 So we sorted {wG } and set the wG of the smallest k groups to 0, and wG for the remaining groups set to a common value that satisfies the constraint. [sent-241, score-0.096]
80 Both GCG methods relinquish local optimization (step 5) in Algorithm 1, but use a totally corrective variant of step 4, which allows efficient optimization by L-BFGS-B via pre-computing XP◦ g (gk )X. [sent-247, score-0.074]
81 Thanks to the efficiency of totally corrective update, almost all computations taken by GCG Secant were devoted to the polar operator. [sent-253, score-0.595]
82 Therefore the acceleration proffered by GCG TUM in computing the polar leads to a reduction of overall optimization time by at least 50%. [sent-254, score-0.551]
83 For this problem, we formulate (1) with a path coding regularizer ΩFp and the empirical risk: 1 (25) f (w) = i ni log(1 + exp(−yi w, xi )), where ni is the number of examples that share the same label as yi . [sent-259, score-0.146]
84 We again considered three methods: APG, GCG with our polar operator (GCG TUM), and GCG with the polar operator from [12, Algorithm 1], which we label as GCG Secant. [sent-266, score-1.152]
85 We implemented the polar operator for GCG Secant based on Matlab’s built-in shortest path routine graphshortestpath (C++ wrapped by mex). [sent-269, score-0.657]
86 Again, both GCG methods only used totally corrective updates. [sent-272, score-0.074]
87 Figure 2 shows the result for path coding, with the regularization coefficient λ set to 10−2 and 10−3 so that the solution is moderately sparse. [sent-274, score-0.078]
88 3 Latent Fused Lasso Finally, we compared GCG and APG on the latent fused lasso problem (23). [sent-277, score-0.255]
89 For each basis (column) of the dictionary, Sj ˜ we use the model Wij = s=1 cs I(is ≤ i ≤ is + ls ), where Sj ∈ {3, 5, 8, 10} specifies the number of consecutive blocks in the j-th basis, cs ∈ {±1, ±2, ±3, ±4, ±5}, is ∈ {1, . [sent-280, score-0.065]
90 In these experiments we observed that the polar typically only requires 5 to 6 calls to Prox. [sent-300, score-0.521]
91 2 0 20 40 60 CPU time (sec) 80 100 Figure 3: Latent fused lasso. [sent-307, score-0.147]
92 Conclusion We have identified and investigated a new class of structured sparse regularizers whose polar can be reformulated as a linear program with totally unimodular constraints. [sent-308, score-0.813]
93 When plugged into the GCG algorithm, one can observe significant reductions in run time for both group lasso and path coding regularization. [sent-310, score-0.224]
94 We have further developed a generic scheme for converting an efficient proximal solver to an efficient method for computing the polar operator. [sent-311, score-0.569]
95 This reduction allowed us to develop a fast new method for latent fused lasso. [sent-312, score-0.209]
96 For future work, we plan to study more general subset cost functions and investigate new structured regularizers amenable to our approach. [sent-313, score-0.227]
97 Tree-guided group lasso for multi-task regression with structured sparsity. [sent-334, score-0.191]
98 Supervised feature selection in graphs with path coding penalties and network flows. [sent-380, score-0.096]
99 A fused lasso latent feature model for analyzing multi-sample aCGH data. [sent-418, score-0.255]
100 An efficient algorithm for a class of fused lasso problems. [sent-465, score-0.223]
wordName wordTfidf (topN-words)
[('gcg', 0.626), ('polar', 0.521), ('apg', 0.255), ('fused', 0.147), ('regularizers', 0.137), ('wg', 0.136), ('tum', 0.125), ('pu', 0.102), ('secant', 0.098), ('sf', 0.096), ('lasso', 0.076), ('prox', 0.074), ('mg', 0.074), ('gauge', 0.072), ('fg', 0.07), ('structured', 0.063), ('path', 0.059), ('fp', 0.058), ('ui', 0.056), ('argprox', 0.056), ('operator', 0.055), ('tv', 0.054), ('group', 0.052), ('dom', 0.051), ('cpu', 0.051), ('regularizer', 0.05), ('lp', 0.05), ('groups', 0.048), ('proximal', 0.048), ('atomic', 0.047), ('tumor', 0.045), ('appendix', 0.045), ('cur', 0.042), ('smoothing', 0.04), ('submodular', 0.039), ('mairal', 0.038), ('coding', 0.037), ('corrective', 0.037), ('wki', 0.037), ('polytope', 0.037), ('totally', 0.037), ('ga', 0.036), ('gi', 0.035), ('ow', 0.035), ('wk', 0.035), ('dictionary', 0.034), ('marginalized', 0.033), ('latent', 0.032), ('conv', 0.03), ('norm', 0.03), ('reduction', 0.03), ('objective', 0.029), ('rn', 0.029), ('gk', 0.029), ('convex', 0.028), ('seconds', 0.028), ('pg', 0.028), ('srbct', 0.028), ('unimodular', 0.028), ('sparsity', 0.027), ('cg', 0.027), ('cost', 0.027), ('sparse', 0.027), ('lifting', 0.026), ('piecewise', 0.026), ('nonnegative', 0.026), ('ls', 0.025), ('breast', 0.025), ('vk', 0.025), ('wij', 0.025), ('wi', 0.025), ('nr', 0.025), ('minimizations', 0.025), ('mex', 0.025), ('spams', 0.025), ('jenatton', 0.024), ('wa', 0.024), ('nonsmooth', 0.024), ('obozinski', 0.024), ('dag', 0.023), ('combinatorial', 0.023), ('bg', 0.023), ('gradient', 0.022), ('routine', 0.022), ('norms', 0.022), ('url', 0.022), ('vertices', 0.021), ('obj', 0.021), ('domain', 0.021), ('arg', 0.021), ('max', 0.021), ('acyclic', 0.02), ('cs', 0.02), ('proposition', 0.02), ('alberta', 0.019), ('composite', 0.019), ('accelerated', 0.019), ('lifted', 0.019), ('moderately', 0.019), ('hal', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
2 0.22810081 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
3 0.11185187 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
Author: Nikhil Rao, Christopher Cox, Rob Nowak, Timothy T. Rogers
Abstract: Multitask learning can be effective when features useful in one task are also useful for other tasks, and the group lasso is a standard method for selecting a common subset of features. In this paper, we are interested in a less restrictive form of multitask learning, wherein (1) the available features can be organized into subsets according to a notion of similarity and (2) features useful in one task are similar, but not necessarily identical, to the features best suited for other tasks. The main contribution of this paper is a new procedure called Sparse Overlapping Sets (SOS) lasso, a convex optimization that automatically selects similar features for related learning tasks. Error bounds are derived for SOSlasso and its consistency is established for squared error loss. In particular, SOSlasso is motivated by multisubject fMRI studies in which functional activity is classified using brain voxels as features. Experiments with real and synthetic data demonstrate the advantages of SOSlasso compared to the lasso and group lasso. 1
4 0.078955829 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
5 0.06473536 268 nips-2013-Reflection methods for user-friendly submodular optimization
Author: Stefanie Jegelka, Francis Bach, Suvrit Sra
Abstract: Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. Consequently, there is need for efficient optimization procedures for submodular functions, especially for minimization problems. While general submodular minimization is challenging, we propose a new method that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction. In our experiments, we illustrate the benefits of our method on two image segmentation tasks. 1
6 0.061122548 75 nips-2013-Convex Two-Layer Modeling
7 0.057154261 109 nips-2013-Estimating LASSO Risk and Noise Level
8 0.056949258 215 nips-2013-On Decomposing the Proximal Map
9 0.056527596 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
10 0.056185566 193 nips-2013-Mixed Optimization for Smooth Functions
11 0.055899691 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
12 0.052723955 91 nips-2013-Dirty Statistical Models
13 0.049369492 11 nips-2013-A New Convex Relaxation for Tensor Completion
14 0.049316529 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
15 0.049047559 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
16 0.049022727 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
17 0.046619616 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
18 0.046453543 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
19 0.04480622 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
20 0.044383731 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
topicId topicWeight
[(0, 0.129), (1, 0.038), (2, 0.049), (3, 0.048), (4, 0.022), (5, 0.051), (6, -0.098), (7, -0.029), (8, 0.015), (9, -0.031), (10, 0.045), (11, 0.006), (12, -0.001), (13, -0.111), (14, -0.078), (15, -0.012), (16, 0.031), (17, 0.015), (18, 0.003), (19, -0.01), (20, 0.022), (21, 0.07), (22, 0.047), (23, 0.075), (24, 0.055), (25, -0.059), (26, 0.061), (27, -0.024), (28, 0.074), (29, 0.177), (30, -0.161), (31, 0.065), (32, 0.024), (33, -0.055), (34, -0.078), (35, -0.049), (36, -0.049), (37, 0.094), (38, -0.037), (39, -0.013), (40, -0.002), (41, 0.021), (42, 0.023), (43, -0.022), (44, -0.051), (45, 0.006), (46, 0.022), (47, -0.029), (48, -0.047), (49, 0.023)]
simIndex simValue paperId paperTitle
1 0.90586752 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
same-paper 2 0.89233649 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
3 0.84901321 215 nips-2013-On Decomposing the Proximal Map
Author: Yao-Liang Yu
Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1
4 0.59138417 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
5 0.56249124 202 nips-2013-Multiclass Total Variation Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James von Brecht
Abstract: Ideas from the image processing literature have recently motivated a new set of clustering algorithms that rely on the concept of total variation. While these algorithms perform well for bi-partitioning tasks, their recursive extensions yield unimpressive results for multiclass clustering tasks. This paper presents a general framework for multiclass total variation clustering that does not rely on recursion. The results greatly outperform previous total variation algorithms and compare well with state-of-the-art NMF approaches. 1
6 0.53953844 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
7 0.53511924 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
8 0.51750976 268 nips-2013-Reflection methods for user-friendly submodular optimization
9 0.51734012 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
10 0.45337257 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
11 0.44949639 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
12 0.43492994 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
13 0.40359589 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
14 0.38311979 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
15 0.38099468 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
16 0.37639797 301 nips-2013-Sparse Additive Text Models with Low Rank Background
17 0.37533611 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
18 0.37160772 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
19 0.36387286 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
20 0.34836563 75 nips-2013-Convex Two-Layer Modeling
topicId topicWeight
[(16, 0.042), (33, 0.124), (34, 0.103), (36, 0.031), (41, 0.046), (49, 0.044), (56, 0.27), (70, 0.087), (85, 0.028), (89, 0.025), (93, 0.053), (95, 0.036)]
simIndex simValue paperId paperTitle
1 0.95654821 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
2 0.95529348 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
Author: Jing Xiang, Seyoung Kim
Abstract: We address the problem of learning a sparse Bayesian network structure for continuous variables in a high-dimensional space. The constraint that the estimated Bayesian network structure must be a directed acyclic graph (DAG) makes the problem challenging because of the huge search space of network structures. Most previous methods were based on a two-stage approach that prunes the search space in the first stage and then searches for a network structure satisfying the DAG constraint in the second stage. Although this approach is effective in a lowdimensional setting, it is difficult to ensure that the correct network structure is not pruned in the first stage in a high-dimensional setting. In this paper, we propose a single-stage method, called A* lasso, that recovers the optimal sparse Bayesian network structure by solving a single optimization problem with A* search algorithm that uses lasso in its scoring system. Our approach substantially improves the computational efficiency of the well-known exact methods based on dynamic programming. We also present a heuristic scheme that further improves the efficiency of A* lasso without significantly compromising the quality of solutions. We demonstrate our approach on data simulated from benchmark Bayesian networks and real data. 1
3 0.9541657 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
Author: Josh S. Merel, Roy Fox, Tony Jebara, Liam Paninski
Abstract: In a closed-loop brain-computer interface (BCI), adaptive decoders are used to learn parameters suited to decoding the user’s neural response. Feedback to the user provides information which permits the neural tuning to also adapt. We present an approach to model this process of co-adaptation between the encoding model of the neural signal and the decoding algorithm as a multi-agent formulation of the linear quadratic Gaussian (LQG) control problem. In simulation we characterize how decoding performance improves as the neural encoding and adaptive decoder optimize, qualitatively resembling experimentally demonstrated closed-loop improvement. We then propose a novel, modified decoder update rule which is aware of the fact that the encoder is also changing and show it can improve simulated co-adaptation dynamics. Our modeling approach offers promise for gaining insights into co-adaptation as well as improving user learning of BCI control in practical settings.
4 0.95408791 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
Author: Zheng Wen, Benjamin Van Roy
Abstract: We consider the problem of reinforcement learning over episodes of a finitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efficiency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-specified indicator functions over disjoint sets. 1
5 0.95385361 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
Author: Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, S. Muthukrishnan
Abstract: Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the setting where the expected gain is initially unknown, and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem. 1
same-paper 6 0.95166391 249 nips-2013-Polar Operators for Structured Sparse Estimation
7 0.94064999 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
8 0.93869215 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
9 0.93674821 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
10 0.93655169 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
11 0.93616372 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
12 0.93552381 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
13 0.93374747 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
14 0.93259627 252 nips-2013-Predictive PAC Learning and Process Decompositions
15 0.93255693 42 nips-2013-Auditing: Active Learning with Outcome-Dependent Query Costs
16 0.93220723 230 nips-2013-Online Learning with Costly Features and Labels
17 0.93206137 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
18 0.92764914 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
19 0.92651385 319 nips-2013-Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints
20 0.92527354 224 nips-2013-On the Sample Complexity of Subspace Learning