nips nips2013 nips2013-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. [sent-4, score-0.37]
2 We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. [sent-5, score-1.023]
3 The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. [sent-6, score-1.418]
4 Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. [sent-7, score-0.382]
5 1 Introduction In many scientific areas, an important methodology that has withstood the test of time is the approximation of “complicated” functions by those that are easier to handle. [sent-8, score-0.093]
6 And one would probably also agree that a specific form of approximation should be favored if it well suits our ultimate goal. [sent-15, score-0.1]
7 Despite of all these common-sense, in optimization algorithms, the smooth approximations are still dominating, bypassing some recent advances on optimizing nonsmooth functions [2, 3]. [sent-16, score-0.407]
8 We consider the composite minimization problem where the objective consists of a smooth loss function and a sum of nonsmooth functions. [sent-18, score-0.453]
9 Such problems have received increasing attention due to the arise of structured sparsity [4], notably the overlapping group lasso [5], the graph-guided fused lasso [6] and some others. [sent-19, score-0.504]
10 Popular gradient-type algorithms dealing with such composite problems include the generic subgradient method [7], (accelerated) proximal gradient (APG) [2, 3], and the smoothed accelerated proximal gradient (S-APG) [8]. [sent-21, score-1.568]
11 The subgradient method is applicable to any nonsmooth function, although the convergence rate is rather slow. [sent-22, score-0.409]
12 APG, being a recent advance, can handle simple functions [9] but for more complicated structured regularizers, an inner iterative procedure is needed, resulting in an overall convergence rate that could be as slow as the subgradient method [10]. [sent-23, score-0.239]
13 Lastly, S-APG simply runs APG on a smooth approximation of the original objective, resulting in a much improved convergence rate. [sent-24, score-0.187]
14 Our work is inspired by the recent advance on nonsmooth optimization [2, 3], of which the building block is the proximal map of the nonsmooth function. [sent-25, score-1.249]
15 This proximal map is available in closed-form 1 for simple functions but can be quite expensive for more complicated functions such as a sum of nonsmooth functions we consider here. [sent-26, score-1.129]
16 A key observation we make is that oftentimes the proximal map for each individual summand can be easily computed, therefore a bold idea is to simply use the sum of proximal maps, pretending that the proximal map is a linear operator. [sent-27, score-2.05]
17 Somewhat surprisingly, this naive choice, when combined with APG, results in a novel proximal algorithm that is strictly better than S-APG, while keeping per-step complexity unchanged. [sent-28, score-0.67]
18 We justify our method via a new tool from convex analysis—the proximal average [11]. [sent-29, score-0.746]
19 In essence, instead of smoothing the nonsmooth function, we use a nonsmooth approximation whose proximal map is cheap to evaluate, after all this is all we need to run APG. [sent-30, score-1.379]
20 After recalling the relevant tools from convex analysis in Section 3 we provide the theoretical justification of our method in Section 4. [sent-32, score-0.092]
21 2 Problem Formulation We are interested in solving the following composite minimization problem: K ¯ min (x) + f (x), x∈Rd where ¯ f (x) = αk fk (x). [sent-35, score-0.257]
22 (1) k=1 Here is convex with L0 -Lipschitz continuous gradient w. [sent-36, score-0.147]
23 Each fk is convex and Mk -Lipschitz continuous w. [sent-42, score-0.267]
24 We are interested in the general case where the functions fk need not be differentiable. [sent-47, score-0.211]
25 As mentioned in the introduction, a generic scheme that solves (1) is the subgradient method [7], of which each step requires merely an arbitrary subgradient of the objective. [sent-48, score-0.228]
26 With a suitable stepsize, the subgradient method converges1 in at most O(1/ 2 ) steps where > 0 is the desired accuracy. [sent-49, score-0.114]
27 Although being general, the subgradient method is exceedingly slow, making it unsuitable for many practical applications. [sent-50, score-0.114]
28 Another recent algorithm for solving (1) is the (accelerated) proximal gradient (APG) [2, 3], of ¯ which each iteration needs to compute the proximal map of the nonsmooth part f in (1): 1/L0 Pf ¯ (x) = argmin L0 x − y 2 2 ¯ + f (y). [sent-51, score-1.659]
29 y (Recall that L0 is the Lipschitz constant of the gradient of the smooth part in (1). [sent-52, score-0.157]
30 ) Provided that the proximal map can be computed in constant time, it can be shown that APG converges within √ O(1/ ) complexity, significantly better than the subgradient method. [sent-53, score-0.829]
31 For some simple functions, the proximal map indeed is available in closed-form, see [9] for a nice survey. [sent-54, score-0.715]
32 However, for more complicated functions such as the one we consider here, the proximal map itself is expensive to compute and an inner iterative subroutine is required. [sent-55, score-0.788]
33 Somewhat disappointingly, recent analysis has shown that such a two-loop procedure can be as slow as the subgradient method [10]. [sent-56, score-0.114]
34 Yet another approach, popularized by Nesterov [8], is to approximate each nonsmooth component fk with a smooth function and then run APG. [sent-57, score-0.575]
35 By carefully balancing the approximation and the convergence requirement of APG, the smoothed accelerated proximal gradient (S-APG) proposed in [8] converges in at most O( 1/ 2 + 1/ ) steps, again much better than the subgradient method. [sent-58, score-0.94]
36 Each proximal map Pµk can be computed “easily” for any µ > 0. [sent-61, score-0.715]
37 do 3: zt = yt − µ (yt ), 4: xt = k αk · Pµk (zt ), √ f2 1+ 1+4ηt 5: ηt+1 = , 2 ηt −1 6: yt+1 = xt + ηt+1 (xt − xt−1 ). [sent-68, score-0.14]
38 do 3: 4: zt = xt−1 − µ xt = k αk · (xt−1 ), Pµk (zt ). [sent-74, score-0.076]
39 f 5: end for We prefer to leave the exact meaning of “easily” unspecified, but roughly speaking, the proximal map should be no more expensive than computing the gradient of the smooth part so that it does not become the bottleneck. [sent-75, score-0.872]
40 Unfortunately, in general, there is no known efficient way that reduces the proximal map of the ¯ average f to the proximal maps of its individual components fk , therefore the fast scheme APG is not readily applicable. [sent-78, score-1.544]
41 The main difficulty, of course, is due to the nonlinearity of the proximal map Pµ , when treated as an operator on the function f . [sent-79, score-0.715]
42 Despite of this fact, we will “naively” pretend f that the proximal map is linear and use ? [sent-80, score-0.715]
43 In this example, fk (x) = xgk where gk is a group (subset) of variables and xg denotes a copy of x with all variables not contained in the group g set to 0. [sent-93, score-0.388]
44 This group regularizer has been proven quite useful in high-dimensional statistics with the capability of selecting meaningful groups of features [5]. [sent-94, score-0.157]
45 ¯ f Clearly each fk is convex and 1-Lipschitz continuous w. [sent-96, score-0.267]
46 Moreover, the proximal map Pµk is simply a re-scaling of the variables in group gk , that is f [Pµk (x)]j = f xj , j ∈ gk , (1 − µ/ xgk )+ xj , j ∈ gk (3) where (λ)+ = max{λ, 0}. [sent-102, score-1.028]
47 This example is an enhanced version of the fused lasso [12], with some graph structure exploited to improve feature selection in biostatistic applications [6]. [sent-105, score-0.263]
48 Specifically, given some graph whose nodes correspond to the feature variables, we let fij (x) = |xi − xj | for every edge (i, j) ∈ E. [sent-106, score-0.093]
49 For a general graph, the proximal map of the regularizer ¯ f = (i,j)∈E αij fij , with αij ≥ 0, (i,j)∈E αij = 1, is not easily computable. [sent-107, score-0.822]
50 Similar as above, each fij is 1-Lipschitz continuous w. [sent-108, score-0.084]
51 Moreover, the proximal map Pµij is easy to compute: f xs , xs − sign(xi − xj ) min{µ, |xi − xj |/2}, Again, both our assumptions are satisfied. [sent-112, score-0.821]
52 s ∈ {i, j} (4) Note that in both examples we could have incorporated weights into the component functions fk or fij , which amounts to changing αk or αij accordingly. [sent-114, score-0.302]
53 3 Technical Tools To justify our new algorithm, we need a few technical tools from convex analysis [14]. [sent-117, score-0.117]
54 Denote Γ0 as the set of all lower semicontinuous proper convex functions f : H → R ∪ {∞}. [sent-119, score-0.106]
55 For any f ∈ Γ0 , we define its Moreau envelop (with parameter µ > 0) [14, 15] 1 Mµ (x) = min 2µ x − y f 2 y + f (y), (5) 2 (6) and correspondingly the proximal map 1 Pµ (x) = argmin 2µ x − y f + f (y). [sent-127, score-0.833]
56 y Since f is closed convex and · 2 is strongly convex, the proximal map is well-defined and singlevalued. [sent-128, score-0.784]
57 As mentioned before, the proximal map is the key component of fast schemes such as APG. [sent-129, score-0.77]
58 We summarize some nice properties of the Moreau envelop and the proximal map as: Proposition 1. [sent-130, score-0.807]
59 ii), albeit being trivial, is the driving force behind the proximal point algorithm [16]. [sent-139, score-0.597]
60 iii) justifies the “niceness” of the Moreau envelop and connects it with the proximal map. [sent-140, score-0.689]
61 And lastly vi), known as Moreau’s identity [15], plays an important role in the early development of convex analysis. [sent-142, score-0.1]
62 Let SCµ ⊆ Γ0 denote the class of µ-strongly convex functions, that is, functions f such that f − µq is convex. [sent-145, score-0.106]
63 Similarly, let SSµ ⊆ Γ0 denote the class of finite-valued functions whose gradient is µ-Lipschitz continuous (w. [sent-146, score-0.115]
64 A well-known duality between strong convexity and smoothness is that for f ∈ Γ0 , we have f ∈ SCµ iff f ∗ ∈ SS1/µ , cf. [sent-150, score-0.101]
65 The Moreau envelop map Mµ : Γ0 → SS1/µ that sends f ∈ Γ0 to Mµ is f bijective, increasing, and concave on any convex subset of Γ0 (under the pointwise order). [sent-156, score-0.324]
66 4 It is clear that SS1/µ is a convex subset of Γ0 , which motivates the definition of the proximal K ¯ average—the key object to us. [sent-157, score-0.666]
67 Recall that f = k αk fk ¯ is the convex combination of the component functions {fk } under the with each fk ∈ Γ0 , i. [sent-159, score-0.485]
68 The µ µ proximal average Af ,α , or simply A when the component functions and weights are clear from K k=1 context, is the unique function h ∈ Γ0 such that Mµ = h αk Mµk . [sent-171, score-0.699]
69 f Indeed, the existence of the proximal average follows from the surjectivity of Mµ while the uniqueness follows from the injectivity of Mµ , both proven in Proposition 2. [sent-172, score-0.631]
70 The main property of the proximal average, as seen from its definition, is that its Moreau envelop is the convex combination of the Moreau envelops of the component functions. [sent-173, score-0.789]
71 It is well-known that as µ → 0, Mµ → f pointwise [14], which, under the Lipschitz assumption, f can be strengthened to uniform convergence (Proof in Appendix B): 2 ¯ Proposition 3. [sent-180, score-0.107]
72 A 2 ¯ For the proximal average, [11] showed that Aµ → f pointwise, which again can be strengthened to uniform convergence (proof follows from (10) and Proposition 3 since Aµ ≥ Mµµ ): A ¯ Proposition 4. [sent-182, score-0.659]
73 ¯ As it turns out, S-APG approximates the nonsmooth function f with the smooth function Mµµ while A µ our algorithm operates on the nonsmooth approximation A (note that it can be shown that Aµ is smooth iff some component fi is smooth). [sent-184, score-0.867]
74 Observe that the proximal A average Aµ remains nondifferentiable at 0 while Mµµ is smooth everywhere. [sent-190, score-0.734]
75 For x ≥ 0, f1 = f2 = A ¯ f = Aµ (the red circled line), thus the proximal average Aµ is a strictly tighter approximation than ¯ smoothing. [sent-191, score-0.732]
76 A ¯ meaning that the proximal average Aµ is a better under-approximation of f than Mµµ . [sent-193, score-0.631]
77 A Let us compare the proximal average Aµ with the smooth approximation Mµµ on a 1-D example. [sent-194, score-0.79]
78 Our fix is almost trivial: If necessary, we use a bigger Lipschitz constant L0 = 1/µ so that we can compute the proximal map easily. [sent-207, score-0.715]
79 1/L Note that if we could reduce PAµ 0 efficiently to Pµµ , we would end up with the optimal (overall) A ¯ rate O( 1/ ), even though we approximate the nonsmooth function f by the proximal average µ A . [sent-214, score-0.898]
80 It is our incapability to (efficiently) relate proximal maps that leads to the sacrifice in convergence rates. [sent-216, score-0.625]
81 5 Discussions To ease our discussion with related works, let us first point out a fact that is not always explicitly ¯ recognized, that is, S-APG essentially relies on approximating the nonsmooth function f with Mµµ . [sent-217, score-0.267]
82 The smoothing idea introduced in [8] purports the superficial max-structure assumption, that is, f (x) = maxy∈C x, y − h(y) where C is some bounded convex set and h ∈ Γ0 . [sent-219, score-0.143]
83 the norm · ) iff dom f ∗ ⊆ B · (0, M ), the ball centered at the origin with radius M . [sent-223, score-0.089]
84 Finally for the general case where f is an average of K nonsmooth functions, the ¯ smoothing technique is applied in a component by component way, i. [sent-230, score-0.437]
85 A For comparison, let us recall that S-APG finds a 2 O( L0 + M 2 /(2 accurate solution in at most ) 1/ ) steps since the Lipschitz constant of the gradient of + Mµµ is upA per bounded by L0 + M 2 /(2 ) (under the choice of µ in Theorem 1). [sent-233, score-0.081]
86 In some sense it is quite remarkable that the seemingly “naive” approximation that pretends the linearity of the proximal map not only can be justified but also leads to a strictly better result. [sent-237, score-0.944]
87 As mentioned, S-APG approximates f with the smooth function Mµµ . [sent-239, score-0.103]
88 This smooth approximation is beneficial if our capability is limited to A smooth functions. [sent-240, score-0.298]
89 Put differently, S-APG implicitly treats applying the fast gradient algorithms as the ultimate goal. [sent-241, score-0.122]
90 However, the recent advances on nonsmooth optimization have broadened the range of fast schemes: It is not smoothness but the proximal map that allows fast convergence. [sent-242, score-1.055]
91 Just as how APG improves upon the subgradient method, our approach, with the ultimate goal to enable efficient computation of the proximal map, improves upon S-APG. [sent-243, score-0.755]
92 To summarize, smoothing is not free and it should be used when truly needed. [sent-245, score-0.074]
93 6 Experiments We compare the proposed algorithm with S-APG on two important problems: overlapping group lasso and graph-guided fused lasso. [sent-248, score-0.382]
94 See Example 1 and Example 2 for details about the nonsmooth ¯ function f . [sent-249, score-0.267]
95 7 Conclusions We have considered the composite minimization problem which consists of a smooth loss and a sum of nonsmooth regularizers. [sent-288, score-0.453]
96 Different from smoothing, we considered a seemingly naive nonsmooth approximation which simply pretends the linearity of the proximal map. [sent-289, score-1.076]
97 Based on the proximal average, a new tool from convex analysis, we proved that the new approximation leads to a novel algorithm that strictly improves the state-of-the-art. [sent-290, score-0.788]
98 Experiments on both overlapping group lasso and graph-guided fused lasso verified the superiority of the proposed method. [sent-291, score-0.48]
99 Smoothing proximal gradient method for general structured sparse regression. [sent-339, score-0.675]
100 Dual averaging and proximal gradient descent for online alternating direction multiplier method. [sent-361, score-0.673]
wordName wordTfidf (topN-words)
[('proximal', 0.597), ('apg', 0.43), ('nonsmooth', 0.267), ('pg', 0.181), ('fk', 0.174), ('fused', 0.165), ('moreau', 0.139), ('map', 0.118), ('pa', 0.118), ('subgradient', 0.114), ('smooth', 0.103), ('lasso', 0.098), ('envelop', 0.092), ('mf', 0.082), ('smoothing', 0.074), ('accelerated', 0.07), ('proposition', 0.07), ('convex', 0.069), ('pretends', 0.068), ('lipschitz', 0.065), ('composite', 0.061), ('group', 0.061), ('fij', 0.06), ('overlapping', 0.058), ('approximation', 0.056), ('pf', 0.055), ('gradient', 0.054), ('fenchel', 0.052), ('yaoliang', 0.052), ('gk', 0.047), ('pointwise', 0.045), ('xgk', 0.045), ('strictly', 0.045), ('ultimate', 0.044), ('assumption', 0.04), ('yurii', 0.04), ('heinz', 0.04), ('lucet', 0.04), ('yves', 0.04), ('iff', 0.04), ('xt', 0.039), ('zt', 0.037), ('functions', 0.037), ('fix', 0.037), ('seyoung', 0.037), ('ralph', 0.037), ('complicated', 0.036), ('justi', 0.036), ('duality', 0.036), ('capability', 0.036), ('groups', 0.036), ('ij', 0.035), ('mk', 0.035), ('linearity', 0.035), ('bauschke', 0.034), ('strengthened', 0.034), ('average', 0.034), ('xj', 0.033), ('iv', 0.032), ('component', 0.031), ('alberta', 0.031), ('lastly', 0.031), ('stepsize', 0.03), ('argminx', 0.03), ('maxy', 0.03), ('naive', 0.028), ('convergence', 0.028), ('dom', 0.027), ('id', 0.027), ('recall', 0.027), ('argmin', 0.026), ('sc', 0.025), ('justify', 0.025), ('smoothness', 0.025), ('seemingly', 0.025), ('yt', 0.025), ('structured', 0.024), ('regularizer', 0.024), ('fast', 0.024), ('continuous', 0.024), ('regularizers', 0.023), ('monotone', 0.023), ('patrick', 0.023), ('tools', 0.023), ('eric', 0.023), ('easily', 0.023), ('clearly', 0.023), ('ax', 0.022), ('norm', 0.022), ('author', 0.022), ('differently', 0.022), ('minimization', 0.022), ('af', 0.022), ('alternating', 0.022), ('smoothed', 0.021), ('tool', 0.021), ('xs', 0.02), ('contend', 0.02), ('legitimate', 0.02), ('keith', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
2 0.28971586 215 nips-2013-On Decomposing the Proximal Map
Author: Yao-Liang Yu
Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1
3 0.22810081 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
4 0.18076621 268 nips-2013-Reflection methods for user-friendly submodular optimization
Author: Stefanie Jegelka, Francis Bach, Suvrit Sra
Abstract: Recently, it has become evident that submodularity naturally captures widely occurring concepts in machine learning, signal processing and computer vision. Consequently, there is need for efficient optimization procedures for submodular functions, especially for minimization problems. While general submodular minimization is challenging, we propose a new method that exploits existing decomposability of submodular functions. In contrast to previous approaches, our method is neither approximate, nor impractical, nor does it need any cumbersome parameter tuning. Moreover, it is easy to implement and parallelize. A key component of our method is a formulation of the discrete submodular minimization problem as a continuous best approximation problem that is solved through a sequence of reflections, and its solution can be easily thresholded to obtain an optimal discrete solution. This method solves both the continuous and discrete formulations of the problem, and therefore has applications in learning, inference, and reconstruction. In our experiments, we illustrate the benefits of our method on two image segmentation tasks. 1
5 0.16299047 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
6 0.13150673 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
7 0.10381906 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
8 0.099706061 202 nips-2013-Multiclass Total Variation Clustering
9 0.096338421 301 nips-2013-Sparse Additive Text Models with Low Rank Background
10 0.082434863 193 nips-2013-Mixed Optimization for Smooth Functions
11 0.078935817 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
12 0.07720273 295 nips-2013-Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors
13 0.075473465 257 nips-2013-Projected Natural Actor-Critic
14 0.073193595 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
15 0.06567324 109 nips-2013-Estimating LASSO Risk and Noise Level
16 0.061534759 219 nips-2013-On model selection consistency of penalized M-estimators: a geometric theory
17 0.061324697 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
18 0.060846549 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
19 0.053746007 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
20 0.053283297 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
topicId topicWeight
[(0, 0.159), (1, 0.036), (2, 0.088), (3, 0.061), (4, 0.008), (5, 0.079), (6, -0.139), (7, -0.003), (8, 0.035), (9, -0.044), (10, 0.016), (11, 0.029), (12, -0.013), (13, -0.142), (14, -0.121), (15, 0.003), (16, 0.037), (17, -0.0), (18, 0.048), (19, 0.07), (20, 0.013), (21, 0.107), (22, 0.134), (23, 0.126), (24, -0.008), (25, -0.1), (26, 0.096), (27, -0.006), (28, 0.104), (29, 0.305), (30, -0.203), (31, 0.107), (32, -0.018), (33, -0.081), (34, -0.167), (35, -0.102), (36, -0.017), (37, 0.196), (38, -0.009), (39, -0.004), (40, 0.04), (41, 0.003), (42, 0.034), (43, -0.016), (44, -0.032), (45, 0.063), (46, 0.03), (47, -0.156), (48, -0.098), (49, 0.068)]
simIndex simValue paperId paperTitle
same-paper 1 0.95904547 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
2 0.9442274 215 nips-2013-On Decomposing the Proximal Map
Author: Yao-Liang Yu
Abstract: The proximal map is the key step in gradient-type algorithms, which have become prevalent in large-scale high-dimensional problems. For simple functions this proximal map is available in closed-form while for more complicated functions it can become highly nontrivial. Motivated by the need of combining regularizers to simultaneously induce different types of structures, this paper initiates a systematic investigation of when the proximal map of a sum of functions decomposes into the composition of the proximal maps of the individual summands. We not only unify a few known results scattered in the literature but also discover several new decompositions obtained almost effortlessly from our theory. 1
3 0.77721024 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
4 0.54966313 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
5 0.54860461 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
Author: Ke Hou, Zirui Zhou, Anthony Man-Cho So, Zhi-Quan Luo
Abstract: Motivated by various applications in machine learning, the problem of minimizing a convex smooth loss function with trace norm regularization has received much attention lately. Currently, a popular method for solving such problem is the proximal gradient method (PGM), which is known to have a sublinear rate of convergence. In this paper, we show that for a large class of loss functions, the convergence rate of the PGM is in fact linear. Our result is established without any strong convexity assumption on the loss function. A key ingredient in our proof is a new Lipschitzian error bound for the aforementioned trace norm–regularized problem, which may be of independent interest.
6 0.51719296 202 nips-2013-Multiclass Total Variation Clustering
7 0.47257197 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
8 0.46514609 268 nips-2013-Reflection methods for user-friendly submodular optimization
9 0.40160203 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses
10 0.37844583 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
11 0.36578718 303 nips-2013-Sparse Overlapping Sets Lasso for Multitask Learning and its Application to fMRI Analysis
12 0.36194795 128 nips-2013-Generalized Method-of-Moments for Rank Aggregation
13 0.33497092 257 nips-2013-Projected Natural Actor-Critic
14 0.32981485 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
15 0.32461777 301 nips-2013-Sparse Additive Text Models with Low Rank Background
16 0.32277852 265 nips-2013-Reconciling "priors" & "priors" without prejudice?
17 0.31908575 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
18 0.31854761 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
19 0.31329513 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
20 0.30787838 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
topicId topicWeight
[(16, 0.039), (33, 0.124), (34, 0.094), (41, 0.033), (49, 0.028), (56, 0.088), (70, 0.36), (85, 0.035), (89, 0.021), (93, 0.066), (95, 0.038)]
simIndex simValue paperId paperTitle
Author: Hesham Mostafa, Lorenz. K. Mueller, Giacomo Indiveri
Abstract: We present a recurrent neuronal network, modeled as a continuous-time dynamical system, that can solve constraint satisfaction problems. Discrete variables are represented by coupled Winner-Take-All (WTA) networks, and their values are encoded in localized patterns of oscillations that are learned by the recurrent weights in these networks. Constraints over the variables are encoded in the network connectivity. Although there are no sources of noise, the network can escape from local optima in its search for solutions that satisfy all constraints by modifying the effective network connectivity through oscillations. If there is no solution that satisfies all constraints, the network state changes in a seemingly random manner and its trajectory approximates a sampling procedure that selects a variable assignment with a probability that increases with the fraction of constraints satisfied by this assignment. External evidence, or input to the network, can force variables to specific values. When new inputs are applied, the network re-evaluates the entire set of variables in its search for states that satisfy the maximum number of constraints, while being consistent with the external input. Our results demonstrate that the proposed network architecture can perform a deterministic search for the optimal solution to problems with non-convex cost functions. The network is inspired by canonical microcircuit models of the cortex and suggests possible dynamical mechanisms to solve constraint satisfaction problems that can be present in biological networks, or implemented in neuromorphic electronic circuits. 1
2 0.90351194 84 nips-2013-Deep Neural Networks for Object Detection
Author: Christian Szegedy, Alexander Toshev, Dumitru Erhan
Abstract: Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC. 1
3 0.89415199 157 nips-2013-Learning Multi-level Sparse Representations
Author: Ferran Diego Andilla, Fred A. Hamprecht
Abstract: Bilinear approximation of a matrix is a powerful paradigm of unsupervised learning. In some applications, however, there is a natural hierarchy of concepts that ought to be reflected in the unsupervised analysis. For example, in the neurosciences image sequence considered here, there are the semantic concepts of pixel → neuron → assembly that should find their counterpart in the unsupervised analysis. Driven by this concrete problem, we propose a decomposition of the matrix of observations into a product of more than two sparse matrices, with the rank decreasing from lower to higher levels. In contrast to prior work, we allow for both hierarchical and heterarchical relations of lower-level to higher-level concepts. In addition, we learn the nature of these relations rather than imposing them. Finally, we describe an optimization scheme that allows to optimize the decomposition over all levels jointly, rather than in a greedy level-by-level fashion. The proposed bilevel SHMF (sparse heterarchical matrix factorization) is the first formalism that allows to simultaneously interpret a calcium imaging sequence in terms of the constituent neurons, their membership in assemblies, and the time courses of both neurons and assemblies. Experiments show that the proposed model fully recovers the structure from difficult synthetic data designed to imitate the experimental data. More importantly, bilevel SHMF yields plausible interpretations of real-world Calcium imaging data. 1
4 0.89159292 16 nips-2013-A message-passing algorithm for multi-agent trajectory planning
Author: Jose Bento, Nate Derbinsky, Javier Alonso-Mora, Jonathan S. Yedidia
Abstract: We describe a novel approach for computing collision-free global trajectories for p agents with specified initial and final configurations, based on an improved version of the alternating direction method of multipliers (ADMM). Compared with existing methods, our approach is naturally parallelizable and allows for incorporating different cost functionals with only minor adjustments. We apply our method to classical challenging instances and observe that its computational requirements scale well with p for several cost functionals. We also show that a specialization of our algorithm can be used for local motion planning by solving the problem of joint optimization in velocity space. 1
5 0.87586945 90 nips-2013-Direct 0-1 Loss Minimization and Margin Maximization with Boosting
Author: Shaodan Zhai, Tian Xia, Ming Tan, Shaojun Wang
Abstract: We propose a boosting method, DirectBoost, a greedy coordinate descent algorithm that builds an ensemble classifier of weak classifiers through directly minimizing empirical classification error over labeled training examples; once the training classification error is reduced to a local coordinatewise minimum, DirectBoost runs a greedy coordinate ascent algorithm that continuously adds weak classifiers to maximize any targeted arbitrarily defined margins until reaching a local coordinatewise maximum of the margins in a certain sense. Experimental results on a collection of machine-learning benchmark datasets show that DirectBoost gives better results than AdaBoost, LogitBoost, LPBoost with column generation and BrownBoost, and is noise tolerant when it maximizes an n′ th order bottom sample margin. 1
same-paper 6 0.83264691 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
7 0.80140692 15 nips-2013-A memory frontier for complex synapses
8 0.70114613 141 nips-2013-Inferring neural population dynamics from multiple partial recordings of the same neural circuit
9 0.69573492 121 nips-2013-Firing rate predictions in optimal balanced networks
10 0.685193 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
11 0.66800374 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
12 0.64668679 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
14 0.62867498 64 nips-2013-Compete to Compute
15 0.61797208 264 nips-2013-Reciprocally Coupled Local Estimators Implement Bayesian Information Integration Distributively
16 0.61090803 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
17 0.60933244 255 nips-2013-Probabilistic Movement Primitives
18 0.60192305 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
19 0.59796274 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
20 0.59588754 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking