nips nips2011 nips2011-63 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. [sent-7, score-0.915]
2 Proximal-gradient methods and accelerated proximal-gradient methods [1, 2] are among the most important methods for taking advantage of the structure of many of the non-smooth optimization problems that arise in practice. [sent-11, score-0.526]
3 In particular, these methods address composite optimization problems of the form minimize x∈Rd f (x) := g(x) + h(x), (1) where g and h are convex functions but only g is smooth. [sent-12, score-0.217]
4 Proximal-gradient methods are an appealing approach for solving these types of non-smooth optimization problems because of their fast theoretical convergence rates and strong practical performance. [sent-14, score-0.332]
5 While classical subgradient methods only achieve an error level on the objective function of √ O(1/ k) after k iterations, proximal-gradient methods have an error of O(1/k) while accelerated proximal-gradient methods futher reduce this to O(1/k 2 ) [1, 2]. [sent-15, score-0.587]
6 That is, accelerated proximalgradient methods for non-smooth convex optimization achieve the same optimal convergence rate that accelerated gradient methods achieve for smooth optimization. [sent-16, score-1.471]
7 Each iteration of a proximal-gradient method requires the calculation of the proximity operator, proxL (y) = arg min x∈Rd 1 L x−y 2 2 + h(x), (2) where L is the Lipschitz constant of the gradient of g. [sent-17, score-0.659]
8 However, in many scenarios the proximity operator may not have an analytic solution, or it may be very expensive to compute this solution exactly. [sent-19, score-0.535]
9 It is known that proximal-gradient methods that use an approximate proximity operator converge under only weak assumptions [16, 17]; we briefly review this and other related work in the next section. [sent-22, score-0.625]
10 In this work we show in several contexts that, provided the error in the proximity operator calculation is controlled in an appropriate way, inexact proximal-gradient strategies achieve the same convergence rates as the corresponding exact methods. [sent-24, score-1.388]
11 In particular, in Section 4 we first consider convex objectives and analyze the inexact proximal-gradient (Proposition 1) and accelerated proximal-gradient (Proposition 2) methods. [sent-25, score-0.858]
12 Note that in these analyses, we also consider the possibility that there is an error in the calculation of the gradient of g. [sent-27, score-0.24]
13 We then present an experimental comparison of various inexact proximal-gradient strategies in the context of solving a structured sparsity problem (Section 5). [sent-28, score-0.517]
14 In the basic proximal-gradient method we choose yk = xk , while in the accelerated proximal-gradient method we choose yk = xk + βk (xk − xk−1 ), where the sequence (βk ) is chosen appropriately. [sent-30, score-1.439]
15 There is a substantial amount of work on methods that use an exact proximity operator but have an error in the gradient calculation, corresponding to the special case where εk = 0 but ek is non-zero. [sent-31, score-0.98]
16 For example, when the ek are independent, zero-mean, and finite-variance random variables, then √ proximal-gradient methods achieve the (optimal) error level of O(1/ k) [18, 19]. [sent-32, score-0.331]
17 This contrasts with our analysis, where by allowing the error to change at every iteration we can achieve convergence to the optimal solution. [sent-36, score-0.239]
18 Other authors have analyzed the convergence rate of the gradient and projected-gradient methods with a decreasing sequence of errors [24, 25], but this analysis does not consider the important class of accelerated gradient methods. [sent-38, score-0.992]
19 In contrast, the analysis of [22] allows a decreasing sequence of errors (though convergence rates in this context are not explicitly 2 mentioned) and considers the accelerated projected-gradient method. [sent-39, score-0.801]
20 This non-intuitive oracle leads to a novel analysis of smoothing methods, but leads to slower convergence rates than proximal-gradient methods. [sent-41, score-0.26]
21 The analysis of [21] considers errors in both the gradient and projection operators for accelerated projected-gradient methods, but this analysis requires that the domain of the function is compact. [sent-42, score-0.676]
22 In the context of proximal-point algorithms, there is a substantial literature on using inexact proximity operators with a decreasing sequence of errors, dating back to the seminal work of Rockafeller [26]. [sent-44, score-0.973]
23 Accelerated proximal-point methods with a decreasing sequence of errors have also been examined, beginning with [27]. [sent-45, score-0.216]
24 However, unlike proximal-gradient methods where the proximity operator is only computed with respect to the non-smooth function h, proximal-point methods require the calculation of the proximity operator with respect to the full objective function. [sent-46, score-1.274]
25 In the context of composite optimization problems of the form (1), this requires the calculation of the proximity operator with respect to g + h. [sent-47, score-0.776]
26 Since it ignores the structure of the problem, this proximity operator may be as difficult to compute (even approximately) as the minimizer of the original problem. [sent-48, score-0.535]
27 Convergence of inexact proximal-gradient methods can be established with only weak assumptions on the method used to approximately solve (2). [sent-49, score-0.474]
28 Convergence of inexact proximal-gradient methods can also be established under the assumption that the norms of the errors are summable [17]. [sent-51, score-0.584]
29 However, these prior works did not consider the rate of convergence of inexact proximal-gradient methods, nor did they consider accelerated proximal-gradient methods. [sent-52, score-0.984]
30 Indeed, the authors of [7] chose to use the non-accelerated variant of the proximal-gradient algorithm since even convergence of the accelerated proximal-gradient method had not been established under an inexact proximity operator. [sent-53, score-1.36]
31 While preparing the final version of this work, [28] independently gave an analysis of the accelerated proximal-gradient method with an inexact proximity operator and a decreasing sequence of errors (assuming an exact gradient). [sent-54, score-1.54]
32 However, while we only assume that the proximal problem can be solved up to a certain accuracy, they make the much stronger assumption that the inexact proximity operator yields an εk -subdifferential of h [28, Definition 2. [sent-56, score-1.169]
33 In particular, the terms in εi disappear from the expressions of Ak , Ak and Ak appearing in the propositions, leading to the optimal convergence rate with a slower decay of εi . [sent-59, score-0.247]
34 3 Notation and Assumptions In this work, we assume that the smooth function g in (1) is convex and differentiable, and that its gradient g is Lipschitz-continuous with constant L, meaning that for all x and y in Rd we have g (x) − g (y) L x−y . [sent-61, score-0.258]
35 This allows h to be any real-valued convex function, but also allows for the possibility that h is an extended real-valued convex function. [sent-71, score-0.192]
36 For example, h could be the indicator function of a convex set, and in this case the proximity operator becomes the projection operator. [sent-72, score-0.654]
37 3 We will use xk to denote the parameter vector at iteration k, and x∗ to denote a minimizer of f . [sent-73, score-0.331]
38 We use ek to denote the error in the calculation of the gradient at iteration k, and we use εk to denote the error in the proximal objective function achieved by xk , meaning that L L xk − y 2 + h(xk ) εk + min x − y 2 + h(x) , (4) d 2 2 x∈R where y = yk−1 − (1/L)(g (yk−1 ) + ek )). [sent-75, score-1.627]
39 Note that the proximal optimization problem (2) is strongly convex and in practice we are often able to obtain such bounds via a duality gap (e. [sent-76, score-0.504]
40 We first consider the basic proximal-gradient method in the convex case: Proposition 1 (Basic proximal-gradient method - Convexity) Assume (H) and that we iterate recursion (3) with yk = xk . [sent-81, score-0.788]
41 This result implies that the well-known O(1/k) convergence √ rate for the gradient method without errors still holds when both ( ek ) and ( εk ) are summable. [sent-85, score-0.7]
42 A sufficient condition to achieve this is that ek decreases as O(1/k 1+δ ) while εk decreases as O(1/k 2+δ ) for any δ, δ > 0. [sent-86, score-0.341]
43 Note that a faster convergence of these two errors will not improve the convergence rate, but will yield a better constant factor. [sent-87, score-0.424]
44 √ It is √ interesting to consider what happens if ( ek ) or ( εk ) is not summable. [sent-88, score-0.235]
45 For instance, if ek and εk decrease as O(1/k), then Ak grows as O(log k) (note that Bk is always smaller than Ak ) log2 k k and the convergence of the function values is in O . [sent-89, score-0.422]
46 We now turn to the case of an accelerated proximal-gradient method. [sent-91, score-0.372]
47 (19) and (27)]: Proposition 2 (Accelerated proximal-gradient method - Convexity) Assume (H) and that we iterate recursion (3) with yk = xk + k−1 (xk − xk−1 ). [sent-93, score-0.592]
48 L 2Bk , (6) √ In this case, we require the series (k ek ) and (k εk ) to be summable to achieve the optimal O(1/k 2 ) rate, which is an√ (unsurprisingly) stronger constraint than in the basic case. [sent-95, score-0.4]
49 A sufficient condition is for ek and εk to decrease as O(1/k 2+δ ) for any δ > 0. [sent-96, score-0.272]
50 Note that, as opposed to Proposition 1 that is stated for the average iterate, this bound is for the last iterate xk . [sent-97, score-0.368]
51 First, √ if ek or εk decreases at a rate of O(1/k 2 ), then k( ek + ek ) decreases as O(1/k) and Ak grows as O(log k) (note that Bk is always smaller than Ak ), yielding a convergence rate of 2 √ O log2 k for f (xk ) − f (x∗ ). [sent-99, score-1.073]
52 Also, and perhaps more interestingly, if ek or εk decreases at k a rate of O(1/k), Eq. [sent-100, score-0.344]
53 More generally, the form of Ak and Bk indicates that errors have a greater effect on the accelerated method than on the basic method. [sent-102, score-0.596]
54 Hence, as also discussed in [22], unlike in the error-free case the accelerated method may not necessarily be better than the basic method because it is more sensitive to errors in the computation. [sent-103, score-0.653]
55 In the case where g is strongly convex it is possible to obtain linear convergence rates that depend on the ratio γ = µ/L as opposed to the sublinear convergence rates discussed above. [sent-104, score-0.555]
56 In particular, we obtain the following convergence rate on the iterates of the basic proximal-gradient method: Proposition 3 (Basic proximal-gradient method - Strong convexity) Assume (H), that g is µstrongly convex, and that we iterate recursion (3) with yk = xk . [sent-105, score-0.944]
57 Then, for all k 1, we have: k ¯ (1 − γ) ( x0 − x∗ + Ak ) , xk − x∗ k with ¯ Ak = (1 − γ) ei + L −i i=1 2εi L (7) . [sent-106, score-0.343]
58 A consequence of this proposition is that we obtain a linear rate of convergence even in the presence √ of errors, provided that ek and εk decrease linearly to 0. [sent-107, score-0.589]
59 If they do so at a rate of Q < (1 − γ), then the convergence rate of xk −x∗ is linear with constant (1 − γ), as in the error-free algorithm. [sent-108, score-0.6]
60 If we have Q > (1 − γ), then the convergence of xk − x∗ is linear with constant Q . [sent-109, score-0.456]
61 If we have k Q = (1 − γ), then xk −x∗ converges to 0 as O(k (1 − γ) ) = o [(1 − γ) + δ ] k for all δ > 0. [sent-110, score-0.306]
62 Finally, we consider the accelerated proximal-gradient algorithm when g is strongly convex. [sent-111, score-0.411]
63 1]: Proposition 4 (Accelerated proximal-gradient method - Strong convexity) Assume (H), that g √ 1− γ is µ-strongly convex, and that we iterate recursion (3) with yk = xk + 1+√γ (xk − xk−1 ). [sent-114, score-0.592]
64 2 This proposition implies that we obtain a linear rate of convergence in the presence √ errors proof vided that ||ek ||2√ εk decrease linearly √ 0. [sent-117, score-0.478]
65 Thus, the accelerated inexact proximal-gradient method will have a faster convergence rate than the exact basic proximalgradient method provided that Q < (1 − γ). [sent-119, score-1.23]
66 Oddly, in our analysis of the strongly convex case, the accelerated method is less sensitive to errors than the basic method. [sent-120, score-0.76]
67 However, unlike the basic method, the accelerated method requires knowing µ in addition to L. [sent-121, score-0.472]
68 If µ is misspecified, then the convergence rate of the accelerated method may be slower than the basic method. [sent-122, score-0.719]
69 5 5 Experiments We tested the basic inexact proximal-gradient and accelerated proximal-gradient methods on the CUR-like factorization optimization problem introduced in [33] to approximate a given matrix W , nr min 1 W − W XW 2 X 2 F nc ||X i ||p + λcol + λrow i=1 ||Xj ||p . [sent-123, score-0.958]
70 j=1 Under an appropriate choice of p, this optimization problem yields a matrix X with sparse rows and sparse columns, meaning that entire rows and columns of the matrix X are set to exactly zero. [sent-124, score-0.185]
71 In [33], the authors used an accelerated proximal-gradient method and chose p = ∞ since under this choice the proximity operator can be computed exactly. [sent-125, score-0.97]
72 The more natural choice of p = 2 was not explored since in this case there is no known algorithm to exactly compute the proximity operator. [sent-127, score-0.433]
73 In this case, it is possible to very quickly compute an approximate proximity operator using the block coordinate descent (BCD) algorithm presented in [12], which is equivalent to the proximal variant of Dykstra’s algorithm introduced by [34]. [sent-129, score-0.813]
74 In our implementation of the BCD method, we alternate between computing the proximity operator with respect to the rows and to the columns. [sent-130, score-0.535]
75 Since the BCD method allows us to compute a duality gap when solving the proximal problem, we can run the method until the duality gap is below a given error threshold εk to find an xk+1 satisfying (4). [sent-131, score-0.511]
76 Rather than assuming we are given the Lipschitz constant L, on the first iteration we set L to 1 and following [2] we double our estimate anytime g(xk ) > g(yk−1 ) + g (yk−1 ), xk − yk−1 + (L/2)||xk −yk−1 ||2 . [sent-135, score-0.331]
77 We tested three different ways to terminate the approximate proximal problem, each parameterized by a parameter α: • εk = 1/k α : Running the BCD algorithm until the duality gap is below 1/k α . [sent-136, score-0.345]
78 Note that the iterates produced by the BCD iterations are sparse, so we expected the algorithms to spend the majority of their time solving the proximity problem. [sent-140, score-0.508]
79 We plot the results after 500 BCD iterations for the first two data sets for the proximal-gradient method in Figure 1, and the accelerated proximal-gradient method in Figure 2. [sent-142, score-0.48]
80 In these plots, the first column varies α using the choice εk = 1/k α , the second column varies α using the choice εk = α, and the third column varies α using the choice n = α. [sent-144, score-0.252]
81 In the context of proximal-gradient methods the choice of εk = 1/k 3 , which is one choice that achieves the fastest convergence rate according to our analysis, gives the best performance across all four data sets. [sent-146, score-0.407]
82 Similar trends are observed for the case of accelerated proximal-gradient methods, though the choice of εk = 1/k 3 (which no longer achieves the fastest convergence rate according to our analysis) no longer dominates the other methods in the accelerated setting. [sent-150, score-1.081]
83 εk = 1/k 4 , which is a choice that achieves the fastest convergence rate up to a poly-logarithmic factor, yields better performance than εk = 1/k 3 . [sent-158, score-0.327]
84 Interestingly, the only choice that yields the fastest possible convergence rate (εk = 1/k 5 ) had reasonable performance but did not give the best performance on any data set. [sent-159, score-0.327]
85 This seems to reflect the trade-off between performing inner BCD iterations to achieve a small duality gap and performing outer gradient iterations to decrease the value of f . [sent-160, score-0.398]
86 7 500 6 Discussion An alternative to inexact proximal methods for solving structured sparsity problems are smoothing methods [35] and alternating direction methods [36]. [sent-162, score-0.761]
87 Further, the accelerated smoothing method only has a convergence rate of O(1/k), and the performance of alternating direction methods is often sensitive to the exact choice of their penalty parameter. [sent-165, score-0.774]
88 On the other hand, while our analysis suggests using a sequence of errors like O(1/k α ) for α large enough, the practical performance of inexact proximal-gradients methods will be sensitive to the exact choice of this sequence. [sent-166, score-0.666]
89 Although we have illustrated the use of our results in the context of a structured sparsity problem, inexact proximal-gradient methods are also used in other applications such as total-variation [7, 8] and nuclear-norm [9, 10] regularization. [sent-167, score-0.508]
90 This work provides a theoretical justification for using inexact proximal-gradient methods in these and other applications, and suggests some guidelines for practioners that do not want to lose the appealing convergence rates of these methods. [sent-168, score-0.664]
91 Further, although our experiments and much of our discussion focus on errors in the calculation of the proximity operator, our analysis also allows for an error in the calculation of the gradient. [sent-169, score-0.788]
92 For example, errors in the calculation of the gradient arise when fitting undirected graphical models and using an iterative method to approximate the gradient of the log-partition function [37]. [sent-171, score-0.485]
93 In our analysis, we assume that the smoothness constant L is known, but it would be interesting to extend methods for estimating L in the exact case [2] to the case of inexact algorithms. [sent-173, score-0.453]
94 In the context of accelerated methods for strongly convex optimization, our analysis also assumes that µ is known, and it would be interesting to explore variants that do not make this assumption. [sent-174, score-0.574]
95 We also note that if the basic proximal-gradient method is given knowledge of µ, then our analysis can be modified to obtain a faster linear convergence rate of (1 − γ)/(1 + γ) instead of (1 − γ) for strongly-convex optimization using a step size of 2/(µ + L), see Theorem 2. [sent-175, score-0.38]
96 Finally, we note that there has been recent interest in inexact proximal Newton-like methods [40], and it would be interesting to analyze the effect of errors on the convergence rates of these methods. [sent-178, score-0.955]
97 First-order methods of smooth convex optimization with inexact oracle. [sent-316, score-0.621]
98 Error bounds and convergence analysis of feasible descent methods: A general approach. [sent-327, score-0.173]
99 Convergence rates of inexact proximal-gradient methods for convex optimization. [sent-356, score-0.578]
100 On accelerated proximal gradient methods for convex-concave optimization, 2008. [sent-370, score-0.694]
wordName wordTfidf (topN-words)
[('proximity', 0.398), ('inexact', 0.39), ('accelerated', 0.372), ('xk', 0.306), ('ek', 0.235), ('proximal', 0.199), ('bcd', 0.177), ('ak', 0.164), ('yk', 0.151), ('convergence', 0.15), ('operator', 0.137), ('errors', 0.124), ('calculation', 0.117), ('convex', 0.096), ('proposition', 0.095), ('gradient', 0.091), ('proximalgradient', 0.087), ('bk', 0.087), ('basic', 0.072), ('rate', 0.072), ('schmidt', 0.07), ('duality', 0.068), ('operators', 0.066), ('iterate', 0.062), ('rates', 0.06), ('optimization', 0.058), ('iterates', 0.058), ('iterations', 0.052), ('fastest', 0.048), ('roux', 0.047), ('recursion', 0.045), ('smooth', 0.045), ('nicolas', 0.045), ('gap', 0.044), ('bauschke', 0.044), ('proxl', 0.044), ('strategies', 0.041), ('convexity', 0.041), ('jenatton', 0.039), ('obozinski', 0.039), ('strongly', 0.039), ('combettes', 0.038), ('dykstra', 0.038), ('summable', 0.038), ('decreases', 0.037), ('decrease', 0.037), ('ei', 0.037), ('siam', 0.036), ('overlapping', 0.036), ('propositions', 0.035), ('context', 0.035), ('decreasing', 0.035), ('choice', 0.035), ('monotone', 0.035), ('approximate', 0.034), ('wright', 0.034), ('arxiv', 0.033), ('terminating', 0.033), ('col', 0.033), ('error', 0.032), ('appealing', 0.032), ('mairal', 0.032), ('methods', 0.032), ('achieve', 0.032), ('tumors', 0.031), ('subsampling', 0.031), ('exact', 0.031), ('composite', 0.031), ('group', 0.03), ('francis', 0.03), ('nowozin', 0.03), ('sra', 0.03), ('row', 0.03), ('sensitive', 0.029), ('method', 0.028), ('rd', 0.027), ('meaning', 0.026), ('varies', 0.026), ('sparsity', 0.026), ('slower', 0.025), ('sequence', 0.025), ('jmlr', 0.025), ('regularizers', 0.025), ('bach', 0.025), ('iteration', 0.025), ('structured', 0.025), ('smoothing', 0.025), ('differentiable', 0.025), ('substantial', 0.024), ('assumptions', 0.024), ('objective', 0.023), ('projection', 0.023), ('column', 0.023), ('descent', 0.023), ('stronger', 0.023), ('variant', 0.022), ('examined', 0.022), ('yields', 0.022), ('outer', 0.022), ('sparse', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
2 0.19513202 78 nips-2011-Efficient Methods for Overlapping Group Lasso
Author: Lei Yuan, Jun Liu, Jieping Ye
Abstract: The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. 1
3 0.14743017 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
4 0.13979286 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
Author: Andrew Cotter, Ohad Shamir, Nati Srebro, Karthik Sridharan
Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
5 0.10318264 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
Author: Yong Zhang, Zhaosong Lu
Abstract: In this paper we consider general rank minimization problems with rank appearing in either objective function or constraint. We first show that a class of matrix optimization problems can be solved as lower dimensional vector optimization problems. As a consequence, we establish that a class of rank minimization problems have closed form solutions. Using this result, we then propose penalty decomposition methods for general rank minimization problems. The convergence results of the PD methods have been shown in the longer version of the paper [19]. Finally, we test the performance of our methods by applying them to matrix completion and nearest low-rank correlation matrix problems. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed. 1
6 0.09963958 251 nips-2011-Shaping Level Sets with Submodular Functions
7 0.091985658 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
8 0.08753784 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
9 0.084863789 268 nips-2011-Speedy Q-Learning
10 0.082520105 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
11 0.0820942 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
12 0.078319006 260 nips-2011-Sparse Features for PCA-Like Linear Regression
13 0.072875038 4 nips-2011-A Convergence Analysis of Log-Linear Training
14 0.067528896 202 nips-2011-On the Universality of Online Mirror Descent
15 0.065283783 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
16 0.064298324 72 nips-2011-Distributed Delayed Stochastic Optimization
17 0.064142741 282 nips-2011-The Fast Convergence of Boosting
18 0.060859274 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
19 0.05949283 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
20 0.058896445 199 nips-2011-On fast approximate submodular minimization
topicId topicWeight
[(0, 0.164), (1, -0.013), (2, -0.056), (3, -0.116), (4, -0.096), (5, 0.068), (6, -0.067), (7, 0.085), (8, -0.079), (9, 0.056), (10, 0.067), (11, -0.052), (12, -0.065), (13, -0.015), (14, -0.039), (15, 0.053), (16, 0.01), (17, 0.073), (18, 0.015), (19, 0.092), (20, -0.075), (21, 0.047), (22, -0.069), (23, -0.055), (24, -0.132), (25, 0.046), (26, 0.032), (27, -0.062), (28, 0.036), (29, -0.062), (30, -0.07), (31, 0.078), (32, -0.178), (33, 0.056), (34, -0.077), (35, 0.05), (36, -0.058), (37, -0.156), (38, 0.0), (39, -0.148), (40, 0.029), (41, 0.075), (42, -0.048), (43, -0.174), (44, 0.028), (45, -0.115), (46, 0.056), (47, 0.03), (48, -0.038), (49, 0.192)]
simIndex simValue paperId paperTitle
same-paper 1 0.95439339 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
2 0.87262702 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
3 0.69824195 78 nips-2011-Efficient Methods for Overlapping Group Lasso
Author: Lei Yuan, Jun Liu, Jieping Ye
Abstract: The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. 1
4 0.52728158 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
Author: Yichuan Zhang, Charles A. Sutton
Abstract: The performance of Markov chain Monte Carlo methods is often sensitive to the scaling and correlations between the random variables of interest. An important source of information about the local correlation and scale is given by the Hessian matrix of the target distribution, but this is often either computationally expensive or infeasible. In this paper we propose MCMC samplers that make use of quasiNewton approximations, which approximate the Hessian of the target distribution from previous samples and gradients generated by the sampler. A key issue is that MCMC samplers that depend on the history of previous states are in general not valid. We address this problem by using limited memory quasi-Newton methods, which depend only on a fixed window of previous samples. On several real world datasets, we show that the quasi-Newton sampler is more effective than standard Hamiltonian Monte Carlo at a fraction of the cost of MCMC methods that require higher-order derivatives. 1
5 0.50140357 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Author: Benjamin Recht, Christopher Re, Stephen Wright, Feng Niu
Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve stateof-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performancedestroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called H OGWILD ! which allows processors access to shared memory with the possibility of overwriting each other’s work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then H OGWILD ! achieves a nearly optimal rate of convergence. We demonstrate experimentally that H OGWILD ! outperforms alternative schemes that use locking by an order of magnitude. 1
6 0.4742941 260 nips-2011-Sparse Features for PCA-Like Linear Regression
7 0.46250913 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
8 0.45281428 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
9 0.44373444 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
10 0.4407298 4 nips-2011-A Convergence Analysis of Log-Linear Training
11 0.43904859 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
12 0.43347633 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
13 0.41196814 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
14 0.40957478 268 nips-2011-Speedy Q-Learning
15 0.40521109 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
16 0.38103378 72 nips-2011-Distributed Delayed Stochastic Optimization
17 0.37817132 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
18 0.36984488 73 nips-2011-Divide-and-Conquer Matrix Factorization
19 0.36715555 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
20 0.35129404 251 nips-2011-Shaping Level Sets with Submodular Functions
topicId topicWeight
[(0, 0.047), (4, 0.02), (20, 0.014), (26, 0.026), (31, 0.045), (33, 0.014), (43, 0.056), (45, 0.562), (57, 0.02), (74, 0.054), (83, 0.026), (99, 0.029)]
simIndex simValue paperId paperTitle
1 0.99797541 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
same-paper 2 0.99753845 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
3 0.99397445 28 nips-2011-Agnostic Selective Classification
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
4 0.99285561 272 nips-2011-Stochastic convex optimization with bandit feedback
Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin
Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1
5 0.98929042 143 nips-2011-Learning Anchor Planes for Classification
Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari
Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1
6 0.98489815 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
7 0.98422414 293 nips-2011-Understanding the Intrinsic Memorability of Images
8 0.9490236 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
9 0.91483569 19 nips-2011-Active Classification based on Value of Classifier
10 0.90769535 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
11 0.90407729 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
12 0.90236628 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
13 0.9006151 220 nips-2011-Prediction strategies without loss
14 0.89950442 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
15 0.89818633 78 nips-2011-Efficient Methods for Overlapping Group Lasso
16 0.89325219 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
17 0.88840562 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
18 0.88674718 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
19 0.88578647 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
20 0.88358891 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model