nips nips2012 nips2012-27 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Fadili† Abstract A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. [sent-4, score-0.457]
2 We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. [sent-5, score-0.285]
3 The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. [sent-6, score-0.161]
4 For large-scale unconstrained smooth convex problems, two classes of methods have seen the most success: limited memory quasi-Newton methods and non-linear conjugate gradient (CG) methods. [sent-11, score-0.253]
5 Both of these methods generally outperform simpler methods, such as gradient descent. [sent-12, score-0.07]
6 For problems with non-smooth terms and/or constraints, it is possible to generalize gradient descent with proximal gradient descent (which includes projected gradient descent as a sub-cases), which is just the application of the forward-backward algorithm [1]. [sent-13, score-0.548]
7 Unlike gradient descent, it is not easy to adapt quasi-Newton and CG methods to problems involving constraints and non-smooth terms. [sent-14, score-0.07]
8 In the limit, as the active-set is correctly identified, the methods behave similar to their unconstrained counterparts. [sent-16, score-0.071]
9 These methods have seen success, but are not as efficient or as elegant as the unconstrained versions. [sent-17, score-0.113]
10 1 Problem statement N Let H = (RN , ·, · ) equipped with the usual Euclidean scalar product x, y = i=1 xi yi and associated norm x = x, x . [sent-20, score-0.077]
11 The domain of f is defined by dom f = {x ∈ H : f (x) < +∞} and f is proper if dom f = ∅. [sent-25, score-0.124]
12 1 class of all proper lsc convex functions from H to R ∪ {+∞} is denoted by Γ0 (H). [sent-35, score-0.116]
13 The class we consider covers non-smooth convex optimization problems, including those with convex constraints. [sent-43, score-0.175]
14 2 Contributions This paper introduces a class of scaled norms for which we can compute a proximity operator; these results themselves are significant, for previous results only cover diagonal scaling (the diagonal scaling result is trivial). [sent-52, score-0.569]
15 Then, motivated by the discrepancy between constrained and unconstrained performance, we define a class of limited-memory quasi-Newton methods to solve (P) and that extends naturally and elegantly from the unconstrained to the constrained case. [sent-53, score-0.289]
16 Most well-known quasi-Newton methods for constrained problems, such as L-BFGS-B [2], are only applicable to box constraints l ≤ x ≤ u. [sent-54, score-0.088]
17 The power of our approach is that it applies to a wide-variety of useful non-smooth functionals (see §3. [sent-55, score-0.05]
18 The approach uses the zero-memory SR1 algorithm, and we provide evidence that the non-diagonal term provides significant improvements over diagonal Hessians. [sent-58, score-0.139]
19 1 Quasi-Newton forward-backward splitting The algorithm In the following, define the quadratic approximation QB (x) = f (xk ) + k f (xk ), x − xk + 1 x − xk 2 2 B, (4) where B ∈ S++ (N ). [sent-60, score-0.568]
20 k 2 (5) Note that this specializes to the gradient descent when h = 0. [sent-62, score-0.117]
21 Therefore, if f is a strictly convex quadratic function and one takes Bk = 2 f (xk ), then we obtain the Newton method. [sent-63, score-0.161]
22 Since f is smooth and can be approximated by a quadratic, and inspired by quasi-Newton methods, this suggest picking Bk as an approximation of the Hessian. [sent-66, score-0.042]
23 Our diagonal+rank 1 quasi-Newton forward-backward splitting algorithm is listed in Algorithm 1 (with details for the quasi-Newton update in Algorithm 2, see §4 for details). [sent-68, score-0.096]
24 Algorithm 1: Zero-memory Symmetric Rank 1 (0SR1) algorithm to solve min f + h Require: x0 ∈ dom(f + h), Lipschitz constant estimate L of f , stopping criterion 1: for k = 1, 2, 3, . [sent-74, score-0.079]
25 do 2: sk ← xk − xk−1 3: yk ← f (xk ) − f (xk−1 ) −1 4: Compute Hk via Algorithm 2, and define Bk = Hk . [sent-77, score-0.748]
26 5: Compute the rank-1 proximity operator (see §3) xk+1 ← proxBk (xk − Hk f (xk )) ˆ h (6) 6: pk ← xk+1 − xk and terminate if pk < ˆ 7: Line-search along the ray xk + tpk to determine xk+1 , or choose t = 1. [sent-78, score-0.776]
27 2 Relation to prior work First-order methods The algorithm in (5) is variously known as proximal descent or iterated shrinkage/thresholding algorithm (IST or ISTA). [sent-80, score-0.164]
28 The spectral projected gradient (SPG) [4] method was designed as an extension of the BarzilaiBorwein spectral step-length method to constrained problems. [sent-82, score-0.203]
29 In [5], it was extended to non-smooth problems by allowing general proximity operators; The Barzilai-Borwein method [6] uses a specific choice of step-length tk motivated by quasi-Newton methods. [sent-83, score-0.302]
30 Nesterov acceleration can be viewed as an over-relaxed version of ISTA with a specific, non-constant over-relaxation parameter αk . [sent-87, score-0.049]
31 The general diagonal case was considered in several papers in the 1980s as a simple quasi-Newton method, but never widely adapted. [sent-89, score-0.139]
32 A convergence rate analysis of forward-backward splitting with static and variable Bk where one of the operators is maximal strongly monotone is given in [10]. [sent-91, score-0.158]
33 Active set approaches Active set methods take a simple step, such as gradient projection, to identify active variables, and then uses a more advanced quadratic model to solve for the free variables. [sent-92, score-0.21]
34 A recent bound-constrained solver is ASA [13] which uses a conjugate gradient (CG) solver on the free variables, and shows good results compared to L-BFGS-B, SPG, GENCAN and TRON. [sent-94, score-0.07]
35 We also compare to several active set approaches specialized for 1 penalties: “Orthant-wise Learning” (OWL) [14], “Projected Scaled Sub-gradient + Active Set” (PSSas) [15], “Fixed-point continuation + Active Set” (FPC AS) [16], and “CG + IST” (CGIST) [17]. [sent-95, score-0.053]
36 IPM requires solving a Newton-step equation, so first-order like “Hessian-free” variants of IPM solve the Newton-step approximately, either by approximately solving the equation or by subsampling the Hessian. [sent-97, score-0.041]
37 Yet another approach is to include the non-smooth h term in the quadratic approximation. [sent-99, score-0.046]
38 The projected quasi-Newton (PQN) algorithm [19, 20] is perhaps the most elegant and logical extension of quasi-Newton methods, but it involves solving a sub-iteration. [sent-102, score-0.122]
39 As discussed in [21], it is possible to generalize PQN to general non-smooth problems whenever the proximity operator is known (since, as mentioned above, it is possible to extend SPG to this case). [sent-105, score-0.35]
40 3 Proximity operators and proximal calculus For space limitation reasons, we only recall essential definitions. [sent-106, score-0.217]
41 More notions, results from convex analysis as well as proofs can be found in the supplementary material. [sent-107, score-0.07]
42 Then, for every x ∈ H, the function 2 1 z → 2 x − z + h(z) achieves its infimum at a unique point denoted by proxh x. [sent-110, score-0.227]
43 The uniquelyvalued operator proxh : H → H thus defined is the proximity operator or proximal mapping of h. [sent-111, score-0.75]
44 1 Proximal calculus in HV Throughout, we denote proxV = (IHV + V −1 ∂h)−1 , where ∂h is the subdifferential of h, the h proximity operator of h w. [sent-113, score-0.388]
45 Note that since V ∈ S++ (N ), the proximity operator proxV is well-defined. [sent-117, score-0.35]
46 1 (8) Diagonal+rank-1: General case Theorem 7 (Proximity operator in HV ). [sent-124, score-0.099]
47 Let h ∈ Γ0 (H) and V = D + uuT , where D is diagonal with (strictly) positive diagonal elements di , and u ∈ RN . [sent-125, score-0.315]
48 Then, proxV (x) = D−1/2 ◦ proxh◦D−1/2 (D1/2 x − v) , h where v = αD −1/2 (9) u and α is the unique root of p(α) = u, x − D−1/2 ◦ proxh◦D−1/2 ◦D1/2 (x − αD−1 u) + α , (10) which is a Lipschitz continuous and strictly increasing function on R with Lipschitz constant 1 + 2 i ui /di . [sent-126, score-0.17]
49 • Computing proxV amounts to solving a scalar optimization problem that involves the comh putation of proxh◦D−1/2 . [sent-128, score-0.07]
50 The latter can be much simpler to compute as D is diagonal (beyond the obvious separable case that we will consider shortly). [sent-129, score-0.296]
51 2 Diagonal+rank-1: Separable case The following corollary is key to our novel optimization algorithm. [sent-134, score-0.077]
52 h(x) = i=1 hi (xi ), and V = D + uuT , where D is diagonal with (strictly) positive diagonal elements di , and u ∈ RN . [sent-138, score-0.315]
53 Then, proxV (x) = proxhi /di (xi − vi /di ) h , (11) i where v = αu and α is the unique root of p(α) = u, x − proxhi /di (xi − αui /di ) +α, (12) i which is a Lipschitz continuous and strictly increasing function on R. [sent-139, score-0.41]
54 is Proof: diagonal, As applying h is separable and Theorem 7 yields ∈ desired D the S++ (N ) result. [sent-140, score-0.157]
55 Assume that for 1 i N , proxhi is piecewise affine on R with ki ≥ 1 segments, i. [sent-142, score-0.237]
56 proxhi (xi ) = aj xi + bj , tj xi tj+1 , j ∈ {1, . [sent-144, score-0.201]
57 Then proxV (x) can be obtained exactly by sorting at most the k real values h (i,j)∈{1,. [sent-149, score-0.064]
58 When proxhi is piecewise affine with ki segments, it is easy to see that p(α) in (12) is also piecewise affine with slopes and intercepts changing at the k transition points di ui (xi − tj ) (i,j)∈{1,. [sent-157, score-0.423]
59 To get α , it is suf- ficient to isolate the unique segment that intersects the abscissa axis. [sent-164, score-0.043]
60 This can be achieved by sorting the values of the transition points which can cost in average complexity O(k log k). [sent-165, score-0.064]
61 • Corollary 9 can be extended to the “block” separable (i. [sent-173, score-0.157]
62 separable in subsets of coordinates) when D is piecewise constant along the same block indices. [sent-175, score-0.204]
63 If no closed-form is available, one can appeal to some efficient iterative method to solve (10) (or (12)). [sent-179, score-0.041]
64 The semi-smooth Newton method for the solution of (10) can be stated as the iteration αt+1 = αt − g(αt )−1 p(αt ) , (13) where g is a generalized derivative of p. [sent-181, score-0.074]
65 If proxh◦D−1/2 is Newton differentiable with generalized derivative G, then so is the mapping p with a generalized derivative g(α) = 1 + u, D−1/2 ◦ G(D1/2 x − αD−1/2 u) ◦ D−1/2 u Furthermore, g is nonsingular with a uniformly bounded inverse on R. [sent-183, score-0.243]
66 Thus, as p is Newton differentiable with nonsingular generalized derivative whose inverse is also bounded, the general semi-smooth Newton convergence theorem implies that (13) converges superlinearly to the unique root of (10). [sent-188, score-0.252]
67 For instance, Table 1 summarizes a few of them where we can obtain either an exact answer by sorting when possible, or else by minimizing w. [sent-192, score-0.1]
68 4 A primal rank 1 SR1 algorithm Following the conventional quasi-Newton notation, we let B denote an approximation to the Hessian of f and H denote an approximation to the inverse Hessian. [sent-198, score-0.092]
69 All quasi-Newton methods update an approximation to the (inverse) Hessian that satisfies the secant condition: Hk yk = sk , yk = f (xk ) − f (xk−1 ), sk = xk − xk−1 (14) Algorithm 1 follows the SR1 method [24], which uses a rank-1 update to the inverse Hessian approximation at every step. [sent-199, score-1.391]
70 In particular, we use zero-memory, which means that at every iteration, a new diagonal plus rank-one matrix is formed. [sent-204, score-0.139]
71 The other modification is to extend the SR1 method to the general setting of minimizing f + h where f is smooth but h need not be smooth; this further generalizes the case when h is an indicator function of a convex set. [sent-205, score-0.112]
72 Every step of the algorithm replaces f with a quadratic approximation, and keeps h unchanged. [sent-206, score-0.046]
73 Choosing H0 step length In our experience, the choice of H0 is best if scaled with a Barzilai-Borwein spectral (we call it τBB2 sk , sk / sk , yk τBB2 = sk , yk / yk , yk (15) to distinguish it from the other Barzilai-Borwein step size τBB1 = τBB2 ). [sent-208, score-2.18]
74 In SR1 methods, the quantity sk − H0 yk , yk must be positive in order to have a well-defined update for uk . [sent-209, score-0.93]
75 The update is: Hk = H0 + uk uT , k uk = (sk − H0 yk )/ 6 sk − H0 yk , yk . [sent-210, score-1.325]
76 12: end if 13: end if −1 14: return Hk = H0 + uk uT {Bk = Hk can be computed via the Sherman-Morrison formula} k For this reason, we choose H0 = γτBB2 IH with 0 < γ < 1, and thus 0 ≤ sk − H0 yk , yk = (1 − γ) sk , yk . [sent-212, score-1.465]
77 If sk , yk = 0, then there is no symmetric rank-one update that satisfies the secant condition. [sent-213, score-0.591]
78 The inequality sk , yk > 0 is the curvature condition, and it is guaranteed for all strictly convex objectives. [sent-214, score-0.65]
79 Following the recommendation in [26], we skip updates whenever sk , yk cannot be guaranteed to be non-zero given standard floating-point precision. [sent-215, score-0.572]
80 5 (b) Figure 1: (a) is first LASSO test, (b) is second LASSO test Consider the unconstrained LASSO problem (1). [sent-222, score-0.071]
81 However, this constrained problem has twice the number of variables, and the Hessian of 7 AT A −AT A −AT A AT A n degenerate 0 eigenvalues and adversely affects solvers. [sent-226, score-0.053]
82 ˜ the quadratic part changes from AT A to A = which necessarily has (at least) A similar situation occurs with the hinge-loss function. [sent-227, score-0.046]
83 Our second example uses a square operator A with dimensions n = 133 = 2197 chosen as a 3D discrete differential operator. [sent-238, score-0.099]
84 This example stems from a numerical analysis problem to solve a discretized PDE as suggested by [28]. [sent-239, score-0.041]
85 6 Conclusions In this paper, we proposed a novel variable metric (quasi-Newton) forward-backward splitting algorithm, designed to efficiently solve non-smooth convex problems structured as the sum of a smooth term and a non-smooth one. [sent-247, score-0.249]
86 We introduced a class of weighted norms induced by a diagonal+rank 1 symmetric positive definite matrices, and proposed a whole framework to compute a proximity operator in the weighted norm. [sent-248, score-0.35]
87 We also provided clear evidence that the non-diagonal term provides significant acceleration over diagonal matrices. [sent-250, score-0.188]
88 Another improvement would be to derive efficient calculation for rank-2 proximity terms, thus allowing a 0-memory BFGS method. [sent-255, score-0.285]
89 However, in general, one must solve an r-dimensional inner problem using the semismooth Newton method. [sent-257, score-0.154]
90 A final possible extension is to take Bk to be diagonal plus rank-1 on diagonal blocks, since if h is separable, this is still can be solved by our algorithm (see Remark 10). [sent-258, score-0.278]
91 Nonmonotone spectral projected gradient methods on ı convex sets. [sent-303, score-0.22]
92 A fast iterative shrinkage-thresholding algorithm for linear inverse problems. [sent-325, score-0.052]
93 Diagonal preconditioning for first order primal-dual algorithms in convex optimization. [sent-338, score-0.07]
94 Remark on algorithm 778: L-BFGS-B: Fortran subroutines for e large-scale bound constrained optimization¨ ACM Trans. [sent-359, score-0.094]
95 A new active set algorithm for box constrained optimization. [sent-368, score-0.141]
96 A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. [sent-405, score-0.105]
97 Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm. [sent-413, score-0.08]
98 Proximal Newton-type methods for minimizing convex objective functions in composite form. [sent-431, score-0.07]
99 A semismooth Newton method for Tikhonov functionals with sparsity constraints. [sent-445, score-0.163]
100 Tackling box-constrained optimization via a new projected quasi-Newton approach. [sent-473, score-0.115]
wordName wordTfidf (topN-words)
[('proxv', 0.338), ('yk', 0.328), ('proximity', 0.251), ('xk', 0.213), ('sk', 0.207), ('proxh', 0.184), ('newton', 0.179), ('separable', 0.157), ('hv', 0.151), ('proxhi', 0.141), ('diagonal', 0.139), ('bk', 0.135), ('proximal', 0.117), ('ih', 0.115), ('nonseparable', 0.113), ('semismooth', 0.113), ('spg', 0.113), ('hk', 0.104), ('bfgs', 0.103), ('operator', 0.099), ('splitting', 0.096), ('asa', 0.092), ('ista', 0.086), ('cgist', 0.085), ('fpc', 0.085), ('pssas', 0.085), ('hessian', 0.081), ('projected', 0.08), ('fortran', 0.075), ('owl', 0.075), ('pqn', 0.075), ('unconstrained', 0.071), ('convex', 0.07), ('gradient', 0.07), ('ipm', 0.069), ('cg', 0.069), ('uk', 0.067), ('sorting', 0.064), ('operators', 0.062), ('dom', 0.062), ('tj', 0.06), ('projector', 0.059), ('lasso', 0.059), ('bauschke', 0.056), ('caen', 0.056), ('mem', 0.056), ('secant', 0.056), ('ax', 0.056), ('fista', 0.055), ('active', 0.053), ('constrained', 0.053), ('inverse', 0.052), ('siam', 0.052), ('tk', 0.051), ('functionals', 0.05), ('ki', 0.049), ('acceleration', 0.049), ('descent', 0.047), ('schmidt', 0.047), ('piecewise', 0.047), ('fadili', 0.046), ('lsc', 0.046), ('uut', 0.046), ('quadratic', 0.046), ('lipschitz', 0.045), ('strictly', 0.045), ('rn', 0.044), ('nonsingular', 0.043), ('byrd', 0.043), ('combettes', 0.043), ('unique', 0.043), ('corollary', 0.042), ('smooth', 0.042), ('yi', 0.042), ('ui', 0.042), ('elegant', 0.042), ('restart', 0.041), ('subroutines', 0.041), ('solve', 0.041), ('derivative', 0.041), ('root', 0.04), ('rank', 0.04), ('solvers', 0.04), ('scaled', 0.04), ('ist', 0.039), ('nocedal', 0.039), ('stopping', 0.038), ('calculus', 0.038), ('remark', 0.037), ('di', 0.037), ('skip', 0.037), ('else', 0.036), ('hinge', 0.036), ('bb', 0.035), ('box', 0.035), ('optimization', 0.035), ('scalar', 0.035), ('preprint', 0.034), ('calculation', 0.034), ('generalized', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
2 0.33376262 282 nips-2012-Proximal Newton-type methods for convex optimization
Author: Jason Lee, Yuekai Sun, Michael Saunders
Abstract: We seek to solve convex optimization problems in composite form: minimize f (x) := g(x) + h(x), n x∈R where g is convex and continuously differentiable and h : Rn → R is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. We prove such methods are globally convergent and achieve superlinear rates of convergence in the vicinity of an optimal solution. We also demonstrate the performance of these methods using problems of relevance in machine learning and statistics. 1
3 0.26524007 300 nips-2012-Scalable nonconvex inexact proximal splitting
Author: Suvrit Sra
Abstract: We study a class of large-scale, nonsmooth, and nonconvex optimization problems. In particular, we focus on nonconvex problems with composite objectives. This class includes the extensively studied class of convex composite objective problems as a subclass. To solve composite nonconvex problems we introduce a powerful new framework based on asymptotically nonvanishing errors, avoiding the common stronger assumption of vanishing errors. Within our new framework we derive both batch and incremental proximal splitting algorithms. To our knowledge, our work is first to develop and analyze incremental nonconvex proximalsplitting algorithms, even if we were to disregard the ability to handle nonvanishing errors. We illustrate one instance of our general framework by showing an application to large-scale nonsmooth matrix factorization. 1
Author: Demba Ba, Behtash Babadi, Patrick Purdon, Emery Brown
Abstract: We consider the problem of recovering a sequence of vectors, (xk )K , for which k=0 the increments xk − xk−1 are Sk -sparse (with Sk typically smaller than S1 ), based on linear measurements (yk = Ak xk + ek )K , where Ak and ek denote the meak=1 surement matrix and noise, respectively. Assuming each Ak obeys the restricted isometry property (RIP) of a certain order—depending only on Sk —we show that in the absence of noise a convex program, which minimizes the weighted sum of the ℓ1 -norm of successive differences subject to the linear measurement constraints, recovers the sequence (xk )K exactly. This is an interesting result bek=1 cause this convex program is equivalent to a standard compressive sensing problem with a highly-structured aggregate measurement matrix which does not satisfy the RIP requirements in the standard sense, and yet we can achieve exact recovery. In the presence of bounded noise, we propose a quadratically-constrained convex program for recovery and derive bounds on the reconstruction error of the sequence. We supplement our theoretical analysis with simulations and an application to real video data. These further support the validity of the proposed approach for acquisition and recovery of signals with time-varying sparsity.
5 0.1740265 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen
Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1
6 0.15282668 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
7 0.1475049 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
8 0.13358642 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
9 0.11875609 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
10 0.10902901 319 nips-2012-Sparse Prediction with the $k$-Support Norm
11 0.10709167 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
12 0.10541451 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
13 0.10459798 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
14 0.10366789 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
15 0.09530963 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
16 0.091987975 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
17 0.091881007 179 nips-2012-Learning Manifolds with K-Means and K-Flats
18 0.089856915 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition
19 0.088638797 206 nips-2012-Majorization for CRFs and Latent Likelihoods
20 0.088069759 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
topicId topicWeight
[(0, 0.222), (1, 0.045), (2, 0.189), (3, -0.119), (4, 0.129), (5, 0.157), (6, 0.013), (7, -0.158), (8, 0.28), (9, 0.05), (10, -0.124), (11, 0.133), (12, 0.09), (13, -0.049), (14, -0.088), (15, -0.019), (16, -0.064), (17, 0.064), (18, 0.019), (19, -0.006), (20, -0.085), (21, 0.057), (22, 0.02), (23, 0.029), (24, 0.035), (25, -0.026), (26, 0.006), (27, -0.049), (28, 0.003), (29, -0.032), (30, 0.013), (31, -0.001), (32, -0.041), (33, 0.023), (34, 0.027), (35, -0.082), (36, -0.009), (37, -0.035), (38, 0.037), (39, 0.05), (40, 0.019), (41, -0.008), (42, 0.014), (43, 0.045), (44, 0.037), (45, -0.017), (46, -0.013), (47, 0.004), (48, -0.028), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.94859827 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
2 0.94076985 282 nips-2012-Proximal Newton-type methods for convex optimization
Author: Jason Lee, Yuekai Sun, Michael Saunders
Abstract: We seek to solve convex optimization problems in composite form: minimize f (x) := g(x) + h(x), n x∈R where g is convex and continuously differentiable and h : Rn → R is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. We prove such methods are globally convergent and achieve superlinear rates of convergence in the vicinity of an optimal solution. We also demonstrate the performance of these methods using problems of relevance in machine learning and statistics. 1
3 0.90211374 300 nips-2012-Scalable nonconvex inexact proximal splitting
Author: Suvrit Sra
Abstract: We study a class of large-scale, nonsmooth, and nonconvex optimization problems. In particular, we focus on nonconvex problems with composite objectives. This class includes the extensively studied class of convex composite objective problems as a subclass. To solve composite nonconvex problems we introduce a powerful new framework based on asymptotically nonvanishing errors, avoiding the common stronger assumption of vanishing errors. Within our new framework we derive both batch and incremental proximal splitting algorithms. To our knowledge, our work is first to develop and analyze incremental nonconvex proximalsplitting algorithms, even if we were to disregard the ability to handle nonvanishing errors. We illustrate one instance of our general framework by showing an application to large-scale nonsmooth matrix factorization. 1
Author: Demba Ba, Behtash Babadi, Patrick Purdon, Emery Brown
Abstract: We consider the problem of recovering a sequence of vectors, (xk )K , for which k=0 the increments xk − xk−1 are Sk -sparse (with Sk typically smaller than S1 ), based on linear measurements (yk = Ak xk + ek )K , where Ak and ek denote the meak=1 surement matrix and noise, respectively. Assuming each Ak obeys the restricted isometry property (RIP) of a certain order—depending only on Sk —we show that in the absence of noise a convex program, which minimizes the weighted sum of the ℓ1 -norm of successive differences subject to the linear measurement constraints, recovers the sequence (xk )K exactly. This is an interesting result bek=1 cause this convex program is equivalent to a standard compressive sensing problem with a highly-structured aggregate measurement matrix which does not satisfy the RIP requirements in the standard sense, and yet we can achieve exact recovery. In the presence of bounded noise, we propose a quadratically-constrained convex program for recovery and derive bounds on the reconstruction error of the sequence. We supplement our theoretical analysis with simulations and an application to real video data. These further support the validity of the proposed approach for acquisition and recovery of signals with time-varying sparsity.
5 0.66441035 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach
Abstract: We propose a new stochastic gradient method for optimizing the sum of a finite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1
6 0.65162247 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition
7 0.57439494 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
8 0.56157464 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
9 0.56095445 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
10 0.53217298 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization
11 0.52180928 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
12 0.50131994 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
13 0.4677707 34 nips-2012-Active Learning of Multi-Index Function Models
14 0.46480137 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
15 0.46145123 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
16 0.4509438 44 nips-2012-Approximating Concavely Parameterized Optimization Problems
17 0.44925579 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
18 0.43683553 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
19 0.43020228 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
20 0.42957401 35 nips-2012-Adaptive Learning of Smoothing Functions: Application to Electricity Load Forecasting
topicId topicWeight
[(0, 0.098), (17, 0.011), (21, 0.028), (38, 0.203), (39, 0.012), (42, 0.037), (54, 0.027), (55, 0.022), (64, 0.187), (74, 0.027), (76, 0.153), (80, 0.09), (92, 0.04)]
simIndex simValue paperId paperTitle
1 0.89555264 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging
Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson
Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classifier and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefficients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation significantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We briefly discuss ramifications in terms of learning bounds (Rademacher complexity). 1
same-paper 2 0.88436973 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
3 0.87041795 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Author: Arthur Guez, David Silver, Peter Dayan
Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, finding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a significant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an infinite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1
4 0.86632687 147 nips-2012-Graphical Models via Generalized Linear Models
Author: Eunho Yang, Genevera Allen, Zhandong Liu, Pradeep K. Ravikumar
Abstract: Undirected graphical models, also known as Markov networks, enjoy popularity in a variety of applications. The popular instances of these models such as Gaussian Markov Random Fields (GMRFs), Ising models, and multinomial discrete models, however do not capture the characteristics of data in many settings. We introduce a new class of graphical models based on generalized linear models (GLMs) by assuming that node-wise conditional distributions arise from exponential families. Our models allow one to estimate multivariate Markov networks given any univariate exponential distribution, such as Poisson, negative binomial, and exponential, by fitting penalized GLMs to select the neighborhood for each node. A major contribution of this paper is the rigorous statistical analysis showing that with high probability, the neighborhood of our graphical models can be recovered exactly. We also provide examples of non-Gaussian high-throughput genomic networks learned via our GLM graphical models. 1
5 0.84460902 221 nips-2012-Multi-Stage Multi-Task Feature Learning
Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang
Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1
6 0.81650513 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation
7 0.81584561 69 nips-2012-Clustering Sparse Graphs
8 0.81483489 86 nips-2012-Convex Multi-view Subspace Learning
9 0.81415802 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models
10 0.81394863 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
11 0.81366855 319 nips-2012-Sparse Prediction with the $k$-Support Norm
12 0.81333256 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
13 0.81330162 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
14 0.81309313 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
15 0.81215996 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
16 0.81171054 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
17 0.81162107 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
18 0.81016415 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
19 0.80925393 125 nips-2012-Factoring nonnegative matrices with linear programs
20 0.80895704 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models