nips nips2012 nips2012-27 knowledge-graph by maker-knowledge-mining

27 nips-2012-A quasi-Newton proximal splitting method

Source: pdf

Author: Stephen Becker, Jalal Fadili

Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efﬁcient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classiﬁcation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Fadili† Abstract A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. [sent-4, score-0.457]

2 We describe efﬁcient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. [sent-5, score-0.285]

3 The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. [sent-6, score-0.161]

4 For large-scale unconstrained smooth convex problems, two classes of methods have seen the most success: limited memory quasi-Newton methods and non-linear conjugate gradient (CG) methods. [sent-11, score-0.253]

5 Both of these methods generally outperform simpler methods, such as gradient descent. [sent-12, score-0.07]

6 For problems with non-smooth terms and/or constraints, it is possible to generalize gradient descent with proximal gradient descent (which includes projected gradient descent as a sub-cases), which is just the application of the forward-backward algorithm [1]. [sent-13, score-0.548]

7 Unlike gradient descent, it is not easy to adapt quasi-Newton and CG methods to problems involving constraints and non-smooth terms. [sent-14, score-0.07]

8 In the limit, as the active-set is correctly identiﬁed, the methods behave similar to their unconstrained counterparts. [sent-16, score-0.071]

9 These methods have seen success, but are not as efﬁcient or as elegant as the unconstrained versions. [sent-17, score-0.113]

10 1 Problem statement N Let H = (RN , ·, · ) equipped with the usual Euclidean scalar product x, y = i=1 xi yi and associated norm x = x, x . [sent-20, score-0.077]

11 The domain of f is deﬁned by dom f = {x ∈ H : f (x) < +∞} and f is proper if dom f = ∅. [sent-25, score-0.124]

12 1 class of all proper lsc convex functions from H to R ∪ {+∞} is denoted by Γ0 (H). [sent-35, score-0.116]

13 The class we consider covers non-smooth convex optimization problems, including those with convex constraints. [sent-43, score-0.175]

14 2 Contributions This paper introduces a class of scaled norms for which we can compute a proximity operator; these results themselves are signiﬁcant, for previous results only cover diagonal scaling (the diagonal scaling result is trivial). [sent-52, score-0.569]

15 Then, motivated by the discrepancy between constrained and unconstrained performance, we deﬁne a class of limited-memory quasi-Newton methods to solve (P) and that extends naturally and elegantly from the unconstrained to the constrained case. [sent-53, score-0.289]

16 Most well-known quasi-Newton methods for constrained problems, such as L-BFGS-B [2], are only applicable to box constraints l ≤ x ≤ u. [sent-54, score-0.088]

17 The power of our approach is that it applies to a wide-variety of useful non-smooth functionals (see §3. [sent-55, score-0.05]

18 The approach uses the zero-memory SR1 algorithm, and we provide evidence that the non-diagonal term provides signiﬁcant improvements over diagonal Hessians. [sent-58, score-0.139]

19 1 Quasi-Newton forward-backward splitting The algorithm In the following, deﬁne the quadratic approximation QB (x) = f (xk ) + k f (xk ), x − xk + 1 x − xk 2 2 B, (4) where B ∈ S++ (N ). [sent-60, score-0.568]

20 k 2 (5) Note that this specializes to the gradient descent when h = 0. [sent-62, score-0.117]

21 Therefore, if f is a strictly convex quadratic function and one takes Bk = 2 f (xk ), then we obtain the Newton method. [sent-63, score-0.161]

22 Since f is smooth and can be approximated by a quadratic, and inspired by quasi-Newton methods, this suggest picking Bk as an approximation of the Hessian. [sent-66, score-0.042]

23 Our diagonal+rank 1 quasi-Newton forward-backward splitting algorithm is listed in Algorithm 1 (with details for the quasi-Newton update in Algorithm 2, see §4 for details). [sent-68, score-0.096]

24 Algorithm 1: Zero-memory Symmetric Rank 1 (0SR1) algorithm to solve min f + h Require: x0 ∈ dom(f + h), Lipschitz constant estimate L of f , stopping criterion 1: for k = 1, 2, 3, . [sent-74, score-0.079]

25 do 2: sk ← xk − xk−1 3: yk ← f (xk ) − f (xk−1 ) −1 4: Compute Hk via Algorithm 2, and deﬁne Bk = Hk . [sent-77, score-0.748]

26 5: Compute the rank-1 proximity operator (see §3) xk+1 ← proxBk (xk − Hk f (xk )) ˆ h (6) 6: pk ← xk+1 − xk and terminate if pk < ˆ 7: Line-search along the ray xk + tpk to determine xk+1 , or choose t = 1. [sent-78, score-0.776]

27 2 Relation to prior work First-order methods The algorithm in (5) is variously known as proximal descent or iterated shrinkage/thresholding algorithm (IST or ISTA). [sent-80, score-0.164]

28 The spectral projected gradient (SPG) [4] method was designed as an extension of the BarzilaiBorwein spectral step-length method to constrained problems. [sent-82, score-0.203]

29 In [5], it was extended to non-smooth problems by allowing general proximity operators; The Barzilai-Borwein method [6] uses a speciﬁc choice of step-length tk motivated by quasi-Newton methods. [sent-83, score-0.302]

30 Nesterov acceleration can be viewed as an over-relaxed version of ISTA with a speciﬁc, non-constant over-relaxation parameter αk . [sent-87, score-0.049]

31 The general diagonal case was considered in several papers in the 1980s as a simple quasi-Newton method, but never widely adapted. [sent-89, score-0.139]

32 A convergence rate analysis of forward-backward splitting with static and variable Bk where one of the operators is maximal strongly monotone is given in [10]. [sent-91, score-0.158]

33 Active set approaches Active set methods take a simple step, such as gradient projection, to identify active variables, and then uses a more advanced quadratic model to solve for the free variables. [sent-92, score-0.21]

34 A recent bound-constrained solver is ASA [13] which uses a conjugate gradient (CG) solver on the free variables, and shows good results compared to L-BFGS-B, SPG, GENCAN and TRON. [sent-94, score-0.07]

35 We also compare to several active set approaches specialized for 1 penalties: “Orthant-wise Learning” (OWL) [14], “Projected Scaled Sub-gradient + Active Set” (PSSas) [15], “Fixed-point continuation + Active Set” (FPC AS) [16], and “CG + IST” (CGIST) [17]. [sent-95, score-0.053]

36 IPM requires solving a Newton-step equation, so ﬁrst-order like “Hessian-free” variants of IPM solve the Newton-step approximately, either by approximately solving the equation or by subsampling the Hessian. [sent-97, score-0.041]

37 Yet another approach is to include the non-smooth h term in the quadratic approximation. [sent-99, score-0.046]

38 The projected quasi-Newton (PQN) algorithm [19, 20] is perhaps the most elegant and logical extension of quasi-Newton methods, but it involves solving a sub-iteration. [sent-102, score-0.122]

39 As discussed in [21], it is possible to generalize PQN to general non-smooth problems whenever the proximity operator is known (since, as mentioned above, it is possible to extend SPG to this case). [sent-105, score-0.35]

40 3 Proximity operators and proximal calculus For space limitation reasons, we only recall essential deﬁnitions. [sent-106, score-0.217]

41 More notions, results from convex analysis as well as proofs can be found in the supplementary material. [sent-107, score-0.07]

42 Then, for every x ∈ H, the function 2 1 z → 2 x − z + h(z) achieves its inﬁmum at a unique point denoted by proxh x. [sent-110, score-0.227]

43 The uniquelyvalued operator proxh : H → H thus deﬁned is the proximity operator or proximal mapping of h. [sent-111, score-0.75]

44 1 Proximal calculus in HV Throughout, we denote proxV = (IHV + V −1 ∂h)−1 , where ∂h is the subdifferential of h, the h proximity operator of h w. [sent-113, score-0.388]

45 Note that since V ∈ S++ (N ), the proximity operator proxV is well-deﬁned. [sent-117, score-0.35]

46 1 (8) Diagonal+rank-1: General case Theorem 7 (Proximity operator in HV ). [sent-124, score-0.099]

47 Let h ∈ Γ0 (H) and V = D + uuT , where D is diagonal with (strictly) positive diagonal elements di , and u ∈ RN . [sent-125, score-0.315]

48 Then, proxV (x) = D−1/2 ◦ proxh◦D−1/2 (D1/2 x − v) , h where v = αD −1/2 (9) u and α is the unique root of p(α) = u, x − D−1/2 ◦ proxh◦D−1/2 ◦D1/2 (x − αD−1 u) + α , (10) which is a Lipschitz continuous and strictly increasing function on R with Lipschitz constant 1 + 2 i ui /di . [sent-126, score-0.17]

49 • Computing proxV amounts to solving a scalar optimization problem that involves the comh putation of proxh◦D−1/2 . [sent-128, score-0.07]

50 The latter can be much simpler to compute as D is diagonal (beyond the obvious separable case that we will consider shortly). [sent-129, score-0.296]

51 2 Diagonal+rank-1: Separable case The following corollary is key to our novel optimization algorithm. [sent-134, score-0.077]

52 h(x) = i=1 hi (xi ), and V = D + uuT , where D is diagonal with (strictly) positive diagonal elements di , and u ∈ RN . [sent-138, score-0.315]

53 Then, proxV (x) = proxhi /di (xi − vi /di ) h , (11) i where v = αu and α is the unique root of p(α) = u, x − proxhi /di (xi − αui /di ) +α, (12) i which is a Lipschitz continuous and strictly increasing function on R. [sent-139, score-0.41]

54 is Proof: diagonal, As applying h is separable and Theorem 7 yields ∈ desired D the S++ (N ) result. [sent-140, score-0.157]

55 Assume that for 1 i N , proxhi is piecewise afﬁne on R with ki ≥ 1 segments, i. [sent-142, score-0.237]

56 proxhi (xi ) = aj xi + bj , tj xi tj+1 , j ∈ {1, . [sent-144, score-0.201]

57 Then proxV (x) can be obtained exactly by sorting at most the k real values h (i,j)∈{1,. [sent-149, score-0.064]

58 When proxhi is piecewise afﬁne with ki segments, it is easy to see that p(α) in (12) is also piecewise afﬁne with slopes and intercepts changing at the k transition points di ui (xi − tj ) (i,j)∈{1,. [sent-157, score-0.423]

59 To get α , it is suf- ﬁcient to isolate the unique segment that intersects the abscissa axis. [sent-164, score-0.043]

60 This can be achieved by sorting the values of the transition points which can cost in average complexity O(k log k). [sent-165, score-0.064]

61 • Corollary 9 can be extended to the “block” separable (i. [sent-173, score-0.157]

62 separable in subsets of coordinates) when D is piecewise constant along the same block indices. [sent-175, score-0.204]

63 If no closed-form is available, one can appeal to some efﬁcient iterative method to solve (10) (or (12)). [sent-179, score-0.041]

64 The semi-smooth Newton method for the solution of (10) can be stated as the iteration αt+1 = αt − g(αt )−1 p(αt ) , (13) where g is a generalized derivative of p. [sent-181, score-0.074]

65 If proxh◦D−1/2 is Newton differentiable with generalized derivative G, then so is the mapping p with a generalized derivative g(α) = 1 + u, D−1/2 ◦ G(D1/2 x − αD−1/2 u) ◦ D−1/2 u Furthermore, g is nonsingular with a uniformly bounded inverse on R. [sent-183, score-0.243]

66 Thus, as p is Newton differentiable with nonsingular generalized derivative whose inverse is also bounded, the general semi-smooth Newton convergence theorem implies that (13) converges superlinearly to the unique root of (10). [sent-188, score-0.252]

67 For instance, Table 1 summarizes a few of them where we can obtain either an exact answer by sorting when possible, or else by minimizing w. [sent-192, score-0.1]

68 4 A primal rank 1 SR1 algorithm Following the conventional quasi-Newton notation, we let B denote an approximation to the Hessian of f and H denote an approximation to the inverse Hessian. [sent-198, score-0.092]

69 All quasi-Newton methods update an approximation to the (inverse) Hessian that satisﬁes the secant condition: Hk yk = sk , yk = f (xk ) − f (xk−1 ), sk = xk − xk−1 (14) Algorithm 1 follows the SR1 method [24], which uses a rank-1 update to the inverse Hessian approximation at every step. [sent-199, score-1.391]

70 In particular, we use zero-memory, which means that at every iteration, a new diagonal plus rank-one matrix is formed. [sent-204, score-0.139]

71 The other modiﬁcation is to extend the SR1 method to the general setting of minimizing f + h where f is smooth but h need not be smooth; this further generalizes the case when h is an indicator function of a convex set. [sent-205, score-0.112]

72 Every step of the algorithm replaces f with a quadratic approximation, and keeps h unchanged. [sent-206, score-0.046]

73 Choosing H0 step length In our experience, the choice of H0 is best if scaled with a Barzilai-Borwein spectral (we call it τBB2 sk , sk / sk , yk τBB2 = sk , yk / yk , yk (15) to distinguish it from the other Barzilai-Borwein step size τBB1 = τBB2 ). [sent-208, score-2.18]

74 In SR1 methods, the quantity sk − H0 yk , yk must be positive in order to have a well-deﬁned update for uk . [sent-209, score-0.93]

75 The update is: Hk = H0 + uk uT , k uk = (sk − H0 yk )/ 6 sk − H0 yk , yk . [sent-210, score-1.325]

76 12: end if 13: end if −1 14: return Hk = H0 + uk uT {Bk = Hk can be computed via the Sherman-Morrison formula} k For this reason, we choose H0 = γτBB2 IH with 0 < γ < 1, and thus 0 ≤ sk − H0 yk , yk = (1 − γ) sk , yk . [sent-212, score-1.465]

77 If sk , yk = 0, then there is no symmetric rank-one update that satisﬁes the secant condition. [sent-213, score-0.591]

78 The inequality sk , yk > 0 is the curvature condition, and it is guaranteed for all strictly convex objectives. [sent-214, score-0.65]

79 Following the recommendation in [26], we skip updates whenever sk , yk cannot be guaranteed to be non-zero given standard ﬂoating-point precision. [sent-215, score-0.572]

80 5 (b) Figure 1: (a) is ﬁrst LASSO test, (b) is second LASSO test Consider the unconstrained LASSO problem (1). [sent-222, score-0.071]

81 However, this constrained problem has twice the number of variables, and the Hessian of 7 AT A −AT A −AT A AT A n degenerate 0 eigenvalues and adversely affects solvers. [sent-226, score-0.053]

82 ˜ the quadratic part changes from AT A to A = which necessarily has (at least) A similar situation occurs with the hinge-loss function. [sent-227, score-0.046]

83 Our second example uses a square operator A with dimensions n = 133 = 2197 chosen as a 3D discrete differential operator. [sent-238, score-0.099]

84 This example stems from a numerical analysis problem to solve a discretized PDE as suggested by [28]. [sent-239, score-0.041]

85 6 Conclusions In this paper, we proposed a novel variable metric (quasi-Newton) forward-backward splitting algorithm, designed to efﬁciently solve non-smooth convex problems structured as the sum of a smooth term and a non-smooth one. [sent-247, score-0.249]

86 We introduced a class of weighted norms induced by a diagonal+rank 1 symmetric positive deﬁnite matrices, and proposed a whole framework to compute a proximity operator in the weighted norm. [sent-248, score-0.35]

87 We also provided clear evidence that the non-diagonal term provides signiﬁcant acceleration over diagonal matrices. [sent-250, score-0.188]

88 Another improvement would be to derive efﬁcient calculation for rank-2 proximity terms, thus allowing a 0-memory BFGS method. [sent-255, score-0.285]

89 However, in general, one must solve an r-dimensional inner problem using the semismooth Newton method. [sent-257, score-0.154]

90 A ﬁnal possible extension is to take Bk to be diagonal plus rank-1 on diagonal blocks, since if h is separable, this is still can be solved by our algorithm (see Remark 10). [sent-258, score-0.278]

91 Nonmonotone spectral projected gradient methods on ı convex sets. [sent-303, score-0.22]

92 A fast iterative shrinkage-thresholding algorithm for linear inverse problems. [sent-325, score-0.052]

93 Diagonal preconditioning for ﬁrst order primal-dual algorithms in convex optimization. [sent-338, score-0.07]

94 Remark on algorithm 778: L-BFGS-B: Fortran subroutines for e large-scale bound constrained optimization¨ ACM Trans. [sent-359, score-0.094]

95 A new active set algorithm for box constrained optimization. [sent-368, score-0.141]

96 A quasi-Newton approach to nonsmooth convex optimization problems in machine learning. [sent-405, score-0.105]

97 Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm. [sent-413, score-0.08]

98 Proximal Newton-type methods for minimizing convex objective functions in composite form. [sent-431, score-0.07]

99 A semismooth Newton method for Tikhonov functionals with sparsity constraints. [sent-445, score-0.163]

100 Tackling box-constrained optimization via a new projected quasi-Newton approach. [sent-473, score-0.115]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('proxv', 0.338), ('yk', 0.328), ('proximity', 0.251), ('xk', 0.213), ('sk', 0.207), ('proxh', 0.184), ('newton', 0.179), ('separable', 0.157), ('hv', 0.151), ('proxhi', 0.141), ('diagonal', 0.139), ('bk', 0.135), ('proximal', 0.117), ('ih', 0.115), ('nonseparable', 0.113), ('semismooth', 0.113), ('spg', 0.113), ('hk', 0.104), ('bfgs', 0.103), ('operator', 0.099), ('splitting', 0.096), ('asa', 0.092), ('ista', 0.086), ('cgist', 0.085), ('fpc', 0.085), ('pssas', 0.085), ('hessian', 0.081), ('projected', 0.08), ('fortran', 0.075), ('owl', 0.075), ('pqn', 0.075), ('unconstrained', 0.071), ('convex', 0.07), ('gradient', 0.07), ('ipm', 0.069), ('cg', 0.069), ('uk', 0.067), ('sorting', 0.064), ('operators', 0.062), ('dom', 0.062), ('tj', 0.06), ('projector', 0.059), ('lasso', 0.059), ('bauschke', 0.056), ('caen', 0.056), ('mem', 0.056), ('secant', 0.056), ('ax', 0.056), ('fista', 0.055), ('active', 0.053), ('constrained', 0.053), ('inverse', 0.052), ('siam', 0.052), ('tk', 0.051), ('functionals', 0.05), ('ki', 0.049), ('acceleration', 0.049), ('descent', 0.047), ('schmidt', 0.047), ('piecewise', 0.047), ('fadili', 0.046), ('lsc', 0.046), ('uut', 0.046), ('quadratic', 0.046), ('lipschitz', 0.045), ('strictly', 0.045), ('rn', 0.044), ('nonsingular', 0.043), ('byrd', 0.043), ('combettes', 0.043), ('unique', 0.043), ('corollary', 0.042), ('smooth', 0.042), ('yi', 0.042), ('ui', 0.042), ('elegant', 0.042), ('restart', 0.041), ('subroutines', 0.041), ('solve', 0.041), ('derivative', 0.041), ('root', 0.04), ('rank', 0.04), ('solvers', 0.04), ('scaled', 0.04), ('ist', 0.039), ('nocedal', 0.039), ('stopping', 0.038), ('calculus', 0.038), ('remark', 0.037), ('di', 0.037), ('skip', 0.037), ('else', 0.036), ('hinge', 0.036), ('bb', 0.035), ('box', 0.035), ('optimization', 0.035), ('scalar', 0.035), ('preprint', 0.034), ('calculation', 0.034), ('generalized', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 27 nips-2012-A quasi-Newton proximal splitting method

Author: Stephen Becker, Jalal Fadili

2 0.33376262 282 nips-2012-Proximal Newton-type methods for convex optimization

Author: Jason Lee, Yuekai Sun, Michael Saunders

Abstract: We seek to solve convex optimization problems in composite form: minimize f (x) := g(x) + h(x), n x∈R where g is convex and continuously differentiable and h : Rn → R is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efﬁciently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. We prove such methods are globally convergent and achieve superlinear rates of convergence in the vicinity of an optimal solution. We also demonstrate the performance of these methods using problems of relevance in machine learning and statistics. 1

3 0.26524007 300 nips-2012-Scalable nonconvex inexact proximal splitting

Author: Suvrit Sra

Abstract: We study a class of large-scale, nonsmooth, and nonconvex optimization problems. In particular, we focus on nonconvex problems with composite objectives. This class includes the extensively studied class of convex composite objective problems as a subclass. To solve composite nonconvex problems we introduce a powerful new framework based on asymptotically nonvanishing errors, avoiding the common stronger assumption of vanishing errors. Within our new framework we derive both batch and incremental proximal splitting algorithms. To our knowledge, our work is ﬁrst to develop and analyze incremental nonconvex proximalsplitting algorithms, even if we were to disregard the ability to handle nonvanishing errors. We illustrate one instance of our general framework by showing an application to large-scale nonsmooth matrix factorization. 1

4 0.22844251 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

Author: Demba Ba, Behtash Babadi, Patrick Purdon, Emery Brown

Abstract: We consider the problem of recovering a sequence of vectors, (xk )K , for which k=0 the increments xk − xk−1 are Sk -sparse (with Sk typically smaller than S1 ), based on linear measurements (yk = Ak xk + ek )K , where Ak and ek denote the meak=1 surement matrix and noise, respectively. Assuming each Ak obeys the restricted isometry property (RIP) of a certain order—depending only on Sk —we show that in the absence of noise a convex program, which minimizes the weighted sum of the ℓ1 -norm of successive differences subject to the linear measurement constraints, recovers the sequence (xk )K exactly. This is an interesting result bek=1 cause this convex program is equivalent to a standard compressive sensing problem with a highly-structured aggregate measurement matrix which does not satisfy the RIP requirements in the standard sense, and yet we can achieve exact recovery. In the presence of bounded noise, we propose a quadratically-constrained convex program for recovery and derive bounds on the reconstruction error of the sequence. We supplement our theoretical analysis with simulations and an application to real video data. These further support the validity of the proposed approach for acquisition and recovery of signals with time-varying sparsity.

5 0.1740265 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen

Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The ﬁrst approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that ﬁrst identiﬁes an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efﬁciently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1

6 0.15282668 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

7 0.1475049 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

8 0.13358642 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

9 0.11875609 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

10 0.10902901 319 nips-2012-Sparse Prediction with the $k$-Support Norm

11 0.10709167 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

12 0.10541451 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

13 0.10459798 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

14 0.10366789 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

15 0.09530963 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics

16 0.091987975 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

17 0.091881007 179 nips-2012-Learning Manifolds with K-Means and K-Flats

18 0.089856915 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

19 0.088638797 206 nips-2012-Majorization for CRFs and Latent Likelihoods

20 0.088069759 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.222), (1, 0.045), (2, 0.189), (3, -0.119), (4, 0.129), (5, 0.157), (6, 0.013), (7, -0.158), (8, 0.28), (9, 0.05), (10, -0.124), (11, 0.133), (12, 0.09), (13, -0.049), (14, -0.088), (15, -0.019), (16, -0.064), (17, 0.064), (18, 0.019), (19, -0.006), (20, -0.085), (21, 0.057), (22, 0.02), (23, 0.029), (24, 0.035), (25, -0.026), (26, 0.006), (27, -0.049), (28, 0.003), (29, -0.032), (30, 0.013), (31, -0.001), (32, -0.041), (33, 0.023), (34, 0.027), (35, -0.082), (36, -0.009), (37, -0.035), (38, 0.037), (39, 0.05), (40, 0.019), (41, -0.008), (42, 0.014), (43, 0.045), (44, 0.037), (45, -0.017), (46, -0.013), (47, 0.004), (48, -0.028), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94859827 27 nips-2012-A quasi-Newton proximal splitting method

Author: Stephen Becker, Jalal Fadili

2 0.94076985 282 nips-2012-Proximal Newton-type methods for convex optimization

Author: Jason Lee, Yuekai Sun, Michael Saunders

3 0.90211374 300 nips-2012-Scalable nonconvex inexact proximal splitting

Author: Suvrit Sra

4 0.83430362 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

Author: Demba Ba, Behtash Babadi, Patrick Purdon, Emery Brown

5 0.66441035 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

Author: Nicolas L. Roux, Mark Schmidt, Francis R. Bach

Abstract: We propose a new stochastic gradient method for optimizing the sum of a ﬁnite set of smooth functions, where the sum is strongly convex. While standard stochastic gradient methods converge at sublinear rates for this problem, the proposed method incorporates a memory of previous gradient values in order to achieve a linear convergence rate. In a machine learning context, numerical experiments indicate that the new algorithm can dramatically outperform standard algorithms, both in terms of optimizing the training error and reducing the test error quickly. 1

6 0.65162247 340 nips-2012-The representer theorem for Hilbert spaces: a necessary and sufficient condition

7 0.57439494 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

8 0.56157464 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

9 0.56095445 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

10 0.53217298 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

11 0.52180928 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

12 0.50131994 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

13 0.4677707 34 nips-2012-Active Learning of Multi-Index Function Models

14 0.46480137 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

15 0.46145123 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

16 0.4509438 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

17 0.44925579 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

18 0.43683553 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

19 0.43020228 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

20 0.42957401 35 nips-2012-Adaptive Learning of Smoothing Functions: Application to Electricity Load Forecasting

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.098), (17, 0.011), (21, 0.028), (38, 0.203), (39, 0.012), (42, 0.037), (54, 0.027), (55, 0.022), (64, 0.187), (74, 0.027), (76, 0.153), (80, 0.09), (92, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89555264 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

Author: Chris Hinrichs, Vikas Singh, Jiming Peng, Sterling Johnson

Abstract: Multiple Kernel Learning (MKL) generalizes SVMs to the setting where one simultaneously trains a linear classiﬁer and chooses an optimal combination of given base kernels. Model complexity is typically controlled using various norm regularizations on the base kernel mixing coefﬁcients. Existing methods neither regularize nor exploit potentially useful information pertaining to how kernels in the input set ‘interact’; that is, higher order kernel-pair relationships that can be easily obtained via unsupervised (similarity, geodesics), supervised (correlation in errors), or domain knowledge driven mechanisms (which features were used to construct the kernel?). We show that by substituting the norm penalty with an arbitrary quadratic function Q 0, one can impose a desired covariance structure on mixing weights, and use this as an inductive bias when learning the concept. This formulation signiﬁcantly generalizes the widely used 1- and 2-norm MKL objectives. We explore the model’s utility via experiments on a challenging Neuroimaging problem, where the goal is to predict a subject’s conversion to Alzheimer’s Disease (AD) by exploiting aggregate information from many distinct imaging modalities. Here, our new model outperforms the state of the art (p-values 10−3 ). We brieﬂy discuss ramiﬁcations in terms of learning bounds (Rademacher complexity). 1

same-paper 2 0.88436973 27 nips-2012-A quasi-Newton proximal splitting method

Author: Stephen Becker, Jalal Fadili

3 0.87041795 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Author: Arthur Guez, David Silver, Peter Dayan

Abstract: Bayesian model-based reinforcement learning is a formally elegant approach to learning optimal behaviour under model uncertainty, trading off exploration and exploitation in an ideal way. Unfortunately, ﬁnding the resulting Bayes-optimal policies is notoriously taxing, since the search space becomes enormous. In this paper we introduce a tractable, sample-based method for approximate Bayesoptimal planning which exploits Monte-Carlo tree search. Our approach outperformed prior Bayesian model-based RL algorithms by a signiﬁcant margin on several well-known benchmark problems – because it avoids expensive applications of Bayes rule within the search tree by lazily sampling models from the current beliefs. We illustrate the advantages of our approach by showing it working in an inﬁnite state space domain which is qualitatively out of reach of almost all previous work in Bayesian exploration. 1

4 0.86632687 147 nips-2012-Graphical Models via Generalized Linear Models

Author: Eunho Yang, Genevera Allen, Zhandong Liu, Pradeep K. Ravikumar

Abstract: Undirected graphical models, also known as Markov networks, enjoy popularity in a variety of applications. The popular instances of these models such as Gaussian Markov Random Fields (GMRFs), Ising models, and multinomial discrete models, however do not capture the characteristics of data in many settings. We introduce a new class of graphical models based on generalized linear models (GLMs) by assuming that node-wise conditional distributions arise from exponential families. Our models allow one to estimate multivariate Markov networks given any univariate exponential distribution, such as Poisson, negative binomial, and exponential, by ﬁtting penalized GLMs to select the neighborhood for each node. A major contribution of this paper is the rigorous statistical analysis showing that with high probability, the neighborhood of our graphical models can be recovered exactly. We also provide examples of non-Gaussian high-throughput genomic networks learned via our GLM graphical models. 1

5 0.84460902 221 nips-2012-Multi-Stage Multi-Task Feature Learning

Author: Pinghua Gong, Jieping Ye, Chang-shui Zhang

Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel regularizer. To solve the non-convex optimization problem, we propose a MultiStage Multi-Task Feature Learning (MSMTFL) algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. 1

6 0.81650513 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

7 0.81584561 69 nips-2012-Clustering Sparse Graphs

8 0.81483489 86 nips-2012-Convex Multi-view Subspace Learning

9 0.81415802 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

10 0.81394863 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

11 0.81366855 319 nips-2012-Sparse Prediction with the $k$-Support Norm

12 0.81333256 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

13 0.81330162 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

14 0.81309313 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

15 0.81215996 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

16 0.81171054 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

17 0.81162107 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

18 0.81016415 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

19 0.80925393 125 nips-2012-Factoring nonnegative matrices with linear programs

20 0.80895704 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models