nips nips2012 nips2012-240 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen
Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. [sent-14, score-0.265]
2 The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. [sent-15, score-0.197]
3 We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. [sent-16, score-0.1]
4 The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. [sent-17, score-0.59]
5 These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. [sent-18, score-0.086]
6 Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. [sent-20, score-0.295]
7 1 Introduction Covariance selection, first described in [2], has come to refer to the problem of estimating a normal distribution that has a sparse inverse covariance matrix P, whose non-zero entries correspond to edges in an associated Gaussian Markov Random Field, [3]. [sent-21, score-0.238]
8 A popular approach to covariance selection is to maximize an 1 penalized log likelihood objective, [4]. [sent-22, score-0.118]
9 This approach has also been applied to related problems, such as sparse multivariate regression with covariance estimation, [5], and covariance selection under a Kronecker product structure, [6]. [sent-23, score-0.285]
10 ∗ We note that Z solves the dual problem Z∗ ij Z∗ = arg max vec(Z) ∞ ≤1 U (Z) = −log det(S + λZ) + n. [sent-28, score-0.113]
11 The first class employs a piecewise quadratic model in the step computation, and can be seen as a generalization of the sequential quadratic programming method for nonlinear programming 2 [9]; the second class minimizes a smooth quadratic model of F over a chosen orthant face in Rn . [sent-30, score-0.815]
12 We argue that both types of methods constitute useful tools for solving the sparse inverse covariance matrix estimation problem. [sent-31, score-0.265]
13 An overview of recent work on the sparse inverse covariance estimation problem is given in [10, 11]. [sent-32, score-0.265]
14 First-order methods proposed include block-coordinate descent approaches, such as COVSEL, [4, 8] and GLASSO [12], greedy coordinate descent, known as SINCO [13], projected subgradient methods PSM [14], first order optimal gradient ascent [15], and the alternating linearization method ALM [16]. [sent-33, score-0.335]
15 Second-order methods include the inexact interior point method IPM proposed in [17], and the coordinate relaxation method described in [1] and implemented in the QUIC software. [sent-34, score-0.166]
16 2 Newton Methods We can define a Newton iteration for problem (1) by constructing a quadratic, or piecewise quadratic, model of F using first and second derivative information. [sent-37, score-0.082]
17 It is well known [4] that the derivatives of the log likelihood function (4) are given by def def g = L (P) = vec(S − P−1 ) and H = L (P) = (P−1 ⊗ P−1 ), (7) where ⊗ denotes the Kronecker product. [sent-38, score-0.072]
18 In the Newton-LASSO Method, we approximate the objective function F at the current iterate Pk by the piecewise quadratic model qk (P) = L(Pk ) + gk vec(P − Pk ) + 1 vec (P − Pk )Hk vec(P − Pk ) + λ vec(P) 1 , (8) 2 where gk = L (Pk ), and similarly for Hk . [sent-40, score-0.793]
19 The trial step of the algorithm is computed as a minimizer of this model, and a backtracking line search ensures that the new iterate lies in the positive definite cone and decreases the objective function F . [sent-41, score-0.246]
20 We note that the minimization of qk is often called the LASSO problem [18] in the case when the unknown is a vector. [sent-42, score-0.121]
21 It is advantageous to perform the minimization of (8) in a reduced space; see e. [sent-43, score-0.044]
22 Specifically, at the beginning of the k-th iteration we define the set Fk of (free) variables that are allowed to move, and the active set Ak . [sent-46, score-0.104]
23 To do so, we compute the steepest descent for the function F , which is given by −(gk + λZk ), where 1 if (Pk )ij > 0 −1 if (Pk )ij < 0 −1 if (Pk )ij = 0 and [gk ]ij > λ (Zk )ij = (9) 1 if (Pk )ij = 0 and [gk ]ij < −λ 1 − λ [gk ]ij if (Pk )ij = 0 and | [gk ]ij | ≤ λ. [sent-47, score-0.092]
24 2 The sets Fk , Ak are obtained by considering a small step along this steepest descent direction, as this guarantees descent in qk (P). [sent-48, score-0.271]
25 For variables satisfying the last condition in (9), a small perturbation of Pij will not decrease the model qk . [sent-49, score-0.152]
26 This suggests defining the active and free sets of variables at iteration k as Ak = {(i, j)|(Pk )ij = 0 and |[gk ]ij | ≤ λ}, Fk = {(i, j)|(Pk )ij = 0 or |[gk ]ij | > λ}. [sent-50, score-0.181]
27 (10) The algorithm minimizes the model qk over the set of free variables. [sent-51, score-0.223]
28 Let us define pF = vec(P)F , to be the free variables, and let pkF = vecF (Pk ) denote their value at the current iterate – and similarly for other quantities. [sent-52, score-0.142]
29 Let us also define HkF to be the matrix obtained by removing from Hk the columns and rows corresponding to the active variables (with indices in Ak ). [sent-53, score-0.064]
30 The reduced model is given by qF (P) = L(Pk ) + gkF (pF − pkF ) + 1 (pF − pkF ) HkF (pF − pkF ) + λ pF 2 1. [sent-54, score-0.044]
31 (11) The search direction d is defined by d= dF dA ˆ pF − pkF 0 = , (12) ˆ where pF is the minimizer of (11). [sent-55, score-0.137]
32 The algorithm performs a line search along the direction D = mat(d), where the operator mat(d) satisfies mat(vec(D)) = D. [sent-56, score-0.119]
33 It is suggested in [1] that coordinate descent is the most effective iteration for solving the LASSO problem (11). [sent-58, score-0.149]
34 These include gradient projection [19] and iterative shrinkage thresholding algorithms, such as ISTA [20] and FISTA [21]. [sent-60, score-0.132]
35 In section 3 we describe a Newton-LASSO method that employs the FISTA iteration. [sent-61, score-0.078]
36 The convergence properties of inexact Newton-LASSO methods will be the subject of a future study. [sent-64, score-0.055]
37 The Orthant-Based Newton method computes steps by solving a smooth quadratic approximation 2 of F over an appropriate orthant – or more precisely, over an orthant face in Rn . [sent-65, score-0.874]
38 The choice of this orthant face is done, as before, by computing the steepest descent direction of F , and is characterized by the matrix Zk in (9). [sent-66, score-0.499]
39 Specifically the first four conditions in (9) identify an orthant 2 in Rn where variables are allowed to move, while the last condition in (9) determines the variables to be held at zero. [sent-67, score-0.359]
40 Therefore, the sets of free and active variables are defined as in (10). [sent-68, score-0.141]
41 If we define zF = vecF (Z), then the quadratic model of F over the current orthant face is given by 1 (14) QF (P) = L(Pk ) + gF (pF − pkF ) + (pF − pkF ) HF (pF − pkF ) + λzF pF . [sent-69, score-0.455]
42 2 The minimizer is p∗ = pkF − H−1 (gF + λzF ), and the step of the algorithm is given by F F d= dF dA p∗ − pkF F 0 = . [sent-70, score-0.051]
43 (15) If pkF +d lies outside the current orthant, we project it onto this orthant and perform a backtracking line search to obtain the new iterate Pk+1 , as discussed in section 4. [sent-71, score-0.492]
44 The orthant-based Newton method therefore moves from one orthant face to another, taking advan2 tage of the fact that F is smooth in every orthant in Rn . [sent-72, score-0.727]
45 Phase I: Determine the sets of fixed and free indices Ak and Fk , using (10). [sent-77, score-0.077]
46 Phase II: Compute the Newton step D given by (12), by minimizing the piecewise quadratic model (11) for the free variables Fk . [sent-79, score-0.24]
47 Globalization: Choose Pk+1 by performing an Armijo backtracking line search starting from Pk + D. [sent-81, score-0.13]
48 Phase I: Determine the active orthant face through the matrix Zk given in (9). [sent-85, score-0.398]
49 Phase II: Compute the Newton direction D given by (15), by minimizing the smooth quadratic model (14) for the free variables Fk . [sent-87, score-0.275]
50 Globalization: Choose Pk+1 in the current orthant by a projected backtracking line search starting from Pk + D. [sent-89, score-0.453]
51 Figure 1: Two classes of Newton methods for the inverse covariance estimation problem (3). [sent-92, score-0.216]
52 A popular orthant based method for the case when the unknown is a vector is OWL [23]; see also [11]. [sent-94, score-0.327]
53 Rather than using the Hessian (7), OWL employs a quasi-Newton approximation to minimize the reduced quadratic, and applies an alignment procedure to ensure descent. [sent-95, score-0.092]
54 However, for reasons that are difficult to justify the OWL step employs the reduced inverse Hessian (as apposed to the inverse of the reduced Hessian), and this can give steps of poor quality. [sent-96, score-0.304]
55 3 A Newton-LASSO Method with FISTA Iteration Let us consider a particular instance of the Newton-LASSO method that employs the Fast Iterative Shrinkage Thresholding Algorithm FISTA [21] to solve the reduced subproblem (11). [sent-99, score-0.122]
56 We can apply the ISTA iteration (18) to the reduced quadratic in (11) starting from x0 = vecFk (X0 ) (which is not necessarily equal to pk = vecFk (Pk )). [sent-102, score-0.6]
57 Let x denote the free variables part of the (approximate) solution of (11) obtained by the FISTA iteration. [sent-105, score-0.108]
58 Phase I of the Newton- LASSO - FISTA method selects the free and active sets, Fk , Ak , as indicated by (10). [sent-106, score-0.14]
59 Phase II, applies the FISTA iteration to the reduced problem (11), and sets ¯ x Pk+1 ← mat . [sent-107, score-0.208]
60 0 4 An Orthant-Based Newton-CG Method We now consider an orthant-based Newton method in which a quadratic model of F is minimized approximately using the conjugate gradient (CG) method. [sent-109, score-0.195]
61 This approach is attractive since, in addition to the usual advantages of CG (optimal Krylov iteration, flexibility), each CG iteration can be efficiently computed by exploiting the structure of the Hessian matrix in (7). [sent-110, score-0.04]
62 Phase I of the orthant-based Newton-CG method computes the matrix Zk given in (9), which is used 2 to identify an orthant face in Rn . [sent-111, score-0.426]
63 Having identified the current orthant face, phase II of the method constructs the quadratic model QF in the free variables, and computes an approximate solution by means of the conjugate gradient method, as described in Algorithm 1. [sent-113, score-0.652]
64 Conjugate Gradient Method for Problem (14) input : Gradient g, orthant indicator z, current iterate P0 , maximum steps K, residual tolerance r , and the regularization parameter λ. [sent-114, score-0.388]
65 The search direction of the method is given by D = mat(d), where d denotes the output of Algorithm 1. [sent-116, score-0.116]
66 / In this case, we perform a projected line search to find a point in the current orthant that yields a decrease in F . [sent-119, score-0.4]
67 Let Π(·) denote the orthogonal projection onto the orthant face defined by Zk , i. [sent-120, score-0.365]
68 5 (20) The line search computes a steplength αk to be the largest member of the sequence {1, 1/2, . [sent-123, score-0.149]
69 } such that F (Π(Pk + αk D)) ≤ F (Pk ) + σ F (Pk )T (Π(Pk + αk D) − Pk ) , (21) where σ ∈ (0, 1) is a given constant and F denotes the minimum norm subgradient of F . [sent-129, score-0.051]
70 The new iterate is defined as Pk+1 = Π(Pk + αk D). [sent-130, score-0.065]
71 The conjugate gradient method requires computing matrix-vector products involving the reduced Hessian, HkF . [sent-131, score-0.149]
72 For our problem, we have HkF (pF − pkF ) = = pF −pkF 0 F P−1 mat pF −pkF k 0 Hk P−1 k F . [sent-132, score-0.124]
73 The cost of performing K steps of the CG algorithm is O(Kn3 ) operations, and K = n2 steps is needed to guarantee an exact solution. [sent-134, score-0.052]
74 Our practical implementation computes a small number of CG steps relative to n, K = O(1), and as a result the search direction is not an accurate approximation of the true Newton step. [sent-135, score-0.143]
75 However, such inexact Newton steps achieve a good balance between the computational cost and the quality of the direction. [sent-136, score-0.081]
76 Let us consider an orthant-based method that minimizes the quadratic model (14), where HF is replaced by a limited memory BFGS matrix, which we denote by BF . [sent-139, score-0.178]
77 This matrix is not formed explicitly, but is defined in terms of the difference pairs yk = gk+1 − gk , sk = vec(Pk+1 − Pk ). [sent-140, score-0.193]
78 11)] that the minimizer of the model QF is given by p∗ = pF + B−1 (λzF − gF ) F F 1 = θ (λzF − gF ) + 1 T θ 2 RF W(I 1 − θ MWT RF RT W) F −1 MWT RF (λzF − gF ). [sent-142, score-0.051]
79 (24) Here RF is a matrix consisting of the set of unit vectors that span the subspace of free variables, θ is a scalar, W is an n2 × 2t matrix containing the t most recent correction pairs (23), and M is a 2t × 2t matrix formed by inner products between the correction pairs. [sent-143, score-0.077]
80 The total cost of computing the minimizer p∗ is 2t2 |F| + 4t|F| operations, where |F| is the cardinality of F. [sent-144, score-0.051]
81 Since the memory F parameter t in the quasi-Newton updating scheme is chosen to be a small number, say between 5 and 20, the cost of computing the subspace minimizer (24) is quite affordable. [sent-145, score-0.084]
82 A similar approach was taken in [25] for the related constrained inverse covariance sparsity problem. [sent-146, score-0.189]
83 We have noted above that OWL, which is an orthant based quasi-Newton method does not correctly approximate the minimizer (24). [sent-147, score-0.378]
84 6 Numerical Experiments We generated test problems by first creating a random sparse inverse covariance matrix1 , Σ−1 , and then sampling data to compute a corresponding non-sparse empirical covariance matrix S. [sent-149, score-0.356]
85 The number of samples used to compute the sample covariance matrix was 10n. [sent-154, score-0.118]
86 Orthant-based quasi-Newton method (see section 5) Alternating linearization method [26]. [sent-163, score-0.111]
87 The NL-FISTA algorithm terminated the FISTA iteration when the minimum norm subgradient of the LASSO subproblem qF became less than 1/10 of the minimum norm subgradient of F at the previous step. [sent-170, score-0.142]
88 Let us compare the computational cost of the inner iteration techniques used in the Newton-like methods discussed in this paper. [sent-171, score-0.04]
89 (i) Applying K steps of the FISTA iteration requires O(Kn3 ) operations. [sent-172, score-0.066]
90 (ii) Coordinate descent, as implemented in [1], requires O(Kn|F|) operations for K coordinate descent sweeps through the set of free variables; (iii) Applying KCG iterations of the CG methods costs O(KCG n3 ) operations. [sent-173, score-0.186]
91 The algorithms were terminated when either 10n iterations were executed or the minimum norm subgradient of F was sufficiently small, i. [sent-174, score-0.051]
92 The Newton-LASSO method with coordinate descent (NL-Coord) is the most efficient when the sparsity level is below 1%, but the methods introduced in this paper, NL-FISTA, OBN-CG and OBN-LBFGS, seem more robust and efficient for problems that are less sparse. [sent-181, score-0.139]
93 Based on these results, OBN-LBFGS appears to be the best choice as a universal solver for the covariance selection problem. [sent-182, score-0.118]
94 Convex optimization techniques for fitting sparse Gaussian graphical models. [sent-421, score-0.049]
95 Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. [sent-441, score-0.076]
96 SINCO-a greedy coordinate ascent method for sparse inverse covariance selection problem. [sent-479, score-0.319]
97 An inexact interior point method for L1-regularized sparse covariance selection. [sent-506, score-0.252]
98 An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. [sent-525, score-0.137]
99 A fast iterative shrinkage-thresholding algorithm for linear inverse problems. [sent-530, score-0.098]
100 A coordinate gradient descent method for nonsmooth separable minimization. [sent-535, score-0.171]
wordName wordTfidf (topN-words)
[('pk', 0.426), ('orthant', 0.297), ('pkf', 0.285), ('quic', 0.264), ('vec', 0.217), ('pf', 0.204), ('alm', 0.192), ('fista', 0.171), ('newton', 0.151), ('card', 0.145), ('gk', 0.129), ('mat', 0.124), ('vecfk', 0.122), ('qk', 0.121), ('covariance', 0.118), ('qf', 0.113), ('ij', 0.113), ('cg', 0.112), ('zf', 0.108), ('zk', 0.106), ('cond', 0.099), ('quadratic', 0.09), ('pij', 0.089), ('fk', 0.087), ('hkf', 0.081), ('hessian', 0.077), ('free', 0.077), ('gf', 0.074), ('rk', 0.074), ('owl', 0.072), ('inverse', 0.071), ('ak', 0.07), ('face', 0.068), ('iterate', 0.065), ('yk', 0.064), ('descent', 0.058), ('nocedal', 0.057), ('iter', 0.057), ('inexact', 0.055), ('backtracking', 0.053), ('phase', 0.052), ('coordinate', 0.051), ('linearization', 0.051), ('minimizer', 0.051), ('subgradient', 0.051), ('sparse', 0.049), ('employs', 0.048), ('rf', 0.048), ('ista', 0.047), ('reduced', 0.044), ('search', 0.044), ('conjugate', 0.043), ('lasso', 0.042), ('piecewise', 0.042), ('direction', 0.042), ('figen', 0.041), ('kcg', 0.041), ('mwt', 0.041), ('obn', 0.041), ('psm', 0.041), ('sinco', 0.041), ('steplength', 0.041), ('vecf', 0.041), ('iteration', 0.04), ('matlab', 0.039), ('thresholding', 0.039), ('hk', 0.038), ('def', 0.036), ('alternating', 0.036), ('globalization', 0.036), ('smooth', 0.035), ('ibm', 0.034), ('steepest', 0.034), ('shrinkage', 0.034), ('line', 0.033), ('ipm', 0.033), ('active', 0.033), ('memory', 0.033), ('kronecker', 0.032), ('rn', 0.032), ('det', 0.032), ('gradient', 0.032), ('computes', 0.031), ('variables', 0.031), ('banerjee', 0.031), ('method', 0.03), ('nl', 0.03), ('watson', 0.03), ('xi', 0.029), ('hf', 0.028), ('el', 0.027), ('glasso', 0.027), ('scheinberg', 0.027), ('estimation', 0.027), ('iterative', 0.027), ('aspremont', 0.026), ('steps', 0.026), ('projected', 0.026), ('minimizes', 0.025), ('ti', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen
Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1
2 0.2820864 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking
Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki
Abstract: A weighted graph is used as an underlying structure of many algorithms like semisupervised learning and spectral clustering. If the edge weights are determined by a single similarity measure, then it hard if not impossible to capture all relevant aspects of similarity when using a single similarity measure. In particular, in the case of visual object matching it is beneficial to integrate different similarity measures that focus on different visual representations. In this paper, a novel approach to integrate multiple similarity measures is proposed. First pairs of similarity measures are combined with a diffusion process on their tensor product graph (TPG). Hence the diffused similarity of each pair of objects becomes a function of joint diffusion of the two original similarities, which in turn depends on the neighborhood structure of the TPG. We call this process Fusion with Diffusion (FD). However, a higher order graph like the TPG usually means significant increase in time complexity. This is not the case in the proposed approach. A key feature of our approach is that the time complexity of the diffusion on the TPG is the same as the diffusion process on each of the original graphs. Moreover, it is not necessary to explicitly construct the TPG in our framework. Finally all diffused pairs of similarity measures are combined as a weighted sum. We demonstrate the advantages of the proposed approach on the task of visual tracking, where different aspects of the appearance similarity between the target object in frame t − 1 and target object candidates in frame t are integrated. The obtained method is tested on several challenge video sequences and the experimental results show that it outperforms state-of-the-art tracking methods. 1
3 0.24386117 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki
Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1
4 0.1740265 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
5 0.15013637 282 nips-2012-Proximal Newton-type methods for convex optimization
Author: Jason Lee, Yuekai Sun, Michael Saunders
Abstract: We seek to solve convex optimization problems in composite form: minimize f (x) := g(x) + h(x), n x∈R where g is convex and continuously differentiable and h : Rn → R is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. We prove such methods are globally convergent and achieve superlinear rates of convergence in the vicinity of an optimal solution. We also demonstrate the performance of these methods using problems of relevance in machine learning and statistics. 1
6 0.1467246 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
7 0.12751178 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
8 0.097940974 358 nips-2012-Value Pursuit Iteration
9 0.086026601 47 nips-2012-Augment-and-Conquer Negative Binomial Processes
10 0.084824942 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming
11 0.084103256 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
12 0.080505878 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
13 0.074481383 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
14 0.067177385 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion
15 0.06698668 206 nips-2012-Majorization for CRFs and Latent Likelihoods
16 0.065852158 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
17 0.064362943 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization
18 0.063083746 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
19 0.060551718 204 nips-2012-MAP Inference in Chains using Column Generation
20 0.059655722 254 nips-2012-On the Sample Complexity of Robust PCA
topicId topicWeight
[(0, 0.162), (1, 0.032), (2, 0.094), (3, -0.089), (4, 0.037), (5, 0.129), (6, -0.009), (7, -0.117), (8, 0.11), (9, -0.034), (10, -0.069), (11, 0.005), (12, -0.011), (13, 0.018), (14, 0.002), (15, 0.073), (16, -0.007), (17, 0.043), (18, 0.009), (19, -0.072), (20, 0.066), (21, 0.04), (22, 0.053), (23, 0.065), (24, -0.0), (25, -0.18), (26, 0.142), (27, -0.122), (28, -0.024), (29, -0.031), (30, -0.036), (31, 0.04), (32, 0.053), (33, -0.128), (34, 0.156), (35, -0.159), (36, -0.162), (37, 0.052), (38, -0.092), (39, 0.198), (40, -0.067), (41, 0.072), (42, -0.074), (43, 0.04), (44, -0.054), (45, -0.084), (46, 0.001), (47, 0.004), (48, 0.002), (49, 0.099)]
simIndex simValue paperId paperTitle
same-paper 1 0.94359839 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen
Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1
2 0.69259983 140 nips-2012-Fusion with Diffusion for Robust Visual Tracking
Author: Yu Zhou, Xiang Bai, Wenyu Liu, Longin J. Latecki
Abstract: A weighted graph is used as an underlying structure of many algorithms like semisupervised learning and spectral clustering. If the edge weights are determined by a single similarity measure, then it hard if not impossible to capture all relevant aspects of similarity when using a single similarity measure. In particular, in the case of visual object matching it is beneficial to integrate different similarity measures that focus on different visual representations. In this paper, a novel approach to integrate multiple similarity measures is proposed. First pairs of similarity measures are combined with a diffusion process on their tensor product graph (TPG). Hence the diffused similarity of each pair of objects becomes a function of joint diffusion of the two original similarities, which in turn depends on the neighborhood structure of the TPG. We call this process Fusion with Diffusion (FD). However, a higher order graph like the TPG usually means significant increase in time complexity. This is not the case in the proposed approach. A key feature of our approach is that the time complexity of the diffusion on the TPG is the same as the diffusion process on each of the original graphs. Moreover, it is not necessary to explicitly construct the TPG in our framework. Finally all diffused pairs of similarity measures are combined as a weighted sum. We demonstrate the advantages of the proposed approach on the task of visual tracking, where different aspects of the appearance similarity between the target object in frame t − 1 and target object candidates in frame t are integrated. The obtained method is tested on several challenge video sequences and the experimental results show that it outperforms state-of-the-art tracking methods. 1
3 0.64445639 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki
Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1
4 0.56024688 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar
Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1
5 0.48003519 27 nips-2012-A quasi-Newton proximal splitting method
Author: Stephen Becker, Jalal Fadili
Abstract: A new result in convex analysis on the calculation of proximity operators in certain scaled norms is derived. We describe efficient implementations of the proximity calculation for a useful class of functions; the implementations exploit the piece-wise linear nature of the dual problem. The second part of the paper applies the previous result to acceleration of convex minimization problems, and leads to an elegant quasi-Newton method. The optimization method compares favorably against state-of-the-art alternatives. The algorithm has extensive applications including signal processing, sparse recovery and machine learning and classification. 1
6 0.47614953 282 nips-2012-Proximal Newton-type methods for convex optimization
7 0.44410449 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
8 0.4343029 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent
9 0.41980368 327 nips-2012-Structured Learning of Gaussian Graphical Models
10 0.39284056 47 nips-2012-Augment-and-Conquer Negative Binomial Processes
11 0.3862502 254 nips-2012-On the Sample Complexity of Robust PCA
12 0.3776125 358 nips-2012-Value Pursuit Iteration
13 0.37475979 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
14 0.36510193 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
15 0.33314621 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
16 0.3268607 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes
17 0.32472751 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders
18 0.32429084 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression
19 0.3187767 125 nips-2012-Factoring nonnegative matrices with linear programs
20 0.31836051 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
topicId topicWeight
[(0, 0.042), (21, 0.011), (38, 0.553), (42, 0.013), (55, 0.012), (74, 0.016), (76, 0.155), (80, 0.067), (92, 0.023)]
simIndex simValue paperId paperTitle
Author: Zhuo Wang, Alan Stocker, Daniel Lee
Abstract: In this work we study how the stimulus distribution influences the optimal coding of an individual neuron. Closed-form solutions to the optimal sigmoidal tuning curve are provided for a neuron obeying Poisson statistics under a given stimulus distribution. We consider a variety of optimality criteria, including maximizing discriminability, maximizing mutual information and minimizing estimation error under a general Lp norm. We generalize the Cramer-Rao lower bound and show how the Lp loss can be written as a functional of the Fisher Information in the asymptotic limit, by proving the moment convergence of certain functions of Poisson random variables. In this manner, we show how the optimal tuning curve depends upon the loss function, and the equivalence of maximizing mutual information with minimizing Lp loss in the limit as p goes to zero. 1
2 0.99408352 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
3 0.99319774 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola
Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1
4 0.98891783 165 nips-2012-Iterative ranking from pair-wise comparisons
Author: Sahand Negahban, Sewoong Oh, Devavrat Shah
Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1
5 0.975537 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
Author: Koby Crammer, Yishay Mansour
Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1
6 0.97145379 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA
7 0.97048801 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
same-paper 8 0.96669024 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
9 0.95648426 208 nips-2012-Matrix reconstruction with the local max norm
10 0.94917786 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
11 0.9162448 285 nips-2012-Query Complexity of Derivative-Free Optimization
12 0.9116233 86 nips-2012-Convex Multi-view Subspace Learning
13 0.90070868 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
14 0.89755517 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
15 0.89654875 30 nips-2012-Accuracy at the Top
16 0.89414567 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
17 0.89122766 324 nips-2012-Stochastic Gradient Descent with Only One Projection
18 0.88840061 187 nips-2012-Learning curves for multi-task Gaussian process regression
19 0.88821727 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
20 0.88780993 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization