nips nips2012 nips2012-164 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki
Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. [sent-12, score-0.545]
2 In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. [sent-13, score-0.547]
3 Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. [sent-14, score-0.36]
4 G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. [sent-15, score-0.181]
5 This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. [sent-16, score-0.139]
6 Numerical convergence results and timing comparisons for the proposed method are presented. [sent-18, score-0.214]
7 A fundamental example is the problem of estimating the covariance matrix from a dataset of n samples {X (i) }n , drawn i. [sent-21, score-0.302]
8 d from a pi=1 dimensional, zero-mean, Gaussian distribution with covariance matrix Σ ∈ Sp , X (i) ∼ Np (0, Σ), ++ where Sp denotes the space of p × p symmetric, positive definite matrices. [sent-23, score-0.302]
9 When n ≥ p the maxi++ T n 1 ˆ mum likelihood covariance estimator Σ is the sample covariance matrix S = n i=1 X (i) X (i) . [sent-24, score-0.541]
10 In this sample deficient case, common throughout several modern applications such as genomics, finance, and earth sciences, the matrix S is not invertible, and thus cannot be directly used to obtain a well-defined estimator for the inverse covariance matrix Ω := Σ−1 . [sent-26, score-0.559]
11 A related problem is the inference of a Gaussian graphical model ([27, 14]), that is, a sparsity pattern in the inverse covariance matrix, Ω. [sent-27, score-0.534]
12 Gaussian graphical models provide a powerful means of dimensionality reduction in high-dimensional data. [sent-28, score-0.09]
13 Moreover, such models allow for discovery of conditional independence relations between random variables since, for multivariate Gaussian data, sparsity in the inverse covariance matrix encodes conditional independences. [sent-29, score-0.507]
14 If a dataset, even one with n p is drawn from a normal distribution with sparse inverse covariance matrix Ω, the inverse sample covariance matrix S −1 will almost surely be a dense matrix, although the estimates for those Ωij which are equal to 0 may be very small in magnitude. [sent-31, score-1.013]
15 As sparse estimates of Ω are more robust than S −1 , and since such sparsity may yield easily interpretable models, there exists significant impetus to perform sparse inverse covariance estimation in very high dimensional low sample size settings. [sent-32, score-0.794]
16 [1] proposed performing such sparse inverse covariance estimation by solving the 1 -penalized maximum likelihood estimation problem, Θ∗ = arg min − log det Θ + S, Θ + ρ Θ ρ p Θ∈S++ 1 , (1) where ρ > 0 is a penalty parameter, S, Θ = Tr (SΘ), and Θ 1 = i,j |Θij |. [sent-34, score-0.861]
17 For ρ > 0, Problem (1) is strongly convex and hence has a unique solution, which lies in the positive definite cone Sp due to the log det term, and is hence invertible. [sent-35, score-0.241]
18 Moreover, the 1 penalty induces sparsity ++ in Θ∗ , as it is the closest convex relaxation of the 0 − 1 penalty, Θ 0 = i,j I(Θij = 0), where ρ I(·) is the indicator function [5]. [sent-36, score-0.163]
19 The unique optimal point of problem (1), Θ∗ , is both invertible ρ (for ρ > 0) and sparse (for sufficiently large ρ), and can be used as an inverse covariance matrix estimator. [sent-37, score-0.606]
20 In this paper, a proximal gradient method for solving Problem (1) is proposed. [sent-38, score-0.252]
21 The convergence rate s is provided explicitly in terms of S and ρ, and importantly, is related to the condition number of Θ∗ . [sent-40, score-0.156]
22 Section 4 contains the convergence proofs of this algorithm, which constitutes the primary mathematical result of this paper. [sent-44, score-0.103]
23 2 Prior Work While several excellent general convex solvers exist (for example, [11] and [4]), these are not always adept at handling high dimensional problems (i. [sent-46, score-0.132]
24 As many modern datasets have several thousands of variables, numerous authors have proposed efficient algorithms designed specifically to solve the 1 -penalized sparse maximum likelihood covariance estimation problem (1). [sent-49, score-0.545]
25 These can be broadly categorized as either primal or dual methods. [sent-50, score-0.289]
26 Following the literature, we refer to primal methods as those which directly solve Problem (1), yielding a concentration estimate. [sent-51, score-0.224]
27 Dual methods [1] yield a covariance matrix by solving the constrained problem, minimize U ∈Rp×p subject to − log det(S + U ) − p U ∞ ≤ ρ, (3) where the primal and dual variables are related by Θ = (S + U )−1 . [sent-52, score-0.682]
28 Both the primal and dual problems can be solved using block methods (also known as “row by row” methods), which sequentially optimize one row/column of the argument at each step until convergence. [sent-53, score-0.407]
29 The primal and dual block problems both reduce to 1 -penalized regressions, which can be solved very efficiently. [sent-54, score-0.354]
30 1 Dual Methods A number of dual methods for solving Problem (1) have been proposed in the literature. [sent-56, score-0.16]
31 [1] consider a block coordinate descent algorithm to solve the block dual problem, which reduces each optimization step to solving a box-constrained quadratic program. [sent-58, score-0.424]
32 [10] iteratively solve the lasso regression as described in [1], but do so using coordinate-wise descent. [sent-61, score-0.116]
33 Their widely used solver, known as the graphical lasso (glasso) is implemented on CRAN. [sent-62, score-0.168]
34 Global convergence rates of these block coordinate methods are unknown. [sent-63, score-0.216]
35 2 Primal Methods Interest in primal methods for solving Problem (1) has been growing for many reasons. [sent-68, score-0.243]
36 One important reason stems from the fact that convergence within a certain tolerance for the dual problem does not necessarily imply convergence within the same tolerance for the primal. [sent-69, score-0.481]
37 [24] proposes a similar method and show a sublinear convergence rate. [sent-72, score-0.145]
38 Mazumder and Hastie [18] consider block-coordinate descent approaches for the primal problem, similar to the dual approach taken in [10]. [sent-73, score-0.289]
39 Mazumder and Agarwal [17] also solve the primal problem with block-coordinate descent, but at each iteration perform a partial as opposed to complete block optimization, resulting in a decreased computational complexity per iteration. [sent-74, score-0.331]
40 Convergence rates of these primal methods have not been considered in the literature and hence theoretical guarantees are not available. [sent-75, score-0.234]
41 [13] propose a second-order proximal point algorithm, called QUIC, which converges superlinearly locally around the optimum. [sent-77, score-0.137]
42 3 Methodology In this section, the graphical iterative shrinkage thresholding algorithm (G-ISTA) for solving the primal problem (1) is presented. [sent-78, score-0.631]
43 A rich body of mathematical and numerical work exists for general iterative shrinkage thresholding and related methods; see, in particular, [3, 8, 19, 20, 21, 25]. [sent-79, score-0.373]
44 The above characterization suggests a method for optimizing problem (4) based on the iteration xt+1 = proxζt g (xt − ζt f (xt )) (8) for some choice of step size, ζt . [sent-85, score-0.095]
45 This simple method is referred to as an iterative shrinkage thresh1 olding algorithm (ISTA). [sent-86, score-0.198]
46 For a step size ζt ≤ L , the ISTA iterates xt are known to satisfy F (xt ) − F (x∗ ) O 1 t , ∀t, (9) where x∗ is some optimal point, which is to say, they converge to the space of optimal points at a sublinear rate. [sent-87, score-0.34]
47 If no Lipschitz constant L for f is known, the same convergence result still holds for ζt chosen such that f (xt+1 ) ≤ Qζt (xt+1 , xt ), (10) where Qζ (·, ·) : X × X → R is a quadratic approximation to f , defined by Qζ (x, y) = f (y) + x − y, f (y) + 1 x−y 2ζ 2 . [sent-88, score-0.213]
48 1 can be adapted to the sparse inverse covariance estimation Problem (1). [sent-92, score-0.545]
49 Moreover, by Lemma ++ 2 of the supplemental section, f has a Lipschitz continuous gradient when restricted to the compact domain aI Θ bI. [sent-99, score-0.155]
50 As in [3], the algorithm uses a backtracking line search for the choice of step size. [sent-105, score-0.101]
51 The procedure terminates when a prespecified duality gap is attained. [sent-106, score-0.107]
52 Note also that the positive definite check of Θt+1 during Step (1) of Algorithm 1 is accomplished using a Cholesky decomposition, and the inverse of Θt+1 is computed using that Cholesky factor. [sent-108, score-0.153]
53 Algorithm 1: G-ISTA for Problem (1) input : Sample covariance matrix S, penalty parameter ρ, tolerance ε, backtracking constant c ∈ (0, 1), initial step size ζ1,0 , initial iterate Θ0 . [sent-109, score-0.632]
54 (4) Compute duality gap: ∆ = − log det(S + Ut+1 ) − p − log det Θt+1 + S, Θ + ρ Θt+1 1 , where (Ut+1 )i,j = min{max{([Θ−1 ]i,j − Si,j ), −ρ}, ρ}. [sent-119, score-0.199]
55 1 Choice of initial step size, ζ0 Each iteration of Algorithm 1 requires an initial step size, ζ0 . [sent-123, score-0.218]
56 However, in practice this choice of step is overly cautious; a much larger step can often be taken. [sent-125, score-0.106]
57 If a certain number of maximum backtracks do not result in an accepted step, G-ISTA takes the safe step, λmin (Θt )2 . [sent-128, score-0.086]
58 Such a safe step can be obtained from λmax (Θ−1 ), which in turn can be quickly apt proximated using power iteration. [sent-129, score-0.1]
59 4 Convergence Analysis In this section, linear convergence of Algorithm 1 is discussed. [sent-130, score-0.103]
60 ) denote the iterates of Algorithm 1, and Θ∗ the optimal solution to Problem (1) for ρ > 0. [sent-134, score-0.135]
61 Assume that the iterates Θt of Algorithm 1 satisfy aI constants 0 < a < b. [sent-137, score-0.219]
62 The step size ζt which yields an optimal worst-case contraction bound s(ζt ) is ζ = 2 a−2 +b−2 . [sent-140, score-0.157]
63 The optimal worst-case contraction bound corresponding to ζ = s(ζ) : = 1 − 5 2 b2 1 + a2 2 a−2 +b−2 is given by Proof. [sent-142, score-0.104]
64 Note that linear convergence of proximal gradient methods for strongly convex objective functions in general has already been proven (see Supplemental section). [sent-144, score-0.37]
65 However, it is possible to specify the constants a and b to yield an explicit rate; this is done in Theorem 2. [sent-147, score-0.075]
66 Then√ iterates the Θt of Algorithm 1 satisfy αI Θt b I, ∀t, with b = Θ∗ 2 + Θ0 − Θ∗ F ≤ β + p(β + α). [sent-150, score-0.178]
67 These eigenvalue bounds on Θt+1 , along with Theorem 1, provide a closed form linear convergence rate for Algorithm 1. [sent-154, score-0.192]
68 Then for a constant step size ζt := ζ < α2 , the iterates of Algorithm 1 converge linearly with a rate of s(ζ) = 1 − 2α2 <1 √ α2 + (β + p(β − α))2 (18) Proof. [sent-158, score-0.241]
69 By Theorem 2, for ζ < α2 , the iterates Θt satisfy αI Θ∗ ρ Θt 2 + Θ0 − Θ∗ ρ F I for all t. [sent-159, score-0.178]
70 Note that the contraction constant (equation 18) of Theorem 3 is closely related to the condition number of Θ∗ , ρ λmax (Θ∗ ) β ρ κ(Θ∗ ) = ≤ ρ λmin (Θ∗ ) α ρ as 1− α2 2α2 2α2 ≥1− 2 ≥ 1 − 2κ(Θ∗ )−2 . [sent-163, score-0.104]
71 ρ 5 Numerical Results In this section, we provide numerical results for the G-ISTA algorithm. [sent-165, score-0.075]
72 3 compares running times of the G-ISTA, glasso [10], and QUIC [13] algorithms. [sent-169, score-0.224]
73 For a fixed p, a p dimensional inverse covariance matrix Ω was generated with off-diagonal entries drawn i. [sent-174, score-0.515]
74 85 to simulate a very sparse and a somewhat sparse model). [sent-179, score-0.206]
75 Finally, a multiple of the identity was added to the resulting matrix so that the smallest eigenvalue was equal to 1. [sent-180, score-0.132]
76 13/4 Table 1: Timing comparisons for p = 2000 dimensional datasets, generated as in Section 5. [sent-273, score-0.116]
77 2 Demonstration of Convergence Rates The linear convergence rate derived for G-ISTA in Section 4 was shown to be heavily dependent on the conditioning of the final estimator. [sent-277, score-0.19]
78 To demonstrate these results, G-ISTA was run on a synthetic dataset, as described in Section 5. [sent-278, score-0.078]
79 For each value of ρ, the numerical optimum was computed to a duality gap of 10−10 using G-ISTA. [sent-287, score-0.182]
80 G-ISTA was then run again, and the Frobenius norm argument errors at each iteration were stored. [sent-294, score-0.076]
81 These errors were plotted on a log scale for each value of ρ to demonstrate the dependence of the convergence rate on condition number. [sent-295, score-0.156]
82 3 Timing Comparisons The G-ISTA, glasso, and QUIC algorithms were run on synthetic datasets (real datasets are presented in the Supplemental section) of varying p, n and with different levels of regularization, ρ. [sent-298, score-0.16]
83 All algorithms were run to ensure a fixed duality gap, here taken to be 10−5 . [sent-299, score-0.098]
84 The comparisons were run on a multicore processor, and it is important to note that the Cholesky decompositions and 7 ρ = 0. [sent-308, score-0.09]
85 iteration number t, demonstrating linear convergence ρ rates of G-ISTA, and dependence of those rates on κ(Θ∗ ). [sent-319, score-0.241]
86 On the other hand, the p2 dimensional lasso solve of QUIC and p-dimensional lasso solve of glasso do not. [sent-321, score-0.516]
87 For this reason, and because Cholesky factorizations and inversions make up the bulk of the computation required by G-ISTA, the CPU time of G-ISTA was typically greater than its wall time by a factor of roughly 4. [sent-322, score-0.115]
88 6 Conclusion In this paper, a proximal gradient method was applied to the sparse inverse covariance problem. [sent-324, score-0.69]
89 Linear convergence was discussed, with a fixed closed-form rate. [sent-325, score-0.103]
90 Numerical results have also been presented, comparing G-ISTA to the widely-used glasso algorithm and the newer, but very fast, QUIC algorithm. [sent-326, score-0.224]
91 To conclude, although second-order methods for the sparse inverse covariance method have recently been shown to perform well, simple first-order methods cannot be ruled out, as they can also be very competitive in many cases. [sent-333, score-0.495]
92 Model selection through sparse maximum likelihood estimation for multivarate gaussian or binary data. [sent-338, score-0.153]
93 A fast iterative shrinkage-thresholding algorithm for linear inverse problems. [sent-345, score-0.241]
94 Templates for convex cone problems with applications to sparse signal recovery. [sent-353, score-0.209]
95 Adaptive first-order methods for general sparse inverse covariance selection. [sent-425, score-0.495]
96 A flexible, scalable and efficient algorithmic framework for the Primal graphical lasso. [sent-429, score-0.09]
97 A method of solving a convex programming problem with convergence rate O(1/k2 ). [sent-435, score-0.285]
98 A note on the lack of symmetry in the graphical lasso. [sent-455, score-0.09]
99 Model selection and estimation in the gaussian graphical model. [sent-486, score-0.14]
100 Alternating direction method of multipliers for covariance selection models. [sent-491, score-0.239]
wordName wordTfidf (topN-words)
[('quic', 0.37), ('nnz', 0.295), ('covariance', 0.239), ('glasso', 0.224), ('sp', 0.201), ('primal', 0.186), ('inverse', 0.153), ('proximal', 0.137), ('det', 0.135), ('iterates', 0.135), ('stanford', 0.124), ('ista', 0.114), ('yurii', 0.112), ('shrinkage', 0.11), ('mazumder', 0.104), ('contraction', 0.104), ('dual', 0.103), ('sparse', 0.103), ('convergence', 0.103), ('thresholding', 0.1), ('supplemental', 0.097), ('cholesky', 0.093), ('graphical', 0.09), ('iterative', 0.088), ('tolerance', 0.086), ('lasso', 0.078), ('numerical', 0.075), ('banerjee', 0.074), ('bala', 0.074), ('rolfs', 0.074), ('convex', 0.072), ('siam', 0.069), ('hsieh', 0.068), ('xt', 0.067), ('lieven', 0.066), ('proxg', 0.066), ('block', 0.065), ('wall', 0.065), ('duality', 0.064), ('matrix', 0.063), ('zhaosong', 0.061), ('rahul', 0.061), ('issn', 0.061), ('dimensional', 0.06), ('rp', 0.06), ('lipschitz', 0.059), ('gradient', 0.058), ('np', 0.057), ('solving', 0.057), ('comparisons', 0.056), ('tr', 0.056), ('timing', 0.055), ('yuan', 0.053), ('rate', 0.053), ('step', 0.053), ('stephen', 0.052), ('sparsity', 0.052), ('scheinberg', 0.05), ('inversions', 0.05), ('prox', 0.05), ('estimation', 0.05), ('ij', 0.049), ('cpu', 0.049), ('backtracking', 0.048), ('invertible', 0.048), ('aspremont', 0.048), ('rates', 0.048), ('safe', 0.047), ('ca', 0.044), ('proximity', 0.044), ('synthetic', 0.044), ('quadratic', 0.043), ('satisfy', 0.043), ('gap', 0.043), ('sublinear', 0.042), ('iteration', 0.042), ('datasets', 0.041), ('nesterov', 0.041), ('modern', 0.041), ('lemma', 0.041), ('constants', 0.041), ('jonathan', 0.04), ('benjamin', 0.039), ('accepted', 0.039), ('penalty', 0.039), ('theorem', 0.039), ('solve', 0.038), ('core', 0.037), ('eigenvalue', 0.036), ('min', 0.035), ('initial', 0.035), ('cone', 0.034), ('iterate', 0.034), ('conditioning', 0.034), ('yield', 0.034), ('run', 0.034), ('el', 0.033), ('identity', 0.033), ('numerous', 0.033), ('katya', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki
Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1
2 0.28949249 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar
Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1
3 0.24386117 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen
Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1
4 0.20474271 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
Author: Aaron Defazio, Tibério S. Caetano
Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.
5 0.1962485 282 nips-2012-Proximal Newton-type methods for convex optimization
Author: Jason Lee, Yuekai Sun, Michael Saunders
Abstract: We seek to solve convex optimization problems in composite form: minimize f (x) := g(x) + h(x), n x∈R where g is convex and continuously differentiable and h : Rn → R is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. We prove such methods are globally convergent and achieve superlinear rates of convergence in the vicinity of an optimal solution. We also demonstrate the performance of these methods using problems of relevance in machine learning and statistics. 1
6 0.19181187 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
7 0.19025955 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
8 0.17783482 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
9 0.16320577 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
10 0.15280752 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent
11 0.14462809 247 nips-2012-Nonparametric Reduced Rank Regression
12 0.1442789 319 nips-2012-Sparse Prediction with the $k$-Support Norm
13 0.13754183 254 nips-2012-On the Sample Complexity of Robust PCA
14 0.13358642 27 nips-2012-A quasi-Newton proximal splitting method
15 0.13136895 324 nips-2012-Stochastic Gradient Descent with Only One Projection
16 0.12150612 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
17 0.11853036 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
18 0.10574768 327 nips-2012-Structured Learning of Gaussian Graphical Models
19 0.097444095 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
20 0.095374756 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
topicId topicWeight
[(0, 0.255), (1, 0.073), (2, 0.211), (3, -0.124), (4, 0.073), (5, 0.171), (6, -0.037), (7, -0.192), (8, 0.016), (9, -0.083), (10, -0.071), (11, -0.034), (12, -0.057), (13, 0.117), (14, 0.102), (15, 0.124), (16, -0.007), (17, 0.012), (18, -0.009), (19, -0.092), (20, 0.136), (21, 0.039), (22, 0.017), (23, 0.009), (24, -0.043), (25, -0.202), (26, 0.141), (27, -0.09), (28, -0.022), (29, 0.01), (30, 0.033), (31, -0.037), (32, -0.005), (33, -0.123), (34, 0.096), (35, -0.05), (36, -0.108), (37, 0.013), (38, -0.023), (39, -0.049), (40, 0.035), (41, 0.106), (42, -0.068), (43, -0.005), (44, -0.068), (45, -0.016), (46, 0.007), (47, -0.036), (48, 0.048), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.95223695 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki
Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1
2 0.837843 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen
Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1
3 0.75448805 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar
Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1
4 0.69643819 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
Author: Piyush Rai, Abhishek Kumar, Hal Daume
Abstract: Multiple-output regression models require estimating multiple parameters, one for each output. Structural regularization is usually employed to improve parameter estimation in such models. In this paper, we present a multiple-output regression model that leverages the covariance structure of the latent model parameters as well as the conditional covariance structure of the observed outputs. This is in contrast with existing methods that usually take into account only one of these structures. More importantly, unlike some of the other existing methods, none of these structures need be known a priori in our model, and are learned from the data. Several previously proposed structural regularization based multiple-output regression models turn out to be special cases of our model. Moreover, in addition to being a rich model for multiple-output regression, our model can also be used in estimating the graphical model structure of a set of variables (multivariate outputs) conditioned on another set of variables (inputs). Experimental results on both synthetic and real datasets demonstrate the effectiveness of our method. 1
5 0.68075889 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent
Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin
Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1
6 0.68018544 327 nips-2012-Structured Learning of Gaussian Graphical Models
7 0.66390985 254 nips-2012-On the Sample Complexity of Robust PCA
8 0.65088993 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
9 0.59506953 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
10 0.5707534 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
11 0.5452342 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
12 0.54250842 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
13 0.53305197 282 nips-2012-Proximal Newton-type methods for convex optimization
14 0.53031582 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
15 0.52400804 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
16 0.52309114 27 nips-2012-A quasi-Newton proximal splitting method
17 0.51751894 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
18 0.51192313 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
19 0.50969732 86 nips-2012-Convex Multi-view Subspace Learning
20 0.50867379 319 nips-2012-Sparse Prediction with the $k$-Support Norm
topicId topicWeight
[(0, 0.052), (38, 0.155), (42, 0.013), (74, 0.018), (76, 0.616), (80, 0.038), (92, 0.025)]
simIndex simValue paperId paperTitle
Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch
Abstract: We propose an efficient, generalized, nonparametric, statistical KolmogorovSmirnov test for detecting distributional change in high-dimensional data. To implement the test, we introduce a novel, hierarchical, minimum-volume sets estimator to represent the distributions to be tested. Our work is motivated by the need to detect changes in data streams, and the test is especially efficient in this context. We provide the theoretical foundations of our test and show its superiority over existing methods. 1
2 0.99176139 286 nips-2012-Random Utility Theory for Social Choice
Author: Hossein Azari, David Parks, Lirong Xia
Abstract: Random utility theory models an agent’s preferences on alternatives by drawing a real-valued score on each alternative (typically independently) from a parameterized distribution, and then ranking the alternatives according to scores. A special case that has received significant attention is the Plackett-Luce model, for which fast inference methods for maximum likelihood estimators are available. This paper develops conditions on general random utility models that enable fast inference within a Bayesian framework through MC-EM, providing concave loglikelihood functions and bounded sets of global maxima solutions. Results on both real-world and simulated data provide support for the scalability of the approach and capability for model selection among general random utility models including Plackett-Luce. 1
3 0.98799282 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
Author: Weiwei Cheng, Willem Waegeman, Volkmar Welker, Eyke Hüllermeier
Abstract: Several machine learning methods allow for abstaining from uncertain predictions. While being common for settings like conventional classification, abstention has been studied much less in learning to rank. We address abstention for the label ranking setting, allowing the learner to declare certain pairs of labels as being incomparable and, thus, to predict partial instead of total orders. In our method, such predictions are produced via thresholding the probabilities of pairwise preferences between labels, as induced by a predicted probability distribution on the set of all rankings. We formally analyze this approach for the Mallows and the Plackett-Luce model, showing that it produces proper partial orders as predictions and characterizing the expressiveness of the induced class of partial orders. These theoretical results are complemented by experiments demonstrating the practical usefulness of the approach. 1
4 0.98700684 28 nips-2012-A systematic approach to extracting semantic information from functional MRI data
Author: Francisco Pereira, Matthew Botvinick
Abstract: This paper introduces a novel classification method for functional magnetic resonance imaging datasets with tens of classes. The method is designed to make predictions using information from as many brain locations as possible, instead of resorting to feature selection, and does this by decomposing the pattern of brain activation into differently informative sub-regions. We provide results over a complex semantic processing dataset that show that the method is competitive with state-of-the-art feature selection and also suggest how the method may be used to perform group or exploratory analyses of complex class structure. 1
5 0.98591226 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video
Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller
Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1
6 0.98493123 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature
7 0.9839893 205 nips-2012-MCMC for continuous-time discrete-state systems
same-paper 8 0.97894973 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
9 0.96602809 247 nips-2012-Nonparametric Reduced Rank Regression
10 0.95554942 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
11 0.91442776 338 nips-2012-The Perturbed Variation
12 0.90552711 142 nips-2012-Generalization Bounds for Domain Adaptation
13 0.89653957 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC
14 0.89288092 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
15 0.8922717 327 nips-2012-Structured Learning of Gaussian Graphical Models
16 0.89049011 291 nips-2012-Reducing statistical time-series problems to binary classification
17 0.88818997 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
18 0.88801444 317 nips-2012-Smooth-projected Neighborhood Pursuit for High-dimensional Nonparanormal Graph Estimation
19 0.88750559 222 nips-2012-Multi-Task Averaging
20 0.88665181 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points