nips nips2010 nips2010-248 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb
Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. [sent-5, score-0.316]
2 , the lack of an edge between i and j denotes the conditional independence of y (i) and y (j) , which corresponds to a zero entry in the inverse covariance matrix Σ−1 ([1]). [sent-18, score-0.173]
3 (1) ˆ n Note that (1) can be rewritten as minX∈S++ max∥U ∥∞ ≤ρ − log det X + ⟨Σ + U, X⟩, where ∥U ∥∞ is the largest absolute value of the entries of U . [sent-26, score-0.087]
4 By exchanging the order of max and min, we obtain 1 ˆ n the dual problem max∥U ∥∞ ≤ρ minX∈S++ − log det X + ⟨Σ + U, X⟩, which is equivalent to ˆ max {log det W + n : ∥W − Σ∥∞ ≤ ρ}. [sent-27, score-0.234]
5 n W ∈S++ (2) Both the primal and dual problems have strictly convex objectives; hence, their optimal solutions are unique. [sent-28, score-0.203]
6 Given a dual solution W , X = W −1 is primal feasible resulting in the duality gap ˆ gap := ⟨Σ, W −1 ⟩ + ρ∥W −1 ∥1 − n. [sent-29, score-0.316]
7 (3) The primal and the dual SICS problems (1) and (2) are semidefinite programming problems and can be solved via interior point methods (IPMs) in polynomial time. [sent-30, score-0.152]
8 [7] proposed a block coordinate descent (BCD) method to solve the dual problem (2). [sent-34, score-0.121]
9 Their method updates one row and one column of W in each iteration by solving a convex quadratic programming problem by an IPM. [sent-35, score-0.14]
10 [5] is based on the same BCD approach as in [7], but it solves each subproblem as a LASSO problem by yet another coordinate descent (CD) method [9]. [sent-37, score-0.086]
11 [10] proposed solving the primal problem (1) by using a BCD method. [sent-39, score-0.118]
12 They formulate the subproblem as a min-max problem and solve it using a prox method proposed by Nemirovski [11]. [sent-40, score-0.154]
13 The SINCO method proposed by Scheinberg and Rish [12] is a greedy CD method applied to the primal problem. [sent-41, score-0.134]
14 All of these BCD and CD approaches lack iteration complexity bounds. [sent-42, score-0.096]
15 A projected gradient method for solving the dual problem (2) that is considered to be state-of-the-art for SICS was proposed by Duchi et al. [sent-44, score-0.127]
16 However, there are no iteration complexity results for it either. [sent-46, score-0.096]
17 Variants of Nesterov’s method [14, 15] have been applied to solve the SICS problem. [sent-47, score-0.061]
18 [16] applied Nesterov’s optimal first-order method to solve the primal problem (1) after smoothing the nonsmooth ℓ1 term, obtaining an iteration complexity bound of O(1/ϵ) for an ϵ-optimal solution, but the implementation in [16] was very slow and did not produce good results. [sent-49, score-0.269]
19 Lu [17] solved the dual problem (2), which √ is a smooth problem, by Nesterov’s algorithm, and improved the iteration complexity to O(1/ ϵ). [sent-50, score-0.18]
20 Yuan [18] proposed an alternating direction method based on an augmented Lagrangian framework (see the ADAL method (8) below). [sent-53, score-0.195]
21 Also, there is no iteration complexity bound for this algorithm. [sent-57, score-0.116]
22 In this paper, we propose an alternating linearization method (ALM) for solving the primal SICS problem. [sent-60, score-0.361]
23 An advantage of solving the primal problem is that the ℓ1 penalty term in the objective function directly promotes sparsity in the optimal inverse covariance matrix. [sent-61, score-0.306]
24 Both methods exploit the special form of the primal problem (1) by alternatingly minimizing one of the terms of the objective function plus an approximation to the other term. [sent-63, score-0.124]
25 As we will show, our method has a theoretically justified interpretation and is based on an algorithmic framework with complexity bounds, while no complexity bound is available for Yuan’s method. [sent-65, score-0.131]
26 Extensive numerical test results on both synthetic data and real problems have shown that our ALM algorithm significantly outperforms other existing algorithms, such as the PSM algorithm proposed by Duchi et al. [sent-67, score-0.083]
27 In Section 2 we briefly review alternating linearization methods for minimizing the sum of two convex functions and establish convergence and iteration complexity results. [sent-71, score-0.341]
28 Finally, we present some numerical results on both synthetic and real data in Section 4 and compare ALM with PSM algorithm [13] and VSM algorithm [17]. [sent-73, score-0.063]
29 2 2 Alternating Linearization Methods We consider here the alternating linearization method (ALM) for solving the following problem: min F (x) ≡ f (x) + g(x), (4) where f and g are both convex functions. [sent-74, score-0.292]
30 , to rewrite (4) as min{f (x) + g(y) : x − y = 0}, x,y (5) and apply an alternating direction augmented Lagrangian method to it. [sent-77, score-0.174]
31 Given a penalty parameter 1/µ, at the k-th iteration, the augmented Lagrangian method minimizes the augmented Lagrangian function 1 ∥x − y∥2 , L(x, y; λ) := f (x) + g(y) − ⟨λ, x − y⟩ + 2 2µ with respect to x and y, i. [sent-78, score-0.117]
32 , it solves the subproblem (xk , y k ) := arg min L(x, y; λk ), x,y (6) and updates the Lagrange multiplier λ via: λk+1 := λk − (xk − y k )/µ. [sent-80, score-0.125]
33 (7) Since minimizing L(x, y; λ) with respect to x and y jointly is usually difficult, while doing so with respect to x and y alternatingly can often be done efficiently, the following alternating direction version of the augmented Lagrangian method (ADAL) is often advocated (see, e. [sent-81, score-0.206]
34 , [20, 21]): xk+1 := arg minx L(x, y k ; λk ) (8) y k+1 := arg miny L(xk+1 , y; λk ) k+1 λ := λk − (xk+1 − y k+1 )/µ. [sent-83, score-0.232]
35 If we also update λ after we solve the subproblem with respect to x, we get the following symmetric version of the ADAL method. [sent-84, score-0.105]
36 k+1 := arg minx L(x, y k ; λk ) x y k+1 λx := λk − (xk+1 − y k )/µ y (9) y k+1 := arg miny L(xk+1 , y; λk+1 ) x k+1 := λk+1 − (xk+1 − y k+1 )/µ. [sent-85, score-0.232]
37 y (10) Substituting these relations into (9), we obtain the following equivalent algorithm for solving (4), which we refer to as the alternating linearization minimization (ALM) algorithm. [sent-88, score-0.269]
38 Algorithm 1 Alternating linearization method (ALM) for smooth problem Input: x0 = y 0 for k = 0, 1, · · · do ⟨ ⟩ 1 1. [sent-89, score-0.162]
39 Solve xk+1 := arg minx Qg (x, y k ) ≡ f (x) + g(y k ) + ∇g(y k ), x − y k + 2µ ∥x − y k ∥2 ; 2 ⟨ ⟩ 1 2. [sent-90, score-0.11]
40 y Algorithm 2 Alternating linearization method with skipping step Input: x0 = y 0 for k = 0, 1, · · · do ⟨ ⟩ 1. [sent-98, score-0.166]
41 Solve xk+1 := arg minx Q(x, y k ) ≡ f (x) + g(y k ) − λk , x − y k + 2. [sent-99, score-0.11]
42 end for 1 2µ ∥x − y k ∥2 ; 2 Algorithm 2 is identical to the symmetric ADAL algorithm (9) as long as F (xk+1 ) ≤ Q(xk+1 , y k ) at each iteration (and to Algorithm 1 if g(x) is in C 1,1 and µ ≤ 1/ max{L(f ), L(g)}). [sent-104, score-0.051]
43 Algorithm 2 has the following convergence property and iteration complexity bound. [sent-106, score-0.096]
44 For β/L(f ) ≤ µ ≤ 1/L(f ) where 0 < β ≤ 1, Algorithm 2 satisfies F (y k ) − F (x∗ ) ≤ ∥x0 − x∗ ∥2 , ∀k, 2µ(k + kn ) (11) where x∗ is an optimal solution of (4) and kn is the number of iterations until the k − th for which F (xk+1 ) ≤ Q(xk+1 , y k ). [sent-111, score-0.088]
45 Note that the iteration complexity bound in Theorem 2. [sent-117, score-0.116]
46 Nesterov [15, 22] proved that one can √ obtain an optimal iteration complexity bound of O(1/ ϵ), using only first-order information. [sent-119, score-0.116]
47 A similar technique can √ adopted be to derive a fast version of Algorithm 2 that has an improved complexity bound of O(1/ ϵ), while keeping the computational effort in each iteration almost unchanged. [sent-123, score-0.136]
48 Moreover, we can apply Algorithm 2 and obtain the complexity bound in Theorem 2. [sent-130, score-0.065]
49 Without the constraint Y ∈ C, only a matrix shrinkage operation is needed, but with this additional constraint the problem becomes harder to solve. [sent-140, score-0.106]
50 Instead of imposing constraint Y ∈ C we can obtain feasible solutions by a line search on µ. [sent-143, score-0.059]
51 Hence if we start the algorithm with 2 X ≽ αI and restrict the step size µ to be sufficiently small then the iterates of the method will remain in C. [sent-145, score-0.07]
52 Algorithm 3 Alternating linearization method (ALM) for SICS Input: X 0 = Y 0 , µ0 . [sent-150, score-0.138]
53 The first-order optimality conditions for Step 1 in Algorithm 3, ignoring the constraint X ∈ C are: ∇f (X) − Λk + (X − Y k )/µk+1 = 0. [sent-160, score-0.053]
54 When the constraint X ( C is imposed, the ) } solution changes to X k+1 := V Diag(γ)V ⊤ with ∈ optimal { √ 2 + 4µ γi = max α/2, di + di k+1 /2 , i = 1, . [sent-166, score-0.065]
55 Since g(Y ) = ρ∥Y ∥1 , it is well known that the solution to (16) is given by ˆ Y k+1 = shrink(X k+1 − µk+1 (Σ − (X k+1 )−1 ), µk+1 ρ), 5 (16) where the “shrinkage operator” shrink(Z, ρ) updates each element Zij of the matrix Z by the formula shrink(Z, ρ)ij = sgn(Zij ) · max{|Zij | − ρ, 0}. [sent-173, score-0.077]
56 The O(n3 ) complexity of Step 1, which requires a spectral decomposition, dominates the O(n2 ) complexity of Step 2 which requires a simple shrinkage. [sent-174, score-0.115]
57 There is no closed-form solution for the subproblem corresponding to Y when the constraint Y ∈ C is imposed. [sent-175, score-0.13]
58 Thus, the resulting iterates Y k may not be positive definite, while the iterates X k remain so. [sent-177, score-0.098]
59 Eventually due to the convergence of Y k and X k , the Y k iterates become positive definite and the constraint Y ∈ C is satisfied. [sent-178, score-0.08]
60 The two steps of the algorithm can be written as 1 ∥X − (Y k + µk+1 Λk )∥2 } (17) X k+1 := arg min{f (X) + F X∈C 2µk+1 and 1 ˆ Y k+1 := arg min{g(Y ) + ∥Y − (X k+1 − µk+1 (Σ − (X k+1 )−1 ))∥2 }. [sent-181, score-0.082]
61 (18) F Y 2µk+1 The SICS problem is trying to optimize two conflicting objectives: on the one hand it tries to find a ˆ covariance matrix X −1 that best fits the observed data, i. [sent-182, score-0.109]
62 , is as close to Σ as possible, and on the other hand it tries to obtain a sparse matrix X. [sent-184, score-0.074]
63 The proposed algorithm address these two objectives in an alternating manner. [sent-185, score-0.128]
64 Given an initial “guess” of the sparse matrix Y k we update this guess by a subgradient descent step of length µk+1 : Y k + µk+1 Λk . [sent-186, score-0.138]
65 Then problem (17) seeks a solution X that optimizes the first objective (best fit of the data) while adding a regularization term which imposes a Gaussian prior on X whose mean is the current guess for the sparse matrix: Y k + µk+1 Λk . [sent-188, score-0.13]
66 The solution to (17) gives us a guess for the inverse covariance X k+1 . [sent-189, score-0.229]
67 Then problem (18) seeks a sparse solution Y while also imposing a Gaussian prior on Y whose mean is the guess ˆ for the inverse covariance matrix X k+1 − µk+1 (Σ − (X k+1 )−1 ). [sent-191, score-0.303]
68 Hence the sequence of X k ’s is a sequence of positive definite inverse covariance matrices that converge to a sparse matrix, while the sequence of Y k ’s is a sequence of sparse matrices that converges to a positive definite inverse covariance matrix. [sent-192, score-0.398]
69 4 Numerical Experiments In this section, we present numerical results on both synthetic and real data to demonstrate the efficiency of our SICS ALM algorithm. [sent-197, score-0.063]
70 ˆ Since −Λk ∈ ∂g(Y k ), ∥Λk ∥∞ ≤ ρ; hence Σ − Λk is a feasible solution to the dual problem (2) as long as it is positive definite. [sent-203, score-0.114]
71 Thus the duality gap at the k-th iteration is given by: ˆ ˆ Dgap := − log det(X k ) + ⟨Σ, X k ⟩ + ρ∥X k ∥1 − log det(Σ − Λk ) − n. [sent-204, score-0.13]
72 gap := Dgap/(1 + |pobj| + |dobj|), where pobj and dobj are respectively the objective function values of the primal problem (12) at point X k , and the dual ˆ problem (2) at Σ − Λk . [sent-206, score-0.267]
73 Defining dk (ϕ(x)) ≡ max{1, ϕ(xk ), ϕ(xk−1 )}, we measure the relative changes of objective function value F (X) and the iterates X and Y as follows: F rel := |F (X k ) − F (X k−1 )| ∥X k − X k−1 ∥F ∥Y k − Y k−1 ∥F , Xrel := , Y rel := . [sent-207, score-0.242]
74 , Algorithm 3 with the above stopping criteria and µ updates), with the projected subgradient method (PSM) proposed by Duchi et al. [sent-219, score-0.059]
75 The per-iteration complexity of all three algorithms is roughly the same; hence a comparison of the number of iterations is meaningful. [sent-221, score-0.065]
76 1 Experiments on synthetic data We randomly created test problems using a procedure proposed by Scheinberg and Rish in [12]. [sent-227, score-0.056]
77 For a given dimension n, we first created a sparse matrix U ∈ Rn×n with nonzero entries equal to -1 or 1 with equal probability. [sent-230, score-0.095]
78 Then we computed S := (U ∗ U ⊤ )−1 as the true covariance matrix. [sent-231, score-0.085]
79 , Yp , from the Gaussian distribution N (0, S) by using the ∑p 1 ˆ mvnrnd function in MATLAB, and computed a sample covariance matrix Σ := p i=1 Yi Yi⊤ . [sent-236, score-0.109]
80 Since PSM and VSM solve the dual problem (2), the duality gap which is given by (3) is available without any additional spectral decompositions. [sent-240, score-0.204]
81 37e-8 8 55 394 1304 3794 7536 2099 774 1088 1158 n iter Dgap 200 500 1000 1500 2000 300 220 180 199 200 200 500 1000 1500 2000 200 500 1000 1500 2000 PSM Dgap Rel. [sent-274, score-0.078]
82 3 n 587 692 834 1255 1869 iter 60 80 100 120 160 Dgap 9. [sent-359, score-0.078]
83 18e-7 CPU 35 73 150 549 2158 iter 178 969 723 1405 1639 Dgap 9. [sent-370, score-0.078]
84 10e-7 CPU 64 531 662 4041 14505 iter 467 953 1097 1740 3587 Dgap 9. [sent-381, score-0.078]
85 09e-7 CPU 273 884 1668 8568 52978 Solution Sparsity In this section, we compare the sparsity patterns of the solutions produced by ALM, PSM and VSM. [sent-392, score-0.09]
86 For ALM, the sparsity of the solution is given by the sparsity of Y . [sent-393, score-0.112]
87 Since PSM and VSM solve the dual problem, the primal solution X, obtained by inverting the dual solution W , is never sparse due to floating point errors. [sent-394, score-0.37]
88 Instead, we measure the sparsity of solutions produced by PSM and VSM by appealing to complementary slackness. [sent-396, score-0.09]
89 Specifically, the (i, j)-th element of the inverse covariance matrix ˆ is deemed to be nonzero if and only if |Wij − Σij | = ρ. [sent-397, score-0.173]
90 For each value of ρ, the first three rows show the number of nonzeros in the solution and the last three rows show the number of entries that are nonzero in the solution produced by one of the methods but are zero in the solution produced by the other method. [sent-399, score-0.18]
91 The sparsity of the ground truth inverse covariance matrix of the synthetic data is 6. [sent-400, score-0.247]
92 01 37510 37510 37510 0 0 0 63000 63000 63000 0 0 0 75566 75566 75568 2 0 2 106882 106870 106876 14 8 2 4617 4617 4617 0 0 0 37613 37615 37613 0 2 0 65959 65957 65959 2 0 0 142053 142051 142051 2 0 0 produce solutions with exactly the same sparsity patterns. [sent-407, score-0.067]
93 An inexact interior point method for l1 -regularized sparse covariance selection. [sent-455, score-0.156]
94 Mining brain region connectivity for alzheimer’s disease study via sparse inverse covariance estimation. [sent-474, score-0.199]
95 Sinco - a greedy coordinate ascent method for sparse inverse covariance selection problem. [sent-483, score-0.22]
96 Solving log-determinant optimization problems by a Newton-CG primal proximal point algorithm. [sent-534, score-0.119]
97 Augmented Lagrangian methods: applications to the numerical solution of boundary-value problems. [sent-539, score-0.062]
98 A method for unconstrained convex minimization problem with the rate of convergence O(1/k2 ). [sent-551, score-0.065]
99 A fast iterative shrinkage-thresholding algorithm for linear inverse problems. [sent-564, score-0.064]
100 Fast alternating linearization methods for minimizing the sum of two convex functions. [sent-571, score-0.245]
wordName wordTfidf (topN-words)
[('alm', 0.535), ('psm', 0.405), ('vsm', 0.405), ('sics', 0.308), ('xk', 0.214), ('dgap', 0.162), ('linearization', 0.117), ('alternating', 0.105), ('primal', 0.092), ('det', 0.087), ('rel', 0.085), ('covariance', 0.085), ('miny', 0.081), ('iter', 0.078), ('cpu', 0.078), ('bcd', 0.071), ('minx', 0.069), ('subproblem', 0.065), ('adal', 0.065), ('scheinberg', 0.065), ('inverse', 0.064), ('dual', 0.06), ('lipschitz', 0.055), ('lagrangian', 0.053), ('qg', 0.052), ('gap', 0.051), ('iteration', 0.051), ('sparse', 0.05), ('iterates', 0.049), ('augmented', 0.048), ('guess', 0.046), ('complexity', 0.045), ('ipm', 0.043), ('toh', 0.043), ('vs', 0.041), ('arg', 0.041), ('solve', 0.04), ('sparsity', 0.039), ('qf', 0.037), ('lu', 0.036), ('duchi', 0.035), ('nesterov', 0.035), ('synthetic', 0.035), ('zij', 0.035), ('solution', 0.034), ('alternatingly', 0.032), ('dms', 0.032), ('dobj', 0.032), ('ngap', 0.032), ('nonzeros', 0.032), ('pobj', 0.032), ('sinco', 0.032), ('terminated', 0.032), ('xrel', 0.032), ('goldfarb', 0.032), ('constraint', 0.031), ('shrink', 0.031), ('preprint', 0.031), ('banerjee', 0.03), ('siam', 0.029), ('yuan', 0.029), ('ieor', 0.028), ('prox', 0.028), ('skipping', 0.028), ('duality', 0.028), ('solutions', 0.028), ('numerical', 0.028), ('kn', 0.027), ('proximal', 0.027), ('solving', 0.026), ('cd', 0.026), ('diag', 0.025), ('spectral', 0.025), ('aspremont', 0.024), ('rish', 0.024), ('smooth', 0.024), ('matrix', 0.024), ('objectives', 0.023), ('glasso', 0.023), ('produced', 0.023), ('convex', 0.023), ('dk', 0.023), ('subproblems', 0.022), ('matlab', 0.022), ('graphical', 0.022), ('optimality', 0.022), ('method', 0.021), ('created', 0.021), ('minimization', 0.021), ('bound', 0.02), ('minutes', 0.02), ('shrinkage', 0.02), ('et', 0.02), ('hence', 0.02), ('effort', 0.02), ('yi', 0.02), ('sun', 0.019), ('updates', 0.019), ('columbia', 0.019), ('subgradient', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb
Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1
2 0.11348338 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
Author: Konrad Rawlik, Marc Toussaint, Sethu Vijayakumar
Abstract: Algorithms based on iterative local approximations present a practical approach to optimal control in robotic systems. However, they generally require the temporal parameters (for e.g. the movement duration or the time point of reaching an intermediate goal) to be specified a priori. Here, we present a methodology that is capable of jointly optimizing the temporal parameters in addition to the control command profiles. The presented approach is based on a Bayesian canonical time formulation of the optimal control problem, with the temporal mapping from canonical to real time parametrised by an additional control variable. An approximate EM algorithm is derived that efficiently optimizes both the movement duration and control commands offering, for the first time, a practical approach to tackling generic via point problems in a systematic way under the optimal control framework. The proposed approach, which is applicable to plants with non-linear dynamics as well as arbitrary state dependent and quadratic control costs, is evaluated on realistic simulations of a redundant robotic plant.
3 0.065014474 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
4 0.057883218 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
Author: Hariharan Narayanan, Sanjoy Mitter
Abstract: The hypothesis that high dimensional data tends to lie in the vicinity of a low dimensional manifold is the basis of a collection of methodologies termed Manifold Learning. In this paper, we study statistical aspects of the question of fitting a manifold with a nearly optimal least squared error. Given upper bounds on the dimension, volume, and curvature, we show that Empirical Risk Minimization can produce a nearly optimal manifold using a number of random samples that is independent of the ambient dimension of the space in which data lie. We obtain an upper bound on the required number of samples that depends polynomially on the curvature, exponentially on the intrinsic dimension, and linearly on the intrinsic volume. For constant error, we prove a matching minimax lower bound on the sample complexity that shows that this dependence on intrinsic dimension, volume log 1 and curvature is unavoidable. Whether the known lower bound of O( k + 2 δ ) 2 for the sample complexity of Empirical Risk minimization on k−means applied to data in a unit ball of arbitrary dimension is tight, has been an open question since 1997 [3]. Here is the desired bound on the error and δ is a bound on the probability of failure. We improve the best currently known upper bound [14] of 2 log 1 log4 k log 1 O( k2 + 2 δ ) to O k min k, 2 + 2 δ . Based on these results, we 2 devise a simple algorithm for k−means and another that uses a family of convex programs to fit a piecewise linear curve of a specified length to high dimensional data, where the sample complexity is independent of the ambient dimension. 1
5 0.056966018 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
Author: Matthias Hein, Thomas Bühler
Abstract: Many problems in machine learning and statistics can be formulated as (generalized) eigenproblems. In terms of the associated optimization problem, computing linear eigenvectors amounts to finding critical points of a quadratic function subject to quadratic constraints. In this paper we show that a certain class of constrained optimization problems with nonquadratic objective and constraints can be understood as nonlinear eigenproblems. We derive a generalization of the inverse power method which is guaranteed to converge to a nonlinear eigenvector. We apply the inverse power method to 1-spectral clustering and sparse PCA which can naturally be formulated as nonlinear eigenproblems. In both applications we achieve state-of-the-art results in terms of solution quality and runtime. Moving beyond the standard eigenproblem should be useful also in many other applications and our inverse power method can be easily adapted to new problems. 1
6 0.055410095 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
7 0.05291957 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
8 0.052077103 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
9 0.051530492 181 nips-2010-Network Flow Algorithms for Structured Sparsity
10 0.051513426 217 nips-2010-Probabilistic Multi-Task Feature Selection
11 0.050319344 162 nips-2010-Link Discovery using Graph Feature Tracking
12 0.049951326 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
13 0.048813652 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
14 0.047533315 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
15 0.043753713 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
16 0.040830027 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
17 0.039188087 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
18 0.03820277 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
19 0.038019482 148 nips-2010-Learning Networks of Stochastic Differential Equations
20 0.036740828 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
topicId topicWeight
[(0, 0.116), (1, 0.029), (2, 0.061), (3, 0.042), (4, 0.027), (5, -0.054), (6, 0.013), (7, 0.012), (8, -0.017), (9, -0.003), (10, -0.0), (11, -0.014), (12, 0.044), (13, 0.034), (14, -0.016), (15, -0.015), (16, -0.013), (17, 0.008), (18, 0.016), (19, 0.023), (20, 0.027), (21, -0.026), (22, 0.001), (23, -0.001), (24, 0.015), (25, -0.045), (26, -0.072), (27, -0.038), (28, 0.076), (29, -0.009), (30, 0.036), (31, 0.027), (32, 0.062), (33, -0.066), (34, -0.071), (35, 0.103), (36, -0.03), (37, -0.099), (38, 0.002), (39, 0.076), (40, 0.016), (41, 0.02), (42, 0.021), (43, -0.022), (44, -0.006), (45, -0.073), (46, 0.065), (47, -0.119), (48, -0.061), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.87518972 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb
Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1
2 0.72011787 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
Author: Konrad Rawlik, Marc Toussaint, Sethu Vijayakumar
Abstract: Algorithms based on iterative local approximations present a practical approach to optimal control in robotic systems. However, they generally require the temporal parameters (for e.g. the movement duration or the time point of reaching an intermediate goal) to be specified a priori. Here, we present a methodology that is capable of jointly optimizing the temporal parameters in addition to the control command profiles. The presented approach is based on a Bayesian canonical time formulation of the optimal control problem, with the temporal mapping from canonical to real time parametrised by an additional control variable. An approximate EM algorithm is derived that efficiently optimizes both the movement duration and control commands offering, for the first time, a practical approach to tackling generic via point problems in a systematic way under the optimal control framework. The proposed approach, which is applicable to plants with non-linear dynamics as well as arbitrary state dependent and quadratic control costs, is evaluated on realistic simulations of a redundant robotic plant.
3 0.63843554 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
4 0.57264572 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
Author: Ronny Luss, Saharon Rosset, Moni Shahar
Abstract: A new algorithm for isotonic regression is presented based on recursively partitioning the solution space. We develop efficient methods for each partitioning subproblem through an equivalent representation as a network flow problem, and prove that this sequence of partitions converges to the global solution. These network flow problems can further be decomposed in order to solve very large problems. Success of isotonic regression in prediction and our algorithm’s favorable computational properties are demonstrated through simulated examples as large as 2 × 105 variables and 107 constraints.
5 0.52284485 181 nips-2010-Network Flow Algorithms for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Francis R. Bach, Guillaume R. Obozinski
Abstract: We consider a class of learning problems that involve a structured sparsityinducing norm defined as the sum of ℓ∞ -norms over groups of variables. Whereas a lot of effort has been put in developing fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlapping groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems. 1
6 0.51531345 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
7 0.51306438 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
8 0.50442916 57 nips-2010-Decoding Ipsilateral Finger Movements from ECoG Signals in Humans
10 0.47330096 257 nips-2010-Structured Determinantal Point Processes
11 0.46343973 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
12 0.4379102 45 nips-2010-CUR from a Sparse Optimization Viewpoint
13 0.43378472 167 nips-2010-Mixture of time-warped trajectory models for movement decoding
14 0.43252873 202 nips-2010-Parallelized Stochastic Gradient Descent
15 0.42611647 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
16 0.41113621 198 nips-2010-Optimal Web-Scale Tiering as a Flow Problem
17 0.40586632 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
18 0.39998344 63 nips-2010-Distributed Dual Averaging In Networks
19 0.38423976 217 nips-2010-Probabilistic Multi-Task Feature Selection
20 0.37311828 118 nips-2010-Implicit Differentiation by Perturbation
topicId topicWeight
[(13, 0.056), (17, 0.021), (24, 0.289), (27, 0.043), (30, 0.067), (35, 0.035), (45, 0.187), (50, 0.061), (52, 0.023), (60, 0.04), (77, 0.041), (78, 0.017), (90, 0.016)]
simIndex simValue paperId paperTitle
1 0.81179631 183 nips-2010-Non-Stochastic Bandit Slate Problems
Author: Satyen Kale, Lev Reyzin, Robert E. Schapire
Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1
same-paper 2 0.74376547 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
Author: Katya Scheinberg, Shiqian Ma, Donald Goldfarb
Abstract: Gaussian graphical models are of great interest in statistical learning. Because the conditional independencies between different nodes correspond to zero entries in the inverse covariance matrix of the Gaussian distribution, one can learn the structure of the graph by estimating a sparse inverse covariance matrix from sample data, by solving a convex maximum likelihood problem with an ℓ1 -regularization term. In this paper, we propose a first-order method based on an alternating linearization technique that exploits the problem’s special structure; in particular, the subproblems solved in each iteration have closed-form solutions. Moreover, our algorithm obtains an ϵ-optimal solution in O(1/ϵ) iterations. Numerical experiments on both synthetic and real data from gene association networks show that a practical version of this algorithm outperforms other competitive algorithms. 1
3 0.69213569 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
Author: Mahdi M. Fard, Joelle Pineau
Abstract: This paper introduces the first set of PAC-Bayesian bounds for the batch reinforcement learning problem in finite state spaces. These bounds hold regardless of the correctness of the prior distribution. We demonstrate how such bounds can be used for model-selection in control problems where prior information is available either on the dynamics of the environment, or on the value of actions. Our empirical results confirm that PAC-Bayesian model-selection is able to leverage prior distributions when they are informative and, unlike standard Bayesian RL approaches, ignores them when they are misleading. 1
4 0.66977423 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
Author: Wei Chen, Tie-yan Liu, Zhi-ming Ma
Abstract: This paper is concerned with the generalization analysis on learning to rank for information retrieval (IR). In IR, data are hierarchically organized, i.e., consisting of queries and documents. Previous generalization analysis for ranking, however, has not fully considered this structure, and cannot explain how the simultaneous change of query number and document number in the training data will affect the performance of the learned ranking model. In this paper, we propose performing generalization analysis under the assumption of two-layer sampling, i.e., the i.i.d. sampling of queries and the conditional i.i.d sampling of documents per query. Such a sampling can better describe the generation mechanism of real data, and the corresponding generalization analysis can better explain the real behaviors of learning to rank algorithms. However, it is challenging to perform such analysis, because the documents associated with different queries are not identically distributed, and the documents associated with the same query become no longer independent after represented by features extracted from query-document matching. To tackle the challenge, we decompose the expected risk according to the two layers, and make use of the new concept of two-layer Rademacher average. The generalization bounds we obtained are quite intuitive and are in accordance with previous empirical studies on the performances of ranking algorithms. 1
5 0.61132711 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
Author: Matthias Broecheler, Lise Getoor
Abstract: Continuous Markov random fields are a general formalism to model joint probability distributions over events with continuous outcomes. We prove that marginal computation for constrained continuous MRFs is #P-hard in general and present a polynomial-time approximation scheme under mild assumptions on the structure of the random field. Moreover, we introduce a sampling algorithm to compute marginal distributions and develop novel techniques to increase its efficiency. Continuous MRFs are a general purpose probabilistic modeling tool and we demonstrate how they can be applied to statistical relational learning. On the problem of collective classification, we evaluate our algorithm and show that the standard deviation of marginals serves as a useful measure of confidence. 1
6 0.61046541 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
7 0.61024201 63 nips-2010-Distributed Dual Averaging In Networks
8 0.60905534 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
9 0.60856354 265 nips-2010-The LASSO risk: asymptotic results and real world examples
10 0.60843217 215 nips-2010-Probabilistic Deterministic Infinite Automata
11 0.60834891 148 nips-2010-Learning Networks of Stochastic Differential Equations
12 0.60783452 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
13 0.6075229 117 nips-2010-Identifying graph-structured activation patterns in networks
14 0.60723662 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
15 0.60654414 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
16 0.60637051 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
17 0.60611933 155 nips-2010-Learning the context of a category
18 0.60609853 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
19 0.60536343 222 nips-2010-Random Walk Approach to Regret Minimization
20 0.60446501 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction