nips nips2011 nips2011-161 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. [sent-2, score-0.055]
2 To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. [sent-3, score-0.231]
3 For fast convergence, we also allow the penalty to change adaptively according a novel update rule. [sent-4, score-0.099]
4 We prove the global convergence of LADM with adaptive penalty (LADMAP). [sent-5, score-0.131]
5 As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. [sent-6, score-0.084]
6 By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. [sent-7, score-0.088]
7 Many of the problems in these fields can be formulated as the following linearly constrained convex programs: min f (x) + g(y), s. [sent-10, score-0.032]
8 , the nuclear norm ∥ · ∥∗ [2], Frobenius norm ∥ · ∥, l2,1 norm ∥ · ∥2,1 [13], and l1 norm ∥ · ∥1 ), and A and B are linear mappings. [sent-14, score-0.116]
9 The accelerated proximal gradient (APG) algorithm [16] is a popular technique due to its guaranteed O(k −2 ) convergence rate, where k is the iteration number. [sent-18, score-0.107]
10 The alternating direction method (ADM) has also regained a lot of attention [11, 15]. [sent-19, score-0.074]
11 However, when 1 A or B is not the identity mapping, the subproblems in ADM may not have closed form solutions. [sent-22, score-0.07]
12 In this paper, we propose a linearized version of ADM (LADM) to overcome the difficulty in solving subproblems. [sent-24, score-0.105]
13 It is to replace the quadratic penalty term by linearizing the penalty term and adding a proximal term. [sent-25, score-0.21]
14 We also allow the penalty parameter to change adaptively and propose a novel and simple rule to update it. [sent-26, score-0.118]
15 Linearization makes the auxiliary variables unnecessary, hence saving memory and waiving the expensive matrix inversions to update the auxiliary variables. [sent-27, score-0.124]
16 Moreover, without the extra constraints introduced by the auxiliary variables, the convergence is also faster. [sent-28, score-0.059]
17 Using a variable penalty parameter further speeds up the convergence. [sent-29, score-0.072]
18 The global convergence of LADM with adaptive penalty (LADMAP) is also proven. [sent-30, score-0.118]
19 As an example, we apply our LADMAP to solve the low-rank representation (LRR) problem [12]1 : min ∥Z∥∗ + µ∥E∥2,1 , s. [sent-31, score-0.035]
20 LRR is an important robust subspace clustering technique and has found wide applications in machine learning and computer vision, e. [sent-34, score-0.083]
21 , motion segmentation, face clustering, and temporal segmentation [12, 14, 6]. [sent-36, score-0.043]
22 However, the existing LRR solver [12] is based on ADM, which suffers from O(n3 ) computation complexity due to the matrix-matrix multiplications and matrix inversions. [sent-37, score-0.072]
23 Moreover, introducing auxiliary variables also slows down the convergence, as there are more variables and constraints. [sent-38, score-0.032]
24 We show that LADMAP can be successfully applied to LRR, obtaining faster convergence speed than the original solver. [sent-41, score-0.04]
25 By further representing Z as its skinny SVD and utilizing an advanced functionality of the PROPACK [9] package, the complexity of solving LRR by LADMAP becomes only O(rn2 ), as there is no full sized matrix-matrix multiplications, where r is the rank of the optimal Z. [sent-42, score-0.141]
26 First, they only proved the convergence of LADM for a specific problem, namely nuclear norm regularization. [sent-47, score-0.097]
27 Their proof utilized some special properties of the nuclear norm, while we prove the convergence of LADM for general problems in (1). [sent-48, score-0.076]
28 Although they mentioned the dynamic updating rule proposed in [8], their proof cannot be straightforwardly applied to the case of variable penalty. [sent-50, score-0.056]
29 So it is difficult to choose an optimal fixed penalty that fits for different problems and problem sizes, while our novel updating rule for the penalty, although simple, is effective for different problems and problem sizes. [sent-53, score-0.128]
30 The linearization technique has also been used in other optimization methods. [sent-54, score-0.048]
31 For example, Yin [22] applied this technique to the Bregman iteration for solving compressive sensing problems and proved that the linearized Bregman method converges to an exact solution conditionally. [sent-55, score-0.173]
32 When solving (1) by ADM, one operates on the following augmented Lagrangian function: L(x, y, λ) = f (x) + g(y) + ⟨λ, A(x) + B(y) − c⟩ + β ∥A(x) + B(y) − c∥2 , 2 (3) where λ is the Lagrange multiplier, ⟨·, ·⟩ is the inner product, and β > 0 is the penalty parameter. [sent-59, score-0.111]
33 The usual augmented Lagrange multiplier method is to minimize L w. [sent-60, score-0.045]
34 (x, y) into two subproblems that 1 Here we switch to bold capital letters in order to emphasize that the variables are matrices. [sent-68, score-0.055]
35 More specifically, the iterations of ADM go as follows: xk+1 = arg min L(x, yk , λk ) x = β ∥A(x) + B(yk ) − c + λk /β∥2 , 2 arg min L(xk+1 , y, λk ) = arg min g(y) + = yk+1 λk+1 = arg min f (x) + (4) x y β ∥B(y) + A(xk+1 ) − c + λk /β∥2 , 2 λk + β[A(xk+1 ) + B(yk+1 ) − c]. [sent-73, score-0.367]
36 (5) y (6) In many machine learning problems, as f and g are matrix or vector norms, the subproblems (4) and (5) usually have closed form solutions when A and B are identities [2, 12, 21]. [sent-74, score-0.084]
37 To overcome this difficulty, a common strategy is to introduce auxiliary variables [12, 1] u and v and reformulate problem (1) into an equivalent one: min f (x) + g(y), s. [sent-80, score-0.064]
38 Moreover, to update u and v, whose subproblems are least squares problems, expensive matrix inversions are often necessary. [sent-84, score-0.115]
39 To avoid introducing auxiliary variables and still solve subproblems (4) and (5) efficiently, inspired by Yang et al. [sent-86, score-0.102]
40 [20], we propose a linearization technique for (4) and (5). [sent-87, score-0.048]
41 To further accelerate the convergence of the algorithm, we also propose an adaptive rule for updating the penalty parameter. [sent-88, score-0.187]
42 Similarly, subproblem (5) can be approximated by βηB ∥y − yk + B ∗ (λk + β(A(xk+1 ) + B(yk ) − c))/(βηB )∥2 . [sent-92, score-0.231]
43 yk+1 = arg min g(y) + y 2 (9) The update of Lagrange multiplier still goes as (6)2 . [sent-93, score-0.093]
44 3 Adaptive Penalty In previous ADM and LADM approaches [15, 21, 20], the penalty parameter β is fixed. [sent-95, score-0.072]
45 ’s adaptive updating rule [8] in their papers, the rule is for ADM only. [sent-102, score-0.094]
46 We propose the following adaptive updating strategy for the penalty parameter β: βk+1 = min(βmax , ρβk ), (10) 2 As in [20], we can also introduce a parameter γ and update λ as λk+1 = λk +γβ[A(xk+1 )+B(yk+1 )−c]. [sent-103, score-0.155]
47 The value of ρ is defined as { √ √ ρ0 , if βk max( ηA ∥xk+1 − xk ∥, ηB ∥yk+1 − yk ∥)/∥c∥ < ε2 , ρ= 1, otherwise, (11) where ρ0 ≥ 1 is a constant. [sent-107, score-0.424]
48 The condition to assign ρ = ρ0 comes from the analysis on the stopping criteria (see Section 2. [sent-108, score-0.046]
49 Our updating rule is fundamentally different from He et al. [sent-111, score-0.056]
50 ’s for ADM [8], which aims at balancing the errors in the stopping criteria and involves several parameters. [sent-112, score-0.046]
51 4 Convergence of LADMAP To prove the convergence of LADMAP, we first have the following propositions. [sent-114, score-0.04]
52 ∥xk+1 − xk ∥ → 0, ∥yk+1 − yk ∥ → 0, ∥λk+1 − λk ∥ → 0. [sent-121, score-0.424]
53 Then we can prove the convergence of LADMAP, as stated in the following theorem. [sent-123, score-0.04]
54 Theorem 3 If {βk } is non-decreasing and upper bounded, ηA > ∥A∥2 , and ηB > ∥B∥2 , then the sequence {(xk , yk , λk )} generated by LADMAP converges to a KKT point of problem (1). [sent-124, score-0.222]
55 So the first stopping criterion is the feasibility: ∥A(xk+1 ) + B(yk+1 ) − c∥/∥c∥ < ε1 . [sent-129, score-0.044]
56 (15) As for the second KKT condition, we rewrite the second part of Proposition 1 as follows ˜ −βk [ηB (yk+1 − yk ) + B ∗ (A(xk+1 − xk ))] − B ∗ (λk+1 ) ∈ ∂g(yk+1 ). [sent-130, score-0.424]
57 (16) ˜ So for λk+1 to satisfy the second KKT condition, both βk ηA ∥xk+1 − xk ∥ and βk ∥ηB (yk+1 − yk ) + B ∗ (A(xk+1 − xk ))∥ should be small enough. [sent-131, score-0.641]
58 This leads to the second stopping criterion: βk max(ηA ∥xk+1 − xk ∥/∥A∗ (c)∥, ηB ∥yk+1 − yk ∥/∥B ∗ (c)∥) ≤ ε′ . [sent-132, score-0.456]
59 (17) 2 √ √ ∗ ∗ By estimating ∥A (c)∥ and ∥B (c)∥ by ηA ∥c∥ and ηB ∥c∥, respectively, we arrive at the second stopping criterion in use: √ √ βk max( ηA ∥xk+1 − xk ∥, ηB ∥yk+1 − yk ∥)/∥c∥ ≤ ε2 . [sent-133, score-0.468]
60 We further introduce acceleration tricks to reduce the computation complexity of each iteration. [sent-142, score-0.064]
61 In the subproblem for updating E, one may apply −1 the l2,1 -norm shrinkage operator [12], with a threshold βk , to matrix Mk = −XZk + X − Λk /βk . [sent-146, score-0.075]
62 In the subproblem for updating Z, one has to apply the singular value shrinkage operator [2], with −1 a threshold (βk ηX )−1 , to matrix Nk = Zk − ηX XT (XZk + Ek+1 − X + Λk /βk ), where ηX > 2 σmax (X). [sent-147, score-0.103]
63 2 Acceleration Tricks for LRR Up to now, LADMAP for LRR is still of complexity O(n3 ), although partial SVD is already used. [sent-151, score-0.04]
64 This is because forming Mk and Nk requires full sized matrix-matrix multiplications, e. [sent-152, score-0.048]
65 To break this complexity bound, we introduce a decomposition technique to further accelerate T LADMAP for LRR. [sent-155, score-0.062]
66 By representing Zk as its skinny SVD: Zk = Uk Σk Vk , some of the full sized matrix-matrix multiplications are gone: they are replaced by successive reduced sized matrix-matrix T multiplications. [sent-156, score-0.162]
67 For example, when updating E, XZk is computed as ((XUk )Σk )Vk , reducing the 2 complexity to O(rn ). [sent-157, score-0.058]
68 Fortunately, in PROPACK the bi-diagonalizing process of Nk is done by the Lanczos procedure [9], which only requires to compute matrix-vector multiplications Nk v and uT Nk , where u and v are some vectors in the Lanczos procedure. [sent-160, score-0.037]
69 So the computation complexity of partial SVD of Nk is still O(rn2 ). [sent-162, score-0.04]
70 Consequently, with our acceleration techniques, the complexity of our accelerated LADMAP (denoted as LADMAP(A) for short) for LRR is O(rn2 ). [sent-163, score-0.074]
71 So one has to predict the number of singular values that are greater than a threshold [11, 20, 16]. [sent-166, score-0.041]
72 Recently, we have modified PROPACK so that it can output the singular values that are greater than a threshold and their corresponding singular vectors. [sent-168, score-0.069]
73 4 T When forming Nk explicitly, XT XZk can be computed as ((XT (XUk ))Σk )Vk , whose complexity is 2 T still O(rn ), while X Ek+1 could also be accelerated as Ek+1 is a column-sparse matrix. [sent-170, score-0.061]
74 while (15) or (18) is not satisfied do T Step 1: Update Ek+1 = arg min µ∥E∥2,1 + βk ∥E + (XUk )Σk Vk − X + Λk /βk ∥2 . [sent-174, score-0.04]
75 Step 2: Update the skinny SVD (Uk+1 , Σk+1 , Vk+1 ) of Zk+1 . [sent-177, score-0.043]
76 First, compute the partial ˜ ˜ ˜T SVD Ur Σr Vr of the implicit matrix Nk , which is bi-diagonalized by the successive matrix˜ vector multiplication technique described in Section 3. [sent-178, score-0.077]
77 Second, Uk+1 = Ur (:, 1 : r′ ), ˜ r (1 : r′ , 1 : r′ ) − (βk ηX )−1 I, Vk+1 = Vr (:, 1 : r′ ), where r′ is the number of ˜ Σk+1 = Σ singular values in Σr that are greater than (βk ηX )−1 . [sent-180, score-0.041]
78 We test and compare these solvers on both synthetic multiple subspaces data and the real world motion data (Hopkin155 motion segmentation database [17]). [sent-192, score-0.088]
79 In particular, for LADM the penalty is fixed at β = 2. [sent-197, score-0.072]
80 As the code of ADM was downloaded, its stopping criteria, ∥XZk + Ek − X∥/∥X∥ ≤ ε1 and max(∥Ek − Ek−1 ∥/∥X∥, ∥Zk − Zk−1 ∥/∥X∥) ≤ ε2 , are used in all our experiments7 . [sent-202, score-0.032]
81 ˜ So each subspace has a rank of r and the data has an ambient dimension of d. [sent-206, score-0.048]
82 However, this does not harm the convergence of LADMAP because (18) is always checked when updating βk+1 (see (11)). [sent-222, score-0.064]
83 Moreover, the advantage of LADMAP(A) is even greater when the ratio r/p, which is roughly the ratio of the rank of Z0 to the size of Z0 , is ˜ smaller, which testifies to the complexity estimations on LADMAP and LADMAP(A) for LRR. [sent-224, score-0.058]
84 5 Conclusions In this paper, we propose a linearized alternating direction method with adaptive penalty for solving subproblems in ADM conveniently. [sent-371, score-0.313]
85 With LADMAP, no auxiliary variables are required and the convergence is also much faster. [sent-372, score-0.059]
86 We further apply it to solve the LRR problem and combine it with an acceleration trick so that the computation complexity is reduced from O(n3 ) to O(rn2 ), which is highly advantageous over the existing LRR solvers. [sent-373, score-0.064]
87 A Proof of Theorem 3 Proof By Proposition 2 (1), {(xk , yk , λk )} is bounded, hence has an accumulation point, say (xkj , ykj , λkj ) → (x∞ , y∞ , λ∞ ). [sent-382, score-0.292]
88 This shows that any accumulation point of {(xk , yk )} is a feasible solution. [sent-387, score-0.228]
89 By letting k = kj − 1 in Proposition 1 and the definition of subgradient, we have ˜ f (xkj ) + g(ykj ) ≤ f (x∗ ) + g(y∗ ) + ⟨xkj − x∗ , −βkj −1 ηA (xkj − xkj −1 ) − A∗ (λkj )⟩ ∗ ∗ ˆ +⟨ykj − y , −βkj −1 ηB (ykj − ykj −1 ) − B (λkj )⟩. [sent-388, score-0.201]
90 Again, let k = kj − 1 in Proposition 1 and by the definition of subgradient, we have ˜ f (x) ≥ f (xk ) + ⟨x − xk , −βk −1 ηA (xk − xk −1 ) − A∗ (λk )⟩, ∀x. [sent-391, score-0.496]
91 We next prove that the whole sequence {(xk , yk , λk )} converges to (x∞ , y∞ , λ∞ ). [sent-397, score-0.235]
92 As (x∞ , y∞ , λ∞ ) can be an arbitrary accumulation point of {(xk , yk , λk )}, we may conclude that {(xk , yk , λk )} converges to a KKT point of problem (1). [sent-401, score-0.45]
93 Distributed optimization and statistical learning via the alternating direction method of multipliers. [sent-408, score-0.074]
94 A closed form solution to robust subspace estimation and clustering. [sent-439, score-0.053]
95 Alternating direction method with Gaussian back substitution for separable convex programming. [sent-445, score-0.037]
96 Alternating direction method with self-adaptive penalty parameters for monotone variational inequality. [sent-451, score-0.097]
97 The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. [sent-467, score-0.06]
98 An accelerated proximal gradient algorithm for nuclear norm regularized least sequares problems. [sent-499, score-0.108]
99 Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization. [sent-526, score-0.149]
100 Alternating direction algorithms for l1 problems in compressive sensing. [sent-531, score-0.048]
wordName wordTfidf (topN-words)
[('ladmap', 0.677), ('adm', 0.408), ('ladm', 0.279), ('lrr', 0.279), ('xk', 0.217), ('yk', 0.207), ('apg', 0.13), ('nk', 0.097), ('xkj', 0.075), ('linearized', 0.073), ('kkt', 0.073), ('penalty', 0.072), ('xzk', 0.064), ('ykj', 0.064), ('kj', 0.062), ('propack', 0.061), ('svd', 0.06), ('subproblems', 0.055), ('alternating', 0.049), ('ek', 0.049), ('zk', 0.046), ('skinny', 0.043), ('xuk', 0.043), ('multiplications', 0.037), ('updating', 0.037), ('vk', 0.037), ('nuclear', 0.036), ('proposition', 0.033), ('sized', 0.033), ('stopping', 0.032), ('auxiliary', 0.032), ('acceleration', 0.028), ('singular', 0.028), ('technique', 0.028), ('convergence', 0.027), ('proximal', 0.027), ('update', 0.027), ('linearizing', 0.026), ('multiplier', 0.026), ('direction', 0.025), ('accelerated', 0.025), ('subspace', 0.024), ('rank', 0.024), ('subproblem', 0.024), ('lagrange', 0.024), ('segmentation', 0.023), ('yang', 0.023), ('lanczos', 0.023), ('compressive', 0.023), ('accumulation', 0.021), ('complexity', 0.021), ('tao', 0.021), ('cand', 0.02), ('motion', 0.02), ('arg', 0.02), ('norm', 0.02), ('solving', 0.02), ('linearization', 0.02), ('min', 0.02), ('partial', 0.019), ('lagrangian', 0.019), ('adaptive', 0.019), ('rule', 0.019), ('inversions', 0.019), ('augmented', 0.019), ('clustering', 0.017), ('ur', 0.016), ('successive', 0.016), ('max', 0.016), ('triple', 0.015), ('corrupted', 0.015), ('forming', 0.015), ('solve', 0.015), ('bregman', 0.015), ('programs', 0.015), ('closed', 0.015), ('converges', 0.015), ('completion', 0.015), ('tricks', 0.015), ('vr', 0.015), ('liu', 0.014), ('robust', 0.014), ('matrix', 0.014), ('criteria', 0.014), ('proved', 0.014), ('database', 0.013), ('prove', 0.013), ('faster', 0.013), ('uk', 0.013), ('accelerate', 0.013), ('fund', 0.013), ('greater', 0.013), ('adding', 0.013), ('overcome', 0.012), ('convex', 0.012), ('criterion', 0.012), ('siam', 0.012), ('xt', 0.012), ('matlab', 0.012), ('synthetic', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
2 0.14743017 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
3 0.071995996 260 nips-2011-Sparse Features for PCA-Like Linear Regression
Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail
Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.
4 0.059782036 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
Author: Yong Zhang, Zhaosong Lu
Abstract: In this paper we consider general rank minimization problems with rank appearing in either objective function or constraint. We first show that a class of matrix optimization problems can be solved as lower dimensional vector optimization problems. As a consequence, we establish that a class of rank minimization problems have closed form solutions. Using this result, we then propose penalty decomposition methods for general rank minimization problems. The convergence results of the PD methods have been shown in the longer version of the paper [19]. Finally, we test the performance of our methods by applying them to matrix completion and nearest low-rank correlation matrix problems. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed. 1
5 0.053676702 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
Author: Fatma K. Karzan, Arkadi S. Nemirovski, Boris T. Polyak, Anatoli Juditsky
Abstract: We discuss new methods for the recovery of signals with block-sparse structure, based on 1 -minimization. Our emphasis is on the efficiently computable error bounds for the recovery routines. We optimize these bounds with respect to the method parameters to construct the estimators with improved statistical properties. We justify the proposed approach with an oracle inequality which links the properties of the recovery algorithms and the best estimation performance. 1
6 0.053319711 270 nips-2011-Statistical Performance of Convex Tensor Decomposition
7 0.050506763 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
8 0.043151818 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
9 0.036487222 165 nips-2011-Matrix Completion for Multi-label Image Classification
10 0.036293626 73 nips-2011-Divide-and-Conquer Matrix Factorization
11 0.034844343 14 nips-2011-A concave regularization technique for sparse mixture models
12 0.034461211 78 nips-2011-Efficient Methods for Overlapping Group Lasso
13 0.034157719 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
14 0.033502173 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
15 0.031630538 259 nips-2011-Sparse Estimation with Structured Dictionaries
16 0.03078231 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
17 0.029455433 268 nips-2011-Speedy Q-Learning
18 0.028760005 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
19 0.028382612 199 nips-2011-On fast approximate submodular minimization
20 0.028157471 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
topicId topicWeight
[(0, 0.078), (1, 0.003), (2, -0.036), (3, -0.058), (4, -0.038), (5, 0.036), (6, -0.032), (7, 0.05), (8, 0.012), (9, 0.04), (10, 0.05), (11, -0.012), (12, -0.006), (13, -0.017), (14, -0.023), (15, -0.038), (16, -0.018), (17, 0.012), (18, -0.017), (19, 0.027), (20, -0.037), (21, 0.03), (22, 0.008), (23, -0.077), (24, -0.066), (25, 0.03), (26, 0.029), (27, -0.084), (28, 0.031), (29, -0.055), (30, -0.034), (31, 0.046), (32, -0.103), (33, 0.045), (34, -0.017), (35, 0.013), (36, -0.046), (37, -0.157), (38, -0.009), (39, -0.17), (40, 0.033), (41, 0.004), (42, -0.024), (43, -0.182), (44, 0.121), (45, -0.069), (46, 0.062), (47, 0.005), (48, 0.009), (49, 0.134)]
simIndex simValue paperId paperTitle
same-paper 1 0.93224072 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
2 0.76578087 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
3 0.52108443 260 nips-2011-Sparse Features for PCA-Like Linear Regression
Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail
Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efficient, and we demonstrate their performance on several data sets.
4 0.49500811 78 nips-2011-Efficient Methods for Overlapping Group Lasso
Author: Lei Yuan, Jun Liu, Jieping Ye
Abstract: The group Lasso is an extension of the Lasso for feature selection on (predefined) non-overlapping groups of features. The non-overlapping group structure limits its applicability in practice. There have been several recent attempts to study a more general formulation, where groups of features are given, potentially with overlaps between the groups. The resulting optimization is, however, much more challenging to solve due to the group overlaps. In this paper, we consider the efficient optimization of the overlapping group Lasso penalized problem. We reveal several key properties of the proximal operator associated with the overlapping group Lasso, and compute the proximal operator by solving the smooth and convex dual problem, which allows the use of the gradient descent type of algorithms for the optimization. We have performed empirical evaluations using both synthetic and the breast cancer gene expression data set, which consists of 8,141 genes organized into (overlapping) gene sets. Experimental results show that the proposed algorithm is more efficient than existing state-of-the-art algorithms. 1
5 0.45446017 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
Author: Yong Zhang, Zhaosong Lu
Abstract: In this paper we consider general rank minimization problems with rank appearing in either objective function or constraint. We first show that a class of matrix optimization problems can be solved as lower dimensional vector optimization problems. As a consequence, we establish that a class of rank minimization problems have closed form solutions. Using this result, we then propose penalty decomposition methods for general rank minimization problems. The convergence results of the PD methods have been shown in the longer version of the paper [19]. Finally, we test the performance of our methods by applying them to matrix completion and nearest low-rank correlation matrix problems. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed. 1
6 0.44577426 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
7 0.42640987 73 nips-2011-Divide-and-Conquer Matrix Factorization
8 0.37163642 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
9 0.34189582 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
10 0.33786368 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
11 0.3264986 4 nips-2011-A Convergence Analysis of Log-Linear Training
12 0.32596582 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
13 0.31892386 14 nips-2011-A concave regularization technique for sparse mixture models
14 0.31014639 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
15 0.30786145 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
16 0.29560551 268 nips-2011-Speedy Q-Learning
17 0.29505676 5 nips-2011-A Denoising View of Matrix Completion
18 0.29320747 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
19 0.29209235 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
20 0.28584769 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
topicId topicWeight
[(0, 0.039), (4, 0.033), (20, 0.025), (26, 0.026), (31, 0.047), (33, 0.025), (43, 0.047), (45, 0.168), (57, 0.018), (62, 0.313), (74, 0.05), (83, 0.041), (99, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.77699375 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Author: Zhouchen Lin, Risheng Liu, Zhixun Su
Abstract: Many machine learning and signal processing problems can be formulated as linearly constrained convex programs, which could be efficiently solved by the alternating direction method (ADM). However, usually the subproblems in ADM are easily solvable only when the linear mappings in the constraints are identities. To address this issue, we propose a linearized ADM (LADM) method by linearizing the quadratic penalty term and adding a proximal term when solving the subproblems. For fast convergence, we also allow the penalty to change adaptively according a novel update rule. We prove the global convergence of LADM with adaptive penalty (LADMAP). As an example, we apply LADMAP to solve lowrank representation (LRR), which is an important subspace clustering technique yet suffers from high computation cost. By combining LADMAP with a skinny SVD representation technique, we are able to reduce the complexity O(n3 ) of the original ADM based method to O(rn2 ), where r and n are the rank and size of the representation matrix, respectively, hence making LRR possible for large scale applications. Numerical experiments verify that for LRR our LADMAP based methods are much faster than state-of-the-art algorithms. 1
2 0.73668391 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
Author: Fahad S. Khan, Joost Weijer, Andrew D. Bagdanov, Maria Vanrell
Abstract: We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation. 1
3 0.66830546 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
Author: Rémi Munos
Abstract: We consider a global optimization problem of a deterministic function f in a semimetric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric ℓ. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of ℓ. We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semimetric ℓ under which f is smooth, and whose performance is almost as good as DOO optimally-fitted.
4 0.66032213 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
Author: Gilles Blanchard, Gyemin Lee, Clayton Scott
Abstract: We consider the problem of assigning class labels to an unlabeled test data set, given several labeled training data sets drawn from similar distributions. This problem arises in several applications where data distributions fluctuate because of biological, technical, or other sources of variation. We develop a distributionfree, kernel-based approach to the problem. This approach involves identifying an appropriate reproducing kernel Hilbert space and optimizing a regularized empirical risk over the space. We present generalization error analysis, describe universal kernels, and establish universal consistency of the proposed methodology. Experimental results on flow cytometry data are presented. 1
5 0.54586446 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
Author: Elad Hazan, Tomer Koren, Nati Srebro
Abstract: We present an optimization approach for linear SVMs based on a stochastic primal-dual approach, where the primal step is akin to an importance-weighted SGD, and the dual step is a stochastic update on the importance weights. This yields an optimization method with a sublinear dependence on the training set size, and the first method for learning linear SVMs with runtime less then the size of the training set required for learning! 1
6 0.54483455 202 nips-2011-On the Universality of Online Mirror Descent
7 0.5437336 220 nips-2011-Prediction strategies without loss
8 0.54284805 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
9 0.54214537 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
10 0.5420692 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
11 0.5413835 80 nips-2011-Efficient Online Learning via Randomized Rounding
12 0.54086787 251 nips-2011-Shaping Level Sets with Submodular Functions
13 0.54073083 271 nips-2011-Statistical Tests for Optimization Efficiency
14 0.54060584 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
15 0.54054368 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
16 0.54026091 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
17 0.53956825 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
18 0.53901517 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
19 0.53892392 78 nips-2011-Efficient Methods for Overlapping Group Lasso
20 0.53551382 277 nips-2011-Submodular Multi-Label Learning