jmlr jmlr2012 jmlr2012-112 jmlr2012-112-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhiwei Qin, Donald Goldfarb
Abstract: We consider a class of sparse learning problems in high dimensional feature space regularized by a structured sparsity-inducing norm that incorporates prior knowledge of the group structure of the features. Such problems often pose a considerable challenge to optimization algorithms due to the non-smoothness and non-separability of the regularization term. In this paper, we focus on two commonly adopted sparsity-inducing regularization terms, the overlapping Group Lasso penalty l1 /l2 -norm and the l1 /l∞ -norm. We propose a unified framework based on the augmented Lagrangian method, under which problems with both types of regularization and their variants can be efficiently solved. As one of the core building-blocks of this framework, we develop new algorithms using a partial-linearization/splitting technique and prove that the accelerated versions 1 of these algorithms require O( √ε ) iterations to obtain an ε-optimal solution. We compare the performance of these algorithms against that of the alternating direction augmented Lagrangian and FISTA methods on a collection of data sets and apply them to two real-world problems to compare the relative merits of the two norms. Keywords: structured sparsity, overlapping Group Lasso, alternating direction methods, variable splitting, augmented Lagrangian
M.V. Afonso, J.M. Bioucas-Dias, and M.A.T. Figueiredo. An augmented Lagrangian approach to the constrained optimization formulation of imaging inverse problems. Image Processing, IEEE Transactions on, (99):1–1, 2009. ISSN 1057-7149. 1461 Q IN AND G OLDFARB Data Sets ogl-5000-100-10-3 ogl-5000-600-10-3 ogl-5000-1000-10-3 Algs ADAL APLM-S FISTA-p FISTA ProxGrad ADAL APLM-S FISTA-p FISTA ProxGrad ADAL APLM-S FISTA-p FISTA ProxGrad CPU (s) 1.70e+000 1.71e+000 9.08e-001 2.74e+000 7.92e+001 6.75e+001 1.79e+002 4.77e+001 3.28e+001 7.96e+002 2.83e+002 8.06e+002 2.49e+002 5.21e+001 1.64e+003 Iters 61 8 8 10 3858 105 9 9 12 5608 151 10 10 13 6471 Avg Sub-iters 1.00e+000 4.88e+000 4.38e+000 7.30e+000 1.00e+000 1.74e+001 8.56e+000 1.36e+001 1.00e+000 2.76e+001 1.28e+001 1.55e+001 - F(x) 1.9482e+005 1.9482e+005 1.9482e+005 1.9482e+005 1.4603e+006 1.4603e+006 1.4603e+006 1.4603e+006 2.6746e+006 2.6746e+006 2.6746e+006 2.6746e+006 - Table 4: Numerical results for ogl set 1. For ProxGrad, Avg Sub-Iters and F(x) fields are not applicable since the algorithm is not based on an outer-inner iteration scheme, and the objective function that it minimizes is different from ours. We tested ten problems with J = 100, · · · , 1000, but only show the results for three of them to save space. Data Sets ogl-1000-200-10-3 ogl-5000-200-10-3 ogl-10000-200-10-3 Algs ADAL APLM-S FISTA-p FISTA ProxGrad ADAL APLM-S FISTA-p FISTA ProxGrad ADAL APLM-S FISTA-p FISTA ProxGrad CPU (s) 4.18e+000 1.64e+001 3.85e+000 2.92e+000 1.16e+002 5.04e+000 8.42e+000 3.96e+000 6.54e+000 1.68e+002 6.41e+000 1.46e+001 5.60e+000 1.09e+001 3.31e+002 Iters 77 9 9 11 4137 63 8 9 10 4345 44 10 10 10 6186 Avg Sub-iters 1.00e+000 2.32e+001 1.02e+001 1.44e+001 1.00e+000 8.38e+000 6.56e+000 9.70e+000 1.00e+000 7.60e+000 5.50e+000 8.50e+000 - F(x) 9.6155e+004 9.6156e+004 9.6156e+004 9.6158e+004 4.1573e+005 4.1576e+005 4.1572e+005 4.1573e+005 1.0026e+006 1.0026e+006 1.0026e+006 1.0027e+006 - Table 5: Numerical results for ogl set 2. We ran the test for ten problems with n = 1000, · · · , 10000, but only show the results for three of them to save space. 1462 S TRUCTURED S PARSITY VIA A LTERNATING D IRECTION M ETHODS Data Sets ogl-dct-1000-5000-1 ogl-dct-1000-10000-1 ogl-dct-1000-15000-1 ogl-dct-1000-20000-1 ogl-dct-1000-25000-1 ogl-dct-1000-30000-1 Algs ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA CPU (s) 1.14e+001 1.21e+001 2.49e+001 3.31e+001 2.54e+001 6.33e+001 6.09e+001 3.95e+001 9.73e+001 9.52e+001 6.66e+001 1.81e+002 1.54e+002 7.50e+001 1.76e+002 1.87e+002 8.79e+001 2.24e+002 Iters 194 20 24 398 41 44 515 52 54 626 63 64 882 88 89 957 96 94 Avg Sub-iters 1.00e+000 1.11e+001 2.51e+001 1.00e+000 5.61e+000 1.74e+001 1.00e+000 4.44e+000 1.32e+001 1.00e+000 6.10e+000 1.61e+001 1.00e+000 3.20e+000 8.64e+000 1.00e+000 2.86e+000 8.54e+000 F(x) 8.4892e+002 8.4892e+002 8.4893e+002 1.4887e+003 1.4887e+003 1.4887e+003 2.7506e+003 2.7506e+003 2.7506e+003 3.3415e+003 3.3415e+003 3.3415e+003 4.1987e+003 4.1987e+003 4.1987e+003 4.6111e+003 4.6111e+003 4.6111e+003 Table 6: Numerical results for dct set 2 (scalability test) with l1 /l2 -regularization. All three algorithms were ran in factorization mode with a fixed µ = µ0 . F. Bach. Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research, 9:1179–1225, 2008. F. Bach. Structured sparsity-inducing norms through submodular functions. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 118–126. 2010. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. D.P. Bertsekas. Multiplier methods: a survey. Automatica, 12(2):133–145, 1976. D.P. Bertsekas. 1886529000. Nonlinear Programming. Athena Scientific Belmont, MA, 1999. ISBN D.P. Bertsekas and J.N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Inc., 1989. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Machine Learning, 3(1):1–123, 2010. P. Brucker. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3 (3):163–166, 1984. 1463 Q IN AND G OLDFARB Data Sets ogl-dct-1000-5000-1 ogl-dct-1000-10000-1 ogl-dct-1000-15000-1 ogl-dct-1000-20000-1 ogl-dct-1000-25000-1 ogl-dct-1000-30000-1 Algs ADAL FISTA-p FISTA ProxFlow ADAL FISTA-p FISTA ProxFlow ADAL FISTA-p FISTA ProxFlow ADAL FISTA-p FISTA ProxFlow ADAL FISTA-p FISTA ProxFlow ADAL FISTA-p FISTA ProxFlow CPU (s) 1.53e+001 1.61e+001 3.02e+001 1.97e+001 3.30e+001 3.16e+001 7.27e+001 3.67e+001 4.83e+001 5.40e+001 8.64e+001 9.91e+001 8.09e+001 8.09e+001 1.48e+002 2.55e+002 7.48e+001 1.15e+002 2.09e+002 1.38e+002 9.99e+001 1.55e+002 2.60e+002 1.07e+002 Iters 266 10 16 330 10 24 328 15 23 463 16 26 309 30 38 359 29 39 - Avg Sub-iters 1.00e+000 3.05e+001 4.09e+001 1.00e+000 3.10e+001 3.25e+001 1.00e+000 2.52e+001 2.66e+001 1.00e+000 2.88e+001 2.93e+001 1.00e+000 1.83e+001 2.30e+001 1.00e+000 2.17e+001 2.25e+001 - F(x) 7.3218e+002 7.3219e+002 7.3233e+002 7.3236e+002 1.2707e+003 1.2708e+003 1.2708e+003 1.2709e+003 2.2444e+003 2.2444e+003 2.2449e+003 2.2467e+003 2.6340e+003 2.6340e+003 2.6342e+003 2.6357e+003 3.5566e+003 3.5566e+003 3.5568e+003 3.5571e+003 3.7057e+003 3.7057e+003 3.7060e+003 3.7063e+003 Table 7: Numerical results for dct set 2 (scalability test) with l1 /l∞ -regularization. The algorithm configurations are exactly the same as in Table 6. S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1999. ISSN 1064-8275. X. Chen, Q. Lin, S. Kim, J. Pe˜ a, J.G. Carbonell, and E.P. Xing. An efficient proximaln gradient method for single and multi-task regression with structured sparsity. Arxiv Preprint arXiv:1005.4717, 2010. P.L. Combettes and J.C. Pesquet. Proximal splitting methods in signal processing. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185–212, 2011. P.L. Combettes and V.R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):1168–1200, 2006. ISSN 1540-3459. 1464 S TRUCTURED S PARSITY VIA A LTERNATING D IRECTION M ETHODS Data Sets ogl-dct-1000-5000-1 ogl-dct-1000-10000-1 ogl-dct-1000-15000-1 ogl-dct-1000-20000-1 ogl-dct-1000-25000-1 ogl-dct-1000-30000-1 Algs FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL FISTA-p FISTA ADAL CPU (s) 1.83e+001 2.49e+001 1.35e+001 3.16e+001 6.33e+001 4.43e+001 4.29e+001 9.73e+001 5.37e+001 7.53e+001 1.81e+002 1.57e+002 7.41e+001 1.76e+002 8.79e+001 8.95e+001 2.24e+002 1.12e+002 Iters 12 24 181 14 44 270 14 54 216 13 64 390 15 89 231 14 94 249 Avg Sub-iters 2.34e+001 2.51e+001 1.00e+000 1.73e+001 1.74e+001 1.00e+000 1.51e+001 1.32e+001 1.00e+000 2.06e+001 1.61e+001 1.00e+000 1.47e+001 8.64e+000 1.00e+000 1.58e+001 8.54e+000 1.00e+000 F(x) 8.4892e+002 8.4893e+002 8.4892e+002 1.4887e+003 1.4887e+003 1.4887e+003 2.7506e+003 2.7506e+003 2.7506e+003 3.3416e+003 3.3415e+003 3.3415e+003 4.1987e+003 4.1987e+003 4.1987e+003 4.6111e+003 4.6111e+003 4.6111e+003 Table 8: Numerical results for the DCT set with l1 /l2 -regularization. FISTA-p and ADAL were ran in PCG mode with the dynamic scheme for updating µ. µ was fixed at µ0 for FISTA. J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272–279. ACM, 2008. J. Eckstein and D.P. Bertsekas. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1):293–318, 1992. ISSN 0025-5610. J. Eckstein and P.J.S. Silva. A practical relative error criterion for augmented lagrangians. Mathematical Programming, pages 1–30, 2012. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976. ISSN 0898-1221. R. Glowinski and A. Marroco. Sur l’approximation, par elements finis d’ordre un, et la resolution, par penalisation-dualite d’une classe de problemes de dirichlet non lineares. Rev. Francaise d’Automat. Inf. Recherche Operationelle, (9):41–76, 1975. D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing the sum of two convex functions. Mathematical Programming Series A, 2011. 1465 Q IN AND G OLDFARB Data Sets ogl-dct-1000-5000-1 ogl-dct-1000-10000-1 ogl-dct-1000-15000-1 ogl-dct-1000-20000-1 ogl-dct-1000-25000-1 ogl-dct-1000-30000-1 Algs FISTA-p ADAL FISTA ProxFlow FISTA-p ADAL FISTA ProxFlow FISTA-p ADAL FISTA ProxFlow FISTA-p ADAL FISTA ProxFlow FISTA-p ADAL FISTA ProxFlow FISTA-p ADAL FISTA ProxFlow CPU (s) 2.30e+001 1.89e+001 3.02e+001 1.97e+001 5.09e+001 4.77e+001 7.27e+001 3.67e+001 6.33e+001 9.41e+001 8.64e+001 9.91e+001 8.21e+001 1.59e+002 1.48e+002 2.55e+002 1.43e+002 1.20e+002 2.09e+002 1.38e+002 1.75e+002 2.01e+002 2.60e+002 1.07e+002 Iters 11 265 16 11 323 24 12 333 23 12 415 26 13 310 38 13 361 39 - Avg Sub-iters 2.93e+001 1.00e+000 4.09e+001 3.16e+001 1.00e+000 3.25e+001 2.48e+001 1.00e+000 2.66e+001 2.42e+001 1.00e+000 2.93e+001 2.98e+001 1.00e+000 2.30e+001 3.18e+001 1.00e+000 2.25e+001 - F(x) 7.3219e+002 7.3218e+002 7.3233e+002 7.3236e+002 1.2708e+003 1.2708e+003 1.2708e+003 1.2709e+003 2.2445e+003 2.2444e+003 2.2449e+003 2.2467e+003 2.6341e+003 2.6340e+003 2.6342e+003 2.6357e+003 3.5567e+003 3.5566e+003 3.5568e+003 3.5571e+003 3.7057e+003 3.7057e+003 3.7060e+003 3.7063e+003 Table 9: Numerical results for the DCT set with l1 /l∞ -regularization. FISTA-p and ADAL were ran in PCG mode. The dynamic updating scheme for µ was applied to FISTA-p, while µ was fixed at µ0 for ADAL and FISTA. Data Sets BreastCancerData Algs ADAL APLM-S FISTA-p FISTA ProxGrad CPU (s) 6.24e+000 4.02e+001 6.86e+000 5.11e+001 7.76e+002 Iters 136 12 12 75 6605 Avg Sub-iters 1.00e+000 4.55e+001 1.48e+001 1.29e+001 1.00e+000 F(x) 2.9331e+003 2.9331e+003 2.9331e+003 2.9340e+003 - Table 10: Numerical results for Breast Cancer Data using l1 /l2 -regularization. In this experiment, we kept µ constant at 0.01 for ADAL. The CPU time is for a single run on the entire data set with the value of λ selected to minimize the RMSE in Figure 4. 1466 S TRUCTURED S PARSITY VIA A LTERNATING D IRECTION M ETHODS T. Goldstein and S. Osher. The split bregman method for l1-regularized problems. SIAM Journal on Imaging Sciences, 2:323, 2009. G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins Univ Pr, 1996. ISBN 0801854148. M.R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4(5):303–320, 1969. J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 417–424. ACM, 2009. L. Jacob, G. Obozinski, and J.P. Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 433–440. ACM, 2009. R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. AISTATS, 2010. R. Jenatton, J.Y. Audibert, F. Bach, et al. Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12:2777–2824, 2011. S. Kim and E.P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In Proceedings of the 27th Annual International Conference on Machine Learning, 2010. Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. Arxiv Preprint arXiv:1009.5055, 2010. J. Liu and J. Ye. Fast overlapping group lasso. Arxiv Preprint arXiv:1009.0306, 2010. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1558–1566. 2010. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Convex and network flow optimization for structured sparsity. The Journal of Machine Learning Research, 12:2681–2720, 2011. S. Mosci, S. Villa, A. Verri, and L. Rosasco. A primal-dual algorithm for group sparse regularization with overlapping groups. In Neural Information Processing Systems, 2010. Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1): 127–152, 2005. ISSN 0025-5610. J. Nocedal and S.J. Wright. Numerical Optimization. Springer Verlag, 1999. G. Obozinski, L. Jacob, and J.P. Vert. Group lasso with overlaps: the latent group lasso approach. Arxiv Preprint arXiv:1110.0413, 2011. J.C. Pesquet and N. Pustelnik. A parallel inertial proximal optimization method. Preprint, 2010. G. Peyre and J. Fadili. Group sparsity with overlapping partition functions. In EUSIPCO 2011, 2011. 1467 Q IN AND G OLDFARB G. Peyre, J. Fadili, and C. Chesneau. Adaptive structured block sparsity via dyadic partitioning. In EUSIPCO 2011, 2011. M.J.D. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization. Academic Press, New York, New York, 1972. Z. Qin, K. Scheinberg, and D. Goldfarb. Efficient block-coordinate descent algorithms for the group lasso. Technical report, Department of Industrial Engineering and Operations Research, Columbia University, 2010. R.T. Rockafellar. The multiplier method of hestenes and powell applied to convex programming. Journal of Optimization Theory and Applications, 12(6):555–562, 1973. V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In Proceedings of the 25th International Conference on Machine Learning, pages 848–855. ACM, 2008. S. Setzer, G. Steidl, and T. Teuber. Deblurring poissonian images by split bregman techniques. Journal of Visual Communication and Image Representation, 21(3):193–199, 2010. J.E. Spingarn. Partial inverse of a monotone operator. Applied Mathematics & Optimization, 10(1): 247–265, 1983. A. Subramanian, P. Tamayo, V.K. Mootha, S. Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43):15545, 2005. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principles and practice of background maintenance. In ICCV, page 255. Published by the IEEE Computer Society, 1999. M.J. Van De Vijver, Y.D. He, L.J. van’t Veer, H. Dai, A.A.M. Hart, D.W. Voskuil, G.J. Schreiber, J.L. Peterse, C. Roberts, M.J. Marton, et al. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 347(25):1999, 2002. S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. ISSN 1467-9868. 1468