jmlr jmlr2011 jmlr2011-79 jmlr2011-79-reference knowledge-graph by maker-knowledge-mining

79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding

Source: pdf

Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach

Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difﬁcult to optimize, and in this paper, we propose efﬁcient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efﬁcient and scales gracefully to millions of variables, which we illustrate in two types of applications: ﬁrst, we consider ﬁxed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespeciﬁed arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization

reference text

M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11): 4311–4322, 2006. R. K. Ahuja, T. L. Magnanti, and J. Orlin. Network Flows. Prentice Hall, 1993. F. Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems, 2008. F. Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems, 2010. 2329 J ENATTON , M AIRAL , O BOZINSKI AND BACH F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In Optimization for Machine Learning. MIT Press, 2011. To appear. R. Baraniuk. Optimal tree approximation with wavelets. Wavelet Applications in Signal and Image Processing VII, 3813:206214, 1999. R. G. Baraniuk, R. A. DeVore, G. Kyriazis, and X. M. Yu. Near best tree approximation. Advances in Computational Mathematics, 16(4):357–373, 2002. R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Transactions on Information Theory, 56:1982–2001, 2010. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. S. Becker, J. Bobin, and E. Candes. NESTA: A Fast and Accurate First-order Method for Sparse Recovery. Technical report, Preprint arXiv:0904.3367, 2009. Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 2009. D. P. Bertsekas. Nonlinear Programming. Athena Scientiﬁc, 1999. P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37(4):1705–1732, 2009. D. Blei and J. McAuliffe. Supervised topic models. In Advances in Neural Information Processing Systems, 2008. D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. D. Blei, T. L. Grifﬁths, and M. I. Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):1–30, 2010. J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and Examples. Springer, 2006. S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. D. M. Bradley and J. A. Bagnell. Differentiable sparse coding. In Advances in Neural Information Processing Systems, 2009. P. Brucker. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3: 163–166, 1984. W. L. Buntine. Variational Extensions to EM and Multinomial PCA. In Proceedings of the European Conference on Machine Learning (ECML), 2002. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20(1):33–61, 1998. 2330 P ROXIMAL M ETHODS FOR H IERARCHICAL S PARSE C ODING X. Chen, Q. Lin, S. Kim, J. Pe˜ a, J. G. Carbonell, and E. P. Xing. An efﬁcient proximal-gradient n method for single and multi-task regression with structured sparsity. Technical report, Preprint arXiv:1005.4717, 2010. R. R. Coifman and D. L. Donoho. Translation-invariant de-noising. Lectures Notes in Statistics, pages 125–125, 1995. P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer, 2010. M. Crouse, R. D. Nowak, and R. G. Baraniuk. Wavelet-based statistical signal processing using hidden Markov models. IEEE Transactions on Signal Processing, 46(4):886–902, 1998. D. L. Donoho. CART and best-ortho-basis: a connection. Annals of Statistics, pages 1870–1911, 1997. D. L. Donoho and I. M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90(432), 1995. J. Duchi and Y. Singer. Efﬁcient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899–2934, 2009. B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32 (2):407–451, 2004. M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12):3736–3745, December 2006. J. Friedman, T. Hastie, and R. Tibshirani. A note on the group lasso and a sparse group lasso. Technical report, Preprint arXiv:1001.0736, 2010. T. L. Grifﬁths and M. Steyvers. Finding scientiﬁc topics. Proceedings of the National Academy of Sciences, 101(Suppl 1):5228, 2004. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer, 2009. L. He and L. Carin. Exploiting structure in wavelet-based Bayesian compressive sensing. IEEE Transactions on Signal Processing, 57:3488–3497, 2009. C. Hu, J. T. Kwok, and W. Pan. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, 2009. J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2009. L. Jacob, G. Obozinski, and J.-P. Vert. Group Lasso with overlaps and graph Lasso. In Proceedings of the International Conference on Machine Learning (ICML), 2009. R. Jenatton, J.-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Technical report, Preprint arXiv:0904.3523, 2009. 2331 J ENATTON , M AIRAL , O BOZINSKI AND BACH R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary learning. In Proceedings of the International Conference on Machine Learning (ICML), 2010. S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967. S. Kim and E. P. Xing. Tree-guided group Lasso for multi-task regression with structured sparsity. In Proceedings of the International Conference on Machine Learning (ICML), 2010. S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classiﬁcation. In Advances in Neural Information Processing Systems, 2008. D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efﬁcient sparse coding algorithms. In Advances in Neural Information Processing Systems, 2007. N. Maculan and J. R. G. Galdino de Paula. A linear-time median-ﬁnding algorithm for projecting a vector on the simplex of Rn. Operations Research Letters, 8(4):219–222, 1989. J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In Advances in Neural Information Processing Systems, 2009a. J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2009b. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(1):19–60, 2010a. J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network ﬂow algorithms for structured sparsity. In Advances in Neural Information Processing Systems, 2010b. S. G. Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1999. D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2001. C. A. Micchelli, J. M. Morales, and M. Pontil. A family of penalty functions for structured sparsity. In Advances in Neural Information Processing Systems, 2010. J. J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R. Acad. Sci. Paris S´ r. A Math., 255:2897–2899, 1962. e D. Needell and J. A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007. 2332 P ROXIMAL M ETHODS FOR H IERARCHICAL S PARSE C ODING B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37:3311–3325, 1997. M. Schmidt and K. Murphy. Convex structure learning in log-linear models: Beyond pairwise potentials. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2010. J. M. Shapiro. Embedded image coding using zerotrees of wavelet coefﬁcients. IEEE Transactions on Signal Processing, 41(12):3445–3462, 1993. P. Sprechmann, I. Ramirez, G. Sapiro, and Y. C. Eldar. Collaborative hierarchical sparse modeling. Technical report, Preprint arXiv:1003.0400, 2010. G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory (Computer Science and Scientiﬁc Computing). Academic Press, 1990. M. Stojnic, F. Parvaresh, and B. Hassibi. On the reconstruction of block-sparse signals with an optimal number of measurements. IEEE Transactions on Signal Processing, 57(8):3075–3085, 2009. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B, pages 267–288, 1996. J. A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231–2242, 2004. J. A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3), 2006. M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 constrained quadratic programming. IEEE Transactions on Information Theory, 55:2183–2202, 2009. S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009. L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:2543–2596, 2010. K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. In Advances in Neural Information Processing Systems, 2009. G. X. Yuan, K. W. Chang, C. J. Hsieh, and C. J. Lin. Comparison of optimization methods and software for large-scale l1-regularized linear classiﬁcation. Journal of Machine Learning Research, 11:3183–3234, 2010. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society. Series B, 68(1):49–67, 2006. P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics, 37(6A):3468–3497, 2009. 2333 J ENATTON , M AIRAL , O BOZINSKI AND BACH J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: maximum margin supervised topic models for regression and classiﬁcation. In Proceedings of the International Conference on Machine Learning (ICML), 2009. D. Zoran and Y. Weiss. The “tree-dependent components” of natural scenes are edge ﬁlters. In Advances in Neural Information Processing Systems, 2009. 2334