nips nips2007 nips2007-8 nips2007-8-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: David P. Wipf, Srikantan S. Nagarajan
Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1
[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.
[2] S.F. Cotter, B.D. Rao, K. Engan, and K. Kreutz-Delgado, “Sparse solutions to linear inverse problems with multiple measurement vectors,” IEEE Trans. Signal Processing, vol. 53, no. 7, pp. 2477–2488, April 2005.
[3] M. Fazel, H. Hindi, and S. Boyd “Log-Det Heuristic for Matrix Rank Minimization with Applications to Hankel and Euclidean Distance Matrices,” Proc. American Control Conf., vol. 3, pp. 2156–2162, June 2003.
[4] M.A.T. Figueiredo, “Adaptive sparseness using Jeffreys prior,” Advances in Neural Information Processing Systems 14, pp. 697–704, 2002.
[5] H. Gao, “Wavelet shrinkage denoising using the nonnegative garrote,” Journal of Computational and Graphical Statistics, vol. 7, no. 4, pp. 469–488, 1998.
[6] D.G. Luenberger, Linear and Nonlinear Programming, Addison–Wesley, Reading, Massachusetts, 2nd ed., 1984.
[7] D.J.C. MacKay, “Bayesian interpolation,” Neural Comp., vol. 4, no. 3, pp. 415–447, 1992.
[8] D.J.C. MacKay, “Comparison of approximate methods for handling hyperparameters,” Neural Comp., vol. 11, no. 5, pp. 1035–1068, 1999.
[9] D.M. Malioutov, M. Cetin, and A.S. Willsky, “Sparse signal reconstruction perspective for ¸ source localization with sensor arrays,” IEEE Trans. Signal Processing, vol. 53, no. 8, pp. 3010–3022, August 2005.
[10] R.M. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.
[11] R. Pique-Regi, E.S. Tsau, A. Ortega, R.C. Seeger, and S. Asgharzadeh, “Wavelet footprints and sparse Bayesian learning for DNA copy number change analysis,” Int. Conf. Acoustics Speech and Signal Processing, April 2007.
[12] R.R. Ram´rez, Neuromagnetic Source Imaging of Spontaneous and Evoked Human Brain ı Dynamics, PhD Thesis, New York University, 2005.
[13] J.G. Silva, J.S. Marques, and J.M. Lemos, “Selecting landmark points for sparse manifold learning,” Advances in Neural Information Processing Systems 18, pp. 1241–1248, 2006.
[14] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” Journal of the Royal Statistical Society, vol. 58, no. 1, pp. 267–288, 1996.
[15] M.E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, vol. 1, pp. 211–244, 2001.
[16] M.E. Tipping and A.C. Faul, “Fast marginal likelihood maximisation for sparse Bayesian models,” Ninth Int. Workshop Artificial Intelligence and Statistics, Jan. 2003.
[17] M.B. Wakin, M.F. Duarte, S. Sarvotham, D. Baron, and R.G. Baraniuk, “Recovery of jointly sparse signals from a few random projections,” Advances in Neural Information Processing Systems 18, pp. 1433–1440, 2006.
[18] D.P. Wipf, “Bayesian Methods for Finding Sparse Representations,” PhD Thesis, UC San Diego, 2006.
[19] C.F. Wu, “On the convergence properties of the EM algorithm,” The Annals of Statistics, vol. 11, pp. 95–103, 1983.