nips nips2013 nips2013-153 nips2013-153-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
[1] I. M. Johnstone and D. M. Titterington. Statistical challenges of high-dimensional data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4237, 2009.
[2] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032, 1988.
[3] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the horseshoe. Journal of Machine Learning Research W&CP;, 5:73–80, 2009.
[4] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
[5] M. E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001.
[6] J. M. Hern´ ndez-Lobato, D. Hern´ ndez-Lobato, and A. Su´ rez. Network-based sparse Bayesian classifia a a cation. Pattern Recognition, 44:886–900, 2011.
[7] M. Van Gerven, B. Cseke, R. Oostenveld, and T. Heskes. Bayesian source localization with the multivariate Laplace prior. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1901–1909, 2009.
[8] Julia E. Vogt and Volker Roth. The group-lasso: 1,∞ regularization versus 1,2 regularization. In Goesele et al., editor, 32nd Anual Symposium of the German Association for Pattern Recognition, volume 6376, pages 252–261. Springer, 2010.
[9] Y. Kim, J. Kim, and Y. Kim. Blockwise sparse regression. Statistica Sinica, 16(2):375, 2006.
[10] D. Hern´ ndez-Lobato, J. M. Hern´ ndez-Lobato, T. Helleputte, and P. Dupont. Expectation propagation a a for Bayesian multi-task feature selection. In Jos´ L. Balc´ zar, Francesco Bonchi, Aristides Gionis, and e a Mich` le Sebag, editors, Proceedings of the European Conference on Machine Learning, volume 6321, e pages 522–537. Springer, 2010.
[11] G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, pages 1–22, 2009.
[12] T. Xiong, J. Bi, B. Rao, and V. Cherkassky. Probabilistic joint feature selection for multi-task learning. In Proceedings of the Seventh SIAM International Conference on Data Mining, pages 332–342. SIAM, 2007.
[13] T. Jebara. Multi-task feature and kernel selection for svms. In Proceedings of the twenty-first international conference on Machine learning, pages 55–62. ACM, 2004.
[14] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In B. Sch¨ lkopf, J. Platt, and o T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 41–48. MIT Press, Cambridge, MA, 2007.
[15] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 964–972. 2010.
[16] P. Garrigues and B. Olshausen. Learning horizontal connections in a sparse coding model of natural images. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 505–512. MIT Press, Cambridge, MA, 2008.
[17] T. Peleg, Y. C Eldar, and M. Elad. Exploiting statistical dependencies in sparse representations for signal recovery. Signal Processing, IEEE Transactions on, 60(5):2286–2303, 2012.
[18] A. Papoulis. Probability, Random Variables, and Stochastic Processes. Mc-Graw Hill, 1984.
[19] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, August 2006.
[20] T. Minka. A Family of Algorithms for approximate Bayesian Inference. PhD thesis, Massachusetts Institute of Technology, 2001.
[21] M. W. Seeger. Expectation propagation for exponential families. Technical report, Department of EECS, University of California, Berkeley, 2006.
[22] T. Minka. Power EP. Technical report, Carnegie Mellon University, Department of Statistics, 2004.
[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[24] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010. 9