jmlr jmlr2011 jmlr2011-94 jmlr2011-94-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shinichi Nakajima, Masashi Sugiyama
Abstract: Recently, variational Bayesian (VB) techniques have been applied to probabilistic matrix factorization and shown to perform very well in experiments. In this paper, we theoretically elucidate properties of the VB matrix factorization (VBMF) method. Through finite-sample analysis of the VBMF estimator, we show that two types of shrinkage factors exist in the VBMF estimator: the positive-part James-Stein (PJS) shrinkage and the trace-norm shrinkage, both acting on each singular component separately for producing low-rank solutions. The trace-norm shrinkage is simply induced by non-flat prior information, similarly to the maximum a posteriori (MAP) approach. Thus, no trace-norm shrinkage remains when priors are non-informative. On the other hand, we show a counter-intuitive fact that the PJS shrinkage factor is kept activated even with flat priors. This is shown to be induced by the non-identifiability of the matrix factorization model, that is, the mapping between the target matrix and factorized matrices is not one-to-one. We call this model-induced regularization. We further extend our analysis to empirical Bayes scenarios where hyperparameters are also learned based on the VB free energy. Throughout the paper, we assume no missing entry in the observed matrix, and therefore collaborative filtering is out of scope. Keywords: matrix factorization, variational Bayes, empirical Bayes, positive-part James-Stein shrinkage, non-identifiable model, model-induced regularization
T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, New York, second edition, 1984. H. Attias. Inferring parameters and structure of latent variable models by variational Bayes. In Proceedings of the Fifteenth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), pages 21–30, San Francisco, CA, 1999. Morgan Kaufmann. P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, MA, USA, 1998. P. F. Baldi and K. Hornik. Learning in linear neural networks: A survey. IEEE Transactions on Neural Networks, 6(4):837–858, 1995. A. J. Baranchik. Multiple regression and estimation of the mean of a multivariate normal distribution. Technical Report 51, Department of Statistics, Stanford University, Stanford, CA, USA, 1964. J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society B, 48: 259–302, 1986. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 2006. J. F. Cai, E. J. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010. O. Chapelle and Z. Harchaoui. A machine learning approach to conjoint analysis. In Advances in Neural Information Processing Systems, volume 17, pages 257–264, 2005. W. Chu and Z. Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2009. A. Cichocki, R. Zdunek, A. H. Phan, and S. Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. John Wiley & Sons, West Sussex, UK, 2009. M. J. Daniels and R. E. Kass. Shrinkage estimators for covariance matrices. Biometrics, 57(4): 1173–1184, 2001. B. Efron and C. Morris. Stein’s estimation rule and its competitors—an empirical Bayes approach. Journal of the American Statistical Association, 68:117–130, 1973. S. Funk. Try this at home. http://sifter.org/˜simon/journal/20061211.html, 2006. A. Gelman. Parameterization and Bayesian modeling. Journal of the American Statistical Association, 99:537–545, 2004. Y. Y. Guo and N. Pal. A sequence of improvements over the James-Stein estimator. Journal of Multivariate Analysis, 42(2):302–317, 1992. 2645 NAKAJIMA AND S UGIYAMA K. Hayashi, J. Hirayama, and S. Ishii. Dynamic exponential family matrix factorization. In T. Theeramunkong, B. Kijsirikul, N. Cercone, and T.-B. Ho, editors, Advances in Knowledge Discovery and Data Mining, volume 5476 of Lecture Notes in Computer Science, pages 452–462, Berlin, 2009. Springer. H. Hotelling. Relations between two sets of variates. Biometrika, 28(3–4):321–377, 1936. A. Hyv¨ rinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley, New York, 2001. a W. James and C. Stein. Estimation with quadratic loss. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 361–379, Berkeley, CA., USA, 1961. University of California Press. H. Jeffreys. An invariant form for the prior probability in estimation problems. In Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, volume 186, pages 453–461, 1946. J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. Grouplens: Applying collaborative filtering to Usenet news. Communications of the ACM, 40(3):77–87, 1997. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951. O. Ledoit and M. Wolf. A well-conditioned estimator for large dimensional covariance matrices. Journal of Multivariate Analysis, pages 365–411, 2004. D. D. Lee and S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. E. L. Lehmann. Theory of Point Estimation. Wiley, New York, 1983. Y. J. Lim and T. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(2):415–447, 1992. D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, Cambridge, UK, 2003. A. W. Marshall, I. Olkin, and B. C. Arnold. Inequalities: Theory of Majorization and Its Applications, Second Edition. Springer, 2009. S. Nakajima and M. Sugiyama. Implicit regularization in variational Bayesian matrix factorization. In A. T. Joachims and J. F¨ rnkranz, editors, Proceedings of 27th International Conference on u Machine Learning (ICML2010), Haifa, Israel, Jun. 21–25 2010. S. Nakajima and S. Watanabe. Variational Bayes solution of linear neural networks and its generalization performance. Neural Computation, 19(4):1112–1153, 2007. R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996. 2646 T HEORETICAL A NALYSIS OF BAYESIAN M ATRIX FACTORIZATION A. Paterek. Improving regularized singular value decomposition for collaborative filtering. In Proceedings of KDD Cup and Workshop, 2007. T. Raiko, A. Ilin, and J. Karhunen. Principal component analysis for large scale problems with lots of missing values. In J. Kok, J. Koronacki, R. Lopez de Mantras, S. Matwin, D. Mladenic, and A. Skowron, editors, Proceedings of the 18th European Conference on Machine Learning, volume 4701 of Lecture Notes in Computer Science, pages 691–698, Berlin, 2007. SpringerVerlag. G. R. Reinsel and R. P. Velu. Multivariate Reduced-Rank Regression: Theory and Applications. Springer, New York, 1998. J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd International Conference on Machine learning, pages 713–719, 2005. J. Rissanen. Stochastic complexity and modeling. Annals of Statistics, 14(3):1080–1100, 1986. R. Rosipal and N. Kr¨ mer. Overview and recent advances in partial least squares. In C. Saunders, a M. Grobelnik, S. Gunn, and J. Shawe-Taylor, editors, Subspace, Latent Structure and Feature Selection Techniques, volume 3940 of Lecture Notes in Computer Science, pages 34–51, Berlin, 2006. Springer. R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1257– 1264, Cambridge, MA, 2008. MIT Press. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o P. Y.-S. Shao and W. E. Strawderman. Improving on the James-Stein positive-part estimator. The Annals of Statistics, 22:1517–1538, 1994. N. Srebro and T. Jaakkola. Weighted low rank approximation. In T. Fawcett and N. Mishra, editors, Proceedings of the Twentieth International Conference on Machine Learning. AAAI Press, 2003. N. Srebro, J. Rennie, and T. Jaakkola. Maximum margin matrix factorization. In Advances in NIPS, volume 17, 2005. C. Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. of the 3rd Berkeley Symp. on Math. Stat. and Prob., pages 197–206, 1956. C. Stein. Estimation of a covariance matrix. In Rietz Lecture, 39th Annual Meeting IMS, 1975. W. E. Strawderman. Proper Bayes minimax estimators of the multivariate normal mean. Annals of Mathematical Statistics, 42:385–388, 1971. D. Tao, M. Song, X. Li, J. Shen, J. Sun, X. Wu, C. Faloutsos, and S. J. Maybank. Tensor approach for 3-D face modeling. IEEE Transactions on Circuits and Systems for Video Technology, 18(10): 1397–1410, 2008. 2647 NAKAJIMA AND S UGIYAMA K. Watanabe and S. Watanabe. Stochastic complexities of Gaussian mixtures in variational Bayesian approximation. Journal of Machine Learning Research, 7:625–644, 2006. S. Watanabe. Algebraic analysis for nonidentifiable learning machines. Neural Computation, 13 (4):899–933, 2001. S. Watanabe. Algebraic Geometry and Statistical Learning. Cambridge University Press, Cambridge, UK, 2009. D. Wipf and S. Nagarajan. A new view of automatric relevance determination. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1625–1632, Cambridge, MA, 2008. MIT Press. H. Wold. Estimation of principal components and related models by iterative least squares. In P. R. Krishnaiah, editor, Multivariate Analysis, pages 391–420. Academic Press, New York, NY, USA, 1966. K. J. Worsley, J-B. Poline, K. J. Friston, and A. C. Evanss. Characterizing the response of PET and fMRI data using multivariate linear models. NeuroImage, 6(4):305–319, 1997. K. Yamazaki and S. Watanabe. Singularities in mixture models and upper bounds of stochastic complexity. Neural Networks, 16(7):1029–1038, 2003. K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. In Proceedings of the Twenty-Second International Conference on Machine learning, pages 1012–1019, 2005. S. Yu, J. Bi, and J. Ye. Probabilistic interpretations and extensions for a family of 2D PCA-style algorithms. In KDD Workshop on Data Mining using Matrices and Tensors, 2008. 2648