nips nips2004 nips2004-98 nips2004-98-reference knowledge-graph by maker-knowledge-mining

98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

Source: pdf

Author: Anton Schwaighofer, Volker Tresp, Kai Yu

Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a ﬁrst step, kernel matrices on a ﬁxed set of input points are learned from data using a simple and efﬁcient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are ﬁtted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1

reference text

[1] Bakker, B. and Heskes, T. Task clustering and gating for bayesian multitask learning. Journal of Machine Learning Research, 4:83–99, 2003.

[2] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

[3] Breese, J. S., Heckerman, D., and Kadie, C. Empirical analysis of predictive algorithms for collaborative ﬁltering. Tech. Rep. MSR-TR-98-12, Microsoft Research, 1998.

[4] Caruana, R. Multitask learning. Machine Learning, 28(1):41–75, 1997.

[5] Chapelle, O. and Harchaoui, Z. A machine learning approach to conjoint analysis. In L. Saul, Y. Weiss, and L. Bottou, eds., Neural Information Processing Systems 17. MIT Press, 2005.

[6] Gelman, A., Carlin, J., Stern, H., and Rubin, D. Bayesian Data Analysis. CRCPress, 1995.

[7] Lawrence, N. D. and Platt, J. C. Learning to learn with the informative vector machine. In R. Greiner and D. Schuurmans, eds., Proceedings of ICML04. Morgan Kaufmann, 2004.

[8] Minka, T. P. and Picard, R. W. Learning how to learn is learning with point sets, 1999. Unpublished manuscript. Revised 1999.

[9] Schafer, J. L. Analysis of Incomplete Multivariate Data. Chapman&Hall;, 1997.

[10] Vishwanathan, S., Guttman, O., Borgwardt, K. M., and Smola, A. Kernel extrapolation, 2005. Unpublished manuscript.

[11] Williams, C. K. Computation with inﬁnite neural networks. Neural Computation, 10(5):1203– 1216, 1998.

[12] Williams, C. K. I. and Seeger, M. Using the nystr¨ m method to speed up kernel machines. In o T. K. Leen, T. G. Dietterich, and V. Tresp, eds., Advances in Neural Information Processing Systems 13, pp. 682–688. MIT Press, 2001.

[13] Yu, K., Schwaighofer, A., Tresp, V., Ma, W.-Y., and Zhang, H. Collaborative ensemble learning: Combining collaborative and content-based information ﬁltering via hierarchical Bayes. In C. Meek and U. Kjærulff, eds., Proceedings of UAI 2003, pp. 616–623, 2003.

[14] Zhu, X., Ghahramani, Z., and Lafferty, J. Semi-supervised learning using Gaussian ﬁelds and harmonic functions. In Proceedings of ICML03. Morgan Kaufmann, 2003. Appendix To derive an EM algorithm for Eq. (2), we treat the functional values f i in each scenario i as the unknown variables. In each EM iteration t, the parameters to be estimated are θ(t) = {m(t) , K (t) , σ 2(t) }. In the E-step, the sufﬁcient statistics are computed, M M f i | y i , θ(t) = E i=1 (10) i=1 M M f i (f i ) | y i , θ(t) = E ˜ i,(t) f ˜ i,(t) (f i,(t) ) + C i ˜ ˜ f (11) i=1 i=1 i ˜ ˜ with f and C i deﬁned in Eq. (4) and (5). In the M-step, the parameters θ are re-estimated as θ(t+1) = arg maxθ Q(θ | θ(t) ), with Q(θ | θ(t) ) = E lp (θ | f , y) | y, θ(t) , (12) where lp stands for the penalized log-likelihood of the complete data, lp (θ | f , y) = log Wi−1 (K | α, β) + log N (m | ν, η −1 K)+ M + i=1 ˜i log N (f | m, K) + M ˜i log N (y i | f I(i) , σ 2 1) (13) I(i) i=1 Updated parameters are obtained by setting the partial derivatives of Q(θ | θ(t) ) to zero.