nips nips2012 nips2012-222 nips2012-222-reference knowledge-graph by maker-knowledge-mining

222 nips-2012-Multi-Task Averaging

Source: pdf

Author: Sergey Feldman, Maya Gupta, Bela Frigyik

Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efﬁcient. 1

reference text

[1] C. Stein, “Inadmissibility of the usual estimator for the mean of a multivariate distribution,” Proc. Third Berkeley Symposium on Mathematical Statistics and Probability, pp. 197–206, 1956.

[2] B. Efron and C. N. Morris, “Stein’s paradox in statistics,” Scientiﬁc American, vol. 236, no. 5, pp. 119–127, 1977.

[3] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” Journal Machine Learning Research, vol. 6, pp. 1705–1749, December 2005.

[4] C. A. Micchelli and M. Pontil, “Kernels for multi–task learning,” in Advances in Neural Information Processing Systems (NIPS), 2004.

[5] E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams, “Multi-task Gaussian process prediction,” in Advances in Neural Information Processing Systems (NIPS). MIT Press, 2008.

[6] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task feature learning,” Machine Learning, vol. 73, no. 3, pp. 243–272, 2008.

[7] W. James and C. Stein, “Estimation with quadratic loss,” Proc. Fourth Berkeley Symposium on Mathematical Statistics and Probability, pp. 361–379, 1961.

[8] M. E. Bock, “Minimax estimators of the mean of a multivariate normal distribution,” The Annals of Statistics, vol. 3, no. 1, 1975.

[9] G. Casella, “An introduction to empirical Bayes data analysis,” The American Statistician, pp. 83–87, 1985.

[10] E. L. Lehmann and G. Casella, Theory of Point Estimation. New York: Springer, 1998.

[11] H. Rue and L. Held, Gaussian Markov Random Fields: Theory and Applications, ser. Monographs on Statistics and Applied Probability. London: Chapman & Hall, 2005, vol. 104.

[12] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying, “A spectral regularization framework for multi-task structure learning,” in Advances in Neural Information Processing Systems (NIPS), 2007.

[13] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task learning for classiﬁcation with Dirichlet process priors,” Journal Machine Learning Research, vol. 8, pp. 35–63, 2007.

[14] L. Jacob, F. Bach, and J.-P. Vert, “Clustered multi-task learning: A convex formulation,” in Advances in Neural Information Processing Systems (NIPS), 2008, pp. 745–752.

[15] Y. Zhang and D.-Y. Yeung, “A convex formulation for learning task relationships,” in Proc. of the 26th Conference on Uncertainty in Artiﬁcial Intelligence (UAI), 2010.

[16] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1990, corrected reprint of the 1985 original.

[17] A. Berman and R. J. Plemmons, Nonnegative Matrices in the Mathematical Sciences. demic Press, 1979.

[18] B. W. Silverman, Density Estimation for Statistics and Data Analysis. and Hall, 1986. Aca- New York: Chapman

[19] D. Brown, J. Dalton, and H. Hoyle, “Spatial forecast methods for terrorist events in urban environments,” Lecture Notes in Computer Science, vol. 3073, pp. 426–435, 2004. 9