jmlr jmlr2011 jmlr2011-67 jmlr2011-67-reference knowledge-graph by maker-knowledge-mining

67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination


Source: pdf

Author: Tony Jebara

Abstract: A multitask learning framework is developed for discriminative classification and regression where multiple large-margin linear classifiers are estimated for different prediction problems. These classifiers operate in a common input space but are coupled as they recover an unknown shared representation. A maximum entropy discrimination (MED) framework is used to derive the multitask algorithm which involves only convex optimization problems that are straightforward to implement. Three multitask scenarios are described. The first multitask method produces multiple support vector machines that learn a shared sparse feature selection over the input space. The second multitask method produces multiple support vector machines that learn a shared conic kernel combination. The third multitask method produces a pooled classifier as well as adaptively specialized individual classifiers. Furthermore, extensions to regression, graphical model structure estimation and other sparse methods are discussed. The maximum entropy optimization problems are implemented via a sequential quadratic programming method which leverages recent progress in fast SVM solvers. Fast monotonic convergence bounds are provided by bounding the MED sparsifying cost function with a quadratic function and ensuring only a constant factor runtime increase above standard independent SVM solvers. Results are shown on multitask data sets and favor multitask learning over single-task or tabula rasa methods. Keywords: meta-learning, support vector machines, feature selection, kernel selection, maximum entropy, large margin, Bayesian methods, variational bounds, classification, regression, Lasso, graphical model structure estimation, quadratic programming, convex programming


reference text

R.K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005. A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, December 2008. F. Bach, G.R.G. Lanckriet, and M.I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In Proceedings of the International Conference on Machine Learning, 2004. F.R. Bach. Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. 107 J EBARA 2 0.8 0.6 0.4 0.2 w(2) h( w(d)) 1.5 1 0 −0.2 −0.4 l 0.5 1 −0.6 l 2 −0.8 l MED 0 0 0.5 1 w(d) 1.5 2 (a) One dimensional plot −0.5 0 w(1) 0.5 (b) Two dimensional contour plot Figure 6: Various penalties on the classifier weights. In (a), the penalties are shown as one dimensional functions of the form h(w(d)) as α varies from α = 0 (which mimics an ℓ2 norm) to large α large (which resembles an ℓ1 norm). In (b), a two-dimensional contour plot is provided showing the shape of the ℓ1 penalty as a dashed line, the shape of the ℓ2 penalty as a dotted line and the shape of the ℓMED penalty with α = 2 as a solid line. J. Baxter. Learning internal representations. In Proceedings of the International ACM Workshop on Computational Learning Theory, 1995. J. Baxter. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12: 149–198, 2000. S. Ben-David and R. Schuller. Exploiting task relatedness for multiple task learning. In Proceedings of the Conference on Learning Theory and Kernel Machines, 2003. L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, 2008. S. Boyd and L. Vanderberghe. Convex Optimization. Cambridge University Press, 2004. R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel machines. Journal of Machine Learning Research, 2:265–292, 2001. N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On kernel target alignment. In Advances in Neural Information Processing Systems, volume 14, 2001. O. Dekel, Y. Singer, and P. Long. Online multitask learning. In Proceedings of the Conference on Learning Theory, 2006. 108 M ULTITASK S PARSITY VIA M AXIMUM E NTROPY D ISCRIMINATION T.G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995. M. Dudik, M.E. Blei, and R.E. Schapire. Hierarchical maximum entropy density estimation. In Proceedings of the International Conference on Machine learning, 2007. T. Evgeniou, C.A. Micchelli, and M. Pontil. Learning multiple tasks with kernels. Journal of Machine Learning Research, 6:615–637, 2005. T. Heskes. Solving a huge number of similar tasks: A combination of multi-task learning and a hierarchical Bayesian approach. In Proceedings of the International Conference on Machine Learning, 1998. T. Heskes. Empirical Bayes for learning to learn. In Proceedings of the International Conference on Machine Learning, 2004. T. Jaakkola and M. Jordan. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–37, 2000. T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In Advances in Neural Information Processing Systems, 1999. T. Jebara. Machine Learning: Discriminative and Generative. Kluwer Academic, 2003. T. Jebara. Multi-task feature and kernel selection for SVMs. In Proceedings of the International Conference on Machine Learning, 2004. T. Jebara and T. Jaakkola. Feature selection and dualities in maximum entropy discrimination. In Uncertainty in Artifical Intelligence, 2000. T. Joachims. Training linear SVMs in linear time. In Proceedings of the ACM International Conference On Knowledge Discovery and Data Mining, pages 217–226, 2006. K. Koh, S.J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine Learning Research, 8:1519–1555, July 2007. G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. Jordan. Learning the kernel matrix with semi-definite programming. In Proceedings of the International Conference on Machine Learning, 2002. J. Langford and J. Shawe-Taylor. PAC-Bayes and margins. In Advances in Neural Information Processing Systems, pages 423–430, 2002. P. Long and X. Wu. Mistake bounds for maximum entropy discrimination. In Advances in Neural Information Processing Systems, 2004. A. Maurer. Bounds for linear multi-task learning. Journal of Machine Learning Research, 7:117– 139, December 2006. A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327–350, June 2009. 109 J EBARA D. McAllester. PAC-Bayesian model averaging. In Proceedings of the Workshop on Computational Learning Theory, 1999. G. Obozinski, B. Taskar, and M.I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. B. Sch¨ lkopf and A.J. Smola. Learning with Kernels: Support Vector Machines, Regularization, o Optimization and Beyond. MIT Press, 2001. B. Sch¨ lkopf, A. J. Smola, and K.R. Muller. Advances in Kernel Methods - Support Vector Learning, o chapter Kernel principal component analysis, pages 327–352. MIT Press, 1999. S. Shalev-Shwartz and N. Srebro. SVM optimization: Inverse dependence on training set size. In Proceedings of the International Conference on Machine Learning, 2008. S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the International Conference on Machine Learning, 2007. B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems, 2004. S. Thrun. Is learning the n-th thing any easier than learning the first? Information Processing Systems, 1995. In Advances in Neural S. Thrun and L.Y. Pratt. Learning to Learn. Kluwer Academic, 1997. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. J.A. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE Transactions on Information Theory, 51(3):1030–1051, 2006. B.A. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Technometrics, pages 349–363, 2005. M. Wainwright, P. Ravikumar, and J. Lafferty. High-dimensional graphical model selection using l1-regularized logistic regression. In Advances in Neural Information Processing Systems, 2007. J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs. In Advances in Neural Information Processing Systems, 2000. Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research, 8:35–63, 2007. M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society B, 68:49–67, 2006. J. Zhu, E.P. Xing, and B. Zhang. Laplace maximum margin Markov networks. In Proceedings of the International Conference on Machine Learning, 2008. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2):301–320, 2005. 110