nips nips2011 nips2011-252 nips2011-252-reference knowledge-graph by maker-knowledge-mining

252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing


Source: pdf

Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua

Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1


reference text

[1] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In International Conference on Machine Learning, 2007.

[2] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In NIPS, pages 41–48, 2006.

[3] F.R. Bach. Consistency of the group lasso and multiple kernel learning. J. of Machine Learning Research, 9:1179–1225, 2008.

[4] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In NIPS, 2011.

[5] E.J. Candes and T. Tao. Decoding by linear programming. IEEE Trans. on Information Theory, 51:4203–4215, 2005.

[6] D. C. Ciresan, U. Meier, L. Maria G., and J. Schmidhuber. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010.

[7] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951–991, 2003.

[8] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz. Multiclass learnability and the erm principle. In COLT, 2011.

[9] G. Davis, S. Mallat, and M. Avellaneda. Greedy adaptive approximation. Journal of Constructive Approximation, 13:57–98, 1997.

[10] D. Decoste and S. Bernhard. Training invariant support vector machines. Mach. Learn., 46:161–190, 2002.

[11] D.L. Donoho. Compressed sensing. In Technical Report, Stanford University, 2006.

[12] J. Duchi and Y. Singer. Boosting with structural sparsity. In Proc. ICML, pages 297–304, 2009.

[13] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.

[14] M. Fink, S. Shalev-Shwartz, Y. Singer, and S. Ullman. Online multiclass learning by interclass hypothesis sharing. In International Conference on Machine Learning, 2006.

[15] Y. Freund and R. E. Schapire. A short introduction to boosting. J. of Japanese Society for AI, pages 771–780, 1999.

[16] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, pages 119–139, 1997.

[17] T.J. Hastie and R.J. Tibshirani. Generalized additive models. Chapman & Hall, 1995.

[18] D. Hsu, S.M. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In NIPS, 2010.

[19] J. Huang and T. Zhang. The benefit of group sparsity. Annals of Statistics, 38(4), 2010.

[20] J. Huang, T. Zhang, and D.N. Metaxas. Learning with structured sparsity. In ICML, 2009.

[21] G.R.G. Lanckriet, N. Cristianini, P.L. Bartlett, L. El Ghaoui, and M.I. Jordan. Learning the kernel matrix with semidefinite programming. J. of Machine Learning Research, pages 27–72, 2004.

[22] Y. L. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE, pages 2278–2324, 1998.

[23] A. Majumdar and R.K. Ward. Fast group sparse classification. Electrical and Computer Engineering, Canadian Journal of, 34(4):136– 144, 2009.

[24] B. Natarajan. Sparse approximate solutions to linear systems. SIAM J. Computing, pages 227–234, 1995.

[25] Y. Nesterov and I.U.E. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Netherlands, 2004.

[26] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An efficient projection for l 1 ,inf inity regularization. In ICML, page 108, 2009.

[27] A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classification with sparse prototype representations. In CVPR, 2008.

[28] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999.

[29] S. Shalev-Shwartz, T. Zhang, and N. Srebro. Trading accuracy for sparsity in optimization problems with sparsity constraints. Siam Journal on Optimization, 20:2807–2832, 2010.

[30] P. Y. Simard, Dave S., and John C. Platt. Best practices for convolutional neural networks applied to visual document analysis. Document Analysis and Recognition, 2003.

[31] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2003.

[32] S. Thrun. Learning to learn: Introduction. Kluwer Academic Publishers, 1996.

[33] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267–288, 1996.

[34] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 854–869, 2007.

[35] J.A. Tropp and A.C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53(12):4655–4666, 2007.

[36] B. A Turlach, W. N V., and Stephen J Wright. Simultaneous variable selection. Technometrics, 47, 2000.

[37] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.

[38] E. Xing, A.Y. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In NIPS, 2003.

[39] T. Zhang. Class-size independent generalization analysis of some discriminative multi-category classification. In NIPS, 2004.

[40] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classification using super-vector coding of local image descriptors. Computer Vision– ECCV 2010, pages 141–154, 2010. 9