nips nips2010 nips2010-135 nips2010-135-reference knowledge-graph by maker-knowledge-mining

135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks


Source: pdf

Author: Samy Bengio, Jason Weston, David Grangier

Abstract: Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster. 1


reference text

[1] E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research (JMLR), 1:113–141, 2001.

[2] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In Proceedings of the 24th international conference on Machine learning, page 24. ACM, 2007.

[3] B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes, and M. Mohri. Half transductive ranking. In Artificial Intelligence and Statistics (AISTATS), 2010.

[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems, 1:585–592, 2002.

[5] J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):517, 1975.

[6] A. Beygelzimer, J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl. Conditional probability tree estimation analysis and algorithm. In Conference in Uncertainty in Artificial Intelligence (UAI), 2009.

[7] A. Beygelzimer, J. Langford, and P. Ravikumar. Error-correcting tournaments. In International Conference on Algorithmic Learning Theory (ALT), pages 247–262, 2009.

[8] L´ on Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg, editors, Advanced Lece tures on Machine Learning, Lecture Notes in Artificial Intelligence, LNAI 3176, pages 146–168. Springer Verlag, Berlin, 2004.

[9] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006.

[10] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research (JMLR), 2:265–292, 2002.

[11] O. Dekel and O. Shamir. Multiclass-Multilabel Learning when the Label Set Grows with the Number of Examples. In Artificial Intelligence and Statistics (AISTATS), 2010.

[12] J. Deng, W. Dong, R. Socher, Li-Jia Li, K. Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.

[13] T. Dietterich and G. Bakiri. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Artificial Intelligence Research (JAIR), 2:263–286, 1995.

[14] C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.

[15] David Grangier and Samy Bengio. A discriminative kernel-based model to rank images from text queries. Transactions on Pattern Analysis and Machine Intelligence, 30(8):1371–1384, 2008.

[16] T. Hastie and R. Tibshirani. Classication by pairwise coupling. The Annals of Statistics, 26(2):451–471, 2001.

[17] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In Neural Information Processing Systems (NIPS), 2009.

[18] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural computation, 6(2):181–214, 1994.

[19] J. Langford and A. Beygelzimer. Sensitive error correcting output codes. In Conference on Learning Theory (COLT), pages 158–172, 2005.

[20] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):416, 2007.

[21] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002.

[22] T. Oates and D. Jensen. The effects of training set size on decision tree complexity. In International Conference on Machine Learning (ICML), pages 254–262, 1997.

[23] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multiclass classification. In NIPS, pages 547–553, 2000.

[24] J. Quinlan. C4.5 : programs for machine learning. Morgan Kaufmann, 1993.

[25] R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research (JMLR), 5:101–141, 2004.

[26] K. Weinberger and O. Chapelle. Large margin taxonomy embedding for document categorization. In NIPS, pages 1737–1744, 2009.

[27] P.N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, page 321. Society for Industrial and Applied Mathematics, 1993. 9