jmlr jmlr2008 jmlr2008-79 jmlr2008-79-reference knowledge-graph by maker-knowledge-mining

79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

Source: pdf

Author: Sivan Sabato, Shai Shalev-Shwartz

Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of ﬁnding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassiﬁcation error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efﬁciency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees

reference text

A. Antos and I. Kontoyiannis. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms, 19(3-4):163–193, 2001. A. Antos, L. Devroye, and L. Gyorﬁ. Lower bounds for Bayes error estimation. IEEE Trans. Pattern Anal. Mach. Intell., 21(7):643–645, 1999. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classiﬁcation and Regression Trees. Wadsworth & Brooks, 1984. R. Lopez de Mantaras. A distance-based attribute selection measure for decision tree induction. Machine Learning, 6(1):81–92, 1991. E. Drukh and Y. Mansour. Concentration bounds for unigrams language model. Journal of Machine Learning Research, 6:1231–1264, 2005. I.J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40:237–264, 1953. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001. M. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. In Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, pages 459–468, 1996. S. Kutin. Extensions to mcdiarmid’s inequality when differences are bounded with high probability. Technical report, University of Chicago TR-2002-04, 2002. D.A. McAllester and R.E. Schapire. On the convergence rate of good-turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics, pages 148–188, 1989. J. Mingers. An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3:319–342, 1989. T. M. Mitchell. Machine Learning. McGraw Hill, 1997. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. K. Torkkola. Information theoretic methods. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature Extraction, Foundations and Applications. Springer, 2006. L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004. 1114