nips nips2002 nips2002-72 nips2002-72-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Clayton Scott, Robert Nowak
Abstract: Classification trees are one of the most popular types of classifiers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classification trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classification trees, called dyadic classification trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classification problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform significantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.
[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Wadsworth, Belmont, CA, 1984. Classification and Regression Trees,
[2] V. Vapnik, Estimation of Dependencies Based on Empirical Data, Springer-Verlag, New York, 1982.
[3] G. Lugosi and K. Zeger, “Concept learning using complexity regularization,” IEEE Transactions on Information Theory, vol. 42, no. 1, pp. 48–54, 1996.
[4] L. Devroye, L. Gy¨ rfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, o New York, 1996.
[5] C. Scott and R. Nowak, “Complexity-regularized dyadic classification trees: Efficient pruning and rates of convergence,” Tech. Rep. TREE0201, Rice University, 2002, available at http://www.dsp.rice.edu/ cscott.
[6] A. Nobel, “Analysis of a complexity based pruning scheme for classification trees,” IEEE Transactions on Information Theory, vol. 48, no. 8, pp. 2362–2368, 2002.
[7] A. B. Tsybakov, “Optimal aggregation of classifiers in statistical learning,” preprint, 2001, available at http://www.proba.jussieu.fr/mathdoc/preprints/.
[8] K. Falconer, Fractal Geometry: Mathematical Foundations and Applications, Wiley, West Sussex, England, 1990.
[9] J. S. Marron, “Optimal rates of convergence to Bayes risk in nonparametric discrimination,” Annals of Statistics, vol. 11, no. 4, pp. 1142–1155, 1983.
[10] Y. Yang, “Minimax nonparametric classification–Part I: Rates of convergence,” IEEE Transactions on Information Theory, vol. 45, no. 7, pp. 2271–2284, 1999.
[11] E. Mammen and A. B. Tsybakov, “Smooth discrimination analysis,” Annals of Statistics, vol. 27, pp. 1808–1829, 1999.
[12] P. Chou, T. Lookabaugh, and R. Gray, “Optimal pruning with applications to tree-structured source coding and modeling,” IEEE Transactions on Information Theory, vol. 35, no. 2, pp. 299–315, 1989.
[13] B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996.
[14] R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993.