nips nips2002 nips2002-72 nips2002-72-reference knowledge-graph by maker-knowledge-mining

72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

Source: pdf

Author: Clayton Scott, Robert Nowak

Abstract: Classiﬁcation trees are one of the most popular types of classiﬁers, with ease of implementation and interpretation being among their attractive features. Despite the widespread use of classiﬁcation trees, theoretical analysis of their performance is scarce. In this paper, we show that a new family of classiﬁcation trees, called dyadic classiﬁcation trees (DCTs), are near optimal (in a minimax sense) for a very broad range of classiﬁcation problems. This demonstrates that other schemes (e.g., neural networks, support vector machines) cannot perform signiﬁcantly better than DCTs in many cases. We also show that this near optimal performance is attained with linear (in the number of training data) complexity growing and pruning algorithms. Moreover, the performance of DCTs on benchmark datasets compares favorably to that of standard CART, which is generally more computationally intensive and which does not possess similar near optimality properties. Our analysis stems from theoretical results on structural risk minimization, on which the pruning rule for DCTs is based.

reference text

[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Wadsworth, Belmont, CA, 1984. Classiﬁcation and Regression Trees,

[2] V. Vapnik, Estimation of Dependencies Based on Empirical Data, Springer-Verlag, New York, 1982.

[3] G. Lugosi and K. Zeger, “Concept learning using complexity regularization,” IEEE Transactions on Information Theory, vol. 42, no. 1, pp. 48–54, 1996.

[4] L. Devroye, L. Gy¨ rﬁ, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, o New York, 1996.

[5] C. Scott and R. Nowak, “Complexity-regularized dyadic classiﬁcation trees: Efﬁcient pruning and rates of convergence,” Tech. Rep. TREE0201, Rice University, 2002, available at http://www.dsp.rice.edu/ cscott.

[6] A. Nobel, “Analysis of a complexity based pruning scheme for classiﬁcation trees,” IEEE Transactions on Information Theory, vol. 48, no. 8, pp. 2362–2368, 2002.

[7] A. B. Tsybakov, “Optimal aggregation of classiﬁers in statistical learning,” preprint, 2001, available at http://www.proba.jussieu.fr/mathdoc/preprints/.

[8] K. Falconer, Fractal Geometry: Mathematical Foundations and Applications, Wiley, West Sussex, England, 1990.

[9] J. S. Marron, “Optimal rates of convergence to Bayes risk in nonparametric discrimination,” Annals of Statistics, vol. 11, no. 4, pp. 1142–1155, 1983.

[10] Y. Yang, “Minimax nonparametric classiﬁcation–Part I: Rates of convergence,” IEEE Transactions on Information Theory, vol. 45, no. 7, pp. 2271–2284, 1999.

[11] E. Mammen and A. B. Tsybakov, “Smooth discrimination analysis,” Annals of Statistics, vol. 27, pp. 1808–1829, 1999.

[12] P. Chou, T. Lookabaugh, and R. Gray, “Optimal pruning with applications to tree-structured source coding and modeling,” IEEE Transactions on Information Theory, vol. 35, no. 2, pp. 299–315, 1989.

[13] B. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996.

[14] R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993.