jmlr jmlr2008 jmlr2008-77 jmlr2008-77-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amit Dhurandhar, Alin Dobra
Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classifiers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classification model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classification algorithm. The specific contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classification algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees
A. Blum, A. Kalai, and J. Langford. Beating the hold-out: Bounds for k-fold and progressive cross-validation. In Computational Learing Theory, 1999. S. Boucheron, O. Bousquet, and G. Lugosi. Introduction to statistical learning theory. http://www.kyb.mpg.de/publications/pdfs/pdf2819.pdf, 2005. 2247 D HURANDHAR AND D OBRA L. Breiman. Random forests. http://oz.berkeley.edu/users/breiman/randomforest2001.pdf, 2001. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth and Brooks, 1984. S. Buttrey and I. Kobayashi. On strength and correlation in random forests. In Proceedings of the 2003 Joint Statistical Meetings, Section on Statistical Computing, 2003. J. Connor-Linton. Chi square tutorial. web chi tut.html, 2003. http://www.georgetown.edu/faculty/ballc/webtools/ A. Dhurandhar and A. Dobra. Semi-analytical method for analyzing models and model selection measures based on moment analysis. ACM Transactions on Knowledge Discovery and Data Mining, 2009. P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 63(1):3–42, 2006. ISSN 0885-6125. doi: http://dx.doi.org/10.1007/s10994-006-6226-1. M. Hall. Correlation-based feature selection for machine learning. Ph.D diss. Hamilton, NZ: Waikato University, Department of Computer Science, 1998. M. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE TRANSACTIONS ON KDE, 2003. T. Hastie and J. Friedman R. Tibshirani. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001. R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In In Proceedings of the Fourteenth IJCAI., 1995. F. Liu, K. Ting, and W. Fan. Maximizing tree diversity by building complete-random decision trees. In PAKDD, pages 605–610, 2005. J. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986. J. Shao. Linear model selection by cross validation. JASA, 88, 1993. J. Shao. Mathematical Statistics. Springer-Verlag, 2003. L. Smith. A tutorial on principal components analysis. www.csnet.otago.ac.nz/cosc453/ student tutorials/principal components.pdf, 2002. V. Vapnik. Statistical Learning Theory. Wiley & Sons, 1998. R. Williamson. Srm and vc theory (statistical learning theory). / williams/papers/P151.pdf, 2001. http://axiom.anu.edu.au K. Zhang, W. Fan, B. Buckles, X. Yuan, and Z. Xu. Discovering unrevealed properties of probability estimation trees: On algorithm selection and performance explanation. ICDM, 0:741–752, 2006. ISSN 1550-4786. doi: http://doi.ieeecomputersociety.org/10.1109/ICDM.2006.58. 2248