jmlr jmlr2006 jmlr2006-85 jmlr2006-85-reference knowledge-graph by maker-knowledge-mining

85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets


Source: pdf

Author: Janez Demšar

Abstract: While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams. Keywords: comparative studies, statistical methods, Wilcoxon signed ranks test, Friedman test, multiple comparisons tests


reference text

E. Alpaydın. Combined 5 × 2 F test for comparing supervised classification learning algorithms. Neural Computation, 11:1885–1892, 1999. J. R. Beck and E. K. Schultz. The use of ROC curves in test performance evaluation. Arch Pathol Lab Med, 110:13–20, 1986. R. Bellazzi and B. Zupan. Intelligent data analysis in medicine and pharmacology: a position statement. In IDAMAP Workshop Notes at the 13th European Conference on Artificial Intelligence, ECAI-98, Brighton, UK, 1998. Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5:1089–1105, 2004. C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. URL R. R. Bouckaert. Choosing between two learning algorithms based on calibrated tests. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA. AAAI Press, 2003. R. R. Bouckaert. Estimating replicability of classifier learning experiments. In C Brodley, editor, Machine Learning, Proceedings of the Twenty-First International Conference (ICML 2004). AAAI Press, 2004. R. R. Bouckaert and E. Frank. Evaluating the replicability of significance tests for comparing learning algorithms. In D. Honghua, R. Srikant, and C. Zhang, editors, Advances in Knowledge Discovery and Data Mining, 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Proceedings. Springer, 2004. P. B. Brazdil and C. Soares. A comparison of ranking methods for classification algorithm selection. In Proceedings of 11th European Conference on Machine Learning. Springer Verlag, 2000. W. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74:329–336, 1979. J. Cohen. The earth is round (p < .05). American Psychologist, 49:997 1003, 1994. 28 S TATISTICAL C OMPARISONS OF C LASSIFIERS OVER M ULTIPLE DATA S ETS J. Demˇar and B. Zupan. Orange: From Experimental Machine Learning to Interactive Data Mins ing, A White Paper. Faculty of Computer and Information Science, Ljubljana, Slovenia, 2004. T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1924, 1998. O. J. Dunn. Multiple comparisons among means. Journal of the American Statistical Association, 56:52–64, 1961. C. W. Dunnett. A multiple comparison procedure for comparing several treatments with a control. Journal of American Statistical Association, 50:1096–1121, 1980. U. M. Fayyad and K. B. Irani. Multi-interval discretization of continuous valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022–1029, Chambery, France, 1993. Morgan-Kaufmann. R. A. Fisher. Statistical methods and scientific inference (2nd edition). Hafner Publishing Co., New York, 1959. M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675–701, 1937. M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11:86–92, 1940. L. C. Hamilton. Modern Data Analysis: A First Course in Applied Statistics. Wadsworth, Belmont, California, 1990. L. L. Harlow and S. A. Mulaik, editors. What If There Were No Significance Tests? Lawrence Erlbaum Associates, July 1997. Y. Hochberg. A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75: 800–803, 1988. B. Holland. On the application of three modified Bonferroni procedures to pairwise multiple comparisons in balanced repeated measures designs. Computational Statistics Quarterly, 6:219–231, 1991. S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6:65–70, 1979. G. Hommel. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika, 75:383–386, 1988. D. A. Hull. Information Retrieval Using Statistical Classification. PhD thesis, Stanford University, November 1994. R. L. Iman and J. M. Davenport. Approximations of the critical region of the Friedman statistic. Communications in Statistics, pages 571–595, 1980. 29 ˇ D EM S AR P. Langley. Crafting papers on machine learning. In Proc. of Seventeenth International Conference on Machine Learning (ICML-2000), 2000. D. Mladeni´ and M. Grobelnik. Feature selection for unbalanced class distribution and naive bayes. c In I. Bratko and S. Dˇ eroski, editors, Machine Learning, Proceedings of the Sixteenth Interz national Conference (ICML 1999), June 27-30, 2002, Bled, Slovenia, pages 258–267. Morgan Kaufmann, 1999. C. Nadeau and Y. Bengio. Inference for the generalization error. Advances in Neural Information Processing Systems, 12:239–281, 2000. P. B. Nemenyi. Distribution-free multiple comparisons. PhD thesis, Princeton University, 1963. J. Pizarro, E. Guerrero, and P. L. Galindo. Multiple comparison procedures applied to model selection. Neurocomputing, 48:155–173, 2002. F. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimation for comparing induction algorithms. In J. Shavlik, editor, Proceedings of the Fifteenth International Conference on Machine Learning (ICML-1998), pages 445–453, San Francisco, CA, 1998. Morgan Kaufmann Publishers. J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. Thirteenth National Conference on Artificial Intelligence, pages 725–730, Portland, OR, 1996. AAAI Press. S. L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:317–328, 1997. F. L. Schmidt. Statistical significance testing and cumulative knowledge in psychology. Psychological Methods, 1:115–129, 1996. H. Sch¨ tze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document represenu tations for the routing problem. In E. A. Fox, P. Ingwersen, and R. Fidel, editors, SIGIR’95, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 229–237. ACM Press, 1995. J. P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46:561–584, 1995. D. J. Sheskin. Handbook of parametric and nonparametric statistical procedures. Chapman & Hall/CRC, 2000. J. W. Tukey. Comparing individual means in the analysis of variance. Biometrics, 5:99–114, 1949. E. G. V´ zquez, A. Y. Escolano, and J. P. Junquera P. G. Ria˜ o. Repeated measures multiple coma n parison procedures applied to model selection in neural networks. In Proc. of the 6th Intl. Conf. On Artificial and Natural Neural Networks (IWANN 2001), pages 88–95, 2001. G. I. Webb. Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40:159–197, 2000. F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80–83, 1945. J. H. Zar. Biostatistical Analysis (4th Edition). Prentice Hall, Englewood Clifs, New Jersey, 1998. 30