nips nips2005 nips2005-37 nips2005-37-reference knowledge-graph by maker-knowledge-mining

37 nips-2005-Benchmarking Non-Parametric Statistical Tests


Source: pdf

Author: Mikaela Keller, Samy Bengio, Siew Y. Wong

Abstract: Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions. 1


reference text

[1] M. Bisani and H. Ney. Bootstrap estimates for confidence intervals in ASR performance evaluation. In Proceedings of ICASSP, 2004.

[2] R. M. Bolle, N. K. Ratha, and S. Pankanti. Error analysis of pattern recognition systems - the subsets bootstrap. Computer Vision and Image Understanding, 93:1– 33, 2004. 1.0 1.0 0.6 Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0.0 0.0 0.2 0.4 Power of the test 0.8 0.8 0.6 0.4 0.2 Power of the test Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0 1000 2000 3000 4000 5000 6000 0 1000 Evaluation set size 5000 6000 1.0 0.6 0.8 Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0.0 0.0 0.2 0.4 0.6 0.4 0.2 Power of the test 1.0 4000 (b) Linear SVM vs MLP - Unbalanced data Bootstrap test dF1 McNemar test Proportion test Bootstrap test dCerr 0.8 3000 Evaluation set size (a) Linear SVM vs MLP - Balanced data Power of the test 2000 0 1000 2000 3000 4000 5000 6000 Evaluation set size (c) Linear vs RBF SVMs - Balanced data 0 1000 2000 3000 4000 5000 6000 Evaluation set size (d) Linear vs RBF SVMs - Unbalanced data Figure 2: Power of several statistical tests comparing Linear SVM vs MLP or vs RBF SVM. The power equals -1, in Figures 2(c) and 2(d), when there was not data to compute the proportion (ie H1 was never true).

[3] A. C. Davison and D. V. Hinkley. Bootstrap methods and their application. Cambridge University Press, 1997.

[4] T.G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895–1924, 1998.

[5] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993.

[6] B. S. Everitt. The analysis of contingency tables. Chapman and Hall, 1977.

[7] C. Goutte and E. Gaussier. A probabilistic interpretation of precision, recall and Fscore, with implication for evaluation. In Proceedings of ECIR, pages 345–359, 2005.

[8] M. Keller, S. Bengio, and S. Y. Wong. Surprising Outcome While Benchmarking Statistical Tests. IDIAP-RR 38, IDIAP, 2005.

[9] Claude Nadeau and Yoshua Bengio. Inference for the generalization error. Machine Learning, 52(3):239–281, 2003.

[10] T.G. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1 - from yesterday’s news to tomorrow’s language resources. In Proceedings of the 3rd Int. Conf. on Language Resources and Evaluation, 2002.

[11] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, 2002.

[12] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, UK, 1975.