jmlr jmlr2012 jmlr2012-82 jmlr2012-82-reference knowledge-graph by maker-knowledge-mining

82 jmlr-2012-On the Necessity of Irrelevant Variables

Source: pdf

Author: David P. Helmbold, Philip M. Long

Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classiﬁers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high conﬁdence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory

reference text

F. Abramovich, Y. Benjamini, D. Donoho, and I. M. Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. Annals of Statistics, 34:584–653, 2006. L. Addario-Berry, N. Broutin, L. Devroye, and G. Lugosi. On combinatorial testing problems. Annals of Statistics, 38(5):3063–3092, 2010. N. Alon, J. H. Spencer, and P. Erd¨ s. The Probabilistic Method. Wiley, New York, 1992. o D. Angluin and L. Valiant. Fast probabilistic algorithms for Hamiltonion circuits and matchings. J. Comp. Sys. Sci., 18(2):155–193, 1979. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999. P. Bickel and E. Levina. Some theory of Fisher’s linear discriminant function, ‘Naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989– 1010, 2004. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, Berlin, 2006. L. Breiman. Arcing classiﬁers. Annals of Statistics, 26(3):801–824, 1998. P. B¨ hlmann and B. Yu. Analyzing bagging. Annals of Statistics, 30:927–961, 2002. u A. DasGupta. Asymptotic Theory of Statistics and Probability. Springer, Berlin, 2008. S. Dasgupta and P. M. Long. Boosting with diverse base classiﬁers. In COLT, pages 273–287, Washington, D.C., 2003. D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, 32:952–994, 2004. D. Donoho and J. Jin. Asymptotic minimaxity of false discovery rate thresholding for sparse exponential data. Annals of Statistics, 34, 2006. D. Donoho and J. Jin. Higher criticism thresholding: optimal feature selection when useful features and rare and weak. PNAS, 105(39):14790–14795, 2008. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁcation (2nd ed.). Wiley, New York, 2000. A. Ehrenfeucht, D. Haussler, M. Kearns, and L. G. Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247–251, 1989. J. Fan and Y. Fan. High dimensional classiﬁcation using features annealed independence rules. Annals of Statistics, 36(6):2605–2637, 2008. W. Feller. An Introduction to Probability Theory and Its Applications. Wiley, New York, 1968. 2169 H ELMBOLD AND L ONG J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of Statistics, 38(2):337–407, 2000. D. Haussler, N. Littlestone, and M. K. Warmuth. Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115(2):129–161, 1994. D. P. Helmbold and P. M. Long. On the necessity of irrelevant variables. In ICML, pages 281–288, Bellevue, WA, 2011. J. Jin. Impossibility of successful classication when useful features are rare and weak. PNAS, 106 (22):8859–8864, 2009. V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error of combined classiﬁers. Annals of Statistics, 30(1), 2002. J. Matouˇek and J. Vondrak. The probabilistic method, 2011. Lecture notes. s N. Meinshausen and J. Rice. Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Annals of Statistics, 34(1):373–393, 2006. R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, Cambridge, 1995. A. Y. Ng and M. I. Jordan. On discriminative vs. generative classiﬁers: A comparison of logistic regression and naive bayes. In NIPS, pages 841–848, Vancouver, B.C., 2001. S. Pemmaraju. Equitable coloring extends Chernoff-Hoeffding bounds. In RANDOM, pages 285– 296, Berkeley, CA, 2001. D. Pollard. Convergence of Stochastic Processes. Springer Verlag, Berlin, 1984. J. Quinlan. Bagging, boosting and C4.5. In AAAI, pages 725–730, Portland, OR, 1996. R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651–1686, 1998. E. Slud. Distribution inequalities for the binomial law. Annals of Probability, 5:404–412, 1977. R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99(10):6567–72, 2002. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. L. Wang, M. Sugiyama, C. Yang, Z. Zhou, and J. Feng. On the margin explanation of boosting algorithms. In COLT, pages 479–490, Helsinki, Finland, 2008. 2170