jmlr jmlr2006 jmlr2006-88 jmlr2006-88-reference knowledge-graph by maker-knowledge-mining

88 jmlr-2006-Streamwise Feature Selection

Source: pdf

Author: Jing Zhou, Dean P. Foster, Robert A. Stine, Lyle H. Ungar

Abstract: In streamwise feature selection, new features are sequentially considered for addition to a predictive model. When the space of potential features is large, streamwise feature selection offers many advantages over traditional feature selection methods, which assume that all features are known in advance. Features can be generated dynamically, focusing the search for new features on promising subspaces, and overﬁtting can be controlled by dynamically adjusting the threshold for adding features to the model. In contrast to traditional forward feature selection algorithms such as stepwise regression in which at each step all possible features are evaluated and the best one is selected, streamwise feature selection only evaluates each feature once when it is generated. We describe information-investing and α-investing, two adaptive complexity penalty methods for streamwise feature selection which dynamically adjust the threshold on the error reduction required for adding a new feature. These two methods give false discovery rate style guarantees against overﬁtting. They differ from standard penalty methods such as AIC, BIC and RIC, which always drastically over- or under-ﬁt in the limit of inﬁnite numbers of non-predictive features. Empirical results show that streamwise regression is competitive with (on small data sets) and superior to (on large data sets) much more compute-intensive feature selection methods such as stepwise regression, and allows feature selection on problems with millions of potential features. Keywords: classiﬁcation, stepwise regression, multiple regression, feature selection, false discovery rate

reference text

F. Abramovich, Y. Benjamini, D. Donoho, and I. Johnstone. Adapting to unknown sparsity by controlling the false discovery rate. Technical Report 2000–19, Dept. of Statistics, Stanford University, Stanford, CA, 2000. H. Akaike. Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Cs` ki, editors, 2nd International Symposium on Information Theory, pages 261– a 281, Budapest, 1973. Akad. Ki` do. a Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B(57):289–300, 1995. P. Bickel and K. Doksum. Mathematical Statistics. Prentice Hall, 2001. A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artiﬁcial Intelligence, 97(1-2):245–271, 1997. D. L. Donoho and I. M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81: 425–455, 1994. S. Dzeroski and N. Lavrac. Relational Data Mining. Springer-Verlag, 2001. ISBN 3540422897. S. Dzeroski, L. D. Raedt, and S. Wrobel. Multi-relational data mining workshop. In KDD-2003, 2003. 1883 Z HOU , F OSTER , S TINE AND U NGAR F. Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research, 5:1531–1555, 2004. D. P. Foster and E. I. George. The risk inﬂation criterion for multiple regression. Annals of Statistics, 22:1947–1975, 1994. D. P. Foster and R. A. Stine. Local asymptotic coding. IEEE Trans. on Info. Theory, 45:1289–1293, 1999. D. P. Foster and R. A. Stine. Adaptive variable selection competes with Bayes experts. Submitted for publication, 2004a. D. P. Foster and R. A. Stine. Variable selection in data mining: Building a predictive model for bankruptcy. Journal of the American Statistical Association (JASA), 2004b. 303-313. E. I. George. The variable selection problem. Journal of the Amer. Statist. Assoc., 95:1304–1308, 2000. E. I. George and D. P. Foster. Calibration and empirical bayes variable selection. Biometrika, 87: 731–747, 2000. R. Gilad-Bachrach, A. Navot, and N. Tishby. Margin based feature selection - theory and algorithms. In Proc. 21’st ICML, 2004. I. Guyon. Nips 2003 workshop on feature extraction and feature selection. http://clopinet.com/isabelle/Projects/NIPS2003. 2003. URL I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature Extraction, Foundations and Applications. Springer, 2006. D. Jensen and L. Getoor. IJCAI Workshop on Learning Statistical Models from Relational Data. 2003. JMLR. Special issue on variable selection. In Journal of Machine Learning Research (JMLR), 2003. URL http://jmlr.csail.mit.edu/. I. M. Johnstone and B. W. Silverman. Needles and straw in haystacks: Empirical bayes estimates of possibly sparse sequences. Annals of Statistics, 32:1594–1649, 2004. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artiﬁcial Intelligence, 97(1-2): 273–324, 1997. R. J. Larsen and M. L. Marx. An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, 2001. J. Li and H. Liu. Bio-medical data analysis. 2002. URL http://sdmc.lit.org.sg/GEDatasets. R. M. Neal. Bayesian Learning for Neural Networks. Number 118 in Lecture Notes in Statistics. Springer-Verlag, 1996. R. M. Neal. Deﬁning priors for distributions using dirichlet diffusion trees. Technical Report 0104, Dept. of Statistics, University of Toronto, 2001. 1884 S TREAMWISE F EATURE S ELECTION R. M. Neal and J. Zhang. Classiﬁcation for high dimensional problems using bayesian neural networks and dirichlet diffusion trees. In NIPS 2003 workshop on feature extraction and feature selection, 2003. URL http://www.cs.toronto.edu/ radford/slides.html. NIPS’03. Challenge results. 2003. URL http://www.nipsfsc.ecs.soton.ac.uk/results. A. Popescul and L. H. Ungar. Structural logistic regression for link analysis. In KDD Workshop on Multi-Relational Data Mining, 2003. A. Popescul and L. H. Ungar. Cluster-based concept invention for statistical relational learning. In Proc. Conference Knowledge Discovery and Data Mining (KDD-2004), 2004. J. Rissanen. Hypothesis selection and testing by the mdl principle. The Computer Journal, 42: 260–269, 1999. G. Schwartz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. R. A. Stine. Model selection using information theory and the mdl principle. Sociological Methods Research, 33:230–260, 2004. L. H. Ungar, J. Zhou, D. P. Foster, and R. A. Stine. Streaming feature selection using iic. In AI&STAT;’05, 2005. J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar. Streaming feature selection using alphainvesting. In ACM SIGKDD’05, 2005. 1885