jmlr jmlr2012 jmlr2012-44 jmlr2012-44-reference knowledge-graph by maker-knowledge-mining

44 jmlr-2012-Feature Selection via Dependence Maximization

Source: pdf

Author: Le Song, Alex Smola, Arthur Gretton, Justin Bedo, Karsten Borgwardt

Abstract: We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artiﬁcial and real-world data show that our feature selector works well in practice. Keywords: kernel methods, feature selection, independence measure, Hilbert-Schmidt independence criterion, Hilbert space embedding of distribution

reference text

A. Alizadeh, M. Eisen, R. Davis, et al. Distinct types of diffuse large b-cell lymphoma identiﬁed by gene expression proﬁling. Nature, 403:503–511, 2000. U. Alon, N Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligo-nucleotide arrays. In Proc. Natl. Acad. Sci. USA, pages 6745–6750, 1999. Shun-ichi Amari and S. Wu. An information-geometrical method for improving performance of support vector machine classiﬁers. In D. Willshaw and A. Murray, editors, Proceedings of ICANN’99, volume 1, pages 85–90. IEE Press, 1999. N. Anderson, P. Hall, and D. Titterington. Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. Journal of Multivariate Analysis, 50:41–54, 1994. 1428 F EATURE S ELECTION VIA D EPENDENCE M AXIMIZATION F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002. C. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273–289, 1973. J. Bedo, C. Sanderson, and A. Kowalczyk. An efﬁcient alternative to SVM based recursive feature elimination with applications in natural language processing and bioinformatics. In Artiﬁcial Intelligence, 2006. D. G. Beer, S. L. Kardia, S. L. Huang, et al. Gene-expression proﬁles predict survival of patients with lung adenocarcinoma. Nat. Med., 8:816–824, 2002. A. Berchuck, E. Iversen, and J. Lancaster. Patterns of gene expression that characterize long-term survival in advanced stage serous ovarian cancers. Clin. Cancer Res., 11:3686–3696, 2005. A. Bhattacharjee, W. G. Richards, W. G. Staunton, et al. Classiﬁcation of human lung carcinomas by mrna expression proﬁling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci., 98:13790–13795, 2001. M. Blaschko and A. Gretton. Learning taxonomies by dependence maximization. In Advances in Neural Information Processing Systems 21, pages 153–160. MIT Press, 2009. P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimization and support vector machines. In J. Shavlik, editor, Proc. Intl. Conf. Machine Learning, pages 82–90, San Francisco, California, 1998. Morgan Kaufmann Publishers. ftp://ftp.cs.wisc.edu/math-prog/tech-reports/9803.ps.Z. M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares, and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci., 97:262–267, 2000. L. Bullinger, K. Dohner, E. Bair, S. Frohling, R. F. Schlenk, R. Tibshirani, H. Dohner, and J. R. Pollack. Use of gene-expression proﬁling to identify prognostic subclasses in adult acute myeloid leukemia. New England Journal of Medicine, 350(16):1605–1616, Apr 2004. M. Collins and N. Duffy. Convolution kernels for natural language. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 625– 632, Cambridge, MA, 2001. MIT Press. N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment. Technical report, UC Davis Department of Statistics, 2003. S. M. Dhanasekaran, T. R. Barrette, D. Ghosh, R. Shah, S. Varambally, K. Kurachi, K. J. Pienta, M. A. Rubin, and A. M. Chinnaiyan. Delineation of prognostic biomarkers in prostate cancer. Nature, 412(6849):822–826, Aug 2001. G. Dornhege, B. Blankertz, G. Curio, and K. M¨ ller. Boosting bit rates in non-invasive EEG singleu trial classiﬁcations by feature combination and multi-class paradigms. IEEE Trans. Biomed. Eng., 51:993–1002, 2004. 1429 S ONG , S MOLA , G RETTON , B EDO AND B ORGWARDT G. Dornhege, B. Blankertz, M. Krauledat, F. Losch, G. Curio, and K. M¨ ller. Optimizing spatiou temporal ﬁlters for improving BCI. In Advances in Neural Information Processing Systems 18, 2006. L. Ein-Dor, O. Zuk, and E. Domany. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. USA, 103(15):5923–5928, Apr 2006. Andrey Feuerverger. A consistent test for bivariate dependence. International Statistical Review, 61(3):419–433, 1993. S. Fine and K. Scheinberg. Efﬁcient SVM training using low-rank kernel representation. Technical report, IBM Watson Research Center, New York, 2000. K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004. K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨ lkopf. Kernel measures of conditional dependence. In o Advances in Neural Information Processing Systems 20, pages 489–496, Cambridge, MA, 2008. MIT Press. G. Fung, O. L. Mangasarian, and A. J. Smola. Minimal kernel classiﬁers. Journal of Machine Learning Research, 3:303–321, 2002. T. G¨ rtner, P.A. Flach, and S. Wrobel. On graph kernels: Hardness results and efﬁcient alternaa tives. In B. Sch¨ lkopf and M. K. Warmuth, editors, Proc. Annual Conf. Computational Learning o Theory, pages 129–143. Springer, 2003. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomﬁeld, and E. S. Lander. Molecular classiﬁcation of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286 (5439):531–537, Oct 1999. A. Gretton, O. Bousquet, A.J. Smola, and B. Sch¨ lkopf. Measuring statistical dependence with o Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings of the International Conference on Algorithmic Learning Theory, pages 63–77. Springer-Verlag, 2005a. A. Gretton, A. Smola, O. Bousquet, R. Herbrich, A. Belitski, M. Augath, Y. Murayama, J. Pauls, B. Sch¨ lkopf, and N. Logothetis. Kernel constrained covariance for dependence measurement. o In AISTATS 10, pages 112–119, 2005b. A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. Smola. A kernel method for the twoo sample problem. In Advances in Neural Information Processing Systems 15, pages 513–520, Cambridge, MA, 2007a. MIT Press. A. Gretton, K. Borgwardt, M. Rasch, B. Schlkopf, and A. Smola. A kernel approach to comparing distributions. Proceedings of the 22nd Conference on Artiﬁcial Intelligence (AAAI-07), pages 1637–1641, 2007b. A. Gretton, K. Fukumizu, C.-H. Teo, L. Song, B. Sch¨ lkopf, and A. Smola. A kernel statistical o test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592, Cambridge, MA, 2008. MIT Press. 1430 F EATURE S ELECTION VIA D EPENDENCE M AXIMIZATION S. Gruvberger, M. Ringner, Y. Chen, S. Panavally, L. H. Saal, A. Borg, M. Ferno, C. Peterson, and P. S. Meltzer. Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res, 61(16):5979–5984, Aug 2001. C. Guestrin, A. Krause, and A. Singh. Near-optimal sensor placements in gaussian processes. In International Conference on Machine Learning ICML’05, 2005. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, March 2003. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classiﬁcation using support vector machines. Machine Learning, 46:389–422, 2002. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2001. Wassily Hoeffding. A class of statistics with asymptotically normal distribution. The Annals of Mathematical Statistics, 19(3):293–325, 1948. N. Iizuka, M. Oka, H. Yamada-Okabe, et al. Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. Lancet, 361:923– 929, 2003. K. Kira and L. Rendell. A practical approach to feature selection. In Proc. 9th Intl. Workshop on Machine Learning, pages 249–256, 1992. S. Lemm, B. Blankertz, G. Curio, and K.-R. M¨ lller. Spatio-spectral ﬁlters for improving the u classiﬁcation of single trial EEG. IEEE Trans. Biomed. Eng., 52:1541–1548, 2005. C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVM protein classiﬁcation. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, volume 15, Cambridge, MA, 2002. MIT Press. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2:419–444, February 2002. Ingrid L¨ nnstedt and Terry Speed. Replicated microarray data. Statistica Sinica, 12:31–46, 2002. o Radford M. Neal. Assessing relevance determination methods using delve. In Neural Networks and Machine Learning, pages 97–129. Springer, 1998. I Nemenman, F Shafee, and W Bialek. Entropy and inference, revisited. In Neural Information Processing Systems, volume 14, Cambridge, MA, 2002. MIT Press. G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14:265–294, 1978. J. Neumann, C. Schn¨ rr, and G. Steidl. Combined SVM-based feature selection and classiﬁcation. o Machine Learning, 61:129–150, 2005. 1431 S ONG , S MOLA , G RETTON , B EDO AND B ORGWARDT N. Quadrianto, L. Song, and A. Smola. Kernelized sorting. In Advances in Neural Information Processing Systems 22, 2009. A. Rosenwald, G. Wright, G. Chan, et al. The use of molecular proﬁling to predict survival after chemotherapy for diffuse large-b-cell lymphoma. N. Engl. J. Med., 346:1937–1947, 2002. B. Sch¨ lkopf. Support Vector Learning. R. Oldenbourg Verlag, Munich, 1997. Download: o http://www.kernel-machines.org. B. Sch¨ lkopf, P. L. Bartlett, A. J. Smola, and R. C. Williamson. Shrinking the tube: a new support o vector regression algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 330–336, Cambridge, MA, 1999. MIT Press. B. Sch¨ lkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, o Cambridge, MA, 2004. Bernhard Sch¨ lkopf and Alex Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o R. Serﬂing. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980. G.K. Smyth. Linear models and empirical bayes methods for assessing differential expressionin microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3, 2004. L. Song, J. Bedo, K.M. Borgwardt, A. Gretton, and A.J. Smola. Gene selection via the BAHSIC family of algorithms. Bioinformatics (ISMB), 23(13):i490–i498, 2007a. L. Song, A. Smola, A. Gretton, K. Borgwardt, and J. Bedo. Supervised feature selection via dependence estimation. In ICML, pages 823–830. Omnipress, 2007b. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch¨ lkopf. Injective Hilbert space o embeddings of probability measures. In Proc. Annual Conf. Computational Learning Theory, pages 111–122, 2008. B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Schoelkopf. Kernel choice and classiﬁability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems 22, Red Hook, NY, 2009. Curran Associates Inc. B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch¨ lkopf. Hilbert space emo beddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517– 1561, 2010. I. Steinwart. On the inﬂuence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Diagnosis of multiple cancer types by shrunken centroids of gene expression. In National Academy of Sciences, volume 99, pages 6567–6572, 2002. R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Class prediction by nearest shrunken centroids, with applicaitons to dna microarrays. Stat Sci, 18:104–117, 2003. 1432 F EATURE S ELECTION VIA D EPENDENCE M AXIMIZATION P.H.S. Torr. Solving Markov random ﬁelds using semideﬁnite programming. In Proceedings of International Workshop on Artiﬁcial Intelligence and Statistics, 2003. P. J. Valk, R. G. Verhaak, M. A. Beijen, C. A. Erpelinck, S. Barjesteh van Waalwijk van DoornKhosrovani, J. M. Boer, H. B. Beverloo, M. J. Moorhouse, P. J. van der Spek, B. Lowenberg, and R. Delwel. Prognostically useful gene-expression proﬁles in acute myeloid leukemia. New England Journal of Medicine, 350(16):1617–1628, Apr 2004. M. J. van de Vijver, Y. D. He, L. J. van ’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C. Roberts, M. J. Marton, M. Parrish, D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine, 247:1999–2009, 2002. L. J. van’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, et al. Gene expression proﬁling predicts clinical outcome of breast cancer. Nature, 415:530–536, 2002. S. V. N. Vishwanathan and A. J. Smola. Fast kernels for string and tree matching. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 569–576. MIT Press, Cambridge, MA, 2003. S. V. N. Vishwanathan, A. J. Smola, and R. Vidal. Binet-Cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. International Journal of Computer Vision, 73 (1):95–119, 2007. Yixin Wang, Jan G M Klijn, Yi Zhang, Anieta M Sieuwerts, Maxime P Look, Fei Yang, Dmitri Talantov, Mieke Timmermans, Marion E Meijer van Gelder, Jack Yu, Tim Jatkoe, Els M J J Berns, David Atkins, and John A Foekens. Gene-expression proﬁles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365(9460):671–679, February 2005. P. Warnat, R. Eils, and B. Brors. Cross-platform analysis of cancer microarray data improves gene expression based classiﬁcation of phenotypes. BMC Bioinformatics, 6:265, Nov 2005. J. B. Welsh, L. M. Sapinoso, A. I. Su, S. G. Kern, J. Wang-Rodriguez, C. A. Moskaluk, J. r. Frierson HF, and G. M. Hampton. Analysis of gene expression identiﬁes candidate markers and pharmacological targets in prostate cancer. Cancer Res, 61(16):5974–5978, Aug 2001. M. West. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics, 7:723–732, 2003. J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection for SVMs. In Advances in Neural Information Processing Systems 13, pages 668–674, 2000. J. Weston, A. Elisseeff, B. Sch¨ lkopf, and M. Tipping. Use of zero-norm with linear models and o kernel methods. Journal of Machine Learning Research, 3:1439–1461, 2003. A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915–936, 2003. 1433 S ONG , S MOLA , G RETTON , B EDO AND B ORGWARDT M. Zaffalon and M. Hutter. Robust feature selection using distributions of mutual information. In A. Darwiche and N. Friedman, editors, Proceedings of the 18th International Conference on Uncertainty in Artiﬁcial Intelligence (UAI-2002), pages 577–584, San Francisco, CA., 2002. Morgan Kaufmann. 1434