jmlr jmlr2007 jmlr2007-66 jmlr2007-66-reference knowledge-graph by maker-knowledge-mining

66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

Source: pdf

Author: Wei Pan, Xiaotong Shen

Abstract: Variable selection in clustering analysis is both challenging and important. In the context of modelbased clustering analysis with a common diagonal covariance matrix, which is especially suitable for “high dimension, low sample size” settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to ﬁt our proposed model, and propose a modiﬁed BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression proﬁles demonstrate the utility of our method. Keywords: BIC, EM, mixture model, penalized likelihood, soft-thresholding, shrinkage

reference text

R. Alexandridis, S. Lin, and M. Irwin. Class discovery and classiﬁcation of tumor samples using mixture modeling of gene expression data. Bioinformatics, 20:2546-2552, 2004. P. J. Bickel, and E. Levina. Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10:9891010, 2004. L. Breiman. Random forests. Machine Learning 45:5-32, 2001. M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, M. Ares, and D. Haussle. Knowledge-based analysis of microarray gene expression data using support vector machines. Proc Natl Acad Sci USA, 97:262-267, 2000. W. C. Chang. On using principal components before separating a mixture of two multivariate normal distributions. Applied Statistics, 32:267-275, 1983. G. Ciuperca, A. Ridolﬁ, and J. Idier. Penalized maximum likelihood estimator for normal mixtures. Scandinavian Journal of Statistics, 30:45-59, 2003. 1161 PAN AND S HEN A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). JRSS-B, 39:1-38, 1977. B. Efron. The estimation of prediction error: covariance penalties and cross-validation. JASA, 99:619-632, 2004. B. Efron, T. Hastie T, I. Johnstone I, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407-499, 2004. M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS, 95:14863-14868, 1998. J. Fan, and R. Li. Variable selection via nonconcave penalized likelihood and its Oracle properties. JASA, 96:1348-1360, 2001. C. Fraley, and A. E. Raftery. How many clusters? Which clustering methods? - Answers via modelbased cluster analysis. The Computer Journal, 41:578-588, 1998. C. Fraley, and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611-631, 2002. C. Fraley, and A. E. Raftery. Bayesian regularization for normal mixture estimation and modelbased clustering. Technical report 486, Dept. of Statistics, University of Washington, 2005. J. H. Friedman, and J. J. Meulman. Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B, 66:815-849, 2004. D. Ghosh D, and A. M. Chinnaiyan. (2002). Mixture modeling of gene expression data from microarray experiments. Bioinformatics, 18:275-286, 2002. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomﬁeld, and E. S. Lander. Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531537, 1999. P. J. Green. On use of the EM for penalized likelihood estimation. J. R. Stat. Soc. Ser. B, 52:443-452, 1990. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer, 2001. P. D. Hoff. Discussion of ‘Clustering objects on subsets of attributes’ by Friedman and Meulman. J. R. Stat. Soc. Ser. B, 66:845-846, 2004. P. D. Hoff. Subset clustering of binary sequences, with an application to genomic abnormality data. Biometrics, 61:1027-1036, 2005. P. D. Hoff. Model-based subspace clustering. Bayesian Analysis, 1:321-344, 2006. 1162 P ENALIZED M ODEL -BASED C LUSTERING T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend. Functional Discovery via a Compendium of Expression Proﬁles. Cell, 102:109-126, 2000. A. Jasra, C. C. Holmes, and D. A. Stephens. Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science, 20:50-67, 2005. S. Kim, M. G. Tadesse, and M. Vannucci. Variable selection in clustering via Dirichlet process mixture models. Biometrika, 93:877-893, 2006. H. Li, and F. Hong. Cluster-Rasch models for microarray gene expression data. Genome Biology, 2: research0031.1-0031.13, 2001. J. S. Liu, J. L. Zhang, M. J. Palumbo, C. E. Lawrence. Bayesian clustering with variable and transformation selection (with discussion). Bayesian Statistics, 7:249-275, 2003. O. L. Mangasarian, and E. W. Wild. Feature selection in k-median clustering. Proceedings of SIAM International Conference on Data Mining, Workshop on Clustering High Dimensional Data and its Applications, April 24, 2004, La Buena Vista, FL, pages 23-28. G. J. McLachlan, R. W. Bean, and D. Peel. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18:413-422, 2002. G. J. McLachlan, and D. Peel. Finite Mixture Model. New York, John Wiley & Sons, Inc, 2002. G. J. McLachlan, D. Peel, and R. W. Bean. Modeling high-dimensional data by mixtures of factor analyzers. Computational Statistics and Data Analysis, 41:379-388, 2003. H. W. Mewes, C. Amid, R. Arnold, D. Frishman, U. Guldener, G. Mannhaupt, M. Munsterkotter, P. Pagel, N. Strack, V. Stumpﬂen, J. Warfsmann, and A. Ruepp. MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res., 32:D41-D44, 2004. W. Pan. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics, 12:546-554, 2002. W. Pan, X. Shen, A. Jiang, and R. P. Hebbel. Semi-supervised learning via penalized mixture model with application to microarray sample classiﬁcation. Bioinformatics, 22:2388-2395, 2006. A. E. Raftery. Discussion of “Bayesian clustering with variable and transformation selection” by Liu et al. Bayesian Statistics, 7:266-271, 2003. A. E. Raftery, and N. Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101:168-178, 2006. S. Richardson, and P. J. Green. On Bayesian analysis of mixtures with an unknown number of components. JRSS-B, 59:731-758, 1997. G. Schwarz. Estimating the dimensions of a model. Annals of Statistics, 6:461-464, 1978. 1163 PAN AND S HEN X. Shen, and J. Ye. Adaptive model selection. Journal of the American Statistical Association, 97:210-221, 2002. M. G. Tadesse, N. Sha, and M. Vannucci. Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association, 100:602-617, 2005. J. G. Thomas, J. M. Olson, S. J. Tapscott, and L. P. Zhao. An efﬁcient and robust statistical modeling approach to discover differentially expressed genes using genomic expression proﬁles. Genome Research, 11:1227-1236, 2001. R. Tibshirani. Regression shrinkage and selection via the Lasso. JRSS-B, 58:267-288, 1996. R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu. Class prediction by nearest shrunken centroids, with application to DNA microarrays. Statistical Science, 18:104-117, 2003. V. Vapnik. Statistical Learning Theory. Wiley, 1998. L. F. Wu, T. R. Hughes, A. P. Davierwala, M. D. Robinson, R. Stoughton, and S. J. Altschuler. Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nature Genetics, 31:255-265, 2002. G. Xiao, and W. Pan. Gene function prediction by a combined analysis of gene expression data and protein-protein interaction data. Journal of Bioinformatics and Computational Biology, 3:13711389, 2005. K. Y. Yeung, and W. L. Ruzzo. Principal component analysis for clustering gene expression data. Bioinformatics, 17:763-774, 2001. K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. Model-based clustering and data transformations for gene expression data. Bioinformatics, 17:977-987, 2001. X. Zhou, M. C. Kao, and W. H. Wong. Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci USA, 99:12783-12788, 2002. H. Zou, T. Hastie, and R. Tibshirani. On the “Degrees of Freedom” of the Lasso. Technical report, Dept. of Statistics, Stanford University, 2004. Available at http://stat.stanford.edu/∼hastie/pub.htm. 1164