jmlr jmlr2010 jmlr2010-78 jmlr2010-78-reference knowledge-graph by maker-knowledge-mining

78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

Source: pdf

Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley

Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overﬁtting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-ﬁtting, regularization, minimum description length

reference text

M. Adankon and M. Cheriet. Uniﬁed framework for SVM model selection. In I. Guyon, et al., editor, Hands on Pattern Recognition. Microtome, 2009. H. Akaike. Information theory and an extension of the maximum likelihood principle. In B.N. Petrov and F. Csaki, editors, 2nd International Symposium on Information Theory, pages 267– 281. Akademia Kiado, Budapest, 1973. C. F. Aliferis, I. Tsamardinos, and A. Statnikov. HITON, a novel Markov blanket algorithm for optimal variable selection. In 2003 American Medical Informatics Association (AMIA) Annual Symposium, pages 21–25, 2003. C.-A. Azencott and P. Baldi. Virtual high-throughput screening with two-dimensional kernels. In I. Guyon, et al., editor, Hands on Pattern Recognition. Microtome, 2009. P. L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9, page 134, Cambridge, MA, 1997. MIT Press. A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. In Paciﬁc Symposium on Biocomputing, pages 6–17, 2002. A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artiﬁcial Intelligence, 97(1-2):245–271, December 1997. B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classiﬁers. In COLT, pages 144–152, 1992. M. Boull´ . Compression-based averaging of selective naive bayes classiﬁers. In I. Guyon and e A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 8, pages 1659–1685, Jul 2007. URL http://www.jmlr.org/papers/volume8/boulle07a/boulle07a.pdf. 82 M ODEL S ELECTION M. Boull´ . Data grid models for preparation and modeling in supervised learning. In I. Guyon, et al., e editor, Hands on Pattern Recognition. Microtome, 2009. L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. G. Cawley. Leave-one-out cross-validation based model selection criteria for weighted ls-svms. In IJCNN, pages 1661–1668, 2006. G. Cawley and N. Talbot. Over-ﬁtting in model selection and subsequent selection bias in performance evaluation. JMLR, submitted, 2009. G. Cawley and N. Talbot. Preventing over-ﬁtting during model selection via Bayesian regularisation of the hyper-parameters. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 8, pages 841–861, Apr 2007a. URL http://www.jmlr.org/papers/volume8/ cawley07a/cawley07a.pdf. G. C. Cawley and N. L. C. Talbot. Agnostic learning versus prior knowledge in the design of kernel machines. In Proc. IJCNN07, Orlando, Florida, Aug 2007b. INNS/IEEE. G.C. Cawley, G.J. Janacek, and N.L.C. Talbot. Generalised kernel machines. In International Joint Conference on Neural Networks, pages 1720 – 1725. IEEE, August 2007. W. Chu, S. Keerthi, C. J. Ong, and Z. Ghahramani. Bayesian Support Vector Machines for feature ranking and selection. In I. Guyon, et al., editor, Feature Extraction, Foundations and Applications, 2006. G. Claeskens, C. Croux, and J. Van Kerckhoven. An information criterion for variable selection in Support Vector Machines. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 9, pages 541–558, Mar 2008. URL http://www.jmlr.org/papers/volume9/ claeskens08a/claeskens08a.pdf. Clopinet. Challenges in machine learning, 2004-2009. URL http://clopinet.com/challenges. C. Dahinden. An improved Random Forests approach with application to the performance prediction challenge datasets. In I. Guyon, et al., editor, Hands on Pattern Recognition. Microtome, 2009. M. Debruyne, M. Hubert, and J. Suykens. Model selection in kernel based regression using the inﬂuence function. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 9, pages 2377–2400, Oct 2–8. URL http://www.jmlr.org/papers/volume9/ debruyne08a/debruyne08a.pdf. Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan Kaufmann, 1996. J. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:1189–1232, 2000. 83 G UYON , S AFFARI , D ROR AND C AWLEY J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression, a statistical view of boosting. Annals of Statistics, 28:337374, 2000. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software (to appear), 2009. P. Germain, A. Lacasse, F. Laviolette, and M. Marchand. PAC-Bayesian learning of linear classiﬁers. In ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, pages 353–360, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. Y. Guermeur. VC theory of large margin multi-category classiﬁers. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 8, pages 2551–2594, Nov 2007. URL http://www.jmlr.org/papers/volume8/guermeur07a/guermeur07a.pdf. I. Guyon. A practical guide to model selection. In J. Marie, editor, Machine Learning Summer School. Springer, to appear, 2009. I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, Editors. Feature Extraction, Foundations and Applications. Studies in Fuzziness and Soft Computing. With data, results and sample code for the NIPS 2003 feature selection challenge. Physica-Verlag, Springer, 2006a. URL http: //clopinet.com/fextract-book/. I. Guyon, A. Saffari, G. Dror, and J. Buhmann. Performance prediction challenge. In IEEE/INNS conference IJCNN 2006, Vancouver, Canada, July 16-21 2006b. I. Guyon, A. Saffari, G. Dror, and G. Cawley. Agnostic learning vs. prior knowledge challenge. In IEEE/INNS conference IJCNN 2007, Orlando, Florida, August 12-17 2007. I. Guyon, C. Aliferis, G. Cooper, A. Elisseeff, J.-P. Pellet, P. Spirtes, and A. Statnikov. Design and analysis of the causation and prediction challenge. In JMLR W&CP;, volume 3, pages 1–33, WCCI2008 workshop on causality, Hong Kong, June 3-4 2008a. URL http://jmlr.csail. mit.edu/papers/topic/causality.html. I. Guyon, A. Saffari, G. Dror, and G. Cawley. Analysis of the IJCNN 2007 agnostic learning vs. prior knowledge challenge. In Neural Networks, volume 21, pages 544–550, Orlando, Florida, March 2008b. I. Guyon, D. Janzing, and B. Sch¨ lkopf. Causality: objectives and assessment. In NIPS 2008 o workshop on causality, volume 7. JMLR W&CP;, in press, 2009a. I. Guyon, V. Lemaire, M. Boull´ , Gideon Dror, and David Vogel. Analysis of the KDD cup 2009: e Fast scoring on a large orange customer database. In KDD cup 2009, in press, volume 8. JMLR W&CP;, 2009b. L. E. Sucar H. J. Escalante, M. Montes. Particle swarm model selection. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 10, pages 405–440, Feb 2009. URL http://www.jmlr.org/papers/volume10/escalante09a/escalante09a.pdf. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Data Mining, Inference and Prediction. Springer Verlag, 2000. 84 M ODEL S ELECTION T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. JMLR, 5:1391–1415, 2004. URL http://jmlr.csail.mit.edu/papers/volume5/ hastie04a/hastie04a.pdf. D. Haussler, M. Kearns, and R. Schapire. Bounds on the sample complexity of Bayesian learning using information theory and the vc dimension. Machine Learning, 14(1):83–113, 1994. ISSN 0885-6125. A. E. Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962. C. Hue and M. Boull´ . A new probabilistic approach in rank regression with optimal Bayesian e partitioning. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 8, pages 2727–2754, Dec 2007. URL http://www.jmlr.org/papers/volume8/hue07a/ hue07a.pdf. IBM team. Winning the KDD cup orange challenge with ensemble selection. In KDD cup 2009, in press, volume 8. JMLR W&CP;, 2009. R. Kohavi and G. John. Wrappers for feature selection. Artiﬁcial Intelligence, 97(1-2):273–324, December 1997. I. Koo and R. M. Kil. Model selection for regression with continuous kernel functions using the modulus of continuity. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 9, pages 2607–2633, Nov 2008. URL http://www.jmlr.org/papers/volume9/ koo08b/koo08b.pdf. G. Kunapuli, J.-S. Pang, and K. Bennett. Bilevel cross-validation-based model selection. In I. Guyon, et al., editor, Hands on Pattern Recognition. Microtome, 2009. J. Langford. Tutorial on practical prediction theory for classiﬁcation. JMLR, 6:273–306, Mar 2005. URL http://jmlr.csail.mit.edu/papers/volume6/langford05a/langford05a.pdf. Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. J. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541 – 551, 1989. R. W. Lutz. Logitboost with trees applied to the WCCI 2006 performance prediction challenge datasets. In Proc. IJCNN06, pages 2966–2969, Vancouver, Canada, July 2006. INNS/IEEE. D. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4:448–472, 1992. O. Madani, D. M. Pennock, and G. W. Flake. Co-validation: Using model disagreement to validate classiﬁcation algorithms. In NIPS, 2005. M. Momma and K. Bennett. A pattern search method for model selection of Support Vector Regression. In In Proceedings of the SIAM International Conference on Data Mining. SIAM, 2002. 85 G UYON , S AFFARI , D ROR AND C AWLEY R. Neal and J. Zhang. High dimensional classiﬁcation with Bayesian neural networks and dirichlet diffusion trees. In I. Guyon, et al., editor, Feature Extraction, Foundations and Applications, 2006. V. Nikulin. Classiﬁcation with random sets, boosting and distance-based clustering. I. Guyon, et al., editor, Hands on Pattern Recognition. Microtome, 2009. In T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247(4945):978 – 982, February 1990. A. Pozdnoukhov and S. Bengio. Invariances in kernel methods: From samples to objects. Pattern Recogn. Lett., 27(10):1087–1097, 2006. ISSN 0167-8655. E. Pranckeviciene and R. Somorjai. Liknon feature selection: Behind the scenes. In I. Guyon, et al., editor, Hands on Pattern Recognition. Microtome, 2009. J. Reunanen. Model selection and assessment using cross-indexing. In Proc. IJCNN07, Orlando, Florida, Aug 2007. INNS/IEEE. S. Rosset and J. Zhu. Sparse, ﬂexible and efﬁcient modeling using L1 regularization. I. Guyon, et al., editor, Feature Extraction, Foundations and Applications, 2006. In S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classiﬁer. Journal of Machine Learning Research, 5:941–973, 2004. M. Saeed. Hybrid learning using mixture models and artiﬁcial neural networks. In I. Guyon, et al., editor, Hands on Pattern Recognition. Microtome, 2009. A. Saffari and I. Guyon. Quick start guide for CLOP. Technical report, Graz University of Technology and Clopinet, May 2006. URL http://clopinet.com/CLOP/. D. Schuurmans and F. Southey. Metric-based methods for adaptive model selection and regularization. Machine Learning, Special Issue on New Methods for Model Selection and Model Combination, 48:51–84, 2001. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classiﬁcation. JMLR, 3:233–269, 2003. URL http://jmlr.csail.mit.edu/papers/volume3/seeger02a/ seeger02a.pdf. M. Seeger. Bayesian inference and optimal design for the sparse linear model. JMLR, 9:759–813, 2008. ISSN 1533-7928. P. Simard, Y. LeCun, and J. Denker. Efﬁcient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 50–58, San Mateo, CA, 1993. Morgan Kaufmann. A. Singh, R. Nowak, and X. Zhu. Unlabeled data: Now it helps, now it doesn’t. In NIPS, 2008. 86 M ODEL S ELECTION A. Smola, S. Mika, B. Sch¨ lkopf, and R. Williamson. Regularized principal manifolds. o JMLR, 1:179–209, 2001. URL http://jmlr.csail.mit.edu/papers/volume1/smola01a/ smola01a.pdf. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. E. Tuv, A. Borisov, G. Runger, and K. Torkkola. Feature selection with ensembles, artiﬁcial variables, and redundancy elimination. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 10, pages 1341–1366, Jul 2009. URL http://www.jmlr.org/ papers/volume10/tuv09a/tuv09a.pdf. L. Valiant. A theory of the learnable. Communications of the ACM,, 27(11):1134–1142, 1984. V. Vapnik. Statistical Learning Theory. John Wiley and Sons, N.Y., 1998. V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982). V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16:264–180, 1971. S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In Advances in Neural Information Processing Systems 15, pages 569–576. MIT Press, 2003. URL http://books. nips.cc/papers/files/nips15/AA11.pdf. G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, 1990. A. Waibel. Consonant recognition by modular construction of large phonemic time-delay neural networks. In NIPS, pages 215–223, 1988. C. Watkins. Dynamic alignment kernels. In A.J. Smola, P.L. Bartlett, B. Sch¨ lkopf, and D. Schuuro mans, editors, Advances in Large Margin Classiﬁers, pages 39–50, Cambridge, MA, 2000. MIT Press. URL http://www.cs.rhul.ac.uk/home/chrisw/dynk.ps.gz. P. Werbos. Backpropagation: Past and future. In International Conference on Neural Networks, pages 343–353. IEEE, IEEE press, 1988. J. Weston, A. Elisseff, B. Schoelkopf, and M. Tipping. Use of the zero norm with linear models and kernel methods. JMLR, 3:1439–1461, 2003. J. Wichard. Agnostic learning with ensembles of classiﬁers. In Proc. IJCNN07, Orlando, Florida, Aug 2007. INNS/IEEE. J. Ye, S. Ji, and J. Chen. Multi-class discriminant kernel learning via convex programming. In I. Guyon and A. Saffari, editors, JMLR, Special Topic on Model Selection, volume 9, pages 719– 758, Apr 2008. URL http://www.jmlr.org/papers/volume9/ye08b/ye08b.pdf. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In NIPS, 2003. 87