jmlr jmlr2007 jmlr2007-25 jmlr2007-25-reference knowledge-graph by maker-knowledge-mining

25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

Source: pdf

Author: Masashi Sugiyama, Matthias Krauledat, Klaus-Robert Müller

Abstract: A common assumption in supervised learning is that the input points in the training set follow the same probability distribution as the input points that will be given in the future test phase. However, this assumption is not satisﬁed, for example, when the outside of the training region is extrapolated. The situation where the training input points and test input points follow different distributions while the conditional distribution of output values given input points is unchanged is called the covariate shift. Under the covariate shift, standard model selection techniques such as cross validation do not work as desired since its unbiasedness is no longer maintained. In this paper, we propose a new method called importance weighted cross validation (IWCV), for which we prove its unbiasedness even under the covariate shift. The IWCV procedure is the only one that can be applied for unbiased classiﬁcation under covariate shift, whereas alternatives to IWCV exist for regression. The usefulness of our proposed method is illustrated by simulations, and furthermore demonstrated in the brain-computer interface, where strong non-stationarity effects can be seen between training and test sessions. Keywords: covariate shift, cross validation, importance sampling, extrapolation, brain-computer interface

reference text

H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, AC-19(6):716–723, 1974. N. Altman and C. L´ ger. On the optimality of prediction-based selection criteria and the convere gence rates of estimators. Journal of the Royal Statistical Society, Series B, 59(1):205–216, 1997. F. Babiloni, F. Cincotti, L. Lazzarini, J. d. R. Mill´ n, J. Mourin˜ , M. Varsta, J. Heikkonen, a o L. Bianchi, and M. G. Marciani. Linear classiﬁcation of low-resolution EEG patterns produced by imagined hand movements. IEEE Transactions on Rehabilitation Engineering, 8(2):186–188, June 2000. 1000 C OVARIATE S HIFT A DAPTATION BY I MPORTANCE W EIGHTED C ROSS VALIDATION ¨ F. Bach. Active learning for misspeciﬁed generalized linear models. In B. Sch olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007. P. Baldi, S. Brunak, and G. A. Stolovitzky. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, 1998. M. Belkin and P. Niyogi. Semi-supervised learning on Riemannian manifolds. Machine Learning, 56(1–3):209–239, 2004. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations ¨ for domain adaptation. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007. S. Bickel. ECML2006 workshop on discovery http://www.ecmlpkdd2006.org/challenge.html. challenge, 2006. URL S. Bickel and T. Scheffer. Dirichlet-enhanced spam ﬁltering based on biased samples. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Syso tems 19. MIT Press, Cambridge, MA, 2007. C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. ¨ B. Blankertz, G. Dornhege, M. Krauledat, and K.-R. M uller. The Berlin brain-computer interface: EEG-based communication without subject training. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2):147–152, 2006. ¨ B. Blankertz, G. Dornhege, M. Krauledat, K.-R. Muller, and G. Curio. The non-invasive Berlin brain-computer interface: Fast acquisition of effective performance in untrained subjects. NeuroImage, 2007. ¨ K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch olkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49– e57, 2006. O. Bousquet, O. Chapelle, and M. Hein. Measure based regularization. In S. Thrun, L. Saul, and B. Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, o Cambridge, MA, 2004. J. Q. Candela, N. Lawrence, and A. Schwaighofer. Learning when test and training inputs have different distributions challenge, 2005. URL http://www.pascal-network.org/Challenges/LETTIDD/. J. Q. Candela, N. Lawrence, A. Schwaighofer, and M. Sugiyama. NIPS2006 workshop on learning when test and training inputs have different distributions, 2006. URL http://ida.first.fraunhofer.de/projects/different06/. O. Chapelle, B. Sch¨ lkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cambridge, o 2006. 1001 ¨ S UGIYAMA , K RAULEDAT AND M ULLER N. Chawla, N. Japkowicz, and A. Kolcz. ICML2003 workshop on learning from imbalanced data sets, 2003. URL http://www.site.uottawa.ca/˜nat/Workshop2003/workshop2003.html. D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. H. Daum´ III and D. Marcu. Domain adaptation for statistical classiﬁers. Journal of Artiﬁcial e Intelligence Research, 26:101–126, 2006. ¨ G. Dornhege, B. Blankertz, G. Curio, and K.-R. Muller. Boosting bit rates in non-invasive EEG single-trial classiﬁcations by feature combination and multi-class paradigms. IEEE Transactions on Biomedial Engineering, 51(6):993–1002, June 2004. ¨ G. Dornhege, J. d. R. Mill´ n, T. Hinterberger, D. McFarland, and K.-R. Muller, editors. Towards a Brain Computer Interfacing. MIT Press, 2007. in preparation. R. O. Duda, P. E. Hart, and D. G. Stor. Pattern Classiﬁcation. Wiley, New York, 2001. B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1–26, 1979. B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, 1993. W. Fan, I. Davidson, B. Zadrozny, and P. S. Yu. An improved categorization of classiﬁer’s sensitivity on sample selection bias. In In Proceedings of the Fifth IEEE International Conference on Data Mining, 2005. V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2): 179–188, 1936. G. S. Fishman. Monte Carlo: Concepts, Algorithms, and Applications. Springer-Verlag, Berlin, 1996. K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. W. H¨ rdle, M. M¨ ller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric Models. a u Springer Series in Statistics. Springer, Berlin, 2004. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, 2001. ¨ X. He and P. Niyogi. Locality preserving projections. In S. Thrun, L. Saul, and B. Sch olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. J. J. Heckman. Sample selection bias as a speciﬁcation error. Econometrica, 47(1):153–162, 1979. M. Hein. Uniform convergence of adaptive graph-based regularization. In Proceedings of the 19th Annual Conference on Learning Theory, pages 50–64, 2006. 1002 C OVARIATE S HIFT A DAPTATION BY I MPORTANCE W EIGHTED C ROSS VALIDATION ¨ J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Sch olkopf. Correcting sample selection ¨ bias by unlabeled data. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007. N. Japkowicz. AAAI2000 workshop on learning from imbalanced data sets, 2000. http://www.site.uottawa.ca/˜nat/Workshop2000/workshop2000.html. URL T. Kanamori and H. Shimodaira. Active learning algorithm using the maximum weighted loglikelihood estimator. Journal of Statistical Planning and Inference, 116(1):149–162, 2003. J. Kiefer. Optimum experimental designs. Journal of the Royal Statistical Society, Series B, 21: 272–304, 1959. S. Kullback and R. A. Leibler. On information and sufﬁciency. Annals of Mathematical Statistics, 22:79–86, 1951. L. Kuncheva. Classiﬁers ensembles for changing environments, http://gow.epsrc.ac.uk/ViewGrant.aspx?GrantRef=EP/D04040X/1. 2006. URL ¨ ¨ Y. LeCun, L. Bottou, G. B. Orr, and K.-R Muller. Efﬁcient backprop. In G. Orr and K.-R. Muller, editors, Neural Networks: Tricks of the Trade, number 1524 in Lecture Notes in Computer Science, pages 299–314. Springer, Berlin, 1998. ¨ S. Lemm, B. Blankertz, G. Curio, and K.-R. Muller. Spatio-spectral ﬁlters for improved classiﬁcation of single trial EEG. IEEE Transactions on Biomedial Engineering, 52(9):1541–1548, Sept. 2005. Y. Lin. Support vector machines and the Bayes rule in classiﬁcation. Data Mining and Knowledge Discovery, 6(3):259–275, 2002. Y. Lin, Y. Lee, and G. Wahba. Support vector machines for classiﬁcation in nonstandard situations. Machine Learning, 46(1/3):191–202, 2002. A. Luntz and V. Brailovsky. On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3, 1969. D. J. C. MacKay. Information-based objective functions for active data selection. Neural Computation, 4(4):590–604, 1992. R. Meir and G. R¨ tsch. An introduction to boosting and leveraging. In S. Mendelson and A. J. a Smola, editors, Advanced Lectures on Machine Learning, pages 119–184. Springer, Berlin, 2003. J. d. R. Mill´ n. On the need for on-line learning in brain-computer interfaces. In Proceedings of a the International Joint Conference on Neural Networks, volume 4, pages 2877–2882, Budapest, Hungary, July 2004. ¨ N. Murata, M. Kawanabe, A. Ziehe, K.-R. Muller, and S. Amari. On-line learning in changing environments with applications in supervised and unsupervised learning. Neural Networks, 15 (4-6):743–760, 2002. 1003 ¨ S UGIYAMA , K RAULEDAT AND M ULLER G. Pfurtscheller and F. H. Lopes da Silva. Event-related EEG/MEG synchronization and desynchronization: basic principles. Clinical Neurophysiology, 110(11):1842–1857, Nov 1999. F. Pukelsheim. Optimal Design of Experiments. Wiley, 1993. H. Ramoser, J. M¨ ller-Gerking, and G. Pfurtscheller. Optimal spatial ﬁltering of single trial EEG u during imagined hand movement. IEEE Transactions on Rehabilitation Engineering, 8(4):441– 446, 2000. H. Robbins and S. Munro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951. D. Saad, editor. On-Line Learning in Neural Networks. Cambridge University Press, Cambridge, 1998. R. E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classiﬁcation, pages 149–172. Springer, New York, 2003. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o C. R. Shelton. Importance Sampling for Reinforcement Learning with Multiple Objectives. PhD thesis, Massachusetts Institute of Technology, 2001. ¨ P. Shenoy, M. Krauledat, B. Blankertz, R. P. N. Rao, and K.-R. M uller. Towards adaptive classiﬁcation for BCI. Journal of Neural Engineering, 3:R13–R23, 2006. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London, 1986. M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36:111–147, 1974. M. Stone. Asymptotics for and against cross-validation. Biometrika, 64(1):29–35, 1977. ¨ A. Storkey and M. Sugiyama. Mixture regression for covariate shift. In B. Sch olkopf, J. C. Platt, and T. Hoffmann, editors, Advances in Neural Information Processing Systems 19, Cambridge, MA, 2007. MIT Press. M. Sugiyama. Active learning in approximately linear regression based on conditional expectation of generalization error. Journal of Machine Learning Research, 7:141–166, Jan. 2006a. M. Sugiyama. Local Fisher discriminant analysis for supervised dimensionality reduction. In W. Cohen and A. Moore, editors, Proceedings of 23rd International Conference on Machine Learning, pages 905–912, Pittsburgh, Pennsylvannia, USA, Jun. 25–29 2006b. ¨ M. Sugiyama, M. Krauledat, and K.-R. Muller. Covariate shift adaptation by importance weighted cross validation. Technical Report TR06-0007, Department of Computer Science, Tokyo Institute of Technology, Sep. 2006. URL http://www.cs.titech.ac.jp/. 1004 C OVARIATE S HIFT A DAPTATION BY I MPORTANCE W EIGHTED C ROSS VALIDATION M. Sugiyama and K.-R. M¨ ller. Input-dependent estimation of generalization error under covariate u shift. Statistics & Decisions, 23(4):249–279, 2005. M. Sugiyama and H. Ogawa. Subspace information criterion for model selection. Neural Computation, 13(8):1863–1889, 2001. M. Sugiyama and H. Ogawa. Active learning with model selection—Simultaneous optimization of sample points and models for trigonometric polynomial models. IEICE Transactions on Information and Systems, E86-D(12):2753–2763, 2003. M. Sugiyama and N. Rubens. A batch ensemble approach to active learning with model selection, 2007. (Submitted). V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. C. Vidaurre, A. Schl¨ gl, R. Cabeza, and G. Pfurtscheller. About adaptive classiﬁers for brain como puter interfaces. Biomedizinische Technik, 49(1):85–86, 2004. G. Wahba. Spline Model for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia and Pennsylvania, 1990. S. Watanabe. Algebraic analysis for nonidentiﬁable learning machines. Neural Computation, 13 (4):899–933, 2001. D. P. Wiens. Robust weights and designs for biased regression models: Least squares and generalized M-estimation. Journal of Statistical Planning and Inference, 83(2):395–412, 2000. J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan. Brain-computer interfaces for communication and control. Clinical Neurophysiology, 113(6):767–791, 2002. ¨ K. Yamazaki, M. Kawanabe, S. Watanabe, M. Sugiyama, and K.-R M uller. Asymptotic Bayesian generalization error when training and test distributions are different, 2007. (Submitted). B. Zadrozny. Learning and evaluating classiﬁers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, New York, NY, 2004. ACM Press. 1005