jmlr jmlr2009 jmlr2009-1 jmlr2009-1-reference knowledge-graph by maker-knowledge-mining

1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

Source: pdf

Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama

Abstract: We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efﬁcient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efﬁcient than competing approaches. Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation

reference text

H. Akaike. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, AC-19(6):716–723, 1974. S. Amari, N. Fujita, and S. Shinomoto. Four types of learning curves. Neural Computation, 4(4): 605–618, 1992. P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press, Cambridge, 1998. D. Bertsekas, A. Nedic, and A. Ozdaglar. Convex Analysis and Optimization. Athena Scientiﬁc, Belmont, MA, 2003. M. J. Best. An algorithm for the solution of the parametric quadratic programming problem. CORR Report 82-24, Faculty of Mathematics, University of Waterloo, 1982. 1441 K ANAMORI , H IDO AND S UGIYAMA S. Bickel and T. Scheffer. Dirichlet-enhanced spam ﬁltering based on biased samples. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Syso tems 19. MIT Press, Cambridge, MA, 2007. S. Bickel, M. Br¨ ckner, and T. Scheffer. Discriminative learning for differing training and test u distributions. In Proceedings of the 24th International Conference on Machine Learning, 2007. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch¨ lkopf, and A. J. Smola. Integrating o structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49– e57, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. G. C. Cawley and N. L. C. Talbot. Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural Networks, 17(10):1467–75, 2004. S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20(1):33–61, 1998. K. F. Cheng and C. K. Chu. Semiparametric density estimation under a two-sample density ratio model. Bernoulli, 10(4):583–604, 2004. A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. Springer-Verlag, New York, 1998. B. Efron, T. Hastie, R. Tibshirani, and I. Johnstone. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1):1–50, 2000. E. A. Fernandez. The dprep Package, 2005. URL http://math.uprm.edu/ edgar/dprep.pdf. G. S. Fishman. Monte Carlo: Concepts, Algorithms, and Applications. Springer-Verlag, Berlin, 1996. G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, MD, 1996. H. Hachiya, T. Akiyama, M. Sugiyama, and J. Peters. Adaptive importance sampling with automatic model selection in value function approximation. In Proceedings of the Twenty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI2008), pages 1351–1356, Chicago, USA, Jul. 13–17 2008. The AAAI Press. 1442 A L EAST- SQUARES A PPROACH TO D IRECT I MPORTANCE E STIMATION L. K. Hansen and J. Larsen. Linear unlearning for crossvalidation. Advances in Computational Mathematics, 5:269–280, 1996. W. H¨ rdle, M. M¨ ller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric Models. a u Springer Series in Statistics. Springer, Berlin, 2004. T. Hastie, S. Rosset, R. Tibshirani, and J. ZHu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–1415, 2004. J. J. Heckman. Sample selection bias as a speciﬁcation error. Econometrica, 47(1):153–162, 1979. S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Inlier-based outlier detection via direct density ratio estimation. In Proceedings of IEEE International Conference on Data Mining (ICDM2008), Pisa, Italy, Dec. 15–19 2008. V. Hodge and J. Austin. A survey of outlier detection methodologies. Artiﬁcial Intelligence Review, 22(2):85–126, 2004. A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3):55–67, 1970. J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Sch¨ lkopf. Correcting sample selection o bias by unlabeled data. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural o Information Processing Systems 19, pages 601–608. MIT Press, Cambridge, MA, 2007. A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. An S4 package for kernel methods in R. Journal of Statistical Planning and Inference, 11(9):1–20, 2004. S. Konishi and G. Kitagawa. Generalized information criteria in model selection. Biometrika, 83: 875–890, 1996. S. Kullback and R. A. Leibler. On information and sufﬁciency. Annals of Mathematical Statistics, 22:79–86, 1951. C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. Technical report, Department of Computer Science, National Taiwan University, 2007. URL http://www.csie.ntu.edu.tw/˜cjlin/liblinear/. A. Luntz and V. Brailovsky. On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3, 1969. in Russian. T. P. Minka. A comparison of numerical optimizers for logistic regression. Technical report, Microsoft Research, 2007. URL http://research.microsoft.com/˜minka/papers/logreg/minka-logreg.pdf. J. E. Moody. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, pages 847–854. Morgan Kaufmann Publishers, Inc., 1992. 1443 K ANAMORI , H IDO AND S UGIYAMA X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functions and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems 20, Cambridge, MA, 2008. MIT Press. K. B. Petersen and M. S. Pedersen. The matrix cookbook. Technical report, Technical University of Denmark, 2007. URL http://www2.imm.dtu.dk/pubdb/p.php?3274. J. Qin. Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85(3):619–639, 1998. J. Qui˜ onero-Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence, editors. Dataset Shift in n Machine Learning. MIT Press, Cambridge, MA, 2008. C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. URL http://www.cs.toronto.edu/˜delve/. G. R¨ tsch, T. Onoda, and K.-R. M¨ ller. Soft margins for adaboost. Machine Learning, 42(3): a u 287–320, 2001. K. Scheinberg. An efﬁcient implementation of an active set method for SVMs. Journal of Machine Learning Research, 7:2237–2257, 2006. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o B. Sch¨ lkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the o support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. L. Song, A. Smola, A. Gretton, K. M. Borgwardt, and J. Bedo. Supervised feature selection via dependence estimation. In Proceedings of the 24th International Conference on Machine learning, pages 823–830, New York, NY, USA, 2007. ACM. I. Steinwart. On the inﬂuence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67–93, 2001. M. Stone. Cross-validatory choice and assessment of statistical predictors. Journal of the Royal Statistical Society B, 32(2):111–147, 1974. M. Sugiyama and K.-R. M¨ ller. Input-dependent estimation of generalization error under covariate u shift. Statistics & Decisions, 23(4):249–279, 2005. M. Sugiyama, M. Krauledat, and K.-R. M¨ ller. Covariate shift adaptation by importance weighted u cross validation. Journal of Machine Learning Research, 8:985–1005, May 2007. M. Sugiyama, S. Nakajima, H. Kashima, P. von B¨ nau, and M. Kawanabe. Direct importance u estimation with model selection and its application to covariate shift adaptation. In J. C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1433–1440, Cambridge, MA, 2008a. MIT Press. 1444 A L EAST- SQUARES A PPROACH TO D IRECT I MPORTANCE E STIMATION M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von B¨ nau, and M. Kawanabe. Direct imporu tance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008b. R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45–66, 2004. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama. Direct density ratio estimation for large-scale covariate shift adaptation. In Proceedings of 2008 SIAM International Conference on Data Mining (SDM2008), Atlanta, Georgia, USA, 2008. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. G. Wahba. Spline Model for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia and Pennsylvania, 1990. P. M. Williams. Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7 (1):117–143, 1995. J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan. Brain-computer interfaces for communication and control. Clinical Neurophysiology, 113(6):767–791, 2002. B. Zadrozny. Learning and evaluating classiﬁers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, New York, NY, 2004. ACM Press. 1445