nips nips2011 nips2011-238 nips2011-238-reference knowledge-graph by maker-knowledge-mining

238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison


Source: pdf

Author: Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, Masashi Sugiyama

Abstract: Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach. 1


reference text

[1] A. J. Smola, L. Song, and C. H. Teo. Relative novelty detection. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS2009), pages 536–543, 2009.

[2] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Statistical outlier detection using direct density ratio estimation. Knowledge and Information Systems, 26(2):309–336, 2011.

[3] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A. J. Smola. A kernel method for the twoo sample-problem. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information o Processing Systems 19, pages 513–520. MIT Press, Cambridge, MA, 2007.

[4] M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura. Least-squares two-sample test. Neural Networks, 24(7):735–751, 2011.

[5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.

[6] M. Sugiyama, M. Krauledat, and K.-R. M¨ ller. Covariate shift adaptation by importance weighted cross u validation. Journal of Machine Learning Research, 8:985–1005, May 2007.

[7] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79– 86, 1951.

[8] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.

[9] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von B¨ nau, and M. Kawanabe. Direct importance u estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60:699–746, 2008.

[10] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.

[11] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50:157–175, 1900.

[12] S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28:131–142, 1966.

[13] I. Csisz´ r. Information-type measures of difference of probability distributions and indirect observation. a Studia Scientiarum Mathematicarum Hungarica, 2:229–318, 1967.

[14] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10:1391–1445, 2009.

[15] T. Suzuki and M. Sugiyama. Sufficient dimension reduction via squared-loss mutual information estimation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS2010), pages 804–811, 2010.

[16] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 442–450. 2010.

[17] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1970.

[18] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.

[19] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.

[20] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, NY, 1993.

[21] G. R¨ tsch, T. Onoda, and K.-R. M¨ ller. Soft margins for adaboost. Machine Learning, 42(3):287–320, a u 2001.

[22] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch¨ lkopf, and A. J. Smola. Integrating o structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.

[23] B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Sch¨ lkopf. Kernel choice and claso sifiability for RKHS embeddings of probability distributions. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1750–1758. 2009.

[24] A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:1145–1159, 1997.

[25] B. Sch¨ lkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of o a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001.

[26] C.-C. Chang and C.h-J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. 9