nips nips2012 nips2012-57 nips2012-57-reference knowledge-graph by maker-knowledge-mining

57 nips-2012-Bayesian estimation of discrete entropy with mixtures of stick-breaking priors


Source: pdf

Author: Evan Archer, Il M. Park, Jonathan W. Pillow

Abstract: We consider the problem of estimating Shannon’s entropy H in the under-sampled regime, where the number of possible symbols may be unknown or countably infinite. Dirichlet and Pitman-Yor processes provide tractable prior distributions over the space of countably infinite discrete distributions, and have found major applications in Bayesian non-parametric statistics and machine learning. Here we show that they provide natural priors for Bayesian entropy estimation, due to the analytic tractability of the moments of the induced posterior distribution over entropy H. We derive formulas for the posterior mean and variance of H given data. However, we show that a fixed Dirichlet or Pitman-Yor process prior implies a narrow prior on H, meaning the prior strongly determines the estimate in the under-sampled regime. We therefore define a family of continuous mixing measures such that the resulting mixture of Dirichlet or Pitman-Yor processes produces an approximately flat prior over H. We explore the theoretical properties of the resulting estimators and show that they perform well on data sampled from both exponential and power-law tailed distributions. 1


reference text

[1] G. Miller. Note on the bias of information estimates. Information theory in psychology: Problems and methods, 2:95–100, 1955.

[2] S. Panzeri and A. Treves. Analytical estimates of limited sampling biases in different information measures. Network: Computation in Neural Systems, 7:87–107, 1996.

[3] R. Strong, S. Koberle, de Ruyter van Steveninck R., and W. Bialek. Entropy and information in neural spike trains. Physical Review Letters, 80:197–202, 1998.

[4] L. Paninski. Estimation of entropy and mutual information. Neural Computation, 15:1191–1253, 2003.

[5] P. Grassberger. Entropy estimates from insufficient samplings. arXiv preprint, January 2008, arXiv:0307138 [physics].

[6] I. Nemenman, F. Shafee, and W. Bialek. Entropy and inference, revisited. Adv. Neur. Inf. Proc. Sys., 14, 2002.

[7] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. The Annals of Probability, 25(2):855–900, 1997.

[8] H. Ishwaran and L. James. Generalized weighted chinese restaurant processes for species sampling mixture models. Statistica Sinica, 13(4):1211–1236, 2003.

[9] S. Goldwater, T. Griffiths, and M. Johnson. Interpolating between types and tokens by estimating powerlaw generators. Adv. Neur. Inf. Proc. Sys., 18:459, 2006.

[10] G. Zipf. Human behavior and the principle of least effort. Addison-Wesley Press, 1949.

[11] T. Dudok de Wit. When do finite sample effects significantly affect entropy estimates? Eur. Phys. J. B Cond. Matter and Complex Sys., 11(3):513–516, October 1999.

[12] M. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary physics, 46(5):323–351, 2005.

[13] M. Hutter. Distribution of mutual information. Adv. Neur. Inf. Proc. Sys., 14:399, 2002.

[14] J. Hausser and K. Strimmer. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. The Journal of Machine Learning Research, 10:1469–1484, 2009.

[15] I. Nemenman, W. Bialek, and R. Van Steveninck. Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):056111, 2004.

[16] I. Nemenman. Coincidences and estimation of entropies of random variables with large cardinalities. Entropy, 13(12):2013–2023, 2011.

[17] V. Q. Vu, B. Yu, and R. E. Kass. Coverage-adjusted entropy estimation. Statistics in medicine, 26(21):4039–4060, 2007.

[18] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, and E. P. Chichilnisky, E. J. Simoncelli. Nature, 454:995–999, 2008.

[19] H. Ishwaran and M. Zarepour. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269–283, 2002.

[20] J. Kingman. Random discrete distributions. Journal of the Royal Statistical Society. Series B (Methodological), 37(1):1–22, 1975.

[21] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453):161–173, March 2001.

[22] Y. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 985–992, 2006.

[23] M. Perman, J. Pitman, and M. Yor. Size-biased sampling of poisson point processes and excursions. Probability Theory and Related Fields, 92(1):21–39, March 1992.

[24] J. Pitman. Random discrete distributions invariant under size-biased permutation. Advances in Applied Probability, pages 525–539, 1996.

[25] A. Chao and T. Shen. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environmental and Ecological Statistics, 10(4):429–443, 2003.

[26] A. Antos and I. Kontoyiannis. Convergence properties of functional estimates for discrete distributions. Random Structures & Algorithms, 19(3-4):163–193, 2001.

[27] D. Wolpert and D. Wolf. Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52(6):6841–6854, 1995. 9