nips nips2012 nips2012-343 nips2012-343-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jayadev Acharya, Hirakendu Das, Alon Orlitsky
Abstract: The minimax KL-divergence of any distribution from all distributions in a collection P has several practical implications. In compression, it is called redundancy and represents the least additional number of bits over the entropy needed to encode the output of any distribution in P. In online estimation and learning, it is the lowest expected log-loss regret when guessing a sequence of random values generated by a distribution in P. In hypothesis testing, it upper bounds the largest number of distinguishable distributions in P. Motivated by problems ranging from population estimation to text classification and speech recognition, several machine-learning and information-theory researchers have recently considered label-invariant observations and properties induced by i.i.d. distributions. A sufficient statistic for all these properties is the data’s profile, the multiset of the number of times each data element appears. Improving on a sequence of previous works, we show that the redundancy of the collection of distributions induced over profiles by length-n i.i.d. sequences is between 0.3 · n1/3 and n1/3 log2 n, in particular, establishing its exact growth power. 1
[1] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, and S. Pan. Competitive closeness testing. J. of Machine Learning Research Proceedings Track, 19:47–68, 2011.
[2] J. Acharya, H. Das, A. Jafarpour, A. Orlitsky, S. Pan, and A. T. Suresh. Competitive classification and closeness testing. Journal of Machine Learning Research - Proceedings Track, 23:22.1–22.18, 2012.
[3] A. R. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
[4] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In Annual Symposium on Foundations of Computer Science, page 259, 2000.
[5] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
[6] K. Chaudhuri and A. McGregor. Finding metric structure in information theoretic clustering. In Conference on Learning Theory, pages 391–402, 2008.
[7] T. Cover. Universal portfolios. Mathematical Finance, 1(1):1–29, January 1991. 8
[8] T. Cover and J. Thomas. Elements of Information Theory, 2nd Ed. Wiley Interscience, 2006.
[9] L. Davisson. Universal noiseless coding. IEEE Transactions on Information Theory, 19(6):783–795, November 1973.
[10] L. D. Davisson, R. J. McEliece, M. B. Pursley, and M. S. Wallace. Efficient universal noiseless source codes. IEEE Transactions on Information Theory, 27(3):269–279, 1981.
[11] M. Drmota and W. Szpankowski. Precise minimax redundancy and regret. IEEE Transactions on Information Theory, 50(11):2686–2707, 2004.
[12] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194– 203, Mar 1975.
[13] B. M. Fitingof. Optimal coding in the case of unknown and changing message statistics. Probl. Inform. Transm., 2(2):1–7, 1966.
[14] A. Garivier. A lower-bound for the maximin redundancy in pattern coding. Entropy, 11(4):634–642, 2009.
[15] G. M. Gemelos and T. Weissman. On the entropy rate of pattern processes. IEEE Transactions on Information Theory, 52(9):3994–4007, 2006.
[16] P. Gr¨ nwald. A tutorial introduction to the minimum description length principle. CoRR, math.ST/0406077, 2004. u ´
[17] P. Gr¨ nwald, J. S. Jones, J. de Winter, and E. Smith. Safe learning: bridging the gap between bayes, mdl and statistical u learning theory via empirical convexity. J. of Machine Learning Research - Proceedings Track, 19:397–420, 2011.
[18] P. D. Gr¨ nwald. The Minimum Description Length Principle. The MIT Press, 2007. u
[19] G. Hardy and S. Ramanujan. Asymptotic formulae in combinatory analysis. Proceedings of London Mathematics Society, 17(2):75–115, 1918.
[20] J. Kelly. A new interpretation of information rate. IEEE Transactions on Information Theory, 2(3):185–189, 1956.
[21] N. Merhav and M. Feder. Universal prediction. IEEE Transactions on Information Theory, 44(6):2124–2147, October 1998.
[22] M. Mitzenmacher and E. Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.
[23] A. Orlitsky, S. Pan, Sajama, N. Santhanam, and K. Viswanathan. Pattern maximum likelihood: computation and experiments. In preparation, 2012.
[24] A. Orlitsky, N. Santhanam, K. Viswanathan, and J. Zhang. On modeling profiles instead of values. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, 2004.
[25] A. Orlitsky, N. Santhanam, and J. Zhang. Universal compression of memoryless sources over unknown alphabets. IEEE Transactions on Information Theory, 50(7):1469– 1481, July 2004.
[26] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang. Limit results on pattern entropy. IEEE Transactions on Information Theory, 52(7):2954–2964, 2006.
[27] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30(4):629– 636, July 1984.
[28] J. Rissanen. Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42(1):40–47, January 1996.
[29] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE Transactions on Information Theory, 38(2):315–323, 1992.
[30] R. M. Roth. Introduction to coding theory. Cambridge University Press, 2006.
[31] G. Shamir. A new upper bound on the redundancy of unknown alphabets. In Proceedings of The 38th Annual Conference on Information Sciences and Systems, Princeton, New-Jersey, 2004.
[32] G. Shamir. Universal lossless compression with unknown alphabets—the average case. IEEE Transactions on Information Theory, 52(11):4915–4944, November 2006.
[33] W. Szpankowski. On asymptotics of certain recurrences arising in universal coding. Problems of Information Transmission, 34(2):142–146, 1998.
[34] P. Valiant. Testing symmetric properties of distributions. PhD thesis, Cambridge, MA, USA, 2008. AAI0821026.
[35] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
[36] Q. Xie and A. Barron. Asymptotic minimax regret for data compression, gambling and prediction. IEEE Transactions on Information Theory, 46(2):431–445, March 2000.
[37] B. Yu and T. P. Speed. A rate of convergence result for a universal d-semifaithful code. IEEE Transactions on Information Theory, 39(3):813–820, 1993.
[38] J. Zhang. Universal Compression and Probability Estimation with Unknown Alphabets. PhD thesis, UCSD, 2005. 9