jmlr jmlr2009 jmlr2009-61 jmlr2009-61-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo
Abstract: Positive definite kernels on probability measures have been recently applied to classification problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and define a k-th order JT q-difference between stochastic processes. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive definite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO
S. Abe. Foundations of nonextensive statistical mechanics. In Chaos, Nonlinearity, Complexity. Springer, 2006. S. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2001. R. Baeza-Yates, and B. Ribeiro-Neto. Modern information retrieval. ACM Press New York, 1999. 971 M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. J. of Mach. Learn. Res., 6:1705–1749, 2005. A. Ben-Hamza. A nonextensive information-theoretic measure for image edge detection. J. of Electronic Imaging, 15-1:13011.1–13011.8, 2006. A. Ben-Hamza and H. Krim. Image registration and segmentation by maximizing the jensen-r´ nyi e divergence. In Proc. of Int. Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 147–163. Springer, 2003. C. Berg, J. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, Berlin, 1984. J. Burbea and C. Rao. On the convexity of some divergence measures based on entropy functions. IEEE Trans. on Information Theory, 28(3):489–495, 1982. C. Cortes, P. Haffner and M. Mohri. Rational Kernels: Theory and Algorithms. J. of Mach. Learn. Res., 5:1035–1062, 2004. T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. I. Csiszar. I-divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975. M. Cuturi and J.-P. Vert. Semigroup kernels on finite sets. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 329–336. MIT Press, Cambridge, MA, 2005. M. Cuturi, K. Fukumizu, and J.-P. Vert. Semigroup kernels on measures. J. of Mach. Learn. Res., 6:1169–1198, 2005. Z. Dar´ czy. Generalized information functions. Information and Control, 16(1):36–51, 1970. o F. Desobry, M. Davy, and W. Fitzgerald. Density kernels on unordered sets for kernel-based signal processing. In Proc. of the IEEE Int. Conf. on Acoustics, Speech, and Signal Proc., 2007. R. El-Yaniv, S. Fine, and N. Tishby. Agnostic classification of markovian sequences. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information Processing Systems 10, pages 465–471. MIT Press, Cambridge, MA, 1998. D. Endres and J. Schindelin. A new metric for probability distributions. IEEE Trans. on Information Theory, 49(7):1858–1860, 2003. B. Fuglede. Spirals in Hilbert space, with an application in information theory. Expositiones Mathematicae, 25(1):23–46, 2005. S. Furuichi. Information theoretical properties of Tsallis entropies. J. of Mathematical Physics, 47 (2), 2006. M. Gell-Mann and C. Tsallis. Nonextensive Entropy: Interdisciplinary Applications. Oxford University Press, 2004. 972 N ONEXTENSIVE I NFORMATION T HEORETIC K ERNELS ON M EASURES I. Grosse, P. Bernaola-Galvan, P. Carpena, R. Roman-Roldan J. Oliver, and H. E. Stanley. Analysis of symbolic sequences using the Jensen-Shannon divergence. Phys. Review E, 65, 2002. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. D. Haussler. Convolution kernels on discrete structures. In Technical Report UCS-CRL-99-10, 1999. M. Havrda and F. Charv´ t. Quantification method of classification processes: concept of structural a α-entropy. Kybernetika, 3:30–35, 1967. Y. He, A. Ben-Hamza, and H. Krim. A generalized divergence measure for robust image registration. IEEE Trans. on Signal Proc., 51(5):1211–1220, 2003. M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In Z. Ghahramani and R. Cowell, editors, Proc. of the 10th Int. Workshop on Artificial Intell. and Stat., 2005. M. Hein, T. Lal, and O. Bousquet. Hilbertian metrics on probability measures and their application in SVMs. In Proc. of the 26th DAGM Symposium, pages 270–277. Springer, 2004. T. Jebara, R. Kondor, and A. Howard. Probability product kernels. J. of Mach. Learn. Res., 5: 819–844, 2004. J. Jensen. Sur les fonctions convexes et les in´ galit´ s entre les valeurs moyennes. Acta Mathematica, e e 30:175–193, 1906. T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, 2002. D. Karakos, S. Khudanpur, J. Eisner, and C. Priebe. Iterative denoising using Jensen-R´ nyi die vergences with an application to unsupervised document categorization. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Proc., Baltimore, MD, 2007. A. Khinchin. Mathematical Foundations of Information Theory. Dover, New York, 1957. S. Kullback and R.A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 79–86, 1951. J. Lafferty and G. Lebanon. Diffusion kernels on statistical manifolds. J. of Mach. Lear. Res., 6: 129–163, 2005. P. Lamberti and A. Majtey. Non-logarithmic Jensen-Shannon divergence. Physica A: Statistical Mechanics and its Applications, 329:81–90, 2003. C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: A string kernel for svm protein classification. In Proc. of the Pacific Symposium on Biocomputing, pages 564–575, 2002. Y. Li, X. Fan, and G. Li. Image segmentation based on Tsallis-entropy and Renyi-entropy and their comparison. In IEEE Int. Conf. on Industrial Informatics, pages 943–948, 2006. 973 M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans. on Information Theory, 37(1):145–151, 1991. J. Lin and S. Wong. A new directed divergence measure and its characterization. Int. J. of General Systems, 17:73–81, 1990. J. Lindhard. On the Theory of Measurement and Its Consequences in Statistical Dynamics. Munksgaard, Copenhagen, 1974. J. Lindhard and V. Nielsen. Studies in Statistical Dynamics. Munksgaard, Copenhagen, 1971. H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using string kernels. J. of Mach. Learn. Res., 2:419–444, 2002. A. F. T. Martins, P. M. Q. Aguiar, and M. A. T. Figueiredo. Tsallis kernels on measures. In Proc. of the IEEE Information Theory Workshop, Porto, Portugal, 2008a. A. F. T. Martins, M. A. T. Figueiredo, P. M. Q. Aguiar, N. A. Smith, and E. P. Xing. Nonextensive entropic kernels. In Proc. of the Int. Conf. on Machine Learning – ICML’08, Helsinki, Finland, 2008b. P. Moreno, P. Ho, and N. Vasconcelos. A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. In S. Thrun, L. Saul, and B. Sch¨ lkopf, editors, Advances o in Neural Information Processing Systems 16. MIT Press, 2004. A. R´ nyi. On measures of entropy and information. In Proc. of the 4th Berkeley Symposium on e Mathematics, Statistics, and Probability, volume 1, pages 547–561, Berkeley, 1961. University of California Press. B. Sch¨ lkopf and A. J. Smola. Learning with Kernels. The MIT Press, Cambridge, MA, 2002. o C. Shannon and W. Weaver. The Mathematical Theory of Communication. University of Illinois Press, Urbana, Ill., 1949. C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27 (3):379–423, 1948. J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. J. Steele. The Cauchy-Schwarz Master Class. Cambridge University Press, Cambridge, 2006. H. Suyari. Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy. IEEE Trans. on Information Theory, 50(8):1783–1787, 2004. F. Topsøe. Some inequalities for information divergence and related measures of discrimination. IEEE Trans. on Information Theory, 46(4):1602–1609, 2000. C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. J. of Statistical Physics, 52: 479–487, 1988. 974 N ONEXTENSIVE I NFORMATION T HEORETIC K ERNELS ON M EASURES S. Vishwanathan and A. Smola. Fast kernels for string and tree matching. In K. Tsuda, B. Sch¨ lkopf, o and J.P. Vert, editors, Kernels and Bioinformatics, Cambridge, MA, 2003. MIT Press. D. Zhang, X. Chen, and W. Lee. Text classification with kernels on the multinomial manifold. In Proc. of the 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 266–273, New York, NY, 2005. 975