jmlr jmlr2011 jmlr2011-43 jmlr2011-43-reference knowledge-graph by maker-knowledge-mining

43 jmlr-2011-Information, Divergence and Risk for Binary Experiments


Source: pdf

Author: Mark D. Reid, Robert C. Williamson

Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds


reference text

H. Abelson, G.J. Sussman, and J. Sussman. Structure and Interpretation of Computer Programs. MIT Press, 1996. J. Acz´ l. Measuring information beyond communication theory — why some generalized informae tion measures maybe be useful, others not. Aequationes Mathematicae, 27:1–19, 1984. ´ M. A. Aizerman, E. M. Braverman, and L. I. Rozono´ r. Theoretical foundations of the potential e function method in pattern recognition learning. Automation and Remote Control, 25:821–837, 1964. S.M. Ali and S.D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), 28(1):131–142, 1966. Y. Altun and A. Smola. Unifying divergence minimization and statistical inference via convex duality. In Proceedings of the Annual Conference on Learning Theory (COLT), pages 139–153, 2006. P. Antosik, J. Mikusinski, and R. Sikorski. Theory of Distributions: The Sequential Approach. American Elsevier, 1973. W. R. Ashby. An Introduction to Cybernetics. Chapman and Hall, 1956. G.A. Babich and O.I. Camps. Weighted Parzen windows for pattern classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(5):567–570, 1996. F.R. Bach, D. Heckerman, and E. Horvitz. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7:1713–1741, 2006. C.Y. Baldwin and K.B. Clark. Modularity in the design of complex engineering systems. In Dan Braha Ali Minai and Yaneer Bar Yam, editors, Complex Engineered Systems: Science Meets Technology. Springer, 2006a. C.Y. Baldwin and K.B. Clark. Between ‘knowledge’ and ‘the economy’: Notes on the scientific study of designs. In D. Forey and B. Kahin, editors, Advancing Knowledge and the Knowledge Economy, pages 299–328, Cambridge, Mass., 2006b. MIT Press. A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7):2664–2669, 2005a. A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. The Journal of Machine Learning Research, 6:1705–1749, 2005b. V. Barnett. Comparative Statistical Inference. John Wiley and Sons, Chichester, 3rd edition, 1999. 803 R EID AND W ILLIAMSON P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3):434–452, 1996. P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006. G. Baudat and F. Anouar. Generalized discriminant analysis using a kernel approach. Neural Computation, 12(10):2385–2404, 2000. H. H. Bauschke, R. Goebel, Y. Lucet, and X. Wang. The proximal average: Basic theory. SIAM Journal on Optimization, 19:766–785, 2008. M.J. Bayarri and J.O. Berger. The interplay of Bayesian and frequentist analysis. Statistical Science, 19(1):58–80, 2004. M. Ben-Bassat. epsilon-equivalence of feature selection rules. IEEE Transactions on Information Theory, 24(6):769–772, 1978. S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman-Vaughan. A theory of learning from different domains. Machine Learning, 79:151–175, 2010. J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer, New York, 1985. J.O. Berger. Could Fisher, Jeffreys and Neyman have agreed on testing? Statistical Science, 18(1): 1–32, 2003. I. Berlin. The Hedgehog and the Fox: An Essay on Tolstoy’s view of History. Weidenfeld and Nicolson, London, 1953. A. Beygelzimer, V. Dani, T. Hayes, J. Langford, and B. Zadrozny. Error limiting reductions between classification tasks. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, 2005. A. Beygelzimer, J. Langford, and P. Ravikumar. Multiclass classification with filter trees. Preprint, June 2007. URL http://hunch.net/˜jl/projects/reductions/mc_to_b/invertedTree. pdf. A. Beygelzimer, J. Langford, and B. Zadrozny. Machine learning techniques — reductions between prediction quality metrics. In Zhen Liu and Cathy H. Xia, editors, Performance Modeling and Engineering, pages 3–28. Springer US, April 2008. URL http://hunch.net/˜jl/projects/ reductions/tutorial/paper/chapter.pdf. P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics, volume 1. Prentice-Hall, 2nd edition, 2001. A. Birnbaum. On the foundations of statistical inference: Binary experiments. The Annals of Mathematical Statistics, 32(2):414–435, June 1961. D. Blackwell. Comparison of experiments. In J. Neyman, editor, Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1950, pages 93–102, Berkeley and Los Angeles, 31 July – 12 August 1951. University of California Press. 804 I NFORMATION , D IVERGENCE AND R ISK D. Blackwell. Equivalent comparisons of experiments. The Annals of Mathematical Statistics, 24 (2):265–272, 1953. A. Blass and Y. Gurevich. Algorithms: A quest for absolute definitions. Bulletin of the European Association for Theoretical Computer Science, 81:195–225, 2003. A. Blass, N. Dershowitz, and Y. Gurevich. When are two algorithms the same? Bull. Symbolic Logic, 15(2):145–168, 2009. F. Bolley and C. Villani. Weighted Csisz´ r-Kullback-Pinsker inequalities and applications to transa portation inequalities. Annales de la Faculte des Sciences de Toulouse, 14(3):331–352, 2005. K.M. Borgwardt, A. Gretton, M.J. Rasch, H.P. Kriegel, B. Sch¨ lkopf, and A.J. Smola. Integrating o structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14), 2006. J.M. Borwein and A.S. Lewis. Duality relationships for entropy-like minimization problems. SIAM Journal of Control and Optimization, 29(2):325–338, 1991. O. Bousquet. Making machine learning more scientific. Blog Post., 2006. URL http://ml. typepad.com/machine_learning_thoughts/2006/06/making_machine_.html. G.C. Bowker. Memory practices in the sciences. MIT Press, 2005. G.C. Bowker and S.L. Star. Sorting Things Out: Classification and its Consequences. MIT Press, Cambridge, Mass., 1999. L.M. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3):200–217, 1967. M. Broniatowski. Minimum divergence in inference and testing. In Atti della XLII Riunione Scientifica, Universit` di Bari, 2004. Societ` Italiana Statistica. a a M. Broniatowski and A. Keziou. Parametric estimation and tests through divergences and the duality technique. Journal of Multivariate Analysis, 100(1):16–36, 2009. L.D. Brown and M.G. Low. Asymptotic equivalence of nonparametric regression and white noise. Annals of Statistics, 24(6):2384–2398, 1996. L.D. Brown and L. Zhao. Direct asymptotic equivalence of nonparametric regression and the infinite dimensional location problem. Preprint, 2003. URL http://www-stat.wharton.upenn.edu/ ˜lzhao/papers/. L.D. Brown, T.T. Cai, M.G. Low, and C.H. Zhang. Asymptotic equivalence theory for nonparametric regression with random design. Annals of Statistics, 30(3):688–707, 2002. H.D. Brunk, G.M. Ewing, and W.R. Utz. Minimizing integrals in certain classes of monotone functions. Pacific Journal of Mathematics, 7(1):833–847, 1957. A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennsylvania, November 2005. 805 R EID AND W ILLIAMSON S. Canu and A. Smola. Kernel methods and the exponential family. Neurocomputing, 69(7-9): 714–720, 2006. V.J. Carey, J. Mar, and R. Gentleman. MLInterfaces: Towards uniform behaviour of machine learning tools in R. Technical report, Harvard University, 2007. A.V. Carter. Deficiency distance between multinomial and multivariate normal experiments. Annals of Statistics, 30(3):708–730, 2002. Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, 1997. M. Cepedello-Boiso. On regularization in superreflexive Banach spaces by infimal convolution formulas. Studia Mathematica, 129(3):265–284, 1998. P. Chaudhuri and W.Y. Loh. Nonparametric estimation of conditional quantiles using quantile regression trees. Bernoulli, 8(5):561–576, 2002. G. Choquet. Theory of capacities. Annales de l’institut Fourier, 5(54):131–295, 1953. W.J. Conover and R.L. Iman. Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician, 35(3):124–129, 1981. C. Cortes and M. Mohri. AUC optimization vs. error rate minimization. Advances in Neural Information Processing Systems, 16, 2004. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. D. Cossock and T. Zhang. Subset ranking using regression. In G. Lugosi and H.U. Simon, editors, Proceedings of Conference on Learning Theory (COLT) 2006, LNAI 4005, pages 605–619, 2006. D. Crisan and A. Doucet. A survey of convergence results on particle filtering methods for practitioners. IEEE Transactions on Signal Processing,, 50(3):736–746, 2002. D. J. Crisp and C. J. C. Burges. A geometric interpretation of ν-SVM classifiers. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Advances in Neural Information Processing Systems 12, u pages 244–250. MIT Press, 2000. I. Csisz´ r. Information-type measures of difference of probability distributions and indirect obsera vations. Studia Scientiarum Mathematicarum Hungarica, 2:299–318, 1967. I. Csisz´ r. Information measures: A critical survey. In Transactions of the Seventh Prague Cona ference on Information Theory, Statistical Decision Functions, Random Processes, pages 73–86, Dordrecht, 1978. D. Riedel. I. Csisz´ r. Generalized projections for non-negative functions. Acta Mathematica Hungarica, 68 a (1):161–186, March 1995. A. Cuevas and R. Fraiman. A plug-in approach to support estimation. The Annals of Statistics, 25 (6):2300–2312, 1997. 806 I NFORMATION , D IVERGENCE AND R ISK A.P. Dawid. The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59(1):77–93, March 2007. M.H. DeGroot. Uncertainty, information, and sequential experiments. The Annals of Mathematical Statistics, 33(2):404–419, 1962. M.H. DeGroot. Optimal Statistical Decisions. McGraw-Hill Book Company, New York, 1970. L. Devroye, L. Gy¨ rfi, and G. Lugosi. A Probabilistic Approach to Pattern Recognition. Springer, o New York, 1996. P. Domingos. MetaCost: a general method for making classifiers cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 155–164, New York, NY, USA, 1999. ACM Press. P. Domingos and M. Richardson. Markov logic: A unifying framework for statistical relational learning. In Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields, pages 49–54, 2004. S.S. Dragomir, V. Gluˇcevi´ , and C.E.M. Pearce. Csisz´ r f -divergence, Ostrowski’s inequality and sˇ c a mutual information. Nonlinear Analysis, 47:2375–2386, 2001. C. Drummond. Discriminative vs. generative classifiers for cost sensitive learning. In Proceedings of the Nineteenth Canadian Conference on Artificial Intelligence, LNAI 4013, pages 479–490, Quebec, Canada, 2006. C. Drummond and R.C. Holte. Cost curves: An improved method for visualizing classifier performance. Machine Learning, 65(1):95–130, 2006. R. M. Dudley. Real Analysis and Probability. Cambridge University Press, 2nd edition, 2003. S. Eguchi. Information geometry and statistical pattern recognition. To appear in Sugaku Exposition, American Mathematical Society, 2005. S. Eguchi and J. Copas. Recent developments in discriminant analysis from an information geometric point of view. Journal of the Korean Statistical Society, 30:247–264, 2001. T. Fawcett. ROC graphs: Notes and practical considerations for researchers. Machine Learning, 31, 2004. T. Fawcett. ROC graphs with instance-varying costs. Pattern Recognition Letters, 27(8):882–891, 2006. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Knowledge discovery and data mining: Towards a unifying framework. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 82–88. AAAI Press, 1996. A.A. Fedotov, P. Harremo¨ s, and F. Topsøe. Refinements of Pinsker’s inequality. IEEE Transactions e on Information Theory, 49(6):1491–1498, June 2003. 807 R EID AND W ILLIAMSON ¨ D. Feldman and F. Osterreicher. A note on f -divergences. Studia Scientiarum Mathematicarum Hungarica, 24:191–200, 1989. P.A. Flach. The geometry of ROC space: understanding machine learning metrics through ROC isometrics. In Proceedings of the 20th International Conference on Machine Learning (ICML’03), pages 194–201, 2003. P.A. Flach and S. Wu. Repairing concavities in ROC curves. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI-2005), page 702, 2005. L. Floridi. Open problems in the philosophy of information. Metaphilosophy, 35(4):554–582, 2004. F.G. Friedlander. Introduction to the Theory of Distributions. Cambridge University Press, 1982. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NPCompleteness. WH Freeman & Co. New York, NY, USA, 1979. J. K. Gershenson, G. J. Prasad, and Y. Zhang. Product modularity: definitions and benefits. Journal of Engineering Design, 14:295–313, 2003. M. Ghallab, A. Howe, C. Knoblock, D. McDermott, A. Ram, M. Veloso, D. Weld, and D. Wilkins. PDDL specification, 1998. AIPS-98 Planning Committee. A.L. Gibbs and F.E. Su. On choosing and bounding probability metrics. International Statistical Review, 70:419–435, 2002. G. L. Gilardoni. On Pinsker’s type inequalities and Csisz´ r’s f -divergences. part i: Second and a fourth-order inequalities. arXiv:cs/0603097v2, April 2006a. G.L. Gilardoni. On the minimum f -divergence for a given total variation. Comptes Rendus Acad´ mie des sciences, Paris, Series 1, 343, 2006b. e G.L. Gilardoni. On the relationship between symmetric f -divergence and total variation and an improved Vajda’s inequality. Preprint, Departamento de Estat´stica, Universidade de Bras´lia, ı ı April 2006c. URL http://www.unb.br/ie/est/docentes/curriculo/gustavo_homepage/ artigos/improved.pdf. T. Gneiting and A.E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, March 2007. P.K. Goel and M.H. DeGroot. Comparison of Experiments and Information Measures. The Annals of Statistics, 7(5):1066–1077, 1979. P.K. Goel and J. Ginebra. When is one experiment ‘always better than’ another? Journal of the Royal Statistical Society Series D (The Statistician), 52(4):515–537, 2003. P.W. Goldberg. When can two unsupervised learners achieve PAC separation? Proceedings of the 14th Annual Conference on Computational Learning Theory (COLT), pages 303–319, 2001. S.A. Goldman, R.L. Rivest, and R.E. Schapire. Learning binary relations and total orders. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science, pages 46–51, 1989. 808 I NFORMATION , D IVERGENCE AND R ISK J.D.J. Golic. On the relationship between the information measures and the Bayes probability of error. IEEE Transactions on Information Theory, 33(5):681–693, 1987. A. Gretton, K.M. Borgwardt, M. Rasch, B. Sch¨ lkopf, and A.J. Smola. A kernel method for the o two-sample-problem. Journal of Machine Learning Research, 2008. submitted. R.L. Grossman, M.F. Hornick, and G. Meyer. Data mining standards initiatives. Commun. ACM, 45(8):59–61, 2002. ISSN 0001-0782. P.D. Gr¨ nwald. The Minimum Description Length Principle. MIT Press, Cambridge, MA, 2007. u P.D. Gr¨ nwald and A.P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust u bayesian decision theory. The Annals of Statistics, 32(4):1367–1433, 2004. C. Gutenbrunner. On applications of the representation of f -divergences as averaged minimal Bayesian risk. In Transactions of the 11th Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, pages 449–456, Dordrecht; Boston, 1990. Kluwer Academic Publishers. D.J. Hand. Deconstructing statistical questions. Journal of the Royal Statistical Society. Series A (Statistics in Society), 157(3):317–356, 1994. D.J. Hand. Comment on “The skill-plot: A graphical technique for evaluating continuous diagnostic tests”. Biometrics, 63:259, 2008. D.J. Hand and R.J. Till. A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2):171–186, 2001. D.J. Hand and V. Vinciotti. Local versus global models for classification problems: Fitting models where it matters. The American Statistician, 57(2):124–131, 2003. J.A. Hanley and B.J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29–36, 1982. P. Harremo¨ s. Time and Conditional Independence. Ph.d. thesis, Roskilde University, 1993. Orige inal in Danish entitled Tid og Betinget Uafhængighed. English translation partially available at http://www.math.ku.dk/˜moes/index.html. R. Herbrich. Learning Kernel Classifiers. MIT Press, Cambridge MA, 2002. H. Heyer. Theory of Statistical Experiments. Springer, 1982. J.-B. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms: Part II: e Advanced Theory and Bundle Methods. Springer, Berlin, 1993a. J.-B. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms: Part I: e Fundamentals. Springer, Berlin, 1993b. J-B. Hiriart-Urruty and C. Lemar´ chal. Fundamentals of Convex Analysis. Springer, Berlin, 2001. e J.-B. Hiriart-Urruty and J.-E. Mart´nez-Legaz. Convex solutions of a functional equation arising in ı information theory. Journal of Mathematical Analysis and Applications, 328:1309–1320, 2007. 809 R EID AND W ILLIAMSON W. James. A Pluralistic Universe. Longmans, Green, and Co., London, 1909. Hibbert Lectures at Manchester College on the Present Situation in Philosophy. R. Jenssen. An information theoretic approach to machine learning. Doctor Scientiarum Thesis, Department of Physics, Faculty of Science, University of Tromsø, 2005a. R. Jenssen. An Information Theoretic Approach to Machine Learning. PhD thesis, Department of Physics, University of Tromsø, Norway, May 2005b. R. Jenssen, D. Erdogmus, J.C. Principe, and T. Eltoft. Towards a unification of information theoretic learning and kernel methods. In Proc. IEEE Workshop on Machine Learning for Signal Processing (MLSP2004), pages 93–102, 2004. R. Jenssen, D. Erdogmus, J.C. Pr´ncipe, and T. Eltoft. Some equivalences between kernel methı ods and information theoretic methods. Journal of VLSI Signal Processing, 45(1):49–65, 2006. (Paper 3 of Jenssen (2005a)). D.S. Johnson. NP-Completeness Columns. Journal of Algorithms; ACM Transactions on Algorithms, 2–13; 1–3, 1982–1992; 2005–2007. URL http://www.research.att.com/˜dsj/ columns/. M.I. Jordan. Learning in Graphical Models. MIT Press, 1999. T. Kailath. The divergence and Bhattacharyya distance measures in signal selection. IEEE Transactions on Communications, 15(1):52–60, 1967. A. Keziou. Dual representations of φ-divergences and applications. Comptes Rendus Acad´ mie des e sciences, Paris, Series 1, 336:857–862, 2003a. A. Keziou. Utilisation des Divergences entre Mesures en Statistique Inf´ rentielle. PhD thesis, e Universit´ Paris 6, 2003b. e M. Khosravifard, D. Fooladivanda, and T.A. Gulliver. Confliction of the convexity and metric properties in f -divergences. IEICE Transactions on Fundamentals of Electronics, Communication and Computer Sciences, E90-A(9):1848–1853, 2007. J. Kiefer. The foundations of statistics — are there any? Synthese, 36:161–176, 1977. J.C. Kiefer. Introduction to Statistical Inference. Springer-Verlag, New York, 1987. D.E. Knuth. Two notes on notation. American Mathematical Monthly, pages 403–422, 1992. F.R. Kschischang, B.J. Frey, and H.A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498–519, 2001. S. Kullback. Lower bound for discrimination information in terms of variation. IEEE Transactions on Information Theory, 13:126–127, 1967. Correction, volume 16, p. 652, September 1970. P. Kumar and S. Chhina. A symmetric information divergence measure of the Csisz´ r’s f -divergence a class and its bounds. Computers and Mathematics with Applications, 49:575–588, 2005. 810 I NFORMATION , D IVERGENCE AND R ISK N. Lambert, J. Langford, J. Wortman, Y. Chen, D. Reeves, Y. Shoham, and D. Pennock. Selffinanced wagering mechanisms for forecasting. In Lance Fortnow, John Riedl, and Tuomas Sandholm, editors, Proceedings of the ACM Conference on Electronic Commerce, pages 170– 179, 2008. J. Langford. Machine learning reductions tutorial. Slides presented at the Machine Learning Summer School, July 2006. J. Langford. Alternative machine learning reductions definitions. Machine Learning (Theory), March 2007. URL http://hunch.net/?p=255. J. Langford and A. Beygelzimer. Sensitive error correcting output codes. In Proceedings of the 18th Annual Conference on Learning Theory (COLT-2005), pages 158–172. Springer, 2005. J. Langford and B. Zadrozny. Estimating class membership probabilities using classifier learners. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTAT’05), 2005. J. Langford, R. Oliveira, and B. Zadrozny. Predicting conditional quantiles via reduction to classification. In Proceedings of the 22nd Conference in Uncertainty in Artificial Intelligence. AUAI Press, 2006. J.A. Lasserre, C.M. Bishop, and T.P. Minka. Principled hybrids of generative and discriminative models. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition – Volume 1, pages 87–94. IEEE Computer Society Washington, DC, USA, 2006. L. LeCam. Sufficiency and Approximate Sufficiency. The Annals of Mathematical Statistics, 35(4): 1419–1455, 1964. L. LeCam. Asymptotic Methods in Statistical Decision Theory. Springer, 1986. L. Li and H.-T. Lin. Ordinal regression by extended binary classification. In Advances in Neural Information Processing Systems 19, pages 865–873, 2007. F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, New York, 2008. F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52(10):4394–4412, 2006. D.V. Lindley. On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, 27(4):986–1005, 1956. J. Lindstr¨ m. On the origin and early history of functional analysis. Technical Report U.U.D.M. o Projrect Report 2008:1, Uppsala Universitet, January 2008. J. Locke. An Essay Concerning Human Understanding. Thomas Basset, London, 1690. P.M. Long and R.A. Servedio. Discriminative learning can succeed where generative learning fails. In G. Lugosi and H.U. Simon, editors, Proceedings of the Conference on Learning Theory (COLT), LNAI 4005, pages 319–334, Berlin, 2006. Springer-Verlag. 811 R EID AND W ILLIAMSON P.M. Long, R.A. Servedio, and H.U. Simon. Discriminative learning can succeed where generative learning fails. Preprint, correction to Long and Servedio (2006), November 2006. R. Lutz and M. Musio. An alternative mathematical foundation for statistics. Acta Applicandae Mathematicae: An International Survey Journal on Applying Mathematics and Mathematical Applications, 89(1):217–249, 2005. C. Lyell. Principles of Geology. John Murray, London, 1830. R. A. Maxion and R. R. Roberts. Proper use of ROC curves in intrusion/anomaly detection. Technical Report CS-TR-871, School of Computing Science, University of Newcastle upon Tyne, November 2004. E. McDermott and S. Katagiri. A Parzen window based derivation of minimum classification error from the theoretical Bayes classification risk. In Proceedings of International Conference on Spoken Language Processing, pages 2465–2468, 2002. E. McDermott and S. Katagiri. A new formalization of minimum classification error using a Parzen estimate of classification chance. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2003. Paper 4406. T. Minka. Discriminative models, not discriminative training. Technical Report MSR-TR-2005-144, Microsoft Research, Cambridge, October 2005. J. Mokyr. The Lever of Riches: Technological Creativity and Economic Progress. Oxford University Press, USA, 1992. N. Morse and R. Sacksteder. Statistical isomorphism. The Annals of Mathematical Statistics, 37(1): 203–214, 1966. A. M¨ ller. Integral probability metrics and their generating classes of functions. Advances in u Applied Probability, 29:429–443, 1997a. A. M¨ ller. Stochastic orders generated by integrals: A unified study. Advances in Applied Probau bility, 29:414–428, 1997b. E. Nelson. Radically Elementary Probability Theory. Princeton University Press, 1987. J. Neyman and E. S. Pearson. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Math. or Phys. Character (1896-1934), 231:289–337, 1933. URL http://dx.doi.org/10. 1098/rsta.1933.0009. A. Y. Ng and M.I. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and na¨ve Bayes. In Advances in Neural Information Processing Systems (NIPS), ı volume 15, 2002. X. Nguyen, M.J. Wainwright, and M.I. Jordan. On distance measures, surrogate loss functions, and distributed detection. Technical Report 695, Department of Statistics, University of California, Berkeley, October 2005. 812 I NFORMATION , D IVERGENCE AND R ISK X. Nguyen, M.J. Wainwright, and M.I. Jordan. On surrogate loss functions and f -divergences. Annals of Statistics, 37:876–904, 2009. D. Noble. The Music of Life: Biology Beyond the Genome. Oxford University Press, 2006. ¨ F. Osterreicher. f -divergences — representation theorem and metrizability. Technical Report, Institute of Mathematics, University of Salzburg, September 2003. ¨ F. Osterreicher and D. Feldman. Divergenzen von Wahrscheinlichkeitsverteilungen — Integralgeometrisch Betrachtet. Acta Mathematica Academaiae Scientiarum Hungarica, 37(4):329–337, 1981. ¨ F. Osterreicher and I. Vajda. Statistical information and discrimination. IEEE Transactions on Information Theory, 39(3):1036–1039, 1993. N. Palmer and P.W. Goldberg. PAC classification based on PAC estimates of label class distributions. Arxiv preprint cs.LG/0607047, 2006. A.R.C. Pavia, J.-.W Xu, and J.C. Pr´ncipe. Kernel principal components are maximum entropy ı projections. In ICA 2006. Springer Lecture Notes in Computer Science 3889, pages 846–853, 2006. R.R. Phelps. Lectures on Choquet’s Theorem, volume 1757 of Lecture Notes in Mathematics. Springer, 2nd edition, 2001. M.S. Pinsker. Information and Information Stability of Random Variables and Processes. HoldenDay, 1964. D. Pollard. Some thoughts on LeCam’s statistical decision theory. Preprint, Statistics Department, Yale University, May 2000. URL http://www.stat.yale.edu/˜pollard/Papers/ thoughts.pdf. H. Vincent Poor and John B. Thomas. Applications of Ali-Silvey distance measures in the design of generalized quantizers for binary decision systems. IEEE Transactions on Communications, 25(9):893–900, September 1977. J.C. Principe, D. Xu, and J. Fisher. Information theoretic learning. In Simon Haykin, editor, Unsupervised Adaptive Filtering: Volume 1, Blind Source Seperation, pages 265–319, New York, 2000a. Wiley. J.C. Principe, D. Xu, Q. Zhao, and J.W. Fisher III. Learning from examples with information theoretic criteria. The Journal of VLSI Signal Processing, 26(1–2):61–77, 2000b. S.T. Rachev. Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics, Chichester, 1991. S. Raudys. Statistical and Neural Classifiers: An Integrated Approach to Design, chapter Taxonomy of Pattern Classification Algorithms. Springer, London, 2001. M.D. Reid and R.C. Williamson. Surrogate regret bounds for proper losses. In Proceedings of the International Conference on Machine Learning, pages 897–904, 2009. 813 R EID AND W ILLIAMSON M.D. Reid and R.C. Williamson. Composite binary losses. Journal of Machine Learning Research, 11:2387–2422, 2010. M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1):107–136, 2006. J. Rissanen. Information and Complexity in Statistical Modelling. Springer, New York, 2007. C.P. Robert. The Bayesian Choice: A Decision-Theoretic Motivation. Springer, New York, 1994. R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Springer-Verlag, Berlin, 2004. Y.D. Rubinstein and T. Hastie. Discriminative vs informative learning. In Knowledge Discovery and Data Mining, pages 49–53. AAAI Press, 1997. R. Sacksteder. A note on statistical equivalence. The Annals of Mathematical Statistics, 38(3): 787–794, 1967. L.J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971. A.H. Sayed and T. Kailath. A survey of spectral factorization methods. Numerical Linear Algebra with Applications, 8(6-7):467–496, 2001. M.J. Schervish. A general method for comparing probability assessors. The Annals of Statistics, 17 (4):1856–1879, 1989. B. Sch¨ lkopf and A.J. Smola. Learning with Kernels. MIT Press, Cambridge MA, 2002. o B. Sch¨ lkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. C. Scott. Calibrated surrogate losses for classification with label-dependent costs. arXiv:1009:2718v1, September 2010. C. Scott and M. Davenport. Regression level set estimation via cost-sensitive classification. IEEE Transactions on Signal Processing, 55(6 Part 1):2752–2757, 2007. G. Shafer and V. Vovk. Probability and Finance: It’s Only a Game! J. Wiley & Sons, 2001. C.E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379– 423; 623–656, 1948. Y. Shen. Loss Functions for Binary Classification and Class Probability Estimation. PhD thesis, Department of Statistics, University of Pennsylvania, October 2005. E. Shuford, A. Albert, and H.E. Massengill. Admissible probability measurement procedures. Psychometrika, 31(2):125–145, June 1966. L.N. Soldatova and R.D. King. An ontology of scientific experiments. Interface: The Journal of The Royal Society, 3(11):795–803, 2006. 814 I NFORMATION , D IVERGENCE AND R ISK L. Song, M.D. Reid, A.J. Smola, and R.C. Williamson. Discriminative estimation of f -divergence. Submitted to AISTATS09, October 2008. B.K. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch¨ lkopf. On integral probao bility metrics, φ-divergences and binary classification. Arxiv preprint arXiv:0901.2698v4, October 2009. I. Steinwart. How to compare different loss functions and their risks. Preprint, Modeling, Algorithms and Informatics Group, CCS-3, Los Alamos National Laboratory, September 2006. I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, August 2007. I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008. I. Steinwart, D. Hush, and C. Scovel. Density level detection is classification. Advances in Neural Information Processing Systems, 17, 2005. H. Strasser. Mathematical Theory of Statistics. Walter de Gruyter, Berlin, 1985. H. Strasser. Reduction of complexity. In Josef A. Mazanec and Helmut Strasser, editors, A Nonparametric Approach to Perceptions-Based Market Segmentation: Foundations, pages 99–140, Wien, 2000. Springer. I.J. Taneja. Refinement inequalities arXiv:math/0501303v2, April 2005a. among symmetric divergence measures. I.J. Taneja. Bounds on non-symmetric divergence measures in terms of symmetric divergence measures. arXiv:math.PR/0506256v1, 2005b. D. Tasche. Conditional expectation as quantile derivative. Arxiv preprint math.PR/0104190, 2001. W. B. Temple. Stieltjes integral representation of convex functions. Duke Mathematical Journal, 21:527–531, 1954. N. Tishby, F.C. Pereira, and W. Bialek. The information bottleneck method. Arxiv preprint physics/0004057, 2000. S. Tong and D. Koller. Restricted Bayes optimal classifiers. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI), pages 658–664, 2000. F. Topsøe. Some inequalities for information divergence and related measures of discrimination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000. F. Topsøe. Bounds for entropy and divergence for distributions over a two-element set. J. Ineq. Pure & Appl. Math, 2(2), 2001. F. Topsøe. Between truth and description. Presented at Prague Stochastics 2006, June 2006. E.N. Torgersen. Measures of information based on comparison with total information and with total ignorance. The Annals of Statistics, 9(3):638–657, 1981. 815 R EID AND W ILLIAMSON E.N. Torgersen. Comparison of Statistical Experiments. Cambridge University Press, 1991. G.T. Toussaint. Probability of error, expected divergence and the affinity of several distributions. IEEE Transactions on Systems, Man and Cybernetics, 8:482–485, 1978. S. Turkle and S. Papert. Epistemological pluralism and the revaluation of the concrete. Journal of Mathematical Behavior, 11(1):3–33, March 1992. P. Turney. Types of cost in inductive concept learning. In Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on Machine Learning, pages 15–21, 2000. A. Unterreiter, A. Arnold, P. Markowich, and G. Toscani. On generalized Csisz´ r-Kullback inequala ities. Monatshefte f¨ r Mathematik, 131:235–253, 2000. u I. Vajda. Note on discrimination and variation. IEEE Transactions on Information Theory, 16: 771–773, 1970. A. van der Vaart. The statistical work of Lucien Le Cam. Annals of Statistics, 30(3):631–682, 2002. V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, 2006. 2nd edition. V. N. Vapnik. Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures). In COLT ’89: Proceedings of the second annual workshop on computational learning theory, pages 3–21, San Francisco, CA, USA, 1989. Morgan Kaufmann Publishers Inc. H.R. Varian. Innovation, components and complements. University of California, Berkeley, October 2003. URL http://www.almaden.ibm.com/coevolution/pdf/varian_paper.pdf. V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer, 2005. A. Wald. Statistical decision functions. The Annals of Mathematical Statistics, 20(2):165–205, June 1949. A. Wald. Statistical Decision Functions. John Wiley & Sons, New York, 1950. M.L. Weitzman. Recombinant growth. Quarterly Journal of Economics, 113(2):331–360, 1998. D. Wettschereck and S. Muller. Exchanging data mining models with the predictive modelling markup language. International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, 2001. Wikipedia. Grothendieck’s relative point of view. Wikipedia, September 2007. URL http://en. wikipedia.org/wiki/Grothendieck’s_relative_point_of_view. Accessed 17/09/2007. R. Winkler, J. Mu˜ oz, J. Cervera, J. Bernardo, G. Blattenberger, J. Kadane, D. Lindley, A. Murphy, n R. Oliver, and D. R´os-Insua. Scoring rules and the evaluation of probabilities. TEST, 5(1):1–60, ı March 1990. L. Withers. Some inequalities relating different measures of divergence between two probability distributions. IEEE Transactions on Information Theory, 45(5):1728–1735, 1999. 816 I NFORMATION , D IVERGENCE AND R ISK A.P. Worthen and W.E. Stark. Unified design of iterative receivers using factor graphs. IEEE Transactions on Information Theory, 47(2):843–849, 2001. J. Xie and C.E. Priebe. A weighted generalization of the Mann-Whitney-Wilcoxon statistic. Journal of Statistical Planning and Inference, 102(2):441–466, 2002. J.-W. Xu, D. Erdogmus, R. Jenssen, and J.C. Pr´ncipe. An information-theoretic perspective to kerı nel independent component analysis. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP’05), volume 5, pages 249–252, 2005. G.L. Yang and L. Le Cam. A conversation with Lucien Le Cam. Statistical Science, 14(2):223–241, 1999. B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Third IEEE International Conference on Data Mining, pages 435–442, 2003. J. Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16(1):159–195, 2004a. J. Zhang and H. Matsuzoe. Dualistic riemannian manifold structure induced from convex functions. Advances in Mechanics and Mathematics., 17:437–464, 2009. T. Zhang. Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Mathematical Statistics, 32:56–134, 2004b. V.M. Zolotarev. Probability metrics. Theory of Probability and its Applications, 38(2):278–302, 1984. 817