jmlr jmlr2010 jmlr2010-25 jmlr2010-25-reference knowledge-graph by maker-knowledge-mining

25 jmlr-2010-Composite Binary Losses


Source: pdf

Author: Mark D. Reid, Robert C. Williamson

Abstract: We study losses for binary classification and class probability estimation and extend the understanding of them from margin losses to general composite losses which are the composition of a proper loss with a link function. We characterise when margin losses can be proper composite losses, explicitly show how to determine a symmetric loss in full from half of one of its partial losses, introduce an intrinsic parametrisation of composite binary losses and give a complete characterisation of the relationship between proper losses and “classification calibrated” losses. We also consider the question of the “best” surrogate binary loss. We introduce a precise notion of “best” and show there exist situations where two convex surrogate losses are incommensurable. We provide a complete explicit characterisation of the convexity of composite binary losses in terms of the link function and the weight function associated with the proper loss which make up the composite loss. This characterisation suggests new ways of “surrogate tuning” as well as providing an explicit characterisation of when Bregman divergences on the unit interval are convex in their second argument. Finally, in an appendix we present some new algorithm-independent results on the relationship between properness, convexity and robustness to misclassification noise for binary losses and show that all convex proper losses are non-robust to misclassification noise. Keywords: surrogate loss, convexity, probability estimation, classification, Fisher consistency, classification-calibrated, regret bound, proper scoring rule, Bregman divergence, robustness, misclassification noise


reference text

J.D. Abernethy, A. Agarwal, P.L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through minimax duality. March 2009. URL http://arxiv.org/abs/0903.5328. J. Aczel and J. Pfanzagl. Remarks on the measurement of subjective probability and information. Metrika, 11(1):91–105, December 1967. F.R. Bach, D. Heckerman, and E. Horvitz. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7:1713–1741, 2006. D. Bainov and P. Simeonov. Integral Inequalities and Applications. Kluwer, Dordrecht, 1992. A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh. Clustering with bregman divergences. The Journal of Machine Learning Research, 6:1705–1749, 2005. P.J. Bartlett, B. Sch¨ lkopf, D. Schuurmans, and A.J. Smola, editors. Advances in Large-Margin o Classifiers. MIT Press, 2000. P.L. Bartlett and A. Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. The Journal of Machine Learning Research, 8:775–790, 2007. P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006. H.H. Bauschke and J.M. Borwein. Joint and separate convexity of the bregman distance. In Dan Butnariu, Yair Censor, and Simeon Reich, editors, Inherently Parallel Algorithms in Feasibility and Optimization and their Applications, volume 8 of Studies in Computational Mathematics, pages 23–36. North-Holland, 2001. A. Beygelzimer, J. Langford, and B. Zadrozny. Machine learning techniques — reductions between prediction quality metrics. In Z. Liu and C.H. Xia, editors, Performance Modeling and Engineering, pages 3–28. Springer US, April 2008. URL http://hunch.net/˜jl/projects/ reductions/tutorial/paper/chapter.pdf. 2419 R EID AND W ILLIAMSON A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennsylvania, November 2005. P.F. Christoffersen and F.X. Diebold. Optimal prediction under asymmetric loss. Econometric Theory, 13(06):808–817, 2009. I. Cohen and M. Goldszmidt. Properties and benefits of calibrated classifiers. Technical Report HPL-2004-22(R.1), HP Laboratories, Palo Alto, July 2004. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, volume 17, pages 973–978, 2001. S. Fidler, D. Skocaj, and A. Leonardis. Combining reconstructive and discriminative subspace methods for robust classification and regression by subsampling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3):337–350, 2006. Y. Freund. A more robust boosting algorithm. arXiv:0905.2138v1 [stat.ML], May 2009. URL http://arxiv.org/abs/0905.2138. T. Gneiting. Evaluating point forecasts. arXiv:0912.0902v1, December 2009. T. Gneiting and A.E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, March 2007. C.W.J. Granger and M.J. Machina. Forecasting and decision theory. In G. Elliot, C.W.J. Granger, and A. Timmermann, editors, Handbook of Economic Forecasting, volume 1, pages 82–98. North-Holland, Amsterdam, 2006. P.D. Gr¨ nwald and A.P. Dawid. Game theory, maximum entropy, minimum discrepancy and robust u bayesian decision theory. The Annals of Statistics, 32(4):1367–1433, 2004. D.J. Hand. Deconstructing statistical questions. Journal of the Royal Statistical Society. Series A (Statistics in Society), 157(3):317–356, 1994. D.J. Hand and V. Vinciotti. Local versus global models for classification problems: Fitting models where it matters. The American Statistician, 57(2):124–131, 2003. D.P. Helmbold, J. Kivinen, and M.K. Warmuth. Relative loss bounds for single neurons. IEEE Transactions on Neural Networks, 10:1291–1304, 1999. J.-B. Hiriart-Urruty and C. Lemar´ chal. Fundamentals of Convex Analysis. Springer, Berlin, 2001. e P.J. Huber. Robust Statistics. Wiley, New York, 1981. Y. Kalnishkan, V. Vovk, and M.V. Vyugin. Loss functions, complexities, and the legendre transformation. Theoretical Computer Science, 313(2):195–207, 2004. 2420 C OMPOSITE B INARY L OSSES Y. Kalnishkan, V. Vovk, and M.V. Vyugin. Generalised entropy and asymptotic complexities of languages. In Learning Theory, volume 4539/2007 of Lecture Notes in Computer Science, pages 293–307. Springer, 2007. M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6): 983–1006, November 1998. J. Kivinen and M.K. Warmuth. Relative loss bounds for multidimensional regression problems. Machine Learning, 45:301–329, 2001. D.E. Knuth. Two notes on notation. American Mathematical Monthly, pages 403–422, 1992. N. Lambert, D. Pennock, and Y. Shoham. Eliciting properties of probability distributions. In Proceedings of the ACM Conference on Electronic Commerce, pages 129–138, 2008. J. Langford and B. Zadrozny. Estimating class membership probabilities using classifier learners. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTAT’05), 2005. Y. Lin. A note on margin-based loss functions in classification. Technical Report 1044, Department of Statistics, University of Wisconsin, Madison, February 2002. P.M. Long and R.A. Servedio. Random classification noise defeats all convex potential boosters. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, ICML, pages 608–615, 2008. doi: 10.1145/1390156.1390233. H. Masnadi-Shirazi and N. Vasconcelos. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1049–1056. 2009. P. McCullagh and J.A. Nelder. Generalized Linear Models. Chapman & Hall/CRC, 1989. R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009a. To appear. R. Nock and F. Nielsen. On the efficient minimization of classification calibrated surrogates. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1201–1208. MIT Press, 2009b. J. Platt. Probabilities for sv machines. In A. Smola, P. Bartlett, B. Sch¨ lkopf, and D. Schuurmans, o editors, Advances in Large Margin Classifiers, pages 61–71. MIT Press, 2000. F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42 (3):203–231, 2001. M.D. Reid and R.C. Williamson. Surrogate regret bounds for proper losses. In Proceedings of the International Conference on Machine Learning, pages 897–904, 2009a. M.D. Reid and R.C. Williamson. Information, divergence and risk for binary experiments. arXiv preprint arXiv:0901.0356v1, January 2009b. 2421 R EID AND W ILLIAMSON R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970. L.J. Savage. Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336):783–801, 1971. M.J. Schervish. A general method for comparing probability assessors. The Annals of Statistics, 17 (4):1856–1879, 1989. B. Sch¨ lkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. o Neural Computation, 12:1207–1245, 2000. Y. Shen. Loss Functions for Binary Classification and Class Probability Estimation. PhD thesis, Department of Statistics, University of Pennsylvania, October 2005. E. Shuford, A. Albert, and H.E. Massengill. Admissible probability measurement procedures. Psychometrika, 31(2):125–145, June 1966. C.-A. S. Sta¨ l von Holstein. Assessment and evaluation of subjective probability distributions. e Economic Research Institute, Stockholm School of Economics, Stockholm, 1970. I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225–287, August 2007. I. Steinwart. Two oracle inequalities for regularized boosting classifiers. Statistics and Its Interface, 2:271–284, 2009. I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008. T.B. Trafalis and R.C. Gilbert. Robust classification and regression using support vector machines. European Journal of Operational Research, 173(3):893–909, 2006. A. Zellner. Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association, 81(394):446–451, June 1986. J. Zhang. Divergence function, duality, and convex analysis. Neural Computation, 16(1):159–195, 2004a. T. Zhang. Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Mathematical Statistics, 32:56–134, 2004b. Z. Zhang, M. I. Jordan, W. J. Li, and D. Y. Yeung. Coherence functions for multicategory marginbased classification methods. In Proceedings of the Twelfth Conference on Artificial Intelligence and Statistics (AISTATS), 2009. 2422