jmlr jmlr2007 jmlr2007-42 jmlr2007-42-reference knowledge-graph by maker-knowledge-mining

42 jmlr-2007-Infinitely Imbalanced Logistic Regression

Source: pdf

Author: Art B. Owen

Abstract: In binary classiﬁcation problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. In this paper we consider the inﬁnitely imbalanced case where one class has a ﬁnite sample size and the other class’s sample size grows without bound. For logistic regression, the inﬁnitely imbalanced case often has a useful solution. Under mild conditions, the intercept diverges as expected, but the rest of the coefﬁcient vector approaches a non trivial and useful limit. That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. Keywords: classiﬁcation, drug discovery, fraud detection, rare events, unbalanced data

reference text

R. J. Bolton and D. J. Hand. Statistical fraud detection: A review. Statistical Science, 17(3):235– 255, 2002. L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classiﬁcation And Regression Trees. Wadsworth, Belmont, CA, 1984. N.V. Chawla, N. Japkowicz, and A. Kolcz. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets. 2003. N.V. Chawla, N. Japkowicz, and A. Kolcz. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1–6, 2004. D.A. Cohn, Z. Ghahramani, and M.I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. N. Japkowicz. Learning from Imbalanced Data Sets: Papers from the AAAI Workshop. AAAI, 2000. Technical Report WS-00-05. G. King and L. Zeng. Logistic regression in rare events data. Political Analysis, 9(2):137–163, 2001. M.J. Silvapulle. On the existence of maximum likelihood estimates for the binomial response models. Journal of the Royal Statistical Society, Series B, 43:310–313, 1981. S. Tong. Active learning: Theory and applications. PhD thesis, Stanford University, 2001. URL http://ai.stanford.edu/∼stong/research.html/tong thesis.pdf. M. Zhu, W. Su, and H. A. Chipman. LAGO: A computationally efﬁcient approach for statistical detection. Technometrics, 48:193–205, 2005. 773