jmlr jmlr2007 jmlr2007-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Art B. Owen
Abstract: In binary classification problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. In this paper we consider the infinitely imbalanced case where one class has a finite sample size and the other class’s sample size grows without bound. For logistic regression, the infinitely imbalanced case often has a useful solution. Under mild conditions, the intercept diverges as expected, but the rest of the coefficient vector approaches a non trivial and useful limit. That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. Keywords: classification, drug discovery, fraud detection, rare events, unbalanced data
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Statistics Stanford University Stanford CA, 94305, USA Editor: Yi Lin Abstract In binary classification problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. [sent-4, score-0.171]
2 In this paper we consider the infinitely imbalanced case where one class has a finite sample size and the other class’s sample size grows without bound. [sent-5, score-0.411]
3 For logistic regression, the infinitely imbalanced case often has a useful solution. [sent-6, score-0.739]
4 Under mild conditions, the intercept diverges as expected, but the rest of the coefficient vector approaches a non trivial and useful limit. [sent-7, score-0.259]
5 That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. [sent-8, score-0.168]
6 The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. [sent-9, score-0.76]
7 Keywords: classification, drug discovery, fraud detection, rare events, unbalanced data 1. [sent-10, score-0.45]
8 Introduction In many applications of logistic regression one of the two classes is extremely rare. [sent-11, score-0.439]
9 In political science, the occurrence of wars, coups, vetos and the decisions of citizens to run for office have been modelled as rare events; see King and Zeng (2001). [sent-12, score-0.209]
10 Bolton and Hand (2002) consider fraud detection, and Zhu et al. [sent-13, score-0.189]
11 In other examples the rare event might correspond to people with a rare disease, customer conversions at an e-commerce web site, or false positives among a set of emails marked as spam. [sent-15, score-0.342]
12 We will let Y ∈ {0, 1} denote a random response with the observed value of Y being y = 1 in the rare case and y = 0 in the common case. [sent-16, score-0.2]
13 We will suppose that the number of observations with y = 0 is so large that we have a satisfactory representation of the distribution of predictors in that setting. [sent-17, score-0.143]
14 It is no surprise that the intercept term in the logistic regression typically tends to −∞ in this limit. [sent-19, score-0.651]
15 The main result (Theorem 8 below) is that under reasonable conditions, the intercept term tends to −∞ like − log(N) plus a constant, while the limiting logistic regression coefficient β = β(N) satisfies R xβ e x dF0 (x) x= R xβ ¯ e c 2007 Art B. [sent-21, score-0.731]
16 ¯ 0 N→∞ (2) Equation (2) reminds one of a well known derivation for logistic regression. [sent-26, score-0.328]
17 If the conditional distribution of X given that Y = y is N(µy , Σ) then the coefficient of X in logistic regression is Σ−1 (µ1 − µ0 ). [sent-27, score-0.439]
18 It outlines the results of Silvapulle (1981) who completely characterizes the conditions under which unique logistic regression estimates exist in the finite sample case. [sent-36, score-0.439]
19 (2005), we can ¯ replace all data points with y = 1 by a single one at (x, 1) with minimal effect on the estimated ¯ coefficient, apart from the intercept term. [sent-43, score-0.241]
20 Section 6 discusses how these results can be used in deciding which unlabelled data points to label, and it shows how the infinitely imbalanced setting may lead to computational savings. [sent-44, score-0.458]
21 We conclude this introduction by relating the present work to the literature on imbalanced data. [sent-45, score-0.411]
22 To study this case we use logistic regression on (x i , yi ) = (Φ−1 ((i − 1/2)/N), 0) for i = 1, . [sent-56, score-0.439]
23 As N increases the problem becomes more imbalanced and the N points used produce an ever better approximation to the normal distribution. [sent-61, score-0.411]
24 0003 Table 1: Logistic regression intercept α and coefficient β for imbalanced data described in the text. [sent-77, score-0.734]
25 From this table it seems clear that as N → ∞, the intercept term is diverging like − log(N) while the coefficient of X is approaching the value 1 that we would get from Equation (2). [sent-81, score-0.254]
26 The Cauchy distribution has tails far heavier than the logistic distribution. [sent-87, score-0.423]
27 The heavy tails of the Cauchy distribution make it fail a condition of Theorem 8. [sent-91, score-0.171]
28 We need the point x to be surrounded ¯ by the distribution of X given Y = 0 as defined in Section 3. [sent-102, score-0.141]
29 9513 Table 2: Logistic regression intercept α and coefficient β for imbalanced data described in the text. [sent-123, score-0.734]
30 14 Table 3: Logistic regression intercept α and coefficient β for imbalanced data described in the text. [sent-145, score-0.734]
31 The logistic regression model is Pr(Y = 1 | X = x) = eα+x β /(1 + eα+x β ) for α ∈ R and β ∈ Rd . [sent-163, score-0.439]
32 The log-likelihood in logistic regression is n ∑ i=1 N α + x1i β − log(1 + eα+x1i β ) − ∑ log(1 + eα+x0i β ). [sent-164, score-0.439]
33 With a bit of foresight we also center the logistic regression around the average x = ∑n xi /n of ¯ i=1 the predictor values for cases with Y = 1. [sent-168, score-0.587]
34 ¯ i=1 ˆ ˆ When the centered log likelihood has an MLE (α0 , β) we can recover the MLE of the uncenˆ remains unchanged while α in the uncentered version is α0 − x β. [sent-170, score-0.17]
35 The ˆ ˆ tered log likelihood easily: β ¯ˆ numerical examples in Section 2 used uncentered logistic regression. [sent-171, score-0.498]
36 1 Silvapulle’s Results It is well known that the MLE in the usual logistic regression setting can fail to be finite when the x values where y = 1 are linearly separable from those where y = 0. [sent-181, score-0.468]
37 The existence and uniqueness of MLE’s for linear logistic regression has been completely characterized by Silvapulle (1981). [sent-182, score-0.439]
38 Then the logistic regression model has Pr(Y = 1 | X = x) = exp(z θ)/(1 + exp(z θ)) where of course z = z(x) = (1, x ) . [sent-193, score-0.439]
39 Silvapulle (1981) employs two convex cones: nj Cj = ∑ k ji z ji | k ji > 0 , j ∈ {0, 1}. [sent-194, score-0.596]
40 i=1 Theorem 1 For data as described above, assume that the n 0 + n1 by d + 1 matrix with rows taken / from z ji for j = 0, 1 and i = 1, . [sent-195, score-0.173]
41 If C0 ∩ C1 = 0 then a unique finite logistic ˆ ˆ = (α, β ) exists. [sent-199, score-0.328]
42 Theorem 1 also holds when the logistic CDF G(t) = exp(t)/(1 + exp(t)) is replaced by the standard normal one (for probit analysis) or by the U(0, 1) CDF. [sent-203, score-0.328]
43 A more readily interpretable condition is that the relative interior (as explained below) of the convex hull of the x’s for y = 0 intersects that for y = 1. [sent-207, score-0.211]
44 / That is H0 ∩ H1 = 0 where nj nj i=1 Hj = i=1 ∑ λ ji x ji | λ ji > 0, ∑ λ ji = 1 . [sent-208, score-0.75]
45 When the x ji span Rd then H j is the interior of the convex hull of x ji . [sent-209, score-0.515]
46 When x ji lie in a lower dimensional affine subspace of Rd then the interior of their convex hull is the empty set. [sent-210, score-0.342]
47 , n j , then the desired relative interior of the convex hull of x j1 , . [sent-215, score-0.169]
48 Then we may write n0 z0 = ∑ k0i i=1 1 x0i n1 = ∑ k1i i=1 1 , x1i where each k ji > 0. [sent-223, score-0.173]
49 Let K denote that value, and put λ ji = k ji /K for j = 0, 1 and i = 1, . [sent-225, score-0.346]
50 Definition 3 The distribution F on Rd has the point x∗ surrounded if Z (x−x∗ ) ω> ε dF(x) > δ (5) holds for some ε > 0, some δ > 0 and all ω ∈ Ω. [sent-237, score-0.141]
51 This implies that having at least one point surrounded by F0 will be enough to avoid rank deficiency. [sent-242, score-0.18]
52 In Theorem 1 it follows from Lemma 2 that we only need there to be some point x ∗ that is ˆ ˆ ˆ surrounded by both F0 and F1 where Fj is the empirical distribution of x j1 , . [sent-243, score-0.141]
53 ) In the infinitely imbalanced setting we expect that F0 will ordinarily surround every single one of x1 , . [sent-249, score-0.584]
54 We do not need F0 to surround them all but it is not enough to just have some point x ∗ ˆ exist that is surrounded by both F0 and F1 . [sent-253, score-0.258]
55 We do not need ¯ ˆ to assume that F1 surrounds x, a condition that fails when the xi are confined to an affine subset of ¯ Rd as they necessarily are for n < d. [sent-255, score-0.19]
56 There is an interesting case in which F0 can fail to surround x. [sent-256, score-0.146]
57 The predictor X may contain ¯ a component that is itself an imbalanced binary variable, and that component might never take the value 1 in the y = 1 sample. [sent-257, score-0.489]
58 Then x is right on the boundary of the support of F0 and we cannot be ¯ sure of a finite β in either the finite sample case or the infinitely imbalanced case. [sent-258, score-0.411]
59 , xn ∈ Rd be given, and assume that the distribution F0 surrounds x = ∑n xi /n and that 0 < N < ∞. [sent-269, score-0.19]
60 Then the log likelihood (α, β) given by (4) has a unique finite ¯ i=1 ˆ ˆ maximizer (α, β). [sent-270, score-0.185]
61 Proof: The log likelihood is strictly concave in (α, β). [sent-271, score-0.142]
62 As illustrated in Section 2, infinitely imbalanced logistic regression will be degenerate if F0 has tails that are too heavy. [sent-303, score-0.945]
63 , xn ∈ Rd be fixed and suppose that F0 satisfies the tail condiˆ ˆ tion (11) and surrounds x = ∑n xi /n as described at (5). [sent-309, score-0.268]
64 413 Table 4: This table shows logistic regression coefficients for the chemical compound data set described in the text. [sent-340, score-0.472]
65 The fourth row shows standard errors for the ordinary logistic regression ¯ coefficients in the top row. [sent-344, score-0.439]
66 Therefore the MLEs satisfy R lim R N→∞ ˆ ¯ x ex β [1 + eα+(x−x) β ]−1 dF0 (x) ˆ ˆ ˆ ¯ ex β [1 + eα+(x−x) β ]−1 dF0 (x) ˆ ˆ = x. [sent-346, score-0.25]
67 ¯ (13) The denominator of (13) is at most ex β dF0 (x) and is at least Z R ˆ ˆ ¯ ex β (1 − eα+(x−x) β )dF0 (x) → ˆ ˆ Z ex β dF0 (x) ˆ as N → ∞ because α → −∞ and e2x β dF0 (x) < ∞ by the tail condition (11). [sent-347, score-0.316]
68 Therefore the denomR ˆ inator of (13) has the same limit as ex β dF0 (x) as N → ∞. [sent-348, score-0.143]
69 Similarly the numerator has the same R ˆ limit as ex β x dF0 (x). [sent-349, score-0.143]
70 Illustration It is perhaps surprising that in the N → ∞ limit, the logistic regression depends on x 1 , . [sent-352, score-0.47]
71 They study a data set with 29,812 chemical compounds on which 6 predictor variables were measured. [sent-360, score-0.174]
72 Table 4 shows the logistic regression coefficients for this data, as well as what happens to them when we replace the 608 data points (x, y) with y = 1 by a single point at (x, 1), or by 608 points ¯ equal to (x, 1). [sent-362, score-0.468]
73 In a centered logistic regression the point (x, 1) becomes (x − x, 1) = (0, . [sent-363, score-0.439]
74 The intercept changes a lot when we reduce the rare cases from 608 to 1 but otherwise the coefficients do not change importantly. [sent-367, score-0.414]
75 Interestingly the single point version has a β vector closer 769 OWEN to the original logistic regression than has the version with 608 points at (x, 1). [sent-368, score-0.439]
76 ) The correlation between the linear ¯ predictor from logistic regression to that fit with all xi = x is 0. [sent-372, score-0.556]
77 The correlation between the ¯ linear predictor from logistic regression to that fit with just one (x, 1) data point is still higher, at ¯ 0. [sent-374, score-0.517]
78 1 shows these findings lead to greater understanding of how logistic regression works or fails and how to improve it. [sent-384, score-0.479]
79 4 describes how using infinitely imbalanced logistic regression may lead to cost savings in fraud detection settings. [sent-390, score-1.124]
80 1 Insight Into Logistic Regression In the infinitely imbalanced limit, logistic regression only uses the y = 1 data points through their average feature vector x. [sent-392, score-0.85]
81 This limiting behavior is a property of logistic regression, not of any ¯ particular data set. [sent-393, score-0.408]
82 It holds equally well in those problems for which logistic regression works badly as it does in problems where the Bayes rule is a logistic regression. [sent-394, score-0.767]
83 In the illustrative example we got almost the same logistic regression after replacing all the rare cases by a single point at x. [sent-395, score-0.641]
84 But knowing that those parameters are very strongly tied to the d components of x gives us insight into how logistic regression works on ¯ imbalanced problems. [sent-400, score-0.879]
85 It is reasonable to expect better results from logistic regression when the x 1i are in a single tight cluster near x than when there are outliers, or when the x 1i points are in two well ¯ separated clusters in different directions from the bulk of F0 . [sent-401, score-0.469]
86 When we detect sharp ¯ clusters among x1i then we might fit one logistic regression per cluster, separating that cluster from the x0i ’s, and predict for new points by pooling the cluster specific results. [sent-404, score-0.499]
87 2 Nontrivial Limiting Predictions ˆ In the infinitely imbalanced limit with N → ∞ we often find that β converges to a finite limit while ˆ α → −∞. [sent-407, score-0.511]
88 Pr(Y = 1 | X = x) For example if we are presented with a number of cases of potential fraud to investigate and have limited resources then we can rank them by x β and investigate as many of the most likely ones as time or other costs allow. [sent-410, score-0.362]
89 If the values of uncovering fraud in the two cases are v and v, respectively, then we might prefer to investigate the former when ve x β > vex β . [sent-412, score-0.335]
90 If the costs of investigation are c and c then we might prefer the former when vex β /c > vex β /c. [sent-413, score-0.205]
91 In the selective setting, the investigator has a mix of labelled cases (both x and y known) and unlabelled cases (x known but y unknown), and must choose which of the unlabelled cases to get a label for. [sent-419, score-0.187]
92 In a rare event setting, finding the cases most likely to have y = 1 is a reasonable proxy for finding the most informative cases, and one could then allocate a large part of the labelling budget to cases with high values of x β. [sent-421, score-0.233]
93 The effective sample size of an imbalanced data set is often considered to be simply the number of rare outcomes. [sent-424, score-0.582]
94 If at least one of the Σk has full rank then F0 will surround the point where λk > 0 and x. [sent-432, score-0.156]
95 By contrast, each step in iteratively reweighted least squares fitting of logistic regression takes O((n + N)d 2 ) work. [sent-444, score-0.439]
96 But after the first iteration there can be substantial computational savings for solving (15) instead of doing logistic regression, when n/d is large. [sent-447, score-0.361]
97 When there is one common class and there are numerous rare classes, such as types of fraud or different targets against which a drug might be active, then the cost of approximating F0 can be shared over the set of uncommon classes. [sent-448, score-0.423]
98 In fraud detection problems we might expect that the distribution F0 for legitimate data points is slowly changing while the patterns in the fraudulent points change rapidly in response to improved detection. [sent-449, score-0.27]
99 These vectors can be for different known types of fraud, for fraud over shorter time intervals, or even individual fraud cases. [sent-451, score-0.378]
100 Editorial: special issue on learning from imbalanced data sets. [sent-485, score-0.411]
wordName wordTfidf (topN-words)
[('imbalanced', 0.411), ('logistic', 0.328), ('intercept', 0.212), ('silvapulle', 0.194), ('fraud', 0.189), ('owen', 0.188), ('ji', 0.173), ('rare', 0.171), ('nfinitely', 0.167), ('cauchy', 0.154), ('mle', 0.147), ('rd', 0.144), ('surrounded', 0.141), ('mbalanced', 0.141), ('cdf', 0.125), ('surround', 0.117), ('surrounds', 0.111), ('regression', 0.111), ('egression', 0.107), ('ogistic', 0.107), ('nitely', 0.107), ('tails', 0.095), ('ex', 0.093), ('coef', 0.086), ('vex', 0.083), ('limiting', 0.08), ('predictor', 0.078), ('log', 0.075), ('maximizer', 0.071), ('df', 0.07), ('tilting', 0.07), ('interior', 0.069), ('chawla', 0.067), ('strati', 0.067), ('lim', 0.064), ('drug', 0.063), ('japkowicz', 0.063), ('compounds', 0.063), ('lemma', 0.06), ('nontrivial', 0.06), ('bolton', 0.056), ('ordinarily', 0.056), ('uncentered', 0.056), ('detection', 0.052), ('predictors', 0.052), ('hull', 0.052), ('cients', 0.051), ('active', 0.051), ('observations', 0.05), ('limit', 0.05), ('overlap', 0.05), ('pr', 0.05), ('convex', 0.048), ('heavy', 0.047), ('unlabelled', 0.047), ('zh', 0.047), ('diverges', 0.047), ('supn', 0.047), ('ze', 0.047), ('stanford', 0.043), ('nonsingular', 0.042), ('tilt', 0.042), ('intersects', 0.042), ('approaching', 0.042), ('suppose', 0.041), ('xn', 0.04), ('fails', 0.04), ('zhu', 0.04), ('likelihood', 0.039), ('xi', 0.039), ('rank', 0.039), ('costs', 0.039), ('political', 0.038), ('jn', 0.038), ('tail', 0.037), ('nite', 0.034), ('chemical', 0.033), ('ray', 0.033), ('savings', 0.033), ('cohn', 0.033), ('king', 0.033), ('investigate', 0.032), ('events', 0.032), ('af', 0.031), ('surprising', 0.031), ('cases', 0.031), ('cluster', 0.03), ('mixture', 0.03), ('replace', 0.029), ('response', 0.029), ('insight', 0.029), ('nj', 0.029), ('fail', 0.029), ('concave', 0.028), ('gaussians', 0.028), ('stronger', 0.028), ('establishing', 0.027), ('unbalanced', 0.027), ('theorem', 0.027), ('aaai', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
Author: Art B. Owen
Abstract: In binary classification problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. In this paper we consider the infinitely imbalanced case where one class has a finite sample size and the other class’s sample size grows without bound. For logistic regression, the infinitely imbalanced case often has a useful solution. Under mild conditions, the intercept diverges as expected, but the rest of the coefficient vector approaches a non trivial and useful limit. That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. Keywords: classification, drug discovery, fraud detection, rare events, unbalanced data
2 0.13924295 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
Author: Ping Li, Trevor J. Hastie, Kenneth W. Church
Abstract: For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k D) whose entries are i.i.d. samples of the standard Cauchy C(0, 1). Because of the impossibility result, one can not hope to recover the pairwise l1 distances in A from B = A × R ∈ Rn×k , using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators (MLE). We derive tail bounds for the geometric mean estimators and establish that k = O log n suffices with the constants explicitly ε2 given. Asymptotically (as k → ∞), both the sample median and the geometric mean estimators are about 80% efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: dimension reduction, l1 norm, Johnson-Lindenstrauss (JL) lemma, Cauchy random projections
3 0.10593387 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
Author: Carine Hue, Marc Boullé
Abstract: In this paper, we consider the supervised learning task which consists in predicting the normalized rank of a numerical variable. We introduce a novel probabilistic approach to estimate the posterior distribution of the target rank conditionally to the predictors. We turn this learning task into a model selection problem. For that, we define a 2D partitioning family obtained by discretizing numerical variables and grouping categorical ones and we derive an analytical criterion to select the partition with the highest posterior probability. We show how these partitions can be used to build univariate predictors and multivariate ones under a naive Bayes assumption. We also propose a new evaluation criterion for probabilistic rank estimators. Based on the logarithmic score, we show that such criterion presents the advantage to be minored, which is not the case of the logarithmic score computed for probabilistic value estimator. A first set of experimentations on synthetic data shows the good properties of the proposed criterion and of our partitioning approach. A second set of experimentations on real data shows competitive performance of the univariate and selective naive Bayes rank estimators projected on the value range compared to methods submitted to a recent challenge on probabilistic metric regression tasks. Our approach is applicable for all regression problems with categorical or numerical predictors. It is particularly interesting for those with a high number of predictors as it automatically detects the variables which contain predictive information. It builds pertinent predictors of the normalized rank of the numerical target from one or several predictors. As the criteria selection is regularized by the presence of a prior and a posterior term, it does not suffer from overfitting. Keywords: rank regression, probabilistic approach, 2D partitioning, non parametric estimation, Bayesian model selection
5 0.062225182 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
6 0.048543632 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
7 0.039742243 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems
8 0.038269389 90 jmlr-2007-Value Regularization and Fenchel Duality
9 0.037738729 15 jmlr-2007-Bilinear Discriminant Component Analysis
10 0.037145335 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
12 0.036125384 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
13 0.035752892 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
14 0.034845993 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
15 0.034647718 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
16 0.03278726 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
17 0.03247859 9 jmlr-2007-AdaBoost is Consistent
18 0.032156743 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
19 0.031839054 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
20 0.031797666 71 jmlr-2007-Refinable Kernels
topicId topicWeight
[(0, 0.211), (1, -0.024), (2, 0.021), (3, 0.033), (4, 0.078), (5, -0.041), (6, -0.084), (7, -0.004), (8, 0.092), (9, -0.103), (10, -0.073), (11, -0.142), (12, -0.044), (13, -0.093), (14, 0.336), (15, 0.334), (16, 0.2), (17, -0.002), (18, -0.017), (19, -0.177), (20, -0.126), (21, -0.038), (22, 0.071), (23, 0.036), (24, -0.121), (25, 0.104), (26, 0.211), (27, 0.033), (28, -0.045), (29, 0.045), (30, 0.018), (31, -0.152), (32, -0.034), (33, 0.004), (34, 0.076), (35, 0.035), (36, -0.008), (37, 0.044), (38, 0.127), (39, 0.035), (40, -0.037), (41, 0.099), (42, 0.105), (43, -0.117), (44, 0.096), (45, -0.043), (46, 0.098), (47, -0.025), (48, -0.034), (49, 0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.96775538 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
Author: Art B. Owen
Abstract: In binary classification problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. In this paper we consider the infinitely imbalanced case where one class has a finite sample size and the other class’s sample size grows without bound. For logistic regression, the infinitely imbalanced case often has a useful solution. Under mild conditions, the intercept diverges as expected, but the rest of the coefficient vector approaches a non trivial and useful limit. That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. Keywords: classification, drug discovery, fraud detection, rare events, unbalanced data
2 0.67107916 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
Author: Ping Li, Trevor J. Hastie, Kenneth W. Church
Abstract: For1 dimension reduction in the l1 norm, the method of Cauchy random projections multiplies the original data matrix A ∈ Rn×D with a random matrix R ∈ RD×k (k D) whose entries are i.i.d. samples of the standard Cauchy C(0, 1). Because of the impossibility result, one can not hope to recover the pairwise l1 distances in A from B = A × R ∈ Rn×k , using linear estimators without incurring large errors. However, nonlinear estimators are still useful for certain applications in data stream computations, information retrieval, learning, and data mining. We study three types of nonlinear estimators: the sample median estimators, the geometric mean estimators, and the maximum likelihood estimators (MLE). We derive tail bounds for the geometric mean estimators and establish that k = O log n suffices with the constants explicitly ε2 given. Asymptotically (as k → ∞), both the sample median and the geometric mean estimators are about 80% efficient compared to the MLE. We analyze the moments of the MLE and propose approximating its distribution of by an inverse Gaussian. Keywords: dimension reduction, l1 norm, Johnson-Lindenstrauss (JL) lemma, Cauchy random projections
3 0.5262785 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
Author: Carine Hue, Marc Boullé
Abstract: In this paper, we consider the supervised learning task which consists in predicting the normalized rank of a numerical variable. We introduce a novel probabilistic approach to estimate the posterior distribution of the target rank conditionally to the predictors. We turn this learning task into a model selection problem. For that, we define a 2D partitioning family obtained by discretizing numerical variables and grouping categorical ones and we derive an analytical criterion to select the partition with the highest posterior probability. We show how these partitions can be used to build univariate predictors and multivariate ones under a naive Bayes assumption. We also propose a new evaluation criterion for probabilistic rank estimators. Based on the logarithmic score, we show that such criterion presents the advantage to be minored, which is not the case of the logarithmic score computed for probabilistic value estimator. A first set of experimentations on synthetic data shows the good properties of the proposed criterion and of our partitioning approach. A second set of experimentations on real data shows competitive performance of the univariate and selective naive Bayes rank estimators projected on the value range compared to methods submitted to a recent challenge on probabilistic metric regression tasks. Our approach is applicable for all regression problems with categorical or numerical predictors. It is particularly interesting for those with a high number of predictors as it automatically detects the variables which contain predictive information. It builds pertinent predictors of the normalized rank of the numerical target from one or several predictors. As the criteria selection is regularized by the presence of a prior and a posterior term, it does not suffer from overfitting. Keywords: rank regression, probabilistic approach, 2D partitioning, non parametric estimation, Bayesian model selection
5 0.27923504 16 jmlr-2007-Boosted Classification Trees and Class Probability Quantile Estimation
Author: David Mease, Abraham J. Wyner, Andreas Buja
Abstract: The standard by which binary classifiers are usually judged, misclassification error, assumes equal costs of misclassifying the two classes or, equivalently, classifying at the 1/2 quantile of the conditional class probability function P[y = 1|x]. Boosted classification trees are known to perform quite well for such problems. In this article we consider the use of standard, off-the-shelf boosting for two more general problems: 1) classification with unequal costs or, equivalently, classification at quantiles other than 1/2, and 2) estimation of the conditional class probability function P[y = 1|x]. We first examine whether the latter problem, estimation of P[y = 1|x], can be solved with LogitBoost, and with AdaBoost when combined with a natural link function. The answer is negative: both approaches are often ineffective because they overfit P[y = 1|x] even though they perform well as classifiers. A major negative point of the present article is the disconnect between class probability estimation and classification. Next we consider the practice of over/under-sampling of the two classes. We present an algorithm that uses AdaBoost in conjunction with Over/Under-Sampling and Jittering of the data (“JOUS-Boost”). This algorithm is simple, yet successful, and it preserves the advantage of relative protection against overfitting, but for arbitrary misclassification costs and, equivalently, arbitrary quantile boundaries. We then use collections of classifiers obtained from a grid of quantiles to form estimators of class probabilities. The estimates of the class probabilities compare favorably to those obtained by a variety of methods across both simulated and real data sets. Keywords: boosting algorithms, LogitBoost, AdaBoost, class probability estimation, over-sampling, under-sampling, stratification, data jittering
6 0.21424599 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians
8 0.18731755 15 jmlr-2007-Bilinear Discriminant Component Analysis
10 0.17866597 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
11 0.17012581 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
13 0.1638588 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
14 0.16290167 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
15 0.15726949 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
16 0.15405822 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
17 0.15191162 90 jmlr-2007-Value Regularization and Fenchel Duality
18 0.149114 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"
19 0.14389245 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
20 0.14269908 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
topicId topicWeight
[(4, 0.015), (8, 0.018), (10, 0.011), (12, 0.012), (28, 0.04), (40, 0.667), (48, 0.028), (60, 0.023), (80, 0.01), (85, 0.027), (98, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.9596417 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
Author: Art B. Owen
Abstract: In binary classification problems it is common for the two classes to be imbalanced: one case is very rare compared to the other. In this paper we consider the infinitely imbalanced case where one class has a finite sample size and the other class’s sample size grows without bound. For logistic regression, the infinitely imbalanced case often has a useful solution. Under mild conditions, the intercept diverges as expected, but the rest of the coefficient vector approaches a non trivial and useful limit. That limit can be expressed in terms of exponential tilting and is the minimum of a convex objective function. The limiting form of logistic regression suggests a computational shortcut for fraud detection problems. Keywords: classification, drug discovery, fraud detection, rare events, unbalanced data
2 0.88355714 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
4 0.45618618 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
Author: Kwangmoo Koh, Seung-Jean Kim, Stephen Boyd
Abstract: Logistic regression with 1 regularization has been proposed as a promising method for feature selection in classification problems. In this paper we describe an efficient interior-point method for solving large-scale 1 -regularized logistic regression problems. Small problems with up to a thousand or so features and examples can be solved in seconds on a PC; medium sized problems, with tens of thousands of features and examples, can be solved in tens of seconds (assuming some sparsity in the data). A variation on the basic method, that uses a preconditioned conjugate gradient method to compute the search step, can solve very large problems, with a million features and examples (e.g., the 20 Newsgroups data set), in a few minutes, on a PC. Using warm-start techniques, a good approximation of the entire regularization path can be computed much more efficiently than by solving a family of problems independently. Keywords: logistic regression, feature selection, 1 regularization, regularization path, interiorpoint methods.
5 0.41034806 77 jmlr-2007-Stagewise Lasso
Author: Peng Zhao, Bin Yu
Abstract: Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences
6 0.39593127 71 jmlr-2007-Refinable Kernels
7 0.38555765 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
8 0.36308566 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
9 0.36174899 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
10 0.35986716 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
12 0.35627088 39 jmlr-2007-Handling Missing Values when Applying Classification Models
13 0.35172749 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
14 0.35050189 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
15 0.34779599 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
16 0.33859801 13 jmlr-2007-Bayesian Quadratic Discriminant Analysis
17 0.33775368 15 jmlr-2007-Bilinear Discriminant Component Analysis
18 0.33247498 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
19 0.33016261 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
20 0.32389513 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels