jmlr jmlr2010 jmlr2010-73 knowledge-graph by maker-knowledge-mining

73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Source: pdf

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

Abstract: The presence of asymmetry in the misclassiﬁcation costs or class prevalences is a common occurrence in the pattern classiﬁcation domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the speciﬁcation of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspeciﬁed. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspeciﬁcation, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassiﬁcation cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. [sent-11, score-0.404]

2 On the other hand, under model misspeciﬁcation, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassiﬁcation cost ratio. [sent-12, score-0.298]

3 Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets 1. [sent-15, score-0.452]

4 , 2001): risk = p (+1) c(+1)p (error| + 1) + p (−1) c(−1)p (error| − 1) , (1) where c(+1) and c(−1) denote the costs of a false negative and false positive, respectively, p(+1) and p(−1) are the prior probabilities for classes y = +1 and y = −1, p (error| + 1) is the false c 2010 Jacek P. [sent-20, score-0.625]

5 Thus, the risk minimization problem is uniquely deﬁned by the ratio p(+1)c(+1) ; that is, even though the priors and costs may vary, as long as this ratio stays p(−1)c(−1) constant, the optimization problem is unchanged. [sent-26, score-0.573]

6 Minimizing risk is synonymous with optimally trading off the false negative and false positive rates. [sent-34, score-0.481]

7 However, “reading off” the expected loss from an ROC graph is not straightforward, and Drummond and Holte (2000) proposed cost curves as an explicit visualization p(+1)c(+1) of a classiﬁer’s risk for varying misclassiﬁcation costs and class priors. [sent-37, score-0.579]

8 Since the ratio p(−1)c(−1) is unbounded, the curves instead show the risk as a function of the probability cost function (pcf): pcf = p(+1)c(+1) . [sent-38, score-0.929]

9 If the classiﬁer is based on the log of the ratio of true class posterior probabilities, the threshold should be modiﬁed by a value equal to the log of the ratio of misclassiﬁcation costs (Duda et al. [sent-42, score-0.296]

10 Elkan (2001) proposes handling asymmetric misclassiﬁcation costs by retraining the classiﬁer on a training set in which the proportion of positive and negative examples is matched to the ratio of misclassiﬁcation costs. [sent-47, score-0.235]

11 The posteriors for the labeled examples are estimated via bagging and then used in conjunction with the Bayesian minimum risk criterion to assign new labels to the supervised data. [sent-52, score-0.405]

12 It is also important to point out that risk minimization is a diverse problem spanning multiple research communities; in particular, signiﬁcant contributions to the problem have been made in the econometrics literature. [sent-61, score-0.367]

13 It is noted therein that to construct the minimum risk decision, the model density need not match the true density; rather, it is only required that the classiﬁer output from the model density falls on the same side of the threshold as the classiﬁer output using the true density. [sent-63, score-0.431]

14 In the former case, shifting the threshold of the maximum likelihood (ML) solution is an asymptotically optimal solution to the risk minimization problem, and in the following we provide a proof of this important point. [sent-67, score-0.475]

15 However, the choice to employ a simple classiﬁer brings many advantages: ease of implementation, a lesser number of parameters to estimate, a reduced risk of over-ﬁtting, and consequently simpliﬁed regularization procedures. [sent-70, score-0.334]

16 In this case, we demonstrate that thresholded ML is suboptimal, and that the minimum risk solution varies with the ratio of misclassiﬁcation costs. [sent-72, score-0.664]

17 In 3315 D MOCHOWSKI , S AJDA AND PARRA this paper, we employ a sigmoidal approximation of the empirical risk to yield a novel minimizer of the loss under asymmetric misclassiﬁcation cost values. [sent-74, score-0.755]

18 We show analytically that the negative weighted log likelihood serves as an upper bound to the sigmoidal empirical risk. [sent-76, score-0.431]

19 Based on the convexity of the negative weighted log likelihood and forthcoming numerical results, we will argue that weighted ML is generally the preferred technique. [sent-77, score-0.218]

20 If there is also no cost for a correct decision, then c(y) is simply the cost of a false positive (y = −1) or false negative (y = +1). [sent-82, score-0.242]

21 The optimal decision rule may be written as: y(x) = sgn [ f (x) − ln co (x)] , ˆ (2) where y (x) is the predicted class given feature vector x, sgn is the signum function sgn(x) = ˆ 1 x>0 , and f (x) is the discriminant function: −1 x ≤ 0 f (x) = ln p (+1|x) . [sent-85, score-0.282]

22 The model discriminant is written as p(+1|x;θ) f (x, θ) = ln p −1|x;θ , and the classiﬁer takes the form: ( ) y (x, θ) = sgn [ f (x, θ) − ln co (x)] . [sent-90, score-0.208]

23 In order to treat these estimation methods, we brieﬂy outline the risk minimization framework which allows for the subsequent analysis of the various cost-sensitive loss functions. [sent-92, score-0.399]

24 ˆ The problems of regression, density estimation, and pattern recognition may all be formulated within the context of risk minimization, simply by altering the loss function L, as outlined in Vapnik (1998, 1999). [sent-97, score-0.366]

25 Assuming ˆ ˆ continuity of the log likelihood (in θ), we have that limN→∞ y x, θML = y (x, θ∗ ), and thus the ˆ thresholded ML estimate yields the minimum risk decision rule for all cost ratios. [sent-109, score-0.75]

26 , 2005) that the risk function is convex, and an iteratively reweighted least squares (IRLS) algorithm locates the optimal linear classiﬁer often within a few iterations. [sent-113, score-0.36]

27 ) In such cases, the classiﬁer in the model set which minimizes risk will vary with the pcf. [sent-117, score-0.334]

28 The purpose of this exercise is not to argue for a simple Gaussian model or a linear classiﬁer but rather to demonstrate in an analytically tractable case the problem that arises with thresholded ML when the model is misspeciﬁed. [sent-121, score-0.238]

29 Figure 1(a) displays the minimum risk quadrics for various values of the pcf, where it is assumed that p(+1) = p(−1). [sent-135, score-0.42]

30 On the other hand, Figure 1(b) depicts the minimum risk planes for the same data, which were computed by minimizing (7) using numerical optimization techniques. [sent-137, score-0.427]

31 It is clear that the direction of the optimal plane is a function of the pcf, and a threshold shift of the minimum error plane is not an optimal solution to the risk minimization problem. [sent-138, score-0.59]

32 The risks obtained by applying the ML and minimum risk planes to the given data are shown in Figure 1(d). [sent-143, score-0.427]

33 In the ﬁgure, we normalize the raw risk of (1) by the “trivial risk”, which is deﬁned as the risk achieved by the scheme: ytrivial = sgn [p(+1)c(+1) − p(−1)c(−1)] . [sent-144, score-0.71]

34 A normalized risk less than 1 indicates that the classiﬁcation scheme yields a “savings” over the a priori decision rule. [sent-146, score-0.366]

35 The ML classiﬁer was trained on each ensemble and the resulting risk computed by substituting the solution into (7). [sent-149, score-0.334]

36 The risk margin between the threshold-shifted ML solution and that of the minimum risk plane is what is “available” for cost-sensitive learning algorithms to improve upon. [sent-150, score-0.764]

37 These methods attempt to learn, for each pcf, the minimum risk plane shown in Figure 1(b), to achieve the dashed cost curve in Figure 1(d). [sent-151, score-0.49]

38 The difference between the threshold-shifted ML and cost-sensitive paradigms may be understood in terms of ROC analysis—Figure 1 (e) depicts the ROC curves for the thresholded ML and minimum risk classiﬁers. [sent-152, score-0.654]

39 3319 D MOCHOWSKI , S AJDA AND PARRA 4 2 2 x2 6 4 x2 6 0 −2 −2 pcf pcf pcf pcf pcf −4 −6 −6 0 −4 −2 0 x1 2 = 0. [sent-156, score-2.295]

40 99 4 pcf pcf pcf pcf pcf −4 6 −6 −6 (a) Minimum risk quadrics −4 −2 0 x1 2 = 0. [sent-161, score-2.666]

41 5 4 Cost / Trivial Cost x2 2 0 −2 pcf pcf pcf pcf pcf −4 −6 −6 −4 −2 0 x1 2 4 = 0. [sent-167, score-2.295]

42 8 1 (e) ROC curves Figure 1: Minimum risk classiﬁcation of Gaussian data with unequal covariance matrices. [sent-190, score-0.4]

43 The empirical risk follows as: Rrel (θ) = − emp 1 ln p [g(xn )|xn ; θ] . [sent-196, score-0.45]

44 N∑ n As the re-labeling of (8) varies with the ratio of misclassiﬁcation costs, the resulting costsensitive classiﬁer is a function of the pcf and thus has the ability to yield the minimum risk estimator. [sent-197, score-0.922]

45 The success of the method hinges on the estimation of the posterior probabilities; in the best case scenario, the re-labeling results in the cost-sensitive boundary approaching the minimum risk boundary. [sent-198, score-0.425]

46 In contrast to the forthcoming methods, MetaCost does not reweight examples, and thus the risk function will not dominated by the examples of a rare but costly class. [sent-199, score-0.378]

47 , 2003): L (y, x, θ) = −c(y, x) ln p (y|x; θ) , such that the corresponding empirical risk takes the form: Rwml (θ) = − emp 1 c(yn , xn ) ln p (yn |xn ; θ) . [sent-205, score-0.562]

48 In principle, this technique allows the classiﬁcation to choose the model in Θ which minimizes risk for the speciﬁed cost matrix. [sent-209, score-0.394]

49 Notice that if the misclassiﬁcation costs are highly asymmetric, the more “costly” examples will be heavily emphasized in the empirical risk function. [sent-211, score-0.483]

50 Furthermore, if there are only a few such examples, the classiﬁer is at an increased risk of overﬁtting, since the effective number of examples is much less than N. [sent-212, score-0.334]

51 5 Sigmoidal Empirical Risk In order to relate the risk of (1) with the empirical risk (4), the appropriate loss function is found to be: L (y, x, θ) = c(y, x)½ (y = y) . [sent-215, score-0.729]

52 Generally, the empirical risk follows as: 1 c (yn , xn ) ½ (yn = yn ) ˆ N∑ n 1 = const. [sent-217, score-0.538]

53 Since our goal is to minimize (1), the optimization of the empirical risk under 0 x≤0 the direct loss of (10) is of great importance. [sent-220, score-0.421]

54 Remp (θ) = ∑ c(yn , xn ) N n 1 + eyn f (xn ,θ) The classiﬁer θ which minimizes the empirical risk follows as: ˆ ˜ θ = arg min Remp (θ). [sent-225, score-0.421]

55 On the other hand, the risk function is non-convex, complicating the minimization of (13). [sent-227, score-0.367]

56 In this subsection, we show that the negative weighted log likelihood, a convex function, provides a tight upper bound of the sigmoidal empirical risk. [sent-231, score-0.394]

57 (14) 1 Substituting z = into (14) results in: ( θ) yn f xn , 1+e 1 yn f (xn ,θ) 1 ≤ − ln 1+e −yn f (xn ,θ) . [sent-233, score-0.346]

58 1+e Combining these inequalities over the training examples and assuming strict positivity of the weights c(yn , xn ), we obtain: ∑ c(yn , xn ) n 1 yn f (xn ,θ) 1+e ≤ − ∑ c(yn , xn ) ln n 1 −yn f (xn ,θ) . [sent-234, score-0.366]

59 Since the negative weighted log-likelihood is convex, to circumvent the non-convexity of the sigmoidal empirical risk, one option is to employ the weighted likelihood loss function. [sent-240, score-0.531]

60 The relationship between the misclassiﬁcation cost ratio and the pcf is given by: c(−1) p(+1) (1 − pcf) = . [sent-254, score-0.562]

61 In order to solve the minimum risk optimization of (13), the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method (Fletcher, 2000) was employed in conjunction with multiple random restarts: a random starting vector is chosen, the BFGS algorithm is run, and the training risk evaluated. [sent-264, score-0.738]

62 This process is repeated 100 times, and the ﬁnal solution is selected as the classiﬁer which yields the lowest training risk among the runs. [sent-265, score-0.355]

63 At N = 10, there is a substantial loss margin between the minimum risk plane and that which is achieved by the thresholded ML technique. [sent-270, score-0.7]

64 However, it is clear that the cost-sensitive techniques are not able to provide reliable estimates of the minimum risk plane direction with such limited data. [sent-271, score-0.43]

65 As the size of the training set increases, the sigmoidal risk estimator converges to the minimum risk plane. [sent-272, score-1.036]

66 Notice, however, that with such large sample sizes, the thresholded ML technique is relatively adept at yielding risk values comparable to the true minimum. [sent-273, score-0.572]

67 The reason for this is that the examples which are misclassiﬁed by thresholded ML and classiﬁed correctly by the the other techniques are mostly the low-cost examples (compare Fig. [sent-274, score-0.238]

68 Also shown in all plots is the Bayes risk, which is the risk attained by the minimum risk quadric. [sent-277, score-0.717]

69 3 (a) and (b), it is once again apparent that given a modest value of N, the beneﬁts provided by cost-sensitive learners over the thresholded ML approach are not substantial. [sent-283, score-0.238]

70 3 (c) and (d), one observes a tangible loss reduction of the sigmoidal risk estimator over 1. [sent-285, score-0.664]

71 Since only one “ensemble” is available in the experiments with real data, we treat the cost (at test time) of each example as an iid realization of the risk and report standard errors of the mean across examples. [sent-288, score-0.394]

72 5 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0 0 0. [sent-293, score-0.393]

73 8 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0 0 1 0. [sent-297, score-0.393]

74 3 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0. [sent-314, score-0.393]

75 8 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0. [sent-321, score-0.393]

76 8 1 pcf (c) N = 1000 (d) N = 10000 Figure 2: Cost curves for Gaussian data with varying training set sizes. [sent-327, score-0.513]

77 In (d), the thresholded ML and MetaCost curves are nearly equivalent. [sent-328, score-0.271]

78 3 (e) demonstrates the nearoptimality of weighted ML and its close approximation of the sigmoidal risk minimizing solution. [sent-340, score-0.675]

79 8 1 pcf (c) Magic 1 (d) Adult Thresholded ML MetaCost Weighted ML Sigmoidal 0. [sent-381, score-0.459]

80 8 1 pcf (e) EEG Figure 3: Cost curves (with standard errors of the means) for various real data sets with synthetic costs. [sent-394, score-0.492]

81 3 German Banking Data Finally, we evaluated the risk minimizing classiﬁers on a publicly available data set collected by a German bank: http://www. [sent-396, score-0.334]

82 We conducted a leave-one-out cross-validation of the thresholded ML, weighted ML, and sigmoidal risk estimators on this N = 1000 example data set (MetaCost is not applicable to problems with example-dependent costs). [sent-414, score-0.913]

83 On average, the means obtained by the thresholded ML, weighted ML, and sigmoidal risk estimators are DM 3. [sent-416, score-0.913]

84 A statistically signiﬁcant improvement in NPV is achieved by both WML and sigmoidal risk estimators over the thresholded ML solution (p < 0. [sent-425, score-0.845]

85 On the other hand, statistical signiﬁcance cannot be established between WML and sigmoidal risk (p ≈ 0. [sent-427, score-0.607]

86 Initially, we were motivated by the observation that with a misspeciﬁed model, the direction of the minimum risk hyperplane is a function of the ratio of misclassiﬁcation costs and the class priors. [sent-432, score-0.605]

87 Since the ML approach is inherently to shift the minimum error hyperplane, we sought to develop an estimator which, given a ratio of misclassiﬁcation costs, will ﬁnd the direction required to minimize risk rather than maximizing likelihood. [sent-433, score-0.509]

88 This led us to the development of the sigmoidal empirical risk minimizer. [sent-435, score-0.636]

89 Firstly, the search for the minimum risk hyperplane is non-trivial: regularization techniques proved to be necessary, particularly in the case of a limited training set. [sent-437, score-0.463]

90 Moreover, both the existing and proposed cost-sensitive learning techniques yield the greatest beneﬁts over thresholded ML when presented with large amounts of training data. [sent-438, score-0.259]

91 When abundant data is available, the sigmoidal risk estimator typically outperforms all other methods, but weighted ML yields quite comparable values. [sent-439, score-0.7]

92 From the so-called structural risk minimization principle, it is well-known that a simpler model set yields empirical risks that are closer to the true risk (Vapnik, 1998). [sent-444, score-0.73]

93 The results with Gaussian data presented above appear to indicate that the sigmoidal risk minimizer tends to the true minimum risk model given enough data. [sent-453, score-0.99]

94 However, the weighted ML estimator provides a tight upper bound on the sigmoidal empirical risk and thus this solution is not far 3328 M AXIMUM L IKELIHOOD IN C OST-S ENSITIVE L EARNING from optimal. [sent-454, score-0.729]

95 Lastly, as seen in Figure 2, both the opportunity and the challenge in cost-sensitive learning lies in the ability to estimate the minimum risk model with limited data. [sent-456, score-0.383]

96 , with few examples relative to the dimensionality of the feature space) inference of the minimum risk classiﬁer. [sent-459, score-0.383]

97 On the other hand, with a misspeciﬁed model, the risk minimizing solution is a function of the misclassiﬁcation cost ratios, and thresholding the ML estimate is sub-optimal. [sent-464, score-0.394]

98 A novel estimator based on a sigmoidal estimation of the empirical risk was presented and shown to outperform conventional techniques provided enough data; however, the negative weighted log likelihood was analytically and empirically shown to tightly upper bound the sigmoidal loss. [sent-465, score-1.063]

99 For a more detailed version as well as implementations of thresholded ML and sigmoidal risk minimization, please refer to http://bme. [sent-475, score-0.845]

100 Learning when data sets are imbalanced and when costs are unequal and unknown. [sent-591, score-0.21]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pcf', 0.459), ('ml', 0.438), ('risk', 0.334), ('sigmoidal', 0.273), ('thresholded', 0.238), ('metacost', 0.186), ('parra', 0.162), ('mochowski', 0.124), ('costs', 0.12), ('yn', 0.117), ('ensitive', 0.112), ('npv', 0.112), ('misclassi', 0.111), ('lieli', 0.099), ('ajda', 0.095), ('misspeci', 0.088), ('aximum', 0.086), ('ikelihood', 0.079), ('lambda', 0.074), ('weighted', 0.068), ('er', 0.064), ('elkan', 0.062), ('classi', 0.061), ('roc', 0.06), ('cost', 0.06), ('hyperplane', 0.059), ('xn', 0.058), ('co', 0.058), ('imbalanced', 0.057), ('loan', 0.056), ('ln', 0.054), ('dm', 0.054), ('zadrozny', 0.053), ('remp', 0.053), ('applicant', 0.05), ('elliott', 0.05), ('minimum', 0.049), ('false', 0.049), ('threshold', 0.048), ('plane', 0.047), ('planes', 0.044), ('ratio', 0.043), ('mccullagh', 0.042), ('posterior', 0.042), ('sgn', 0.042), ('trivial', 0.041), ('duda', 0.039), ('earning', 0.038), ('costsensitive', 0.037), ('dmochowski', 0.037), ('irls', 0.037), ('quadrics', 0.037), ('wml', 0.037), ('likelihood', 0.037), ('asymmetry', 0.035), ('minimization', 0.033), ('unequal', 0.033), ('emp', 0.033), ('curves', 0.033), ('loss', 0.032), ('shift', 0.032), ('decision', 0.032), ('jacek', 0.032), ('domingos', 0.032), ('erm', 0.031), ('empirical', 0.029), ('nelder', 0.029), ('eeg', 0.029), ('city', 0.028), ('asymmetric', 0.027), ('reweighted', 0.026), ('minimize', 0.026), ('estimator', 0.025), ('white', 0.025), ('biographical', 0.025), ('drummond', 0.025), ('dudik', 0.025), ('endogenous', 0.025), ('nonstandard', 0.025), ('prevalences', 0.025), ('relabeled', 0.025), ('repayment', 0.025), ('rwml', 0.025), ('synonymous', 0.025), ('transfusion', 0.025), ('vold', 0.025), ('lucas', 0.025), ('negative', 0.024), ('credit', 0.023), ('shifting', 0.023), ('costly', 0.023), ('biomedical', 0.022), ('posteriors', 0.022), ('magic', 0.021), ('prevalent', 0.021), ('chawla', 0.021), ('forthcoming', 0.021), ('erf', 0.021), ('haberman', 0.021), ('training', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

2 0.19195464 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization

3 0.11553397 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classiﬁers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classiﬁcation rate (i.e., risk), respectively. Given an ordering, we can ﬁnd a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classiﬁcation using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classiﬁcation performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures signiﬁcantly outperforms generatively produced structures, and achieves a classiﬁcation accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classiﬁers still hold in the case of missing features, a case where generative classiﬁers have an advantage over discriminative classiﬁers. Keywords: Bayesian networks, classiﬁcation, discriminative learning, structure learning, graphical model, missing feature

4 0.10164212 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

Abstract: In this paper, we investigate the problem of binary classiﬁcation with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassiﬁcation. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufﬁcient condition for inﬁnite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classiﬁcation, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option

5 0.094194256 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley

Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overﬁtting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-ﬁtting, regularization, minimum description length

6 0.090828449 10 jmlr-2010-An Exponential Model for Infinite Rankings

7 0.086133383 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

8 0.082236975 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

9 0.070739642 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

10 0.060029268 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

11 0.057860665 25 jmlr-2010-Composite Binary Losses

12 0.053647339 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

13 0.053242356 22 jmlr-2010-Classification Using Geometric Level Sets

14 0.049663398 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

15 0.049403388 109 jmlr-2010-Stochastic Composite Likelihood

16 0.047209922 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

17 0.046254203 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

18 0.04411044 60 jmlr-2010-Learnability, Stability and Uniform Convergence

19 0.043657396 61 jmlr-2010-Learning From Crowds

20 0.039138649 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.196), (1, -0.021), (2, 0.04), (3, 0.071), (4, -0.052), (5, 0.142), (6, 0.208), (7, 0.189), (8, -0.174), (9, 0.044), (10, -0.102), (11, -0.123), (12, -0.074), (13, -0.227), (14, -0.133), (15, -0.189), (16, 0.219), (17, 0.115), (18, -0.132), (19, -0.059), (20, -0.08), (21, -0.032), (22, -0.048), (23, 0.086), (24, 0.051), (25, 0.119), (26, 0.031), (27, 0.007), (28, 0.13), (29, -0.117), (30, 0.042), (31, 0.151), (32, -0.019), (33, 0.103), (34, -0.005), (35, 0.048), (36, -0.045), (37, -0.074), (38, -0.122), (39, -0.018), (40, 0.03), (41, -0.015), (42, -0.02), (43, 0.038), (44, 0.035), (45, 0.05), (46, 0.03), (47, 0.051), (48, -0.021), (49, 0.089)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93311936 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

2 0.6038034 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

Author: Jitkomut Songsiri, Lieven Vandenberghe

3 0.50630808 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

Author: Franz Pernkopf, Jeff A. Bilmes

4 0.43805227 10 jmlr-2010-An Exponential Model for Infinite Rankings

Author: Marina Meilă, Le Bao

Abstract: This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large. A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with ﬁnite ordered lists called top-t orderings over an inﬁnite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufﬁcient statistics, being thus an exponential family model with continuous and discrete parameters. We describe its conjugate prior and other statistical properties. Then, we extend the estimation problem to multimodal data by introducing an Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of our model and demonstrate that inﬁnite models over permutations can be simple, elegant and practical. Keywords: permutations, partial orderings, Mallows model, distance based ranking model, exponential family, non-parametric clustering, branch-and-bound

5 0.42116591 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

6 0.42088959 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

7 0.34519312 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

8 0.32238492 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.295573 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

10 0.29164487 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

11 0.27781299 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

12 0.26987013 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

13 0.2695975 25 jmlr-2010-Composite Binary Losses

14 0.26688746 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

15 0.23503932 22 jmlr-2010-Classification Using Geometric Level Sets

16 0.2307312 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

17 0.21666823 63 jmlr-2010-Learning Instance-Specific Predictive Models

18 0.20545459 109 jmlr-2010-Stochastic Composite Likelihood

19 0.18631233 102 jmlr-2010-Semi-Supervised Novelty Detection

20 0.18031488 40 jmlr-2010-Fast and Scalable Local Kernel Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.011), (5, 0.01), (8, 0.016), (21, 0.015), (32, 0.611), (36, 0.024), (37, 0.028), (75, 0.104), (81, 0.011), (85, 0.06), (96, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92250085 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds

Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra

2 0.8926875 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

Author: Ran El-Yaniv, Yair Wiener

Abstract: We consider selective classiﬁcation, a term we adopt here to refer to ‘classiﬁcation with a reject option.’ The essence in selective classiﬁcation is to trade-off classiﬁer coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classiﬁcation including characterizations of RC trade-offs in various interesting settings. Keywords: classiﬁcation with a reject option, selective classiﬁcation, perfect learning, high performance classiﬁcation, risk-coverage trade-off

3 0.86682928 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

Author: Jörg Lücke, Julian Eggert

Abstract: We show how a preselection of hidden variables can be used to efﬁciently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efﬁciently computable approximation to the sufﬁcient statistics of a given model. The computational cost to compute the sufﬁcient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the beneﬁts of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artiﬁcial and realistic data, and are compared to other models in the literature. We ﬁnd that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models

4 0.85791922 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan

Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classiﬁcation jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efﬁcient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a ﬂexible and efﬁcient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition

5 0.54306 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

Author: Ming Yuan, Marten Wegkamp

6 0.52551532 60 jmlr-2010-Learnability, Stability and Uniform Convergence

7 0.51960433 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

8 0.49815035 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

9 0.48616508 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

10 0.47203094 102 jmlr-2010-Semi-Supervised Novelty Detection

11 0.47010761 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

12 0.46619642 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials

13 0.46578458 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

14 0.45901603 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

15 0.45898023 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

16 0.45621285 109 jmlr-2010-Stochastic Composite Likelihood

17 0.45378983 22 jmlr-2010-Classification Using Geometric Level Sets

18 0.45299321 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data

19 0.45271662 25 jmlr-2010-Composite Binary Losses

20 0.45257282 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm