jmlr jmlr2010 jmlr2010-73 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
Reference: text
sentIndex sentText sentNum sentScore
1 In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. [sent-11, score-0.404]
2 On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. [sent-12, score-0.298]
3 Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets 1. [sent-15, score-0.452]
4 , 2001): risk = p (+1) c(+1)p (error| + 1) + p (−1) c(−1)p (error| − 1) , (1) where c(+1) and c(−1) denote the costs of a false negative and false positive, respectively, p(+1) and p(−1) are the prior probabilities for classes y = +1 and y = −1, p (error| + 1) is the false c 2010 Jacek P. [sent-20, score-0.625]
5 Thus, the risk minimization problem is uniquely defined by the ratio p(+1)c(+1) ; that is, even though the priors and costs may vary, as long as this ratio stays p(−1)c(−1) constant, the optimization problem is unchanged. [sent-26, score-0.573]
6 Minimizing risk is synonymous with optimally trading off the false negative and false positive rates. [sent-34, score-0.481]
7 However, “reading off” the expected loss from an ROC graph is not straightforward, and Drummond and Holte (2000) proposed cost curves as an explicit visualization p(+1)c(+1) of a classifier’s risk for varying misclassification costs and class priors. [sent-37, score-0.579]
8 Since the ratio p(−1)c(−1) is unbounded, the curves instead show the risk as a function of the probability cost function (pcf): pcf = p(+1)c(+1) . [sent-38, score-0.929]
9 If the classifier is based on the log of the ratio of true class posterior probabilities, the threshold should be modified by a value equal to the log of the ratio of misclassification costs (Duda et al. [sent-42, score-0.296]
10 Elkan (2001) proposes handling asymmetric misclassification costs by retraining the classifier on a training set in which the proportion of positive and negative examples is matched to the ratio of misclassification costs. [sent-47, score-0.235]
11 The posteriors for the labeled examples are estimated via bagging and then used in conjunction with the Bayesian minimum risk criterion to assign new labels to the supervised data. [sent-52, score-0.405]
12 It is also important to point out that risk minimization is a diverse problem spanning multiple research communities; in particular, significant contributions to the problem have been made in the econometrics literature. [sent-61, score-0.367]
13 It is noted therein that to construct the minimum risk decision, the model density need not match the true density; rather, it is only required that the classifier output from the model density falls on the same side of the threshold as the classifier output using the true density. [sent-63, score-0.431]
14 In the former case, shifting the threshold of the maximum likelihood (ML) solution is an asymptotically optimal solution to the risk minimization problem, and in the following we provide a proof of this important point. [sent-67, score-0.475]
15 However, the choice to employ a simple classifier brings many advantages: ease of implementation, a lesser number of parameters to estimate, a reduced risk of over-fitting, and consequently simplified regularization procedures. [sent-70, score-0.334]
16 In this case, we demonstrate that thresholded ML is suboptimal, and that the minimum risk solution varies with the ratio of misclassification costs. [sent-72, score-0.664]
17 In 3315 D MOCHOWSKI , S AJDA AND PARRA this paper, we employ a sigmoidal approximation of the empirical risk to yield a novel minimizer of the loss under asymmetric misclassification cost values. [sent-74, score-0.755]
18 We show analytically that the negative weighted log likelihood serves as an upper bound to the sigmoidal empirical risk. [sent-76, score-0.431]
19 Based on the convexity of the negative weighted log likelihood and forthcoming numerical results, we will argue that weighted ML is generally the preferred technique. [sent-77, score-0.218]
20 If there is also no cost for a correct decision, then c(y) is simply the cost of a false positive (y = −1) or false negative (y = +1). [sent-82, score-0.242]
21 The optimal decision rule may be written as: y(x) = sgn [ f (x) − ln co (x)] , ˆ (2) where y (x) is the predicted class given feature vector x, sgn is the signum function sgn(x) = ˆ 1 x>0 , and f (x) is the discriminant function: −1 x ≤ 0 f (x) = ln p (+1|x) . [sent-85, score-0.282]
22 The model discriminant is written as p(+1|x;θ) f (x, θ) = ln p −1|x;θ , and the classifier takes the form: ( ) y (x, θ) = sgn [ f (x, θ) − ln co (x)] . [sent-90, score-0.208]
23 In order to treat these estimation methods, we briefly outline the risk minimization framework which allows for the subsequent analysis of the various cost-sensitive loss functions. [sent-92, score-0.399]
24 ˆ The problems of regression, density estimation, and pattern recognition may all be formulated within the context of risk minimization, simply by altering the loss function L, as outlined in Vapnik (1998, 1999). [sent-97, score-0.366]
25 Assuming ˆ ˆ continuity of the log likelihood (in θ), we have that limN→∞ y x, θML = y (x, θ∗ ), and thus the ˆ thresholded ML estimate yields the minimum risk decision rule for all cost ratios. [sent-109, score-0.75]
26 , 2005) that the risk function is convex, and an iteratively reweighted least squares (IRLS) algorithm locates the optimal linear classifier often within a few iterations. [sent-113, score-0.36]
27 ) In such cases, the classifier in the model set which minimizes risk will vary with the pcf. [sent-117, score-0.334]
28 The purpose of this exercise is not to argue for a simple Gaussian model or a linear classifier but rather to demonstrate in an analytically tractable case the problem that arises with thresholded ML when the model is misspecified. [sent-121, score-0.238]
29 Figure 1(a) displays the minimum risk quadrics for various values of the pcf, where it is assumed that p(+1) = p(−1). [sent-135, score-0.42]
30 On the other hand, Figure 1(b) depicts the minimum risk planes for the same data, which were computed by minimizing (7) using numerical optimization techniques. [sent-137, score-0.427]
31 It is clear that the direction of the optimal plane is a function of the pcf, and a threshold shift of the minimum error plane is not an optimal solution to the risk minimization problem. [sent-138, score-0.59]
32 The risks obtained by applying the ML and minimum risk planes to the given data are shown in Figure 1(d). [sent-143, score-0.427]
33 In the figure, we normalize the raw risk of (1) by the “trivial risk”, which is defined as the risk achieved by the scheme: ytrivial = sgn [p(+1)c(+1) − p(−1)c(−1)] . [sent-144, score-0.71]
34 A normalized risk less than 1 indicates that the classification scheme yields a “savings” over the a priori decision rule. [sent-146, score-0.366]
35 The ML classifier was trained on each ensemble and the resulting risk computed by substituting the solution into (7). [sent-149, score-0.334]
36 The risk margin between the threshold-shifted ML solution and that of the minimum risk plane is what is “available” for cost-sensitive learning algorithms to improve upon. [sent-150, score-0.764]
37 These methods attempt to learn, for each pcf, the minimum risk plane shown in Figure 1(b), to achieve the dashed cost curve in Figure 1(d). [sent-151, score-0.49]
38 The difference between the threshold-shifted ML and cost-sensitive paradigms may be understood in terms of ROC analysis—Figure 1 (e) depicts the ROC curves for the thresholded ML and minimum risk classifiers. [sent-152, score-0.654]
39 3319 D MOCHOWSKI , S AJDA AND PARRA 4 2 2 x2 6 4 x2 6 0 −2 −2 pcf pcf pcf pcf pcf −4 −6 −6 0 −4 −2 0 x1 2 = 0. [sent-156, score-2.295]
40 99 4 pcf pcf pcf pcf pcf −4 6 −6 −6 (a) Minimum risk quadrics −4 −2 0 x1 2 = 0. [sent-161, score-2.666]
41 5 4 Cost / Trivial Cost x2 2 0 −2 pcf pcf pcf pcf pcf −4 −6 −6 −4 −2 0 x1 2 4 = 0. [sent-167, score-2.295]
42 8 1 (e) ROC curves Figure 1: Minimum risk classification of Gaussian data with unequal covariance matrices. [sent-190, score-0.4]
43 The empirical risk follows as: Rrel (θ) = − emp 1 ln p [g(xn )|xn ; θ] . [sent-196, score-0.45]
44 N∑ n As the re-labeling of (8) varies with the ratio of misclassification costs, the resulting costsensitive classifier is a function of the pcf and thus has the ability to yield the minimum risk estimator. [sent-197, score-0.922]
45 The success of the method hinges on the estimation of the posterior probabilities; in the best case scenario, the re-labeling results in the cost-sensitive boundary approaching the minimum risk boundary. [sent-198, score-0.425]
46 In contrast to the forthcoming methods, MetaCost does not reweight examples, and thus the risk function will not dominated by the examples of a rare but costly class. [sent-199, score-0.378]
47 , 2003): L (y, x, θ) = −c(y, x) ln p (y|x; θ) , such that the corresponding empirical risk takes the form: Rwml (θ) = − emp 1 c(yn , xn ) ln p (yn |xn ; θ) . [sent-205, score-0.562]
48 In principle, this technique allows the classification to choose the model in Θ which minimizes risk for the specified cost matrix. [sent-209, score-0.394]
49 Notice that if the misclassification costs are highly asymmetric, the more “costly” examples will be heavily emphasized in the empirical risk function. [sent-211, score-0.483]
50 Furthermore, if there are only a few such examples, the classifier is at an increased risk of overfitting, since the effective number of examples is much less than N. [sent-212, score-0.334]
51 5 Sigmoidal Empirical Risk In order to relate the risk of (1) with the empirical risk (4), the appropriate loss function is found to be: L (y, x, θ) = c(y, x)½ (y = y) . [sent-215, score-0.729]
52 Generally, the empirical risk follows as: 1 c (yn , xn ) ½ (yn = yn ) ˆ N∑ n 1 = const. [sent-217, score-0.538]
53 Since our goal is to minimize (1), the optimization of the empirical risk under 0 x≤0 the direct loss of (10) is of great importance. [sent-220, score-0.421]
54 Remp (θ) = ∑ c(yn , xn ) N n 1 + eyn f (xn ,θ) The classifier θ which minimizes the empirical risk follows as: ˆ ˜ θ = arg min Remp (θ). [sent-225, score-0.421]
55 On the other hand, the risk function is non-convex, complicating the minimization of (13). [sent-227, score-0.367]
56 In this subsection, we show that the negative weighted log likelihood, a convex function, provides a tight upper bound of the sigmoidal empirical risk. [sent-231, score-0.394]
57 (14) 1 Substituting z = into (14) results in: ( θ) yn f xn , 1+e 1 yn f (xn ,θ) 1 ≤ − ln 1+e −yn f (xn ,θ) . [sent-233, score-0.346]
58 1+e Combining these inequalities over the training examples and assuming strict positivity of the weights c(yn , xn ), we obtain: ∑ c(yn , xn ) n 1 yn f (xn ,θ) 1+e ≤ − ∑ c(yn , xn ) ln n 1 −yn f (xn ,θ) . [sent-234, score-0.366]
59 Since the negative weighted log-likelihood is convex, to circumvent the non-convexity of the sigmoidal empirical risk, one option is to employ the weighted likelihood loss function. [sent-240, score-0.531]
60 The relationship between the misclassification cost ratio and the pcf is given by: c(−1) p(+1) (1 − pcf) = . [sent-254, score-0.562]
61 In order to solve the minimum risk optimization of (13), the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method (Fletcher, 2000) was employed in conjunction with multiple random restarts: a random starting vector is chosen, the BFGS algorithm is run, and the training risk evaluated. [sent-264, score-0.738]
62 This process is repeated 100 times, and the final solution is selected as the classifier which yields the lowest training risk among the runs. [sent-265, score-0.355]
63 At N = 10, there is a substantial loss margin between the minimum risk plane and that which is achieved by the thresholded ML technique. [sent-270, score-0.7]
64 However, it is clear that the cost-sensitive techniques are not able to provide reliable estimates of the minimum risk plane direction with such limited data. [sent-271, score-0.43]
65 As the size of the training set increases, the sigmoidal risk estimator converges to the minimum risk plane. [sent-272, score-1.036]
66 Notice, however, that with such large sample sizes, the thresholded ML technique is relatively adept at yielding risk values comparable to the true minimum. [sent-273, score-0.572]
67 The reason for this is that the examples which are misclassified by thresholded ML and classified correctly by the the other techniques are mostly the low-cost examples (compare Fig. [sent-274, score-0.238]
68 Also shown in all plots is the Bayes risk, which is the risk attained by the minimum risk quadric. [sent-277, score-0.717]
69 3 (a) and (b), it is once again apparent that given a modest value of N, the benefits provided by cost-sensitive learners over the thresholded ML approach are not substantial. [sent-283, score-0.238]
70 3 (c) and (d), one observes a tangible loss reduction of the sigmoidal risk estimator over 1. [sent-285, score-0.664]
71 Since only one “ensemble” is available in the experiments with real data, we treat the cost (at test time) of each example as an iid realization of the risk and report standard errors of the mean across examples. [sent-288, score-0.394]
72 5 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0 0 0. [sent-293, score-0.393]
73 8 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0 0 1 0. [sent-297, score-0.393]
74 3 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0. [sent-314, score-0.393]
75 8 Thresholded ML MetaCost Weighted ML Sigmoidal Minimum risk hyperplane Bayes Risk 0. [sent-321, score-0.393]
76 8 1 pcf (c) N = 1000 (d) N = 10000 Figure 2: Cost curves for Gaussian data with varying training set sizes. [sent-327, score-0.513]
77 In (d), the thresholded ML and MetaCost curves are nearly equivalent. [sent-328, score-0.271]
78 3 (e) demonstrates the nearoptimality of weighted ML and its close approximation of the sigmoidal risk minimizing solution. [sent-340, score-0.675]
79 8 1 pcf (c) Magic 1 (d) Adult Thresholded ML MetaCost Weighted ML Sigmoidal 0. [sent-381, score-0.459]
80 8 1 pcf (e) EEG Figure 3: Cost curves (with standard errors of the means) for various real data sets with synthetic costs. [sent-394, score-0.492]
81 3 German Banking Data Finally, we evaluated the risk minimizing classifiers on a publicly available data set collected by a German bank: http://www. [sent-396, score-0.334]
82 We conducted a leave-one-out cross-validation of the thresholded ML, weighted ML, and sigmoidal risk estimators on this N = 1000 example data set (MetaCost is not applicable to problems with example-dependent costs). [sent-414, score-0.913]
83 On average, the means obtained by the thresholded ML, weighted ML, and sigmoidal risk estimators are DM 3. [sent-416, score-0.913]
84 A statistically significant improvement in NPV is achieved by both WML and sigmoidal risk estimators over the thresholded ML solution (p < 0. [sent-425, score-0.845]
85 On the other hand, statistical significance cannot be established between WML and sigmoidal risk (p ≈ 0. [sent-427, score-0.607]
86 Initially, we were motivated by the observation that with a misspecified model, the direction of the minimum risk hyperplane is a function of the ratio of misclassification costs and the class priors. [sent-432, score-0.605]
87 Since the ML approach is inherently to shift the minimum error hyperplane, we sought to develop an estimator which, given a ratio of misclassification costs, will find the direction required to minimize risk rather than maximizing likelihood. [sent-433, score-0.509]
88 This led us to the development of the sigmoidal empirical risk minimizer. [sent-435, score-0.636]
89 Firstly, the search for the minimum risk hyperplane is non-trivial: regularization techniques proved to be necessary, particularly in the case of a limited training set. [sent-437, score-0.463]
90 Moreover, both the existing and proposed cost-sensitive learning techniques yield the greatest benefits over thresholded ML when presented with large amounts of training data. [sent-438, score-0.259]
91 When abundant data is available, the sigmoidal risk estimator typically outperforms all other methods, but weighted ML yields quite comparable values. [sent-439, score-0.7]
92 From the so-called structural risk minimization principle, it is well-known that a simpler model set yields empirical risks that are closer to the true risk (Vapnik, 1998). [sent-444, score-0.73]
93 The results with Gaussian data presented above appear to indicate that the sigmoidal risk minimizer tends to the true minimum risk model given enough data. [sent-453, score-0.99]
94 However, the weighted ML estimator provides a tight upper bound on the sigmoidal empirical risk and thus this solution is not far 3328 M AXIMUM L IKELIHOOD IN C OST-S ENSITIVE L EARNING from optimal. [sent-454, score-0.729]
95 Lastly, as seen in Figure 2, both the opportunity and the challenge in cost-sensitive learning lies in the ability to estimate the minimum risk model with limited data. [sent-456, score-0.383]
96 , with few examples relative to the dimensionality of the feature space) inference of the minimum risk classifier. [sent-459, score-0.383]
97 On the other hand, with a misspecified model, the risk minimizing solution is a function of the misclassification cost ratios, and thresholding the ML estimate is sub-optimal. [sent-464, score-0.394]
98 A novel estimator based on a sigmoidal estimation of the empirical risk was presented and shown to outperform conventional techniques provided enough data; however, the negative weighted log likelihood was analytically and empirically shown to tightly upper bound the sigmoidal loss. [sent-465, score-1.063]
99 For a more detailed version as well as implementations of thresholded ML and sigmoidal risk minimization, please refer to http://bme. [sent-475, score-0.845]
100 Learning when data sets are imbalanced and when costs are unequal and unknown. [sent-591, score-0.21]
wordName wordTfidf (topN-words)
[('pcf', 0.459), ('ml', 0.438), ('risk', 0.334), ('sigmoidal', 0.273), ('thresholded', 0.238), ('metacost', 0.186), ('parra', 0.162), ('mochowski', 0.124), ('costs', 0.12), ('yn', 0.117), ('ensitive', 0.112), ('npv', 0.112), ('misclassi', 0.111), ('lieli', 0.099), ('ajda', 0.095), ('misspeci', 0.088), ('aximum', 0.086), ('ikelihood', 0.079), ('lambda', 0.074), ('weighted', 0.068), ('er', 0.064), ('elkan', 0.062), ('classi', 0.061), ('roc', 0.06), ('cost', 0.06), ('hyperplane', 0.059), ('xn', 0.058), ('co', 0.058), ('imbalanced', 0.057), ('loan', 0.056), ('ln', 0.054), ('dm', 0.054), ('zadrozny', 0.053), ('remp', 0.053), ('applicant', 0.05), ('elliott', 0.05), ('minimum', 0.049), ('false', 0.049), ('threshold', 0.048), ('plane', 0.047), ('planes', 0.044), ('ratio', 0.043), ('mccullagh', 0.042), ('posterior', 0.042), ('sgn', 0.042), ('trivial', 0.041), ('duda', 0.039), ('earning', 0.038), ('costsensitive', 0.037), ('dmochowski', 0.037), ('irls', 0.037), ('quadrics', 0.037), ('wml', 0.037), ('likelihood', 0.037), ('asymmetry', 0.035), ('minimization', 0.033), ('unequal', 0.033), ('emp', 0.033), ('curves', 0.033), ('loss', 0.032), ('shift', 0.032), ('decision', 0.032), ('jacek', 0.032), ('domingos', 0.032), ('erm', 0.031), ('empirical', 0.029), ('nelder', 0.029), ('eeg', 0.029), ('city', 0.028), ('asymmetric', 0.027), ('reweighted', 0.026), ('minimize', 0.026), ('estimator', 0.025), ('white', 0.025), ('biographical', 0.025), ('drummond', 0.025), ('dudik', 0.025), ('endogenous', 0.025), ('nonstandard', 0.025), ('prevalences', 0.025), ('relabeled', 0.025), ('repayment', 0.025), ('rwml', 0.025), ('synonymous', 0.025), ('transfusion', 0.025), ('vold', 0.025), ('lucas', 0.025), ('negative', 0.024), ('credit', 0.023), ('shifting', 0.023), ('costly', 0.023), ('biomedical', 0.022), ('posteriors', 0.022), ('magic', 0.021), ('prevalent', 0.021), ('chawla', 0.021), ('forthcoming', 0.021), ('erf', 0.021), ('haberman', 0.021), ('training', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
2 0.19195464 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
Author: Jitkomut Songsiri, Lieven Vandenberghe
Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization
3 0.11553397 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
Author: Franz Pernkopf, Jeff A. Bilmes
Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features, a case where generative classifiers have an advantage over discriminative classifiers. Keywords: Bayesian networks, classification, discriminative learning, structure learning, graphical model, missing feature
4 0.10164212 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
5 0.094194256 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
Author: Isabelle Guyon, Amir Saffari, Gideon Dror, Gavin Cawley
Abstract: The principle of parsimony also known as “Ockham’s razor” has inspired many theories of model selection. Yet such theories, all making arguments in favor of parsimony, are based on very different premises and have developed distinct methodologies to derive algorithms. We have organized challenges and edited a special issue of JMLR and several conference proceedings around the theme of model selection. In this editorial, we revisit the problem of avoiding overfitting in light of the latest results. We note the remarkable convergence of theories as different as Bayesian theory, Minimum Description Length, bias/variance tradeoff, Structural Risk Minimization, and regularization, in some approaches. We also present new and interesting examples of the complementarity of theories leading to hybrid algorithms, neither frequentist, nor Bayesian, or perhaps both frequentist and Bayesian! Keywords: model selection, ensemble methods, multilevel inference, multilevel optimization, performance prediction, bias-variance tradeoff, Bayesian priors, structural risk minimization, guaranteed risk minimization, over-fitting, regularization, minimum description length
6 0.090828449 10 jmlr-2010-An Exponential Model for Infinite Rankings
7 0.086133383 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
8 0.082236975 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
9 0.070739642 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
10 0.060029268 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
11 0.057860665 25 jmlr-2010-Composite Binary Losses
12 0.053647339 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
13 0.053242356 22 jmlr-2010-Classification Using Geometric Level Sets
14 0.049663398 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
15 0.049403388 109 jmlr-2010-Stochastic Composite Likelihood
16 0.047209922 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
17 0.046254203 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
18 0.04411044 60 jmlr-2010-Learnability, Stability and Uniform Convergence
19 0.043657396 61 jmlr-2010-Learning From Crowds
20 0.039138649 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
topicId topicWeight
[(0, -0.196), (1, -0.021), (2, 0.04), (3, 0.071), (4, -0.052), (5, 0.142), (6, 0.208), (7, 0.189), (8, -0.174), (9, 0.044), (10, -0.102), (11, -0.123), (12, -0.074), (13, -0.227), (14, -0.133), (15, -0.189), (16, 0.219), (17, 0.115), (18, -0.132), (19, -0.059), (20, -0.08), (21, -0.032), (22, -0.048), (23, 0.086), (24, 0.051), (25, 0.119), (26, 0.031), (27, 0.007), (28, 0.13), (29, -0.117), (30, 0.042), (31, 0.151), (32, -0.019), (33, 0.103), (34, -0.005), (35, 0.048), (36, -0.045), (37, -0.074), (38, -0.122), (39, -0.018), (40, 0.03), (41, -0.015), (42, -0.02), (43, 0.038), (44, 0.035), (45, 0.05), (46, 0.03), (47, 0.051), (48, -0.021), (49, 0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.93311936 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
2 0.6038034 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
Author: Jitkomut Songsiri, Lieven Vandenberghe
Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization
3 0.50630808 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers
Author: Franz Pernkopf, Jeff A. Bilmes
Abstract: We introduce a simple order-based greedy heuristic for learning discriminative structure within generative Bayesian network classifiers. We propose two methods for establishing an order of N features. They are based on the conditional mutual information and classification rate (i.e., risk), respectively. Given an ordering, we can find a discriminative structure with O N k+1 score evaluations (where constant k is the tree-width of the sub-graph over the attributes). We present results on 25 data sets from the UCI repository, for phonetic classification using the TIMIT database, for a visual surface inspection task, and for two handwritten digit recognition tasks. We provide classification performance for both discriminative and generative parameter learning on both discriminatively and generatively structured networks. The discriminative structure found by our new procedures significantly outperforms generatively produced structures, and achieves a classification accuracy on par with the best discriminative (greedy) Bayesian network learning approach, but does so with a factor of ∼10-40 speedup. We also show that the advantages of generative discriminatively structured Bayesian network classifiers still hold in the case of missing features, a case where generative classifiers have an advantage over discriminative classifiers. Keywords: Bayesian networks, classification, discriminative learning, structure learning, graphical model, missing feature
4 0.43805227 10 jmlr-2010-An Exponential Model for Infinite Rankings
Author: Marina Meilă, Le Bao
Abstract: This paper presents a statistical model for expressing preferences through rankings, when the number of alternatives (items to rank) is large. A human ranker will then typically rank only the most preferred items, and may not even examine the whole set of items, or know how many they are. Similarly, a user presented with the ranked output of a search engine, will only consider the highest ranked items. We model such situations by introducing a stagewise ranking model that operates with finite ordered lists called top-t orderings over an infinite space of items. We give algorithms to estimate this model from data, and demonstrate that it has sufficient statistics, being thus an exponential family model with continuous and discrete parameters. We describe its conjugate prior and other statistical properties. Then, we extend the estimation problem to multimodal data by introducing an Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of our model and demonstrate that infinite models over permutations can be simple, elegant and practical. Keywords: permutations, partial orderings, Mallows model, distance based ranking model, exponential family, non-parametric clustering, branch-and-bound
5 0.42116591 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
6 0.42088959 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
7 0.34519312 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
8 0.32238492 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
9 0.295573 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
10 0.29164487 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
11 0.27781299 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
12 0.26987013 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
13 0.2695975 25 jmlr-2010-Composite Binary Losses
14 0.26688746 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
15 0.23503932 22 jmlr-2010-Classification Using Geometric Level Sets
16 0.2307312 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
17 0.21666823 63 jmlr-2010-Learning Instance-Specific Predictive Models
18 0.20545459 109 jmlr-2010-Stochastic Composite Likelihood
19 0.18631233 102 jmlr-2010-Semi-Supervised Novelty Detection
20 0.18031488 40 jmlr-2010-Fast and Scalable Local Kernel Machines
topicId topicWeight
[(4, 0.011), (5, 0.01), (8, 0.016), (21, 0.015), (32, 0.611), (36, 0.024), (37, 0.028), (75, 0.104), (81, 0.011), (85, 0.06), (96, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.92250085 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
2 0.8926875 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
3 0.86682928 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
Author: Jörg Lücke, Julian Eggert
Abstract: We show how a preselection of hidden variables can be used to efficiently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efficiently computable approximation to the sufficient statistics of a given model. The computational cost to compute the sufficient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the benefits of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artificial and realistic data, and are compared to other models in the literature. We find that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models
4 0.85791922 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
5 0.54306 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
6 0.52551532 60 jmlr-2010-Learnability, Stability and Uniform Convergence
7 0.51960433 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
8 0.49815035 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
9 0.48616508 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
10 0.47203094 102 jmlr-2010-Semi-Supervised Novelty Detection
11 0.47010761 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
12 0.46619642 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
13 0.46578458 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
14 0.45901603 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
15 0.45898023 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
16 0.45621285 109 jmlr-2010-Stochastic Composite Likelihood
17 0.45378983 22 jmlr-2010-Classification Using Geometric Level Sets
18 0.45299321 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
19 0.45271662 25 jmlr-2010-Composite Binary Losses
20 0.45257282 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm