nips nips2008 nips2008-228 knowledge-graph by maker-knowledge-mining

228 nips-2008-Support Vector Machines with a Reject Option


Source: pdf

Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu

Abstract: We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. [sent-2, score-0.37]

2 From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. [sent-3, score-1.174]

3 We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. [sent-5, score-0.998]

4 We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. [sent-6, score-0.22]

5 1 Introduction In decision problems where errors incur a severe loss, one may have to build classifiers that abstain from classifying ambiguous examples. [sent-7, score-0.32]

6 In particular, Chow (1970) analyses how the error rate may be decreased thanks to the reject option. [sent-9, score-0.297]

7 There have been several attempts to integrate a reject option in Support Vector Machines (SVMs), using strategies based on the thresholding of SVMs scores (Kwok, 1999) or on a new training criterion (Fumera & Roli, 2002). [sent-10, score-0.489]

8 We introduce a piecewise linear and convex training criterion dedicated to the problem of classification with the reject option. [sent-12, score-0.438]

9 , 2006), is a double hinge loss, reflecting the two thresholds in Chow’s rule. [sent-14, score-0.725]

10 Hence, we generalize the loss suggested by Bartlett and Wegkamp (2008) to arbitrary asymmetric misclassification and rejection costs. [sent-15, score-0.515]

11 For the symmetric case, our probabilistic viewpoint motivates another decision rule. [sent-16, score-0.317]

12 We then propose the first algorithm specifically dedicated to train SVMs with a double hinge loss. [sent-17, score-0.718]

13 Its implementation shows that our decision rule is at least at par with the one of Bartlett and Wegkamp (2008). [sent-18, score-0.299]

14 Section 2 defines the problem and recalls Bayes rule for binary classification with a reject option. [sent-20, score-0.416]

15 The proposed double hinge loss is derived in Section 3, together with the decision rule associated with SVM scores. [sent-21, score-1.141]

16 Section 4 addresses implementation issues: it formalizes the SVM training problem and details an active set algorithm specifically designed for 1 training with the double hinge loss. [sent-22, score-0.795]

17 For this purpose, we construct a decision rule d : X → A, where A is a set of actions that typically consists in assigning a label to x ∈ X . [sent-26, score-0.299]

18 However, patterns close to the decision boundary are misclassified with high probability. [sent-29, score-0.18]

19 This problem becomes especially eminent in cases where the costs, c− or c+ , are high, such as in medical decision making. [sent-30, score-0.217]

20 In these processes, it might be better to alert the user and abstain from prediction. [sent-31, score-0.106]

21 This motivates the introduction of a reject option for classifiers that cannot predict a pattern with enough confidence. [sent-32, score-0.44]

22 This decision to abstain, which is denoted by 0, incurs a cost, r− and r+ for examples labeled −1 and +1, respectively. [sent-33, score-0.258]

23 y The costs pertaining to each possible decision are recapped on the righthand-side. [sent-34, score-0.246]

24 In what follows, we assume that all costs are strictly positive: Furthermore, it should be possible to incur a lower expected loss by choosing the reject option instead of any prediction, that is c− r+ + c+ r− < c− c+ . [sent-35, score-0.67]

25 +1 0 c− 0 r+ r− −1 c+ 0 Bayes’ decision theory is the paramount framework in statistical decision theory, where decisions are taken to minimize expected losses. [sent-37, score-0.36]

26 For classification with a reject option, the overall risk is R(d) = c+ EXY [Y = 1, d(X) = −1] + c− EXY [Y = −1, d(X) = 1] + r+ EXY [Y = 1, d(X) = 0] + r− EXY [Y = −1, d(X) = 0] , (3) where X and Y denote the random variable describing patterns and labels. [sent-38, score-0.361]

27 Since the seminal paper of Chow (1970), this rule is sometimes referred to as Chow’s rule: d∗ (x) = +1 if P(Y = 1|X = x) > p+ −1 if P(Y = 1|X = x) < p− 0 otherwise , (4) c− − r− r− and p− = . [sent-40, score-0.119]

28 where p+ = One of the major inductive principle is the empirical risk minimization, where one minimizes the empirical counterpart of the risk (3). [sent-42, score-0.128]

29 For example, Vapnik (1995) motivated the hinge loss as a “computationally simple” (i. [sent-44, score-0.559]

30 The following section is dedicated to the construction of such a surrogate for classification with a reject option. [sent-47, score-0.406]

31 3 Training Criterion One method to get around the hardness of learning decision functions is to replace the conditional probability P(Y = 1|X = x) with its estimation P(Y = 1|X = x), and then plug this estimation back in (4) to build a classification rule (Herbei & Wegkamp, 2006). [sent-48, score-0.332]

32 5 0 f+ f (x) f− 0 f+ f (x) Figure 1: Double hinge loss function p− ,p+ for positive (left) and negative examples (right), with p− = 0. [sent-53, score-0.683]

33 Note that the decision thresholds f+ and f− are not symmetric around zero. [sent-56, score-0.298]

34 In the standard logistic H regression procedure, is the negative log-likelihoood loss (y, f (x)) = log(1 + exp(−yf (x))) . [sent-59, score-0.276]

35 This loss function is convex and decision-calibrated (Bartlett & Tewari, 2007), but it lacks an appealing feature of the hinge loss used in SVMs, that is, it does not lead to sparse solutions. [sent-60, score-0.787]

36 However, the definition of the Bayes’ rule (4) clearly shows that the estimation of P(Y = 1|X = x) does not have to be accurate everywhere, but only in the vicinity of p+ and p− . [sent-62, score-0.173]

37 This motivates the construction of a training criterion that focuses on this goal, without estimating P(Y = 1|X = x) on the whole range as an intermediate step. [sent-63, score-0.135]

38 Our purpose is to derive such a loss function, without sacrifying sparsity to the consistency of the decision rule. [sent-64, score-0.397]

39 Though not a proper negative log-likelihood, the hinge loss can be interpreted in a maximum a posteriori framework: The hinge loss can be derived as a relaxed minimization of negative loglikelihood (Grandvalet et al. [sent-65, score-1.21]

40 According to this viewpoint, minimizing the hinge loss aims at deriving a loose approximation to the the logistic regression model (5) that is accurate only at f (x) = 0, thus allowing to estimate whether P(Y = 1|X = x) > 1/2 or not. [sent-67, score-0.672]

41 More generally, one can show that, in order to have a precise estimate of P(Y = 1|X = x) = p, the surrogate loss should be tangent to the neg-log-likelihood at f = log(p/(1 − p)). [sent-68, score-0.233]

42 Following this simple constructive principle, we derive the double hinge loss, which aims at reliably estimating P(Y = 1|X = x) at the threshold points p+ and p− . [sent-69, score-0.719]

43 Furthermore, to encourage sparsity, we set the loss to zero for all points classified with high confidence. [sent-70, score-0.188]

44 After training, the decision rule is defined as the plug-in estimation of (4) using the logistic regression probability estimation. [sent-74, score-0.341]

45 Let f+ = log(p+ /(1 − p+ )) and f− = log(p− /(1 − p− )), the decision rule can be expressed in terms of the function f as follows +1 if f (x) > f+ −1 if f (x) < f− (9) dp− ,p+ (x; f ) = 0 otherwise . [sent-75, score-0.299]

46 The following result shows that the rule dp− ,p+ (·; f ) is universally consistent when f is learned by minimizing empirical risk based on p− ,p+ . [sent-76, score-0.344]

47 Hence, in the limit, learning with the double hinge loss is optimal in the sense that the risk for the learned decision rule converges to the Bayes’ risk. [sent-77, score-1.205]

48 i=1 ∗ Then, limn→∞ R(dp− ,p+ (X; fn )) = R(d∗ (X)) holds almost surely, that is, the classifier ∗ dp− ,p+ (·; fn ) is strongly universally consistent. [sent-82, score-0.189]

49 Besides mild regularity conditions that hold for p− ,p+ , a loss function is said regular if, for every α ∈ [0, 1], and every tα such that tα = arg min α p− ,p+ (+1, t) + (1 − α) p− ,p+ (−1, t) , t we have that dp− ,p+ (tα , x) agrees with d∗ (x) almost everywhere. [sent-87, score-0.188]

50 Let f1 = −H(p− )/p− , f2 = −(H(p+ ) − H(p− ))/(p+ − p− ) and f3 = H(p+ )/(1 − p+ ) denote the hinge locations in p− ,p+ (±1, f (x)). [sent-88, score-0.371]

51 Note on a Close Relative A double hinge loss function has been proposed recently with a different perspective by Bartlett and Wegkamp (2008). [sent-92, score-0.842]

52 In this situation, rejection may occur only if 0 ≤ r < 1/2, and the thresholds on the conditional probabilities in Bayes’ rule (4) are p− = 1 − p+ = r. [sent-94, score-0.541]

53 For symmetric classification, the loss function of Bartlett and Wegkamp (2008) is a scaled version of our proposal that leads to equivalent solutions for f , but our decision rule differs. [sent-95, score-0.534]

54 While our probabilistic derivation of the double hinge loss motivates the decision function (9), the decision rule of Bartlett and Wegkamp (2008) has a free parameter (corresponding to the threshold f+ = −f− ) whose value is set by optimizing a generalization bound. [sent-96, score-1.379]

55 Our decision rule rejects more examples when the loss incurred by rejection is small and fewer examples otherwise. [sent-97, score-0.925]

56 4 4 SVMs with Double Hinge In this section, we show how the standard SVM optimization problem is modified when the hinge loss is replaced by the double hinge loss. [sent-101, score-1.248]

57 1 Optimization Problem Minimizing the regularized empirical risk (6) with the double hinge loss (7–8) is an optimization problem akin to the standard SVM problem. [sent-104, score-0.973]

58 , n , where, for positive examples, ti = H(p+ )/(1 − p+ ), τi = −(H(p− ) − H(p+ ))/(p− − p+ ), while, for negative examples ti = H(p− )/p− , τi = (H(p− ) − H(p+ ))/(p− − p+ ). [sent-117, score-0.314]

59 , τn )T are vectors of Rn and G is the n × n Gram matrix with general entry Gij = yi yj k(xi , xj ). [sent-136, score-0.109]

60 n For f , we have: f (·) = i=1 γi yi k(·, xi ), and b is obtained in the optimization process described below. [sent-140, score-0.195]

61 First, the repartition of training examples in support and non-support vectors is assumed to be known, and the training criterion is optimized considering that this partition fixed. [sent-145, score-0.332]

62 Then, this optimization results in an updated partition of examples in support and non-support vectors. [sent-146, score-0.192]

63 Partitioning the Training Set The training set is partitioned into five subsets defined by the activity of the box constraints of Problem (11). [sent-148, score-0.128]

64 Hence, provided that the repartition of examples in the subsets I0 , It , IC , Iτ and ID is known, we only have to consider a problem in γ. [sent-151, score-0.131]

65  (Ci + D) yi = 0 , Ci yi + yi γi +  i∈IT i∈ID i∈IC where si = ti − j∈IC Cj Gji − j∈ID (Cj + D) Gji for i ∈ It and si = τi − j∈IC Cj Gji − j∈ID (Cj + D) Gji for i ∈ Iτ . [sent-155, score-0.498]

66 Note that the box constraints of Problem (11) do not appear here, because we assumed the partition to be correct. [sent-156, score-0.118]

67 Note that the |IT | equations of the linear system given on the first line of (13) express that, for i ∈ It , yi (f (xi ) + λ) = ti and for i ∈ Iτ , yi (f (xi ) + λ) = τi . [sent-158, score-0.313]

68 Algorithm The algorithm, described in Algorithm 1, simply alternates updates of the partition of examples in {I0 , It , IC , Iτ , ID }, and the ones of coefficients γi for the current active set IT . [sent-160, score-0.205]

69 Algorithm 1 SVM Training with a Reject Option input {xi , yi }1≤i≤n and hyper-parameters C, p+ , p− initialize γ old IT = {It , Iτ }, IT = {I0 , IC , ID }, repeat solve linear system (13) → (γi )i∈IT , b = λ. [sent-165, score-0.178]

70 if any (γi )i∈IT violates the box constraints (11) then new old old Compute the largest ρ s. [sent-166, score-0.259]

71 The exact convergence is obtained when all constraints are fulfilled, that is, when all examples belong to the same subset at the begining and the end of the main loop. [sent-169, score-0.108]

72 However, it is possible to relax the convergence criteria while having a good control on the precision on the solution by monitoring the duality gap, that is the difference between the primal and the dual objectives, respectively provided in the definition of Problems (10) and (11). [sent-170, score-0.123]

73 6 Table 1: Performances in terms of average test loss, rejection rate and misclassification rate (rejection is not an error) with r+ = r− = 0. [sent-171, score-0.282]

74 45, for the three rejection methods over four different datasets. [sent-172, score-0.282]

75 5 Experiments We compare the performances of three different rejection schemes based on SVMs. [sent-227, score-0.335]

76 For this purpose, we selected the datasets from the UCI repository related to medical problems, as medical decision making is an application domain for which rejection is of primary importance. [sent-228, score-0.536]

77 Each trial consists in splitting randomly the examples into a training set with 80 % of examples and an independent test set. [sent-230, score-0.201]

78 Note that the training examples were normalized to zero-mean and unit variance before cross-validation (test sets were of course rescaled accordingly). [sent-231, score-0.123]

79 In a first series of experiments, to compare our decision rule with the one proposed by Bartlett and Wegkamp (2008) (B&W;’s), we used symmetric costs: c+ = c− = 1 and r+ = r− = r. [sent-232, score-0.346]

80 45, which corresponds to rather low rejection rates, in order to favour different behaviors between these two decision rules (recall that they are identical for r 0. [sent-234, score-0.509]

81 Besides the double hinge loss, we also implemented a “naive” method that consists in running the standard SVM algorithm (using the hinge loss) and selecting a symmetric rejection region around zero by cross-validation. [sent-236, score-1.354]

82 This includes the selection of the kernel widths, the regularization parameter C for all methods and additionally of the rejection thresholds for the naive method. [sent-239, score-0.445]

83 Note that B&W;’s and our decision rules are based on learning with the double-hinge loss. [sent-240, score-0.227]

84 Hence, the results displayed in Table 1 only differ due to the size of the rejection region, and to the disparities that arise from the choice of hyper-parameters that may arise in the cross-validation process (since the decision rules differ, the cross-validation scores differ also). [sent-241, score-0.539]

85 Overall, all methods lead to equivalent average test losses, with an unsignificant but consistent advantage for our decision rule. [sent-243, score-0.21]

86 We also see that the naive method tends to reject fewer test examples than the consistent methods. [sent-244, score-0.497]

87 This means that, for comparable average losses, the decision rules based on the scores learned by minimizing the double hinge loss tend to classify more accurately the examples that are not rejected, as seen on the last column of the table. [sent-245, score-1.215]

88 For noisy problems such as Liver and Pima, we observed that reducing rejection costs considerably decrease the error rate on classified examples (not shown on the table). [sent-246, score-0.426]

89 The performances of the two learning methods based on the double-hinge get closer, and there is still no significant gain 7 compared to the naive approach. [sent-247, score-0.145]

90 Note however that the symmetric setting is favourable to the naive approach, since we only have to estimate a single decision thershold. [sent-248, score-0.319]

91 We are experimenting to see whether the double-hinge loss shows more substantial improvements for asymmetric losses and for larger training sets. [sent-249, score-0.328]

92 6 Conclusion In this paper we proposed a new solution to the general problem of classification with a reject option. [sent-250, score-0.297]

93 The double hinge loss was derived from the simple desiderata to obtain accurate estimates of posterior probabilities only in the vicinity of the decision boundaries. [sent-251, score-1.154]

94 Our formulation handles asymmetric misclassification and rejection costs and compares favorably to the one of Bartlett and Wegkamp (2008). [sent-252, score-0.393]

95 We showed that for suitable kernels, including usual ones such as the Gaussian kernel, training a kernel machine with the double hinge loss provides a universally consistent classifier with reject option. [sent-253, score-1.348]

96 Furthermore, the loss provides sparse solutions, with a limited number of support vectors, similarly to the standard L1-SVM classifier. [sent-254, score-0.232]

97 We presented what we believe to be the first principled and efficient implementation of SVMs for classification with a reject option. [sent-255, score-0.297]

98 The only computational overhead is brought by monitoring five categories of examples, instead of the three ones considered in standard SVMs (support vector, support at bound, inactive example). [sent-258, score-0.119]

99 Our approach for deriving the double hinge loss can be used for other decision problems relying on conditional probabilities at specific values or in a limited range or values. [sent-259, score-1.091]

100 Classification with a reject option using a hinge loss. [sent-274, score-0.753]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hinge', 0.371), ('reject', 0.297), ('double', 0.283), ('rejection', 0.282), ('wegkamp', 0.232), ('loss', 0.188), ('decision', 0.18), ('ic', 0.154), ('id', 0.149), ('bartlett', 0.14), ('ci', 0.137), ('chow', 0.127), ('rule', 0.119), ('svms', 0.114), ('yi', 0.109), ('classi', 0.108), ('abstain', 0.106), ('gji', 0.106), ('ti', 0.095), ('universally', 0.093), ('grandvalet', 0.093), ('exy', 0.093), ('naive', 0.092), ('svm', 0.085), ('option', 0.085), ('tewari', 0.079), ('examples', 0.078), ('thresholds', 0.071), ('gij', 0.069), ('misclassi', 0.069), ('old', 0.069), ('dp', 0.069), ('costs', 0.066), ('risk', 0.064), ('dedicated', 0.064), ('bayes', 0.059), ('motivates', 0.058), ('vicinity', 0.054), ('steinwart', 0.054), ('box', 0.053), ('compi', 0.053), ('fumera', 0.053), ('gne', 0.053), ('herbei', 0.053), ('repartition', 0.053), ('roli', 0.053), ('rouen', 0.053), ('simplesvm', 0.053), ('performances', 0.053), ('cj', 0.052), ('active', 0.051), ('xi', 0.051), ('losses', 0.05), ('vishwanathan', 0.05), ('cation', 0.048), ('fn', 0.048), ('rules', 0.047), ('symmetric', 0.047), ('primal', 0.047), ('kwok', 0.046), ('negative', 0.046), ('training', 0.045), ('asymmetric', 0.045), ('surrogate', 0.045), ('support', 0.044), ('liver', 0.042), ('pima', 0.042), ('desiderata', 0.042), ('dual', 0.042), ('logistic', 0.042), ('machines', 0.041), ('ones', 0.041), ('er', 0.041), ('incurring', 0.04), ('cp', 0.04), ('lacks', 0.04), ('vapnik', 0.04), ('si', 0.038), ('minimizing', 0.038), ('violates', 0.038), ('medical', 0.037), ('probabilities', 0.036), ('optimization', 0.035), ('partition', 0.035), ('yf', 0.034), ('monitoring', 0.034), ('incur', 0.034), ('kkt', 0.033), ('conditional', 0.033), ('aims', 0.033), ('criterion', 0.032), ('akin', 0.032), ('constructive', 0.032), ('viewpoint', 0.032), ('consistent', 0.03), ('constraints', 0.03), ('scores', 0.03), ('universit', 0.029), ('sparsity', 0.029), ('sparseness', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 228 nips-2008-Support Vector Machines with a Reject Option

Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu

Abstract: We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1

2 0.11642931 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The machine learning problem of classifier design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classification problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classifier design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1

3 0.10484795 239 nips-2008-Tighter Bounds for Structured Estimation

Author: Olivier Chapelle, Chuong B. Do, Choon H. Teo, Quoc V. Le, Alex J. Smola

Abstract: Large-margin structured estimation methods minimize a convex upper bound of loss functions. While they allow for efficient optimization algorithms, these convex formulations are not tight and sacrifice the ability to accurately model the true loss. We present tighter non-convex bounds based on generalizing the notion of a ramp loss from binary classification to structured estimation. We show that a small modification of existing optimization algorithms suffices to solve this modified problem. On structured prediction tasks such as protein sequence alignment and web page ranking, our algorithm leads to improved accuracy. 1

4 0.10273271 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

Abstract: In classification problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be efficiently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1

5 0.09804631 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising

Author: Steffen Bickel, Christoph Sawade, Tobias Scheffer

Abstract: We address the problem of learning classifiers for several related tasks that may differ in their joint distribution of input and output variables. For each task, small – possibly even empty – labeled samples and large unlabeled samples are available. While the unlabeled samples reflect the target distribution, the labeled samples may be biased. This setting is motivated by the problem of predicting sociodemographic features for users of web portals, based on the content which they have accessed. Here, questionnaires offered to a portion of each portal’s users produce biased samples. We derive a transfer learning procedure that produces resampling weights which match the pool of all examples to the target distribution of any given task. Transfer learning enables us to make predictions even for new portals with few or no training data and improves the overall prediction accuracy. 1

6 0.086126156 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

7 0.085327275 78 nips-2008-Exact Convex Confidence-Weighted Learning

8 0.084838495 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

9 0.083436511 226 nips-2008-Supervised Dictionary Learning

10 0.083121181 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

11 0.081212468 85 nips-2008-Fast Rates for Regularized Objectives

12 0.078300513 233 nips-2008-The Gaussian Process Density Sampler

13 0.077070937 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions

14 0.076174319 62 nips-2008-Differentiable Sparse Coding

15 0.075918972 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

16 0.074648641 178 nips-2008-Performance analysis for L\ 2 kernel classification

17 0.074278735 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

18 0.073363438 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition

19 0.072634131 65 nips-2008-Domain Adaptation with Multiple Sources

20 0.072072819 202 nips-2008-Robust Regression and Lasso


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.228), (1, -0.057), (2, -0.118), (3, 0.025), (4, -0.042), (5, 0.045), (6, 0.014), (7, -0.043), (8, -0.018), (9, 0.086), (10, 0.111), (11, 0.092), (12, -0.026), (13, -0.082), (14, -0.005), (15, 0.035), (16, -0.01), (17, -0.043), (18, 0.022), (19, 0.116), (20, 0.004), (21, 0.036), (22, -0.058), (23, -0.026), (24, -0.012), (25, 0.067), (26, 0.017), (27, -0.094), (28, 0.008), (29, 0.011), (30, 0.022), (31, -0.046), (32, 0.086), (33, -0.04), (34, 0.049), (35, 0.081), (36, -0.133), (37, 0.022), (38, 0.016), (39, 0.06), (40, -0.012), (41, -0.06), (42, 0.088), (43, -0.051), (44, 0.089), (45, 0.095), (46, 0.103), (47, 0.059), (48, -0.133), (49, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95635682 228 nips-2008-Support Vector Machines with a Reject Option

Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu

Abstract: We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1

2 0.77176106 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The machine learning problem of classifier design is studied from the perspective of probability elicitation, in statistics. This shows that the standard approach of proceeding from the specification of a loss, to the minimization of conditional risk is overly restrictive. It is shown that a better alternative is to start from the specification of a functional form for the minimum conditional risk, and derive the loss function. This has various consequences of practical interest, such as showing that 1) the widely adopted practice of relying on convex loss functions is unnecessary, and 2) many new losses can be derived for classification problems. These points are illustrated by the derivation of a new loss which is not convex, but does not compromise the computational tractability of classifier design, and is robust to the contamination of data with outliers. A new boosting algorithm, SavageBoost, is derived for the minimization of this loss. Experimental results show that it is indeed less sensitive to outliers than conventional methods, such as Ada, Real, or LogitBoost, and converges in fewer iterations. 1

3 0.73969352 163 nips-2008-On the Efficient Minimization of Classification Calibrated Surrogates

Author: Richard Nock, Frank Nielsen

Abstract: Bartlett et al (2006) recently proved that a ground condition for convex surrogates, classification calibration, ties up the minimization of the surrogates and classification risks, and left as an important problem the algorithmic questions about the minimization of these surrogates. In this paper, we propose an algorithm which provably minimizes any classification calibrated surrogate strictly convex and differentiable — a set whose losses span the exponential, logistic and squared losses —, with boosting-type guaranteed convergence rates under a weak learning assumption. A particular subclass of these surrogates, that we call balanced convex surrogates, has a key rationale that ties it to maximum likelihood estimation, zerosum games and the set of losses that satisfy some of the most common requirements for losses in supervised learning. We report experiments on more than 50 readily available domains of 11 flavors of the algorithm, that shed light on new surrogates, and the potential of data dependent strategies to tune surrogates. 1

4 0.71293271 185 nips-2008-Privacy-preserving logistic regression

Author: Kamalika Chaudhuri, Claire Monteleoni

Abstract: This paper addresses the important tradeoff between privacy and learnability, when designing algorithms for learning from private databases. We focus on privacy-preserving logistic regression. First we apply an idea of Dwork et al. [6] to design a privacy-preserving logistic regression algorithm. This involves bounding the sensitivity of regularized logistic regression, and perturbing the learned classifier with noise proportional to the sensitivity. We then provide a privacy-preserving regularized logistic regression algorithm based on a new privacy-preserving technique: solving a perturbed optimization problem. We prove that our algorithm preserves privacy in the model due to [6]. We provide learning guarantees for both algorithms, which are tighter for our new algorithm, in cases in which one would typically apply logistic regression. Experiments demonstrate improved learning performance of our method, versus the sensitivity method. Our privacy-preserving technique does not depend on the sensitivity of the function, and extends easily to a class of convex loss functions. Our work also reveals an interesting connection between regularization and privacy. 1

5 0.64425194 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

Abstract: In classification problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be efficiently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1

6 0.59164476 178 nips-2008-Performance analysis for L\ 2 kernel classification

7 0.58250105 78 nips-2008-Exact Convex Confidence-Weighted Learning

8 0.57207775 239 nips-2008-Tighter Bounds for Structured Estimation

9 0.5696975 85 nips-2008-Fast Rates for Regularized Objectives

10 0.5605306 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring

11 0.54460055 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

12 0.53875583 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising

13 0.53568244 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

14 0.53274548 226 nips-2008-Supervised Dictionary Learning

15 0.5148285 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

16 0.49749282 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

17 0.49570569 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence

18 0.47721657 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

19 0.46672091 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation

20 0.46437636 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.268), (6, 0.109), (7, 0.079), (12, 0.047), (15, 0.013), (28, 0.153), (57, 0.04), (59, 0.024), (63, 0.042), (71, 0.036), (77, 0.05), (78, 0.011), (83, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78973895 228 nips-2008-Support Vector Machines with a Reject Option

Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu

Abstract: We consider the problem of binary classification where the classifier may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is defined by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classifier, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efficiently. We finally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1

2 0.72992122 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1

3 0.70013076 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

Author: Benjamin Yackley, Eduardo Corona, Terran Lane

Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1

4 0.64175504 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1

5 0.63359952 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1

6 0.63271493 75 nips-2008-Estimating vector fields using sparse basis field expansions

7 0.6306051 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

8 0.63052797 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

9 0.62674779 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

10 0.62661123 196 nips-2008-Relative Margin Machines

11 0.6256696 143 nips-2008-Multi-label Multiple Kernel Learning

12 0.62489438 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

13 0.62427258 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

14 0.62394655 85 nips-2008-Fast Rates for Regularized Objectives

15 0.6233027 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

16 0.62277627 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

17 0.61923718 179 nips-2008-Phase transitions for high-dimensional joint support recovery

18 0.61820066 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

19 0.61750686 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

20 0.61713785 239 nips-2008-Tighter Bounds for Structured Estimation