jmlr jmlr2010 jmlr2010-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
Reference: text
sentIndex sentText sentNum sentScore
1 IL Computer Science Department Technion – Israel Institute of Technology Haifa 32000, Israel Editor: Gabor Lugosi Abstract We consider selective classification, a term we adopt here to refer to ‘classification with a reject option. [sent-7, score-0.493]
2 ’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. [sent-8, score-0.663]
3 For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. [sent-11, score-0.375]
4 Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off 1. [sent-12, score-0.635]
5 Introduction In this paper we study the trade-off between coverage and accuracy of classifiers with a reject option, a trade-off we refer to as the risk-coverage (RC) trade-off. [sent-13, score-0.406]
6 Throughout the paper we use the term selective classification to refer to ‘classification with a reject option. [sent-15, score-0.493]
7 Through the years, selective classification continued to draw attention and numerous papers have been published. [sent-17, score-0.375]
8 The attraction of effective selective classification is rather obvious in applications where one is not concerned with, or can afford partial coverage of the domain, and/or in cases where extremely low risk is a must but is not achievable in standard classification frameworks. [sent-18, score-0.805]
9 Despite the relatively large number of research publications on selective classification, the vast majority of these works have been concerned with implementing a reject option within specific learning schemes, by endowing a learning scheme (e. [sent-21, score-0.519]
10 ” While there are many convincing accounts for the potential effectiveness of selective classification in reducing the risk, we are not familiar with a thorough or conclusive discussions on the relative power of the numerous rejection mechanisms that have been considered so far. [sent-25, score-0.481]
11 E L -YANIV AND W IENER lective classification (see Section 10) do provide some risk or coverage bounds for specific schemes (e. [sent-27, score-0.397]
12 A thorough understanding and effective use of selective classification requires characterization of the theoretical and practical boundaries of RC trade-offs, which are essential elements in any discussion of optimality in selective classification. [sent-33, score-0.75]
13 These missing elements in the current literature are critical when constructing and exploring selective classification schemes and selective classification algorithms that aim at achieving optimality in controlling the RC trade-off. [sent-34, score-0.75]
14 One of our longer term goals is to provide such characterizations and introduce a notion of optimality for selective classification in the most general agnostic model. [sent-35, score-0.408]
15 In selective classification the learner should output a binary selective classifier defined to be a pair ( f , g), with f being a standard binary classifier, and g : X → [0, 1] a selection function whose meaning is as follows. [sent-50, score-0.783]
16 When applying the selective classifier to a sample x, its output is: ( f , g)(x) re ject, w. [sent-51, score-0.394]
17 (1) Thus, in its most general form, the selective classifier is randomized. [sent-56, score-0.375]
18 Whenever the selection function is a zero-one rule, g : X → {0, 1}, we say that the selective classifier is deterministic. [sent-57, score-0.375]
19 , no rejection is allowed) is the special case of selective classification where g(x) selects all points (i. [sent-60, score-0.501]
20 The two main characteristics of a selective classifier are its coverage and its risk (or “true error”). [sent-63, score-0.753]
21 Definition 1 (coverage) The coverage of a selective classifier ( f , g) is the mean value of the selection function g(X) taken over the underlying distribution P, Φ( f , g) E [g(X)] . [sent-64, score-0.663]
22 1606 O N THE F OUNDATIONS OF N OISE - FREE S ELECTIVE C LASSIFICATION Definition 2 (risk) For a bounded loss function ℓ : Y × Y → [0, 1], we define the risk of a selective classifier ( f , g) as the average loss on the accepted samples, R( f , g) E [ℓ( f (X),Y ) · g(X)] . [sent-65, score-0.465]
23 Note that (at the outset) both the coverage and risk are unknown quantities because they are defined in terms of the unknown underlying distribution P. [sent-67, score-0.378]
24 We define a learning algorithm ALG to be a (random) function that, given a sample Sm , chooses a selective classifier ( f , g). [sent-68, score-0.394]
25 We evaluate learners with respect to their coverage and risk and derive both positive and negative results on achievable risk and coverage. [sent-69, score-0.52]
26 ALG is applied on Sm and outputs a selective classifier ( f , g). [sent-85, score-0.375]
27 The result of the game is evaluated in terms of the risk and coverage obtained by the chosen selective classifier and clearly, these are random quantities that trade-off each other. [sent-86, score-0.773]
28 For a selective classifier ( f , g) with coverage Φ( f , g) we can specify a Risk-Coverage (RC) trade-off as a bound on the risk R( f , g), expressed in terms of Φ( f , g). [sent-104, score-0.772]
29 Using a training sample Sm , the goal in selective classification is to output a selective classifier ( f , g) that has sufficiently low risk with sufficiently high coverage. [sent-116, score-0.879]
30 We call the trade-off between risk and coverage the risk-coverage (RC) trade-off. [sent-118, score-0.378]
31 The best way to benefit from selective classification is to control the creation of the classifier so as to meet a prescribed error/coverage specification along the RC trade-off. [sent-119, score-0.375]
32 The entire region depicted, called the RC plane, consisting of all (r, c) points in the rectangle of interest, where r is a risk (error) coordinate and c is a coverage coordinate. [sent-127, score-0.425]
33 ” We say that (r, c) is (efficiently) achievable if there is an (efficient) learning algorithm that will output a selective classifier ( f , g) such that with probability of at least 1 − δ, its coverage is at least c and its risk is at most r. [sent-133, score-0.843]
34 ” At this point we require full coverage with certainty and the achievable risk represents the lowest possible risk in our fixed setting (which should be achievable with probability of at least 1 − δ). [sent-135, score-0.633]
35 We call point c∗ perfect learning because achievable perfect learning means that we can generate a classifier that never errs with certainty for the problem at hand. [sent-139, score-0.378]
36 This curve passes somewhere in the zone labeled with a question mark and represents optimal selective classification. [sent-142, score-0.45]
37 Given the training set Sm , we are required to generate a “perfect” selective classifier ( f , g) for which 1609 E L -YANIV AND W IENER it is known with certainty that R( f , g) = 0. [sent-159, score-0.437]
38 Our first observation is Theorem 8, stating that for any finite hypothesis class F , perfect learning with guaranteed coverage is achievable by a particular selective classification strategy. [sent-162, score-0.998]
39 For any tolerance δ, with probability of at least 1 − δ, it is guaranteed that the coverage achieved by this strategy will be at least 1 1 − O(|F | + ln(1/δ)). [sent-163, score-0.384]
40 We show in Theorem 7 that any other strategy that achieves perfect learning cannot have larger coverage than CSS. [sent-166, score-0.46]
41 This distribution-free coverage guarantee (2) is proven to be nearly tight for CSS and therefore, it is the best possible bound for any selective learner. [sent-171, score-0.71]
42 Specifically, as shown in Theorem 11, there exist a particular finite hypothesis class and a particular underlying distribution for which a matching negative result (up to multiplicative constants) holds for any consistent selective learner. [sent-172, score-0.515]
43 This result is readily extended to any selective learner by the CSS coverage optimality of Theorem 7. [sent-173, score-0.696]
44 We show in Theorem 14 that it is impossible to provide any coverage guarantees for perfect learning, in the general case. [sent-175, score-0.43]
45 Specifically, for linear classifiers, we show a bad distribution for which any selective learner ensuring zero risk will be forced to reject the entire volume of X , thus failing to guarantee more than zero coverage. [sent-176, score-0.644]
46 So the bad news is that perfect learning with guaranteed coverage cannot in general be achieved if the hypothesis space is infinite. [sent-179, score-0.571]
47 For any selective hypothesis ( f , g), that is consistent with a sample Sm , Theorem 21 ensures perfect learning with a high probability coverage guarantee of the following form: 1 m m Φ( f , g) ≥ 1 − O γ(F , n) ln , (3) ˆ + ln m γ(F , n) ˆ δ 1. [sent-182, score-1.17]
48 The requirement that in perfect learning the risk is zero with certainty is dual to the requirement that the coverage is 100% with certainty in standard learning. [sent-183, score-0.604]
49 ˆ This bound immediately yields a coverage guarantee for perfect learning of linear classifiers, as stated in Corollary 33. [sent-191, score-0.477]
50 This is a powerful result providing strong indication on the potential effectiveness of perfect learning with guaranteed coverage in a variety of applications. [sent-192, score-0.458]
51 We generalize the CSS strategy and define a “controllable selective strategy” (Definition 34). [sent-195, score-0.405]
52 The upper envelop on the RC curve is then derived in Theorem 37 for any selective classifier ( f , g) by constructing a particular bad distribution for which R( f , g) ≥ 1 1 16 1 · min 2Φ − 1, 2Φ − 2 + · VCdim(F ) − ln 4Φ 4m 3 1 − 2δ . [sent-199, score-0.539]
53 We show that perfect selective classification with guaranteed coverage is achievable (from a learning-theoretic perspective) by a learning strategy termed consistent selective strategy (CSS). [sent-211, score-1.347]
54 Definition 6 (consistent selective strategy (CSS)) Given Sm , a consistent selective strategy (CSS) is a selective classification strategy that takes f to be any hypothesis in V SF ,Sm (i. [sent-220, score-1.355]
55 An immediate consequence is that any CSS selective hypothesis ( f , g) always satisfies R( f , g) = 0. [sent-225, score-0.488]
56 The main concern, however, is whether its coverage Φ( f , g) can be bounded from below and whether any other strategy that achieves perfect learning with certainty can achieve better coverage. [sent-226, score-0.502]
57 Theorem 7 (CSS coverage optimality) Given Sm , let ( f , g) be a selective classifier chosen by any strategy that ensures zero risk with certainty for any unknown distribution P and any target concept f ∗ ∈ F . [sent-228, score-0.847]
58 Let ( fc , gc ) be a selective classifier selected by CSS using Sm . [sent-229, score-0.412]
59 Given a hypothetical sample Sm of size ˜c , gc ) be the selective classifier chosen by CSS and let ( f˜, g) be the selective classifier m, let ( f ˜ ˜ ˜ chosen by any competing strategy. [sent-233, score-0.85]
60 ˜ ˜ ˜ ˜ The next result establishes the existence of perfect learning with guaranteed coverage in the finite case. [sent-247, score-0.458]
61 Theorem 8 (guaranteed coverage) Assume a finite F and let ( f , g) be a selective classifier selected by CSS. [sent-248, score-0.397]
62 There exist a distribution P, that depends on m and n, and a finite hypothesis class F of size n, such that for any selective classifier ( f , g), chosen from F by CSS (so R( f , g) = 0) using a training sample Sm drawn i. [sent-277, score-0.527]
63 2 2 Since the coverage Φ( f , g) is the volume of the maximal agreement set with respect to the version space V SF ,Sm , it follows that Φ( f , g) = 1 − |V SF ,Sm | · Bin m, n , 2δ |F | 1 2 ≤ 1 − · Bin m, , 2δ . [sent-293, score-0.435]
64 4 There exist a distribution P, that depends on m and n, and a finite hypothesis class F of size n, such that for any selective classifier ( f , g), chosen from F by CSS (so R( f , g) = 0) using a training sample Sm drawn i. [sent-298, score-0.527]
65 We show that in the general case, perfect selective classification with guaranteed (non-zero) coverage is not achievable even when F has a finite VC-dimension. [sent-306, score-0.885]
66 There exist a distribution P, an infinite hypothesis class F with a finite VC-dimension d, and a target hypothesis in F , such that Φ( f , g) = 0 for any selective classifier ( f , g), chosen from F by CSS using a training sample Sm drawn i. [sent-311, score-0.64]
67 A direct corollary of Theorem 14 is that, in the general case, perfect selective classification with distribution-free guaranteed coverage is not achievable for infinite hypothesis spaces. [sent-324, score-1.018]
68 ˆ ˆ Since a maximal agreement set is a region in X , rather than an hypothesis, we formally define the dual hypothesis that matches every maximal agreement set. [sent-344, score-0.434]
69 For any probability distribution P on X × {±1}, with probability of at least 1 − δ over the choice of Sm from Pm , any hypothesis f ∈ F consistent with Sm satisfies 2em 2 2 h ln (5) + ln , R( f ) ≤ ε(h, m, δ) = m h δ where R( f ) E [I( f (x) = f ∗ (x))] is the risk of f . [sent-358, score-0.427]
70 Combining this bound with Theorem 21, we immediately obtain a data-dependent compression coverage guarantee, as stated in Corollary 28. [sent-387, score-0.378]
71 This powerful result, which is stated in Corollary 33, indicates that consistent selective classification might be relevant in various applications of interest. [sent-389, score-0.402]
72 As long as the empirical version space compression set size n is sufficiently small compared to ˆ m, Corollary 28 provides a meaningful coverage guarantee. [sent-457, score-0.359]
73 Is it possible to learn a selective classifier with full control over this trade-off? [sent-502, score-0.375]
74 1 Lower Envelop: Controlling the Coverage-risk Trade-off Our lower RC envelop is facilitated by the following strategy, which is a generalization of the consistent selective classification strategy (CSS) of Definition 6. [sent-509, score-0.507]
75 Clearly, CSS is a special case of the controllable selective strategy obtained with α = 0. [sent-511, score-0.465]
76 The following result provides a distribution independent upper bound on the risk of the controllable selective strategy as a function of its coverage. [sent-517, score-0.574]
77 Let ( f , g) be a selective classifier chosen by a controllable selective learner after observing a training sample Sm . [sent-519, score-0.882]
78 m δ 1625 E L -YANIV AND W IENER Proof For any controllable selective learner with a mixing parameter α we have, Φ( f , g) = E [g(X)] = E [I(g(X) = 1)] + αE [I(g(X) = 1)] . [sent-521, score-0.468]
79 The statement is a probabilistic lower bound on the risk of any selective classifier expressed as a function of the coverage. [sent-530, score-0.484]
80 There exists a distribution P (that depends on F ), such that for any selective classifier ( f , g), chosen using a training sample Sm drawn i. [sent-533, score-0.414]
81 There exist a distribution P, that depends on m and n, and a finite hypothesis class F of size n, such that for any selective classifier ( f , g), chosen using a training sample Sm drawn i. [sent-556, score-0.527]
82 The following method, which we term lazy CSS, is very similar to the implicit selective sampling algorithm of Cohn et al. [sent-572, score-0.403]
83 1628 O N THE F OUNDATIONS OF N OISE - FREE S ELECTIVE C LASSIFICATION Remark 40 For the realizable case we can modify any rejection mechanism by restricting rejection only to the region chosen for rejection by CSS. [sent-592, score-0.414]
84 The question we discuss in this section is: what would be an appropriate optimization criterion for selective classification? [sent-601, score-0.375]
85 Specifically, by bounding the coverage and the risk separately (as we do in this paper) we can in principle optimize any generalized rejective risk function according to any desired rejection model including the cost, the bounded-improvement and bounded-abstention models. [sent-631, score-0.6]
86 Herbei and Wegkamp (2006) developed excess risk bounds for the classification with a reject option setting where the loss function is the 0-1 loss, extended such that the cost of each reject point is 0 ≤ d ≤ 1/2 (cost model; see Section 9). [sent-655, score-0.391]
87 1631 E L -YANIV AND W IENER Selective classification is related to selective sampling (Atlas et al. [sent-684, score-0.375]
88 In selective sampling the learner sequentially processes unlabeled examples, and for each one determines whether or not to request a label. [sent-686, score-0.408]
89 While coverage bounds and label complexity bounds cannot be directly compared, we conjecture that formal connections between these two settings exist because the disagreement region plays a key role in both. [sent-695, score-0.378]
90 Nevertheless, not enough is known about selective classification in order to harness its power in a controlled, optimal way, or to avoid its use in cases where it cannot sufficiently help. [sent-700, score-0.375]
91 In this work we made a first step toward a rigorous analysis of selective classification by revealing properties of the risk-coverage trade-off, which represents optimal selective classification. [sent-701, score-0.75]
92 What is the precise relation between selective classification and selective sampling? [sent-706, score-0.75]
93 Can selective classification be rigorously analyzed in transductive, semisupervised or active settings? [sent-708, score-0.375]
94 Here we could employ a selective strategy aiming at achieving the error rate of the best hypothesis in the class precisely (and perhaps with certainty). [sent-710, score-0.518]
95 Noticing that x1 cos α − x′ 1 · x′ 2 · sin α is continuous in α, and applying the ′ intermediate value theorem, we know that 0 < α < π and x1 2 cos α − x′ 1 · x′ 2 · sin α = 0. [sent-841, score-0.556]
96 Recall that all points in Sm ¯ ¯ ¯ Therefore, 0 ∀x′ ∈ Sm ¯ ′ (Rα w′ )T x′ = x1 · cos α − x′ 2 · sin α = w′T x · cos α − v′T x · sin α = 0, ¯ ¯ ¯ ¯ ¯ ¯ and they reside on the boundary of fRα w′ ,0 . [sent-844, score-0.607]
97 ¯ ¯ Proof If (Rα w′ )T x′ ≥ 0 and (R−β w′ )T x′ ≥ 0, then ¯ ¯ ¯ ¯ ′ ′ x1 cos α − x2 · sin α ≥ 0, ′ ′ x1 cos β + x2 · sin β ≥ 0. [sent-848, score-0.556]
98 Using the trigonometric identities cos(α − β) = cos α cos β + sin α sin β; sin(α − β) = sin α cos β − cos α sin β, we get that cos β = cos(β + α − α) = cos(α + β) · cos α + sin(α + β) · sin α = − cos α, and sin β = sin(β + α − α) = sin(α + β) · cos α − cos(α + β) · sin α = sin α. [sent-852, score-2.224]
99 1638 O N THE F OUNDATIONS OF N OISE - FREE S ELECTIVE C LASSIFICATION Proof By definition we get that for all samples in Sm , ′ ′ ′ x1 2 cos α − x1 · x2 · sin α ≥ 0, ′ 2 cos β + x′ · x′ · sin β ≥ 0. [sent-857, score-0.576]
100 x1 1 2 Multiplying the first inequality by sin β > 0 (0 < β < π), the second inequality by sin α > 0, adding the two, and using the trigonometric identity sin(α + β) = sin α cos β + cos α sin β, we have 2 ′ sin(α + β) · x1 ≥ 0. [sent-858, score-0.882]
wordName wordTfidf (topN-words)
[('sm', 0.461), ('selective', 0.375), ('coverage', 0.288), ('css', 0.281), ('sf', 0.239), ('rc', 0.194), ('sin', 0.163), ('perfect', 0.142), ('elective', 0.135), ('iener', 0.135), ('oise', 0.135), ('oundations', 0.135), ('wt', 0.131), ('reject', 0.118), ('cos', 0.115), ('hypothesis', 0.113), ('fw', 0.106), ('rejection', 0.106), ('agreement', 0.092), ('classi', 0.091), ('risk', 0.09), ('ln', 0.089), ('bin', 0.082), ('characterizing', 0.08), ('envelop', 0.075), ('compression', 0.071), ('realizable', 0.069), ('lassification', 0.065), ('rd', 0.064), ('hn', 0.062), ('sn', 0.061), ('er', 0.061), ('mp', 0.061), ('controllable', 0.06), ('pr', 0.059), ('hull', 0.058), ('maximal', 0.055), ('envelopes', 0.052), ('achievable', 0.052), ('zone', 0.052), ('vcdim', 0.045), ('fv', 0.045), ('hypotheses', 0.043), ('certainty', 0.042), ('chow', 0.04), ('fr', 0.039), ('fumera', 0.037), ('gc', 0.037), ('lemma', 0.037), ('intersection', 0.036), ('ei', 0.035), ('vc', 0.033), ('learner', 0.033), ('agnostic', 0.033), ('facets', 0.032), ('sliced', 0.032), ('boundary', 0.031), ('ers', 0.031), ('strategy', 0.03), ('pietraszek', 0.03), ('wegkamp', 0.029), ('rejects', 0.029), ('guarantee', 0.028), ('lazy', 0.028), ('guaranteed', 0.028), ('consistent', 0.027), ('region', 0.027), ('free', 0.026), ('option', 0.026), ('vertices', 0.026), ('proof', 0.026), ('herbei', 0.026), ('rejective', 0.026), ('theorem', 0.026), ('disagreement', 0.025), ('plane', 0.025), ('labeled', 0.023), ('alg', 0.022), ('elevation', 0.022), ('klee', 0.022), ('tortorella', 0.022), ('binomial', 0.022), ('let', 0.022), ('rotated', 0.021), ('geometry', 0.021), ('depicted', 0.021), ('cost', 0.02), ('points', 0.02), ('samples', 0.02), ('corollary', 0.02), ('training', 0.02), ('game', 0.02), ('least', 0.019), ('bound', 0.019), ('bounds', 0.019), ('br', 0.019), ('technion', 0.019), ('atlas', 0.019), ('sample', 0.019), ('separability', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
2 0.11510136 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
3 0.0922556 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
Author: Choon Hui Teo, S.V.N. Vishwanthan, Alex J. Smola, Quoc V. Le
Abstract: A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ε) steps to ε precision for general convex problems and in O(log(1/ε)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available data sets. Keywords: optimization, subgradient methods, cutting plane method, bundle methods, regularized risk minimization, parallel optimization ∗. Also at Canberra Research Laboratory, NICTA. c 2010 Choon Hui Teo, S.V. N. Vishwanthan, Alex J. Smola and Quoc V. Le. T EO , V ISHWANATHAN , S MOLA AND L E
4 0.08406204 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
5 0.081598245 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
6 0.072751984 80 jmlr-2010-On-Line Sequential Bin Packing
7 0.068397343 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
8 0.060029268 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
9 0.058215879 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
10 0.057200044 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
11 0.054865614 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
12 0.054178957 22 jmlr-2010-Classification Using Geometric Level Sets
13 0.053807184 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
14 0.052177005 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
15 0.048812505 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
16 0.047662705 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
17 0.04564745 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
18 0.044182945 61 jmlr-2010-Learning From Crowds
19 0.041818872 102 jmlr-2010-Semi-Supervised Novelty Detection
20 0.039168611 63 jmlr-2010-Learning Instance-Specific Predictive Models
topicId topicWeight
[(0, -0.195), (1, -0.072), (2, 0.098), (3, 0.054), (4, 0.08), (5, 0.093), (6, -0.021), (7, 0.171), (8, -0.052), (9, 0.046), (10, -0.003), (11, -0.046), (12, -0.129), (13, 0.0), (14, -0.109), (15, -0.012), (16, -0.161), (17, 0.026), (18, -0.049), (19, 0.036), (20, -0.175), (21, 0.038), (22, -0.11), (23, 0.091), (24, 0.035), (25, 0.0), (26, 0.066), (27, -0.139), (28, 0.03), (29, 0.137), (30, 0.185), (31, -0.267), (32, -0.084), (33, 0.034), (34, 0.062), (35, -0.122), (36, -0.008), (37, -0.104), (38, 0.006), (39, 0.11), (40, -0.124), (41, -0.076), (42, -0.116), (43, 0.017), (44, 0.142), (45, 0.008), (46, -0.008), (47, -0.116), (48, -0.021), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.91343212 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
2 0.54299587 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
3 0.44326663 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
4 0.41696781 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
Author: Michel Journée, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms
5 0.37598029 90 jmlr-2010-Permutation Tests for Studying Classifier Performance
Author: Markus Ojala, Gemma C. Garriga
Abstract: We explore the framework of permutation-based p-values for assessing the performance of classifiers. In this paper we study two simple permutation tests. The first test assess whether the classifier has found a real class structure in the data; the corresponding null distribution is estimated by permuting the labels in the data. This test has been used extensively in classification problems in computational biology. The second test studies whether the classifier is exploiting the dependency between the features in classification; the corresponding null distribution is estimated by permuting the features within classes, inspired by restricted randomization techniques traditionally used in statistics. This new test can serve to identify descriptive features which can be valuable information in improving the classifier performance. We study the properties of these tests and present an extensive empirical evaluation on real and synthetic data. Our analysis shows that studying the classifier performance via permutation tests is effective. In particular, the restricted permutation test clearly reveals whether the classifier exploits the interdependency between the features in the data. Keywords: classification, labeled data, permutation tests, restricted randomization, significance testing
6 0.36349195 22 jmlr-2010-Classification Using Geometric Level Sets
7 0.34686577 61 jmlr-2010-Learning From Crowds
8 0.33043182 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
9 0.3288354 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
10 0.30016723 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
11 0.29101813 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
12 0.27617395 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
13 0.26113063 102 jmlr-2010-Semi-Supervised Novelty Detection
14 0.2606672 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
15 0.25937313 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
16 0.25476804 60 jmlr-2010-Learnability, Stability and Uniform Convergence
17 0.24280174 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
18 0.23354827 77 jmlr-2010-Model-based Boosting 2.0
19 0.22696912 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
20 0.22391503 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems
topicId topicWeight
[(1, 0.019), (3, 0.031), (8, 0.014), (21, 0.015), (32, 0.552), (36, 0.027), (37, 0.039), (75, 0.097), (85, 0.069), (96, 0.012)]
simIndex simValue paperId paperTitle
1 0.93004835 73 jmlr-2010-Maximum Likelihood in Cost-Sensitive Learning: Model Specification, Approximations, and Upper Bounds
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
same-paper 2 0.90752614 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
3 0.8755033 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
Author: Jörg Lücke, Julian Eggert
Abstract: We show how a preselection of hidden variables can be used to efficiently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efficiently computable approximation to the sufficient statistics of a given model. The computational cost to compute the sufficient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the benefits of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artificial and realistic data, and are compared to other models in the literature. We find that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models
4 0.86703217 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
5 0.5824675 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
6 0.55096537 60 jmlr-2010-Learnability, Stability and Uniform Convergence
7 0.54692018 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
8 0.51971996 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
9 0.50965267 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
10 0.49870417 102 jmlr-2010-Semi-Supervised Novelty Detection
11 0.49852693 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
12 0.49537185 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
13 0.48643878 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
14 0.47965899 25 jmlr-2010-Composite Binary Losses
15 0.47775394 109 jmlr-2010-Stochastic Composite Likelihood
16 0.47703686 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
17 0.47609764 22 jmlr-2010-Classification Using Geometric Level Sets
18 0.4741973 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
19 0.47274062 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
20 0.47101337 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data