nips nips2011 nips2011-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. [sent-5, score-0.341]
2 The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. [sent-6, score-0.307]
3 Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. [sent-7, score-0.526]
4 We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. [sent-8, score-0.444]
5 Indeed, consider a game where our classifier is allowed to abstain from prediction, without penalty, in some region of our choice. [sent-12, score-0.187]
6 Perhaps surprisingly it was shown that the volume of this rejection region is bounded, and in fact, this volume diminishes with increasing training set sizes (under certain conditions). [sent-16, score-0.438]
7 The setting under consideration, where classifiers can abstain from prediction, is called classification with a reject option [2, 3], or selective classification [1]. [sent-19, score-0.749]
8 We first show that the concept of “perfect classification” that was introduced for the realizable case in [1], can be extended to the agnostic setting. [sent-21, score-0.199]
9 While pure perfection is impossible to accomplish in a noisy environment, a more realistic objective is to perform as well as the best hypothesis in the class within a region of our choice. [sent-22, score-0.318]
10 We call this type of learning “weakly optimal” selective classification and show that a novel strategy accomplishes this type of learning with diminishing rejection rate under certain Bernstein type conditions (a stronger notion of optimality is mentioned later as well). [sent-23, score-0.811]
11 This strategy relies on empirical risk minimization, which is computationally difficult. [sent-24, score-0.313]
12 We conclude with numerical examples that examine the empirical performance of the new algorithm and compare its performance with that of the widely used selective classification method for rejection, based on distance from decision boundary. [sent-26, score-0.553]
13 1 2 Selective classification and other definitions Consider a standard agnostic binary classification setting where X is some feature space, and H is our hypothesis class of binary classifiers, h : X → {±1}. [sent-27, score-0.294]
14 m i=1 m R(h) Pr (X,Y )∼P {h(X) ̸= Y } , ˆ R(h) ˆ ˆ Let h arg inf h∈H R(h) be the empirical risk minimizer (ERM), and h∗ true risk minimizer. [sent-33, score-0.535]
15 arg inf h∈H R(h), the In selective classification [1], given Sm we need to select a binary selective classifier defined to be a pair (h, g), with h ∈ H being a standard binary classifier, and g : X → {0, 1} is a selection function defining the sub-region of activity of h in X . [sent-34, score-0.846]
16 For a bounded loss function ℓ : Y × Y → [0, 1], the risk of (h, g) is defined as the average loss on the accepted samples, E [ℓ(h(X), Y ) · g(X)] R(h, g) . [sent-38, score-0.235]
17 Φ(h, g) As pointed out in [1], the trade-off between risk and coverage is the main characteristic of a selective classifier. [sent-39, score-0.723]
18 For any hypothesis class H, target hypothesis h ∈ H, distribution P , sample Sm , and real r > 0, define { } ˆ ˆ ˆ V(h, r) = {h′ ∈ H : R(h′ ) ≤ R(h) + r} and V(h, r) = h′ ∈ H : R(h′ ) ≤ R(h) + r . [sent-47, score-0.27]
19 X∼P 3 Perfect and weakly optimal selective classifiers The concept of perfect classification was introduced in [1] within a realizable selective classification setting. [sent-50, score-1.067]
20 Perfect classification is an extreme case of selective classification where a selective classifier (h, g) achieves R(h, g) = 0 with certainty; that is, the classifier never errs on its region of activity. [sent-51, score-0.877]
21 , finite hypothesis class) the rejected region diminishes at rate Ω(1/m), where m is the size of the training set. [sent-55, score-0.312]
22 In agnostic environments, as we consider here, such perfect classification appears to be out of reach. [sent-56, score-0.216]
23 In general, in the worst case no hypothesis can achieve zero error over any nonempty subset of the 1 Some authors refer to an equivalent variant of this curve as “Accuracy-Rejection Curve” or ARC. [sent-57, score-0.227]
24 We consider here the following weaker, but still extremely desirable behavior, which we call “weakly optimal selective classification. [sent-59, score-0.409]
25 ” Let h∗ ∈ H be the true risk minimizer of our problem. [sent-60, score-0.278]
26 Let (h, g) be a selective classifier selected after observing the training set Sm . [sent-61, score-0.481]
27 We say that (h, g) is a weakly optimal selective classifier if, for any 0 < δ < 1, with probability of at least 1 − δ over random choices of Sm , R(h, g) ≤ R(h∗ , g). [sent-62, score-0.542]
28 That is, with high probability our classifier is at least as good as the true risk minimizer over its region of activity. [sent-63, score-0.366]
29 We call this classifier ‘weakly optimal’ because a stronger requirement would be that the classifier should achieve the best possible error among all hypotheses in H restricted to the region of activity defined by g. [sent-64, score-0.197]
30 4 A learning strategy We now present a strategy that will be shown later to achieve non-trivial weakly optimal selective classification under certain conditions. [sent-65, score-0.71]
31 Using standard concentration inequalities one can show that the training error of the true risk minimizer, h∗ , cannot be “too far” from the training error of the ˆ empirical risk minimizer, h. [sent-68, score-0.601]
32 Therefore, we can guarantee, with high probability, that the class of all hypothesis with “sufficiently low” empirical error includes the true risk minimizer h∗ . [sent-69, score-0.525]
33 Selecting only subset of the domain, for which all hypothesis in that class agree, is then sufficient to guarantee weak optimality. [sent-70, score-0.159]
34 Strategy 1 Learning strategy for weakly optimal selective classifiers Input: Sm , m, δ, d Output: a selective classifier (h, g) such that R(h, g) = R(h∗ , g) w. [sent-73, score-1.006]
35 , h is any empirical risk minimizer from H ) ( √ 8 d(ln 2me )+ln δ d ˆ ˆ (see Eq. [sent-77, score-0.328]
36 Consider an instance of a binary learning problem with hypothesis class H, an underlying distribution P over X × Y, and a loss function ℓ(Y, Y). [sent-79, score-0.187]
37 Let h∗ = arg inf h∈H {Eℓ(h(X), Y )} be the true risk minimizer. [sent-80, score-0.207]
38 In the following sequence of lemmas and theorems we assume a binary hypothesis class H with VC-dimension d, an underlying distribution P over X × {±1}, and ℓ is the 0/1 loss function. [sent-85, score-0.187]
39 Hanneke introduced a complexity measure for active learning problems termed the disagreement coefficient [5]. [sent-114, score-0.298]
40 The disagreement coefficient of h with respect to H under distribution P is, θh sup r>ϵ ∆B(h, r) , r (4) where ϵ = 0. [sent-115, score-0.201]
41 The disagreement coefficient of the hypothesis class H with respect to P is defined as θ lim sup θh(k) , k→∞ { } where h(k) is any sequence of h(k) ∈ H with R(h(k) ) monotonically decreasing. [sent-116, score-0.36]
42 Assume that H has disagreement coefficient θ and that F is a (β, B)-Bernstein class w. [sent-119, score-0.249]
43 By the definition of the disagreement coefficient, for any r′ > 0, ∆B(h∗ , r′ ) ≤ θr′ . [sent-129, score-0.201]
44 Assume that H has disagreement coefficient θ and that F is a (β, B)-Bernstein class w. [sent-132, score-0.249]
45 Let (h, g) be the selective classifier chosen by Algorithm 1. [sent-136, score-0.409]
46 Hanneke introduced, in his original work [5], an alternative definition of the disagreement coefficient θ, for which the supermum in (4) is taken with respect to any fixed ϵ > 0. [sent-151, score-0.201]
47 Using this alternative definition it is possible to show that fast coverage rates are achievable, not only for finite disagreement coefficients (Theorem 5. [sent-152, score-0.336]
48 5), but also if the disagreement coefficient grows slowly with respect to 1/ϵ (as shown by Wang [12], under sufficient smoothness conditions). [sent-153, score-0.201]
49 6 A disbelief principle and the risk-coverage trade-off Theorem 5. [sent-155, score-0.438]
50 5 tells us that the strategy presented in Section 4 not only outputs a weakly optimal selective classifier, but this classifier also has guaranteed coverage (under some conditions). [sent-156, score-0.732]
51 The result is an alternative characterization of the selection function g, of the weakly optimal selective classifier chosen by Strategy 1. [sent-160, score-0.513]
52 Let (h, g) be a selective classifier chosen by Strategy 1 after observing the training ˆ sample Sm . [sent-164, score-0.481]
53 Let x be any point in X and { ( )} ˆ ˆ hx argmin R(h) | h(x) = −sign h(x) , h∈H ˆ an empirical risk minimizer forced to label x the opposite from h(x). [sent-166, score-0.571]
54 1 tells us that in order to decide if point x should be rejected we need to measure the ˆ empirical error R(hx ) of a special empirical risk minimizer, hx , which is constrained to label x the ˆ ˆ ˆ opposite from h(x). [sent-173, score-0.589]
55 If this error is sufficiently close to R(h) our classifier cannot be too sure about the label of x and we must reject it. [sent-174, score-0.263]
56 Observe that D(x) is large whenever our model is sensitive to label of x in the sense that when we are forced to bend our best model to fit the opposite label of x, our model substantially deteriorates, giving rise to a large disbelief index. [sent-182, score-0.593]
57 This large D(x) can be interpreted as our disbelief in the possibility that x can be labeled so differently. [sent-183, score-0.438]
58 Given a pool of test points we can rank these test points according to their disbelief index, and points with low index should be rejected first. [sent-188, score-0.622]
59 As in our disbelief index, the difference between the empirical risk (or importance weighted empirical risk [14]) of two ERM oracles (with different constraints) is used to estimate prediction confidence. [sent-191, score-0.952]
60 7 Implementation At this point in the paper we switch from theory to practice, aiming at implementing rejection methods inspired by the disbelief principle and see how well they work on real world (well, . [sent-192, score-0.729]
61 Attempting to implement a learning algorithm driven by the disbelief index we face a major bottleneck because the calculation of the index requires the identification of ERM hypotheses. [sent-196, score-0.574]
62 Another problem we face is that the disbelief index is a noisy statistic that highly depends on the sample Sm . [sent-200, score-0.506]
63 For each sample we calculate the disbelief index for all test points and for each point take the median of these measurements as the final index. [sent-206, score-0.537]
64 We note that for any finite training sample the disbelief index is a discrete variable. [sent-207, score-0.551]
65 It is often the case that several test points share the same disbelief index. [sent-208, score-0.469]
66 8 Empirical results Focusing on SVMs with a linear kernel we compared the RC (Risk-Coverage) curves achieved by the proposed method with those achieved by SVM with rejection based on distance from decision boundary. [sent-215, score-0.385]
67 This latter approach is very common in practical applications of selective classification. [sent-216, score-0.409]
68 Before presenting these results we wish to emphasize that the proposed method leads to rejection regions fundamentally different than those obtained by the traditional distance-based technique. [sent-218, score-0.35]
69 (a) (b) Figure 1: confidence height map using (a) disbelief index; (b) distance from decision boundary. [sent-221, score-0.557]
70 SVM was trained on the entire training set and test samples were sorted according to confidence (either using distance from decision boundary or disbelief index). [sent-227, score-0.654]
71 Figure 2 depicts the RC curves of our technique (red solid line) and rejection based on distance from decision boundary (green dashed line) for linear kernel on all 6 datasets. [sent-228, score-0.431]
72 Our method in solid red, and rejection based on distance from decision boundary in dashed green. [sent-275, score-0.431]
73 While the traditional approach cannot achieve error less than 8% for any rejection rate, in our approach the test error decreases monotonically to zero with rejection rate. [sent-279, score-0.777]
74 Furthermore, a clear advantage for our method over a large range of rejection rates is evident in the Haberman dataset. [sent-280, score-0.291]
75 9 Related work The literature on theoretical studies of selective classification is rather sparse. [sent-287, score-0.409]
76 El-Yaniv and Wiener [1] studied the performance of a simple selective learning strategy for the realizable case. [sent-288, score-0.557]
77 Given an hypothesis class H, and a sample Sm , their method abstain from prediction if all hypotheses in the version space do not agree on the target sample. [sent-289, score-0.42]
78 They were able to show that their selective classifier achieves perfect classification with meaningful coverage under some conditions. [sent-290, score-0.625]
79 Given an hypothesis class H, the method outputs a weighted average of all the hypotheses in H, where the weight of each hypothesis exponentially depends on its individual training error. [sent-294, score-0.416]
80 They were able to bound the probability of misclassification by 2R(h∗ ) + ϵ(m) and, under some conditions, they proved a bound of 5R(h∗ ) + ϵ(F, m) on the rejection rate. [sent-296, score-0.291]
81 We include in our “ensemble” only hypotheses with sufficiently low empirical error and we abstain if the weighted average of all predictions is not definitive ( ̸= ±1). [sent-299, score-0.317]
82 Excess risk bounds were developed by Herbei and Wegkamp [19] for a model where each rejection incurs a cost 0 ≤ d ≤ 1/2. [sent-301, score-0.47]
83 Their bound applies to any empirical risk minimizer over a hypothesis class of ternary hypotheses (whose output is in {±1, reject}). [sent-302, score-0.558]
84 A rejection mechanism for SVMs based on distance from decision boundary is perhaps the most widely known and used rejection technique. [sent-304, score-0.722]
85 Few papers proposed alternative techniques for rejection in the case of SVMs. [sent-306, score-0.291]
86 Those include taking the reject area into account during optimization [25], training two SVM classifiers with asymmetric cost [26], and using a hinge loss [20]. [sent-307, score-0.284]
87 [16] proposed an efficient implementation of SVM with a reject option using a double hinge loss. [sent-309, score-0.244]
88 They empirically compared their results with two other selective classifiers: the one proposed by Bartlett and Wegkamp [20] and the traditional rejection based on distance from decision boundary. [sent-310, score-0.853]
89 In their experiments there was no statistically significant advantage to either method compared to the traditional approach for high rejection rates. [sent-311, score-0.35]
90 10 Conclusion We presented and analyzed a learning strategy for selective classification that achieves weak optimality. [sent-312, score-0.493]
91 We showed that the coverage rate directly depends on the disagreement coefficient, thus linking between active learning and selective classification. [sent-313, score-0.816]
92 Recently it has been shown that, for the noise-free case, active learning can be reduced to selective classification [27]. [sent-314, score-0.48]
93 Exact implementation of our strategy, or exact computation of the disbelief index may be too difficult to achieve or even obtain with approximation guarantees. [sent-316, score-0.535]
94 Our empirical examination of the proposed algorithm indicate that it can provide significant and consistent advantage over the traditional rejection technique with SVMs. [sent-318, score-0.4]
95 A bound on the label complexity of agnostic active learning. [sent-342, score-0.252]
96 2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in risk minimization. [sent-357, score-0.32]
97 Discussion of ”2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in risk minimization” by V. [sent-363, score-0.32]
98 Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. [sent-387, score-0.453]
99 Classification with a reject option using a hinge loss. [sent-443, score-0.244]
100 An ordinal data method for the classification with reject option. [sent-484, score-0.179]
wordName wordTfidf (topN-words)
[('disbelief', 0.438), ('selective', 0.409), ('rejection', 0.291), ('disagreement', 0.201), ('classi', 0.197), ('sm', 0.197), ('reject', 0.179), ('risk', 0.179), ('agnostic', 0.135), ('coverage', 0.135), ('hx', 0.134), ('abstain', 0.128), ('er', 0.124), ('hypothesis', 0.111), ('weakly', 0.104), ('minimizer', 0.099), ('erm', 0.096), ('strategy', 0.084), ('perfect', 0.081), ('dis', 0.077), ('perfection', 0.073), ('hypotheses', 0.071), ('active', 0.071), ('index', 0.068), ('excess', 0.067), ('realizable', 0.064), ('breast', 0.064), ('haberman', 0.064), ('traditional', 0.059), ('region', 0.059), ('coef', 0.059), ('rc', 0.058), ('decision', 0.057), ('cancer', 0.054), ('rejected', 0.054), ('hanneke', 0.052), ('ers', 0.052), ('cation', 0.051), ('empirical', 0.05), ('curve', 0.049), ('grandvalet', 0.049), ('herbei', 0.049), ('ims', 0.049), ('medallion', 0.049), ('wegkamp', 0.049), ('class', 0.048), ('boundary', 0.046), ('label', 0.046), ('training', 0.045), ('bernstein', 0.044), ('diminishes', 0.043), ('hepatitis', 0.043), ('svms', 0.04), ('svm', 0.04), ('bartlett', 0.039), ('pima', 0.039), ('opposite', 0.038), ('br', 0.038), ('error', 0.038), ('distance', 0.037), ('complexities', 0.036), ('agree', 0.036), ('focusing', 0.035), ('dence', 0.034), ('facilitates', 0.033), ('heuristically', 0.033), ('microarray', 0.033), ('wiener', 0.033), ('medical', 0.033), ('option', 0.033), ('beygelzimer', 0.032), ('certainty', 0.032), ('freund', 0.032), ('hinge', 0.032), ('test', 0.031), ('optimizer', 0.03), ('libsvm', 0.03), ('weighted', 0.03), ('ln', 0.029), ('oracle', 0.029), ('diagnosis', 0.029), ('achieve', 0.029), ('least', 0.029), ('lecture', 0.028), ('loss', 0.028), ('inf', 0.028), ('observing', 0.027), ('diminishing', 0.027), ('accomplish', 0.027), ('inequalities', 0.027), ('pr', 0.027), ('union', 0.027), ('termed', 0.026), ('prediction', 0.026), ('dasgupta', 0.026), ('lemma', 0.025), ('support', 0.025), ('nition', 0.025), ('forced', 0.025), ('height', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 28 nips-2011-Agnostic Selective Classification
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
2 0.34572953 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
Author: Dmitry Pidan, Ran El-Yaniv
Abstract: Focusing on short term trend prediction in a Ä?Ĺš nancial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The Ä?Ĺš rst is a rejection in the spirit of Chow’s well-known ambiguity principle. The second is a specialized mechanism for HMMs that identiÄ?Ĺš es low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior. 1
3 0.15039597 162 nips-2011-Lower Bounds for Passive and Active Learning
Author: Maxim Raginsky, Alexander Rakhlin
Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1
4 0.1065411 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li
Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1
5 0.10112198 198 nips-2011-On U-processes and clustering performance
Author: Stéphan J. Clémençcon
Abstract: Many clustering techniques aim at optimizing empirical criteria that are of the form of a U -statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of U -processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the √ excess of clustering risk is proved to be of the order OP (1/ n). Based on recent results related to the tail behavior of degenerate U -processes, it is also shown how to establish tighter rate bounds. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. 1
6 0.088405296 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning
7 0.088252418 178 nips-2011-Multiclass Boosting: Theory and Algorithms
8 0.085478686 21 nips-2011-Active Learning with a Drifting Distribution
9 0.085205995 19 nips-2011-Active Classification based on Value of Classifier
10 0.078377128 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
11 0.076365039 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
12 0.072056577 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
13 0.067119405 180 nips-2011-Multiple Instance Filtering
14 0.066667706 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
15 0.062886685 287 nips-2011-The Manifold Tangent Classifier
16 0.06281881 53 nips-2011-Co-Training for Domain Adaptation
17 0.061466742 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
18 0.06022625 29 nips-2011-Algorithms and hardness results for parallel large margin learning
19 0.059840698 165 nips-2011-Matrix Completion for Multi-label Image Classification
20 0.059827056 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
topicId topicWeight
[(0, 0.191), (1, -0.028), (2, -0.074), (3, -0.02), (4, 0.03), (5, 0.047), (6, 0.035), (7, -0.139), (8, -0.148), (9, -0.006), (10, -0.113), (11, 0.031), (12, 0.115), (13, 0.055), (14, -0.023), (15, -0.11), (16, -0.039), (17, -0.09), (18, 0.108), (19, 0.034), (20, 0.123), (21, 0.005), (22, 0.01), (23, 0.068), (24, 0.014), (25, 0.167), (26, -0.153), (27, -0.028), (28, 0.18), (29, -0.046), (30, 0.214), (31, -0.017), (32, 0.089), (33, 0.052), (34, -0.116), (35, -0.158), (36, -0.169), (37, -0.085), (38, -0.036), (39, 0.053), (40, 0.107), (41, 0.025), (42, -0.017), (43, 0.046), (44, -0.05), (45, -0.125), (46, 0.025), (47, -0.012), (48, 0.043), (49, 0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.94095653 28 nips-2011-Agnostic Selective Classification
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
2 0.84100831 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
Author: Dmitry Pidan, Ran El-Yaniv
Abstract: Focusing on short term trend prediction in a Ä?Ĺš nancial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The Ä?Ĺš rst is a rejection in the spirit of Chow’s well-known ambiguity principle. The second is a specialized mechanism for HMMs that identiÄ?Ĺš es low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior. 1
3 0.63046616 19 nips-2011-Active Classification based on Value of Classifier
Author: Tianshi Gao, Daphne Koller
Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.
4 0.59663081 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
Author: Christoph H. Lampert
Abstract: We study multi-label prediction for structured output sets, a problem that occurs, for example, in object detection in images, secondary structure prediction in computational biology, and graph matching with symmetries. Conventional multilabel classification techniques are typically not applicable in this situation, because they require explicit enumeration of the label set, which is infeasible in case of structured outputs. Relying on techniques originally designed for single-label structured prediction, in particular structured support vector machines, results in reduced prediction accuracy, or leads to infeasible optimization problems. In this work we derive a maximum-margin training formulation for multi-label structured prediction that remains computationally tractable while achieving high prediction accuracy. It also shares most beneficial properties with single-label maximum-margin approaches, in particular formulation as a convex optimization problem, efficient working set training, and PAC-Bayesian generalization bounds. 1
5 0.52892107 162 nips-2011-Lower Bounds for Passive and Active Learning
Author: Maxim Raginsky, Alexander Rakhlin
Abstract: We develop unified information-theoretic machinery for deriving lower bounds for passive and active learning schemes. Our bounds involve the so-called Alexander’s capacity function. The supremum of this function has been recently rediscovered by Hanneke in the context of active learning under the name of “disagreement coefficient.” For passive learning, our lower bounds match the upper bounds of Gin´ and Koltchinskii up to constants and generalize analogous results of Mase sart and N´ d´ lec. For active learning, we provide first known lower bounds based e e on the capacity function rather than the disagreement coefficient. 1
6 0.50103462 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors
7 0.48805577 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
8 0.48101446 33 nips-2011-An Exact Algorithm for F-Measure Maximization
9 0.47683308 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
10 0.46123835 178 nips-2011-Multiclass Boosting: Theory and Algorithms
11 0.46029961 59 nips-2011-Composite Multiclass Losses
12 0.45608884 21 nips-2011-Active Learning with a Drifting Distribution
13 0.45286119 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
14 0.44938189 4 nips-2011-A Convergence Analysis of Log-Linear Training
15 0.44859117 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
16 0.41624504 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
17 0.40147313 198 nips-2011-On U-processes and clustering performance
18 0.39264837 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
19 0.3911663 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems
20 0.3898935 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
topicId topicWeight
[(0, 0.028), (4, 0.046), (20, 0.018), (26, 0.037), (31, 0.052), (33, 0.019), (43, 0.045), (45, 0.589), (57, 0.019), (74, 0.037), (83, 0.017), (99, 0.019)]
simIndex simValue paperId paperTitle
1 0.99686193 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
same-paper 2 0.99487609 28 nips-2011-Agnostic Selective Classification
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
3 0.99484539 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
4 0.99209744 272 nips-2011-Stochastic convex optimization with bandit feedback
Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin
Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1
5 0.9869675 143 nips-2011-Learning Anchor Planes for Classification
Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari
Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1
6 0.98431033 293 nips-2011-Understanding the Intrinsic Memorability of Images
7 0.98324752 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
8 0.94692171 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes
9 0.91567129 19 nips-2011-Active Classification based on Value of Classifier
10 0.90237719 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
11 0.9020434 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
12 0.89601052 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
13 0.89380753 220 nips-2011-Prediction strategies without loss
14 0.89300728 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
15 0.88872874 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
16 0.88809854 78 nips-2011-Efficient Methods for Overlapping Group Lasso
17 0.88732243 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
18 0.87688065 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
19 0.87615323 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
20 0.87466139 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model