nips nips2012 nips2012-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer
Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. [sent-3, score-0.765]
2 In this case, new test instances have to be drawn and labeled at a cost. [sent-5, score-0.263]
3 We devise an active comparison method that selects instances according to an instrumental sampling distribution. [sent-6, score-0.89]
4 We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. [sent-7, score-0.386]
5 In practice, it is not always possible to compare the models’ risks on held-out training data. [sent-10, score-0.32]
6 The suppliers of the models could provide risk estimates based on held-out training data; however, such estimates might be biased because the training data would not necessarily reflect the distribution of images the deployed models will be exposed to. [sent-13, score-0.398]
7 By the time a new predictive model is considered, a previous risk estimate of the baseline model may no longer be accurate. [sent-17, score-0.292]
8 The standard approach to comparing models would be to draw n test instances according to the test distribution which the model is exposed to in practice, label these data, and calculate the difference of the empirical √ ˆn 2 ˆ risks ∆n and the sample variance Sn . [sent-19, score-0.894]
9 In many application scenarios, unlabeled test instances are readily available whereas the process of labeling data is costly. [sent-21, score-0.501]
10 We study an active model comparison process that, in analogy to active learning, selects instances from a pool of unlabeled test data and queries their labels. [sent-22, score-1.27]
11 Instances are selected according to an instrumental sampling distribution q. [sent-23, score-0.324]
12 The empirical difference of the models’ risks is weighted appropriately to compensate for the discrepancy between instrumental and test distributions which leads to consistent—that is, asymptotically unbiased—risk estimates. [sent-24, score-0.835]
13 1 The principal theoretical contribution of this paper is the derivation of a sampling distribution q that allows us to make the decision to prefer the superior model as confidently as possible given a fixed labeling budget n, if one of the models is in fact superior. [sent-25, score-0.599]
14 Equivalently, one may use q to minimize the labeling costs n required to reach a correct decision at a prescribed level of confidence. [sent-26, score-0.48]
15 The active comparison problem that we study can be seen as an extreme case of active learning, in which the model space contains only two (or, more generally, a small number of) models. [sent-27, score-0.945]
16 For the special case of classification with zero-one loss and two models under study, a simplified version of the sampling distribution we derive coincides with the sampling distribution used in the A 2 and IWAL active learning algorithms proposed by Balcan et al. [sent-28, score-0.863]
17 For A 2 and IWAL , the derivation of this distribution is based on finite-sample complexity bounds, while in our approach, it is based on maximizing the power of a statistical test comparing the models under study. [sent-31, score-0.357]
18 A further difference to active learning is that our goal is not only to choose the best model, but also to obtain a well-calibrated p-value indicating the confidence with which this decision can be made. [sent-33, score-0.473]
19 Our method is also related to recent work on active data acquisition strategies for the evaluation of a single predictive model, in terms of standard risks [8] or generalized risks that subsume precision, recall, and f-measure [9]. [sent-34, score-1.178]
20 The problem addressed in this paper is different in that we seek to assess the relative performance of two models, without necessarily determining absolute risks precisely. [sent-35, score-0.347]
21 have studied active model selection, where the goal is also to identify a model with lowest risk [5]. [sent-37, score-0.629]
22 However, in their setting costs are associated with obtaining predictions y = f (x), while in our setting costs are associated with obtaining labels y ∼ p(y|x). [sent-38, score-0.372]
23 Hoeffding ˆ races [6] and sequential sampling algorithms [10] perform efficient model selection by keeping track of risk bounds for candidate models and removing models that are clearly outperformed from consideration. [sent-39, score-0.459]
24 The goal of these methods is to reduce computational complexity, not labeling effort. [sent-40, score-0.258]
25 Section 3 derives the instrumental distribution and details our theoretical findings. [sent-43, score-0.237]
26 Let p(y|x; θ1 ) and p(y|x; θ2 ) be given θ-parameterized models of p(y|x) and let fj : X → Y with fj (x) = arg maxy p(y|x; θj ) be the corresponding predictive functions. [sent-47, score-0.404]
27 The risks of f1 , f2 are given by R[fj ] = (fj (x), y)p(x, y)dy dx (1) for a loss function : Y × Y → R. [sent-48, score-0.395]
28 The standard approach to comparing models is to compare empirical risk estimates 1 ˆ Rn [fj ] = n n (fj (xi ), yi ), (2) i=1 where n test instances (xi , yi ) are drawn from p(x, y) = p(x)p(y|x). [sent-50, score-0.61]
29 We will focus on a data labeling process that draws test instances according to an instrumental distribution q(x) rather than p(x). [sent-53, score-0.668]
30 Let q(x) denote an instrumental distribution with the property that p(x) > 0 implies q(x) > 0 for all x ∈ X . [sent-55, score-0.234]
31 Weighting factors p(xi ) compensate for the q(x q(x discrepancy between test and instrumental distribution, and the normalizer is the sum of weights. [sent-57, score-0.316]
32 In preferring one model over the on which model is preferable; a positive ∆ ˆ other, one rejects the null hypothesis that the observed difference ∆n,q is only a random effect, and ˆ R[f1 ] = R[f2 ] holds. [sent-61, score-0.361]
33 The null hypothesis implies that the mean of ∆n,q is asymptotically zero. [sent-62, score-0.456]
34 , [3]), it further implies that the statistic Because ∆ ˆ √ ∆n,q n ∼ N (0, 1) σn,q 1 2 ˆ is asymptotically standard-normally distributed, where n σn,q = Var[∆n,q ] denotes the variance ˆ n,q . [sent-65, score-0.311]
35 Because Sn,q consistently estimates σn,q , the null hypothesis also implies that the observable statistic is asymptotically standard normally distributed, ˆ √ ∆n,q n ∼ N (0, 1). [sent-69, score-0.639]
36 The p-value quantifies the likelihood of observing the given absolute value of the test statistic, or a higher value, by chance under the null hypothesis. [sent-74, score-0.304]
37 Student’s t-distribution can serve as a more popular approximation of the distribution of a test statistic under the null hypothesis, resulting in the common t-test. [sent-75, score-0.431]
38 Note, however, that Sn,q would have to be a sum of squared, normally distributed random variables for the test statistic to be asymptotically governed by the t-distribution. [sent-76, score-0.378]
39 If the null hypothesis does not hold and the two models incur different risks, the distribution of the test statistic depends on the chosen sampling distribution q(x). [sent-78, score-0.805]
40 Our goal is to find a distribution q(x) that allows us to tell the risks of f1 and f2 apart with high confidence. [sent-79, score-0.371]
41 More formally, the power of a test when sampling from q(x) is the likelihood that the null hypothesis can be rejected, that is, the likelihood that the p-value falls below a pre-specified confidence threshold α. [sent-80, score-0.548]
42 Our goal is to find the sampling distribution q that maximizes test power: q ∗ = arg max p 2 1 − Φ q 3 ˆ √ |∆n,q | n Sn,q ≤α . [sent-81, score-0.304]
43 2 discusses the sampling distribution in a pool-based setting and presents the active comparison algorithm. [sent-86, score-0.671]
44 σn,q ˆ |∆n,q | Sn,q (8) of the test statistic follows a folded normal dis- n∆ tribution with location parameter σn,q and scale parameter one. [sent-92, score-0.283]
45 The distribution q ∗ (x) ∝ p(x) 2 ( (f1 (x), y) − (f2 (x), y) − ∆) p(y|x)dy asymptotically maximizes βn,q ; that is, for any other sampling distribution q = q ∗ it holds that βn,q < βn,q∗ for sufficiently large n. [sent-98, score-0.404]
46 Before we prove Theorem 1, we show that a sampling distribution asymptotically maximizes βn,q if ˆ and only if it minimizes the asymptotic variance of the estimator ∆n,q . [sent-99, score-0.445]
47 Lemma 2 shows that in order to solve the optimization problem given by Equation 6, we need to find the sampling distribution minimizing the asymptotic ˆ variance of the estimator ∆n,q . [sent-104, score-0.261]
48 2 (12) Empirical Sampling Distribution The distribution q ∗ also depends on the true conditional p(y|x) and the true difference in risks ∆. [sent-117, score-0.471]
49 Note that as long as p(x) > 0 implies q(x) > 0, any choice of q will yield consistent risk estimates because weighting factors account for the discrepancy between sampling and test distribution (Equation 3). [sent-119, score-0.538]
50 To approximate the true conditional p(y|x), we use the given predictive models p(y|x; θ1 ) and p(y|x; θ2 ), and assume a mixture distribution giving equal weight to both models: 1 1 (13) p(y|x) ≈ p(y|x; θ1 ) + p(y|x; θ2 ). [sent-121, score-0.232]
51 2 2 The risk difference ∆ is replaced by a difference ∆θ of introspective risks calculated from Equa1 tion 1, where the integral over X is replaced by a sum over the pool, p(x) = m , and p(y|x) is approximated by Equation 13. [sent-122, score-0.557]
52 We will now derive the empirical sampling distribution for two standard loss functions. [sent-123, score-0.231]
53 This baseline coincides with the A 2 as well as the IWAL active learning algorithms, applied to the model space {f1 , f2 }, as can be seen from inspection of Algorithm 1 in [1] and Algorithms 1 and 2 in [2]. [sent-132, score-0.49]
54 If only predictions fj (x) but 2 no predictive distribution is available, we can assume peaked distributions with τx → 0, leading to q ∗ (x) ∝ (f1 (x) − f2 (x))2 , 2 or we can assume infinitely broad predictive distributions with τx → ∞, leading to ∗ q (x) ∝ |f1 (x) − f2 (x)|. [sent-147, score-0.37]
55 It samples n instances with replacement from the pool according to the distribution prescribed by Derivations 4 (for zero-one loss) or 5 (for squared loss) and queries their label. [sent-150, score-0.312]
56 Note that instances can be drawn more than once; in the special case that the labeling process is deterministic, the actual labeling costs may thus stay below the sample size. [sent-151, score-0.848]
57 In this case, the loop is continued until the labeling budget is exhausted. [sent-152, score-0.303]
58 We have so far focused on the problem of comparing the risks of two prediction models, such as a baseline and a challenger. [sent-153, score-0.412]
59 We might also consider several alternative models; the objective of an evaluation could be to rank the models according to the risk incurred or to identify the model with lowest risk. [sent-154, score-0.237]
60 Standard generalizations of the Wald test that compare multiple alternatives—for instance, within-subject ANOVA [11]—try to reject the null hypothesis that the means of all considered alternatives are equal. [sent-155, score-0.472]
61 Choosing a sampling distribution q that maximizes the power of such a test would thus in general not reflect the objectives of the empirical evaluation. [sent-157, score-0.386]
62 Accordingly, we derive a heuristic sampling distribution for the comparison of multiple models θ1 , . [sent-159, score-0.279]
63 , θk as a mixture of pairwise-optimal sampling distributions, 1 ∗ q ∗ (x) = qi,j (x), (15) k(k − 1) i=j ∗ qi,j where denotes the optimal distribution for comparing the models θi and θj given by Theorem 1. [sent-162, score-0.265]
64 4 Empirical Results We study the empirical behavior of active comparison (Algorithm 1, labeled active in all diagrams) relative to a risk comparison based on a test sample drawn uniformly from the pool (labeled passive) 6 active active≠ passive 0. [sent-167, score-2.155]
65 5 1 50 ARE A2 IWAL 100 150 labeling costs n 0. [sent-168, score-0.444]
66 95 200 400 600 labeling costs n ARE passive Abalone (Regression, 5 Models) 0. [sent-180, score-0.741]
67 4 model selection accuracy Spam Filtering (Classification, 2 Models) model selection accuracy model selection accuracy 0. [sent-196, score-0.24]
68 9 200 400 600 labeling costs n ARE passive 800 Inverse Dynamics (Regression, 5 Models) 0. [sent-197, score-0.741]
69 4 200 400 600 labeling costs n ARE passive 800 Figure 1: Model selection accuracy over labeling costs for comparison of two prediction models (top) and multiple prediction models (bottom). [sent-201, score-1.426]
70 We also include the active ∗ ∗ risk estimator presented in [8] in our study, which infers optimal sampling distributions q1 and q2 for individually estimating the risks of the models θ1 and θ2 . [sent-205, score-1.082]
71 Each comparison method returns the model 2 2 with lower empirical risk and the p-value of a paired two-sided test. [sent-207, score-0.24]
72 When studying classification, we also include the active learning algorithms A 2 [1] and IWAL [2] as baselines by using them to sample test instances. [sent-208, score-0.563]
73 Models are trained on part of the available data; the rest of the data serve as the pool of unlabeled test instances for which labels can be queried. [sent-215, score-0.325]
74 The true risk is taken to be the risk over all test instances in the pool. [sent-220, score-0.539]
75 Figure 1 (top) shows that for the comparison of two models active results in significantly higher model selection accuracy than passive, or, equivalently, saves between 70% and 90% of labeling effort. [sent-221, score-0.957]
76 Differences between active and the simplified variants active0 , active∞ , and active= are marginal. [sent-222, score-0.443]
77 In the object recognition domain, active saves approximately 70% of labeling effort compared to passive. [sent-228, score-0.843]
78 A 2 and IWAL outperform passive but are less accurate than active. [sent-229, score-0.297]
79 For the regression domains, active saves between 60% and 85% of labeling effort compared to passive. [sent-230, score-0.9]
80 2 Significance Testing: Type I and Type II Errors We now study how often a comparison method is able to reject the null hypothesis that two predictive models incur identical risks, and the calibration of the resulting p-values. [sent-232, score-0.606]
81 1 passive active 200 400 600 labeling costs n Average p−value Abalone (Regression) average p−value True Positive Significance Abalone (Regression, n=800) 1 frequency frequency True Positive Significance Inverse Dynamics (Regression, n=800) 1 passive active 0. [sent-251, score-2.02]
82 passive active 200 400 600 800 labeling costs n 0. [sent-278, score-1.184]
83 02 0 passive active 200 400 600 800 labeling costs n Figure 3: False-positive significance rate over test level α (left, left-center). [sent-280, score-1.273]
84 False-positive significance rate over labeling costs n (right-center, right). [sent-281, score-0.444]
85 method active= is equivalent to passive applied to D= = {x ∈ D|f1 (x) = f2 (x)} (see Section 3. [sent-283, score-0.297]
86 Figure 2 (left, left-center) shows how often the active and passive comparison methods are able to reject the null hypothesis that the two models incur identical risk. [sent-288, score-1.251]
87 The true risks incurred are never equal in these experiments. [sent-289, score-0.355]
88 We observe that active is able to reject the null hypothesis more often and with a higher confidence. [sent-290, score-0.792]
89 In the Abalone domain, active rejects the null hypothesis at α = 0. [sent-291, score-0.774]
90 001 more often than passive is able to reject it at α = 0. [sent-292, score-0.356]
91 Figure 2 (right-center, right) shows that active comparison also results in lower average p-values, in particular for large n. [sent-294, score-0.502]
92 5; this ensures that the true risks of f1 and f2 coincide. [sent-297, score-0.355]
93 We finally study a protocol in which test instances are drawn and labeled until the null hypothesis can be rejected or the labeling budget is exhausted. [sent-300, score-0.888]
94 Results (included in the online appendix) indicate that active incurs the lowest average labeling costs, obtains significance results most often, and has the lowest likelihood of incorrectly choosing the model with higher true risk. [sent-301, score-0.835]
95 5 Conclusion We have derived the sampling distribution that asymptotically maximizes the power of a statistical test that compares the risk of two predictive models. [sent-302, score-0.738]
96 The sampling distribution intuitively gives preference to test instances on which the models disagree strongly. [sent-303, score-0.424]
97 Empirically, we observed that the resulting active comparison method consistently outperforms a traditional comparison based on a uniform sample of test instances. [sent-304, score-0.65]
98 Active comparison identifies the model with lower true risk more often, and is able to detect significant differences between the risks of two given models more quickly. [sent-305, score-0.615]
99 In the four experimental domains that we studied, performing active comparison resulted in a saved labeling effort of between 60% and over 90%. [sent-306, score-0.846]
100 We also performed experiments under the null hypothesis that both models incur identical risks, and verified that active comparison does not lead to increased false-positive significance results. [sent-307, score-0.895]
wordName wordTfidf (topN-words)
[('active', 0.443), ('risks', 0.32), ('passive', 0.297), ('labeling', 0.258), ('iwal', 0.2), ('null', 0.188), ('costs', 0.186), ('instrumental', 0.155), ('risk', 0.15), ('significance', 0.15), ('abalone', 0.146), ('asymptotically', 0.138), ('fj', 0.129), ('sampling', 0.118), ('instances', 0.115), ('statistic', 0.103), ('hypothesis', 0.102), ('landwehr', 0.1), ('sawade', 0.1), ('predictive', 0.095), ('regression', 0.089), ('test', 0.089), ('pool', 0.082), ('spam', 0.077), ('recognizers', 0.075), ('saves', 0.066), ('equation', 0.063), ('dently', 0.061), ('folded', 0.061), ('cance', 0.06), ('comparison', 0.059), ('reject', 0.059), ('scheffer', 0.057), ('dynamics', 0.052), ('incur', 0.052), ('var', 0.052), ('distribution', 0.051), ('power', 0.051), ('models', 0.051), ('challenger', 0.05), ('madani', 0.05), ('potsdam', 0.05), ('asymptotic', 0.05), ('selection', 0.048), ('frequency', 0.048), ('normally', 0.048), ('baseline', 0.047), ('beygelzimer', 0.046), ('maximizes', 0.046), ('budget', 0.045), ('comparing', 0.045), ('dx', 0.044), ('effort', 0.044), ('derivation', 0.043), ('variance', 0.042), ('domains', 0.042), ('races', 0.041), ('rejects', 0.041), ('discrepancy', 0.041), ('unlabeled', 0.039), ('false', 0.037), ('prescribed', 0.036), ('wald', 0.036), ('dy', 0.036), ('inverse', 0.036), ('xi', 0.036), ('included', 0.036), ('lowest', 0.036), ('true', 0.035), ('resolves', 0.035), ('alternatives', 0.034), ('classification', 0.034), ('prefer', 0.033), ('bars', 0.033), ('yi', 0.033), ('object', 0.032), ('accuracy', 0.032), ('balcan', 0.032), ('rejected', 0.032), ('estimates', 0.032), ('loss', 0.031), ('exposed', 0.031), ('empirical', 0.031), ('baselines', 0.031), ('drawn', 0.031), ('derives', 0.031), ('compensate', 0.031), ('normal', 0.03), ('con', 0.03), ('difference', 0.03), ('weighting', 0.029), ('labeled', 0.028), ('squared', 0.028), ('hoeffding', 0.028), ('implies', 0.028), ('online', 0.027), ('ect', 0.027), ('maximizing', 0.027), ('absolute', 0.027), ('approximated', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 32 nips-2012-Active Comparison of Prediction Models
Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer
Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1
2 0.20850493 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain
Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1
3 0.15045576 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
Author: Mijung Park, Jonathan W. Pillow
Abstract: Active learning methods can dramatically improve the yield of neurophysiology experiments by adaptively selecting stimuli to probe a neuron’s receptive field (RF). Bayesian active learning methods specify a posterior distribution over the RF given the data collected so far in the experiment, and select a stimulus on each time step that maximally reduces posterior uncertainty. However, existing methods tend to employ simple Gaussian priors over the RF and do not exploit uncertainty at the level of hyperparameters. Incorporating this uncertainty can substantially speed up active learning, particularly when RFs are smooth, sparse, or local in space and time. Here we describe a novel framework for active learning under hierarchical, conditionally Gaussian priors. Our algorithm uses sequential Markov Chain Monte Carlo sampling (“particle filtering” with MCMC) to construct a mixture-of-Gaussians representation of the RF posterior, and selects optimal stimuli using an approximate infomax criterion. The core elements of this algorithm are parallelizable, making it computationally efficient for real-time experiments. We apply our algorithm to simulated and real neural data, and show that it can provide highly accurate receptive field estimates from very limited data, even with a small number of hyperparameter samples. 1
4 0.12122462 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
5 0.12053008 271 nips-2012-Pointwise Tracking the Optimal Regression Function
Author: Yair Wiener, Ran El-Yaniv
Abstract: This paper examines the possibility of a ‘reject option’ in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efficient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.
6 0.11072936 305 nips-2012-Selective Labeling via Error Bound Minimization
7 0.098598294 34 nips-2012-Active Learning of Multi-Index Function Models
8 0.093618125 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
9 0.083236821 33 nips-2012-Active Learning of Model Evidence Using Bayesian Quadrature
10 0.081664667 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
11 0.079279862 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task
12 0.077047296 200 nips-2012-Local Supervised Learning through Space Partitioning
13 0.075216487 74 nips-2012-Collaborative Gaussian Processes for Preference Learning
14 0.072996035 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
15 0.072534218 247 nips-2012-Nonparametric Reduced Rank Regression
16 0.071298137 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction
17 0.068963788 16 nips-2012-A Polynomial-time Form of Robust Regression
18 0.065830134 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds
19 0.064461589 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries
20 0.064084865 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification
topicId topicWeight
[(0, 0.191), (1, -0.001), (2, 0.014), (3, 0.007), (4, 0.013), (5, -0.004), (6, 0.023), (7, 0.131), (8, -0.04), (9, -0.075), (10, -0.055), (11, 0.016), (12, 0.007), (13, -0.053), (14, -0.012), (15, -0.075), (16, -0.004), (17, -0.007), (18, -0.081), (19, 0.04), (20, -0.028), (21, 0.077), (22, 0.011), (23, 0.106), (24, 0.075), (25, -0.069), (26, 0.062), (27, 0.045), (28, 0.079), (29, -0.106), (30, -0.152), (31, -0.046), (32, -0.118), (33, 0.05), (34, -0.126), (35, -0.009), (36, -0.108), (37, 0.033), (38, 0.117), (39, -0.022), (40, 0.031), (41, 0.014), (42, -0.135), (43, 0.088), (44, 0.057), (45, -0.044), (46, 0.096), (47, 0.048), (48, -0.099), (49, -0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.960177 32 nips-2012-Active Comparison of Prediction Models
Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer
Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1
2 0.66513014 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain
Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1
3 0.64464885 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray
Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1
4 0.6437223 271 nips-2012-Pointwise Tracking the Optimal Regression Function
Author: Yair Wiener, Ran El-Yaniv
Abstract: This paper examines the possibility of a ‘reject option’ in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efficient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.
5 0.55811691 34 nips-2012-Active Learning of Multi-Index Function Models
Author: Tyagi Hemant, Volkan Cevher
Abstract: We consider the problem of actively learning multi-index functions of the form k f (x) = g(Ax) = i=1 gi (aT x) from point evaluations of f . We assume that i the function f is defined on an 2 -ball in Rd , g is twice continuously differentiable almost everywhere, and A ∈ Rk×d is a rank k matrix, where k d. We propose a randomized, active sampling scheme for estimating such functions with uniform approximation guarantees. Our theoretical developments leverage recent techniques from low rank matrix recovery, which enables us to derive an estimator of the function f along with sample complexity bounds. We also characterize the noise robustness of the scheme, and provide empirical evidence that the highdimensional scaling of our sample complexity bounds are quite accurate. 1
6 0.55215937 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
7 0.54568928 305 nips-2012-Selective Labeling via Error Bound Minimization
8 0.52678651 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task
9 0.5263015 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
10 0.48744136 222 nips-2012-Multi-Task Averaging
11 0.48298866 279 nips-2012-Projection Retrieval for Classification
12 0.46541342 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
13 0.4566054 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support
14 0.44948277 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification
15 0.44683257 288 nips-2012-Rational inference of relative preferences
16 0.44414264 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction
17 0.43928343 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
18 0.43310031 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
19 0.43299478 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds
20 0.42999241 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
topicId topicWeight
[(0, 0.019), (17, 0.013), (21, 0.021), (27, 0.23), (38, 0.157), (39, 0.013), (42, 0.017), (54, 0.016), (55, 0.024), (74, 0.046), (76, 0.192), (80, 0.135), (92, 0.044)]
simIndex simValue paperId paperTitle
1 0.96649039 39 nips-2012-Analog readout for optical reservoir computers
Author: Anteo Smerieri, François Duport, Yvon Paquot, Benjamin Schrauwen, Marc Haelterman, Serge Massar
Abstract: Reservoir computing is a new, powerful and flexible machine learning technique that is easily implemented in hardware. Recently, by using a timemultiplexed architecture, hardware reservoir computers have reached performance comparable to digital implementations. Operating speeds allowing for real time information operation have been reached using optoelectronic systems. At present the main performance bottleneck is the readout layer which uses slow, digital postprocessing. We have designed an analog readout suitable for time-multiplexed optoelectronic reservoir computers, capable of working in real time. The readout has been built and tested experimentally on a standard benchmark task. Its performance is better than non-reservoir methods, with ample room for further improvement. The present work thereby overcomes one of the major limitations for the future development of hardware reservoir computers. 1
2 0.9365012 109 nips-2012-Efficient Monte Carlo Counterfactual Regret Minimization in Games with Many Player Actions
Author: Neil Burch, Marc Lanctot, Duane Szafron, Richard G. Gibson
Abstract: Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing strategies in extensive-form games. The Monte Carlo CFR (MCCFR) variants reduce the per iteration time cost of CFR by traversing a smaller, sampled portion of the tree. The previous most effective instances of MCCFR can still be very slow in games with many player actions since they sample every action for a given player. In this paper, we present a new MCCFR algorithm, Average Strategy Sampling (AS), that samples a subset of the player’s actions according to the player’s average strategy. Our new algorithm is inspired by a new, tighter bound on the number of iterations required by CFR to converge to a given solution quality. In addition, we prove a similar, tighter bound for AS and other popular MCCFR variants. Finally, we validate our work by demonstrating that AS converges faster than previous MCCFR algorithms in both no-limit poker and Bluff. 1
3 0.89665002 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme
Author: Liva Ralaivola
Abstract: This paper provides the first —to the best of our knowledge— analysis of online learning algorithms for multiclass problems when the confusion matrix is taken as a performance measure. The work builds upon recent and elegant results on noncommutative concentration inequalities, i.e. concentration inequalities that apply to matrices, and, more precisely, to matrix martingales. We do establish generalization bounds for online learning algorithms and show how the theoretical study motivates the proposition of a new confusion-friendly learning procedure. This learning algorithm, called COPA (for COnfusion Passive-Aggressive) is a passive-aggressive learning algorithm; it is shown that the update equations for COPA can be computed analytically and, henceforth, there is no need to recourse to any optimization package to implement it. 1
same-paper 4 0.86244226 32 nips-2012-Active Comparison of Prediction Models
Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer
Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1
5 0.85794616 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
Author: Xianghang Liu, James Petterson, Tibério S. Caetano
Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1
6 0.79976231 348 nips-2012-Tractable Objectives for Robust Policy Optimization
7 0.77988327 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization
8 0.77684605 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
9 0.77553713 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
10 0.77527225 279 nips-2012-Projection Retrieval for Classification
11 0.77521586 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
12 0.77513319 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
13 0.77501255 280 nips-2012-Proper losses for learning from partial labels
14 0.77461463 200 nips-2012-Local Supervised Learning through Space Partitioning
15 0.77389657 197 nips-2012-Learning with Recursive Perceptual Representations
16 0.77372313 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
17 0.77344316 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
18 0.77292526 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
19 0.77221471 272 nips-2012-Practical Bayesian Optimization of Machine Learning Algorithms
20 0.77210808 74 nips-2012-Collaborative Gaussian Processes for Preference Learning