nips nips2010 nips2010-180 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Golovin, Andreas Krause, Debajyoti Ray
Abstract: We tackle the fundamental problem of Bayesian active learning with noise, where we need to adaptively select from a number of expensive tests in order to identify an unknown hypothesis sampled from a known prior distribution. In the case of noise–free observations, a greedy algorithm called generalized binary search (GBS) is known to perform near–optimally. We show that if the observations are noisy, perhaps surprisingly, GBS can perform very poorly. We develop EC2 , a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. Our results hold even if the tests have non–uniform cost and their noise is correlated. We also propose E FF ECXTIVE , a particularly fast approximation of EC 2 , and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We develop EC2 , a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. [sent-4, score-0.379]
2 Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. [sent-5, score-0.2]
3 Our results hold even if the tests have non–uniform cost and their noise is correlated. [sent-6, score-0.33]
4 We also propose E FF ECXTIVE , a particularly fast approximation of EC 2 , and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty. [sent-7, score-0.278]
5 In all these applications, we have to sequentially select among a set of noisy, expensive observations (outcomes of experiments, medical tests, expert labels) in order to determine which hypothesis (theory, diagnosis, classifier) is most accurate. [sent-9, score-0.214]
6 One way to formalize such active learning problems is Bayesian experimental design [6], where one assumes a prior on the hypotheses, as well as probabilistic assumptions on the outcomes of tests. [sent-11, score-0.268]
7 The goal then is to determine the correct hypothesis while minimizing the cost of the experimentation. [sent-12, score-0.165]
8 Unfortunately, finding this optimal policy is not just NP-hard, but also hard to approximate [5]. [sent-13, score-0.12]
9 While there are some recent positive results in understanding the label complexity of noisy active learning [19, 1], to our knowledge, so far there are no algorithms that are provably competitive with the optimal sequential policy, except in very restricted settings [16]. [sent-19, score-0.202]
10 GBS greedily selects tests to maximize, in expectation over the test outcomes, the prior probability mass of eliminated hypotheses (i. [sent-21, score-0.529]
11 2 1 introduce a general formulation of Bayesian active learning with noisy observations that we call the Equivalence Class Determination problem. [sent-27, score-0.251]
12 We show that, perhaps surprisingly, generalized binary search performs poorly in this setting, as do greedily (myopically) maximizing the information gain (measured w. [sent-28, score-0.153]
13 the distribution on equivalence classes) or the decision-theoretic value of information. [sent-31, score-0.125]
14 This motivates us to introduce a novel active learning criterion, and use it to develop a greedy active learning algorithm called the Equivalence Class Edge Cutting algorithm (EC2 ), whose expected cost is competitive to that of the optimal policy. [sent-32, score-0.37]
15 Our key insight is that our new objective function satisfies adaptive submodularity [9], a natural diminishing returns property that generalizes the classical notion of submodularity to adaptive policies. [sent-33, score-0.256]
16 Our results also allow us to relax the common assumption that the outcomes of the tests are conditionally independent given the true hypothesis. [sent-34, score-0.403]
17 Our results from human subject experiments further reveal that E FF ECX TIVE can be used as a real-time tool to classify people according to the economic theory that best describes their behaviour in financial decision-making, and reveal some interesting heterogeneity in the population. [sent-37, score-0.19]
18 2 Bayesian Active Learning in the Noiseless Case In the Bayesian active learning problem, we would like to distinguish among a given set of hypotheses H = {h1 , . [sent-38, score-0.275]
19 Running test t incurs a cost of c(t) and produces an outcome from a finite set of outcomes X = {1, 2, . [sent-45, score-0.282]
20 We let H denote the random variable which equals the true hypothesis, and model the outcome of each test t by a random variable Xt taking values in X . [sent-49, score-0.139]
21 In the noiseless case, we assume that the outcome of each test is deterministic given the true hypothesis, i. [sent-55, score-0.209]
22 Thus, each hypothesis h is associated with a particular vector of test outcomes. [sent-61, score-0.115]
23 , that no two hypotheses lead to the same outcomes for all tests. [sent-66, score-0.229]
24 Our goal is to find an adaptive policy for running tests that allows us to determine the value of H while minimizing the cost of the tests performed. [sent-69, score-0.862]
25 Formally, a policy π (also called a conditional plan) is a partial mapping π from partial observation vectors xA to tests, specifying which test to run next (or whether we should stop testing) for any observation vector xA . [sent-70, score-0.154]
26 Hereby, xA ∈ X A is a vector of outcomes indexed by a set of tests A ⊆ T that we have performed so far 3 (e. [sent-71, score-0.403]
27 , the set of labeled examples in active learning, or outcomes of a set of medical tests that we ran). [sent-73, score-0.527]
28 We denote the set of hypotheses consistent with event Λ (often called the version space associated with Λ) by V(Λ) := {h ∈ H : P (h | Λ) > 0}. [sent-75, score-0.121]
29 We call a policy feasible if it is guaranteed to uniquely determine the correct hypothesis. [sent-76, score-0.212]
30 We can define the expected cost of a policy π by c(π) := P (h)c(T (π, h)) h where T (π, h) ⊆ T is the set of tests run by policy π in case H = h. [sent-78, score-0.604]
31 Our goal is to find a feasible policy π ∗ of minimum expected cost, i. [sent-79, score-0.154]
32 1) π A policy π can be naturally represented as a decision tree T , and thus problem (2. [sent-82, score-0.184]
33 Unfortunately, obtaining an approximate policy π for which c(π) ≤ c(π ∗ ) · o(log(n)) is NP-hard [5]. [sent-84, score-0.12]
34 Two of the most popular heuristics are to select tests greedily to maximize the information gain (IG) 3 Formally we also require that (xt )t∈B ∈ dom(π) and A ⊆ B, implies (xt )t∈A ∈ dom(π) (c. [sent-86, score-0.505]
35 Both heuristics are greedy, and after having made observations xA will select t∗ = arg max ∆Alg (t | xA ) /c(t), t∈T where Alg ∈ {IG, GBS}. [sent-90, score-0.134]
36 Thus, both heuristics greedily chooses the test that maximizes the benefit-cost ratio, measured with respect to their particular benefit functions. [sent-92, score-0.163]
37 They stop after running a set of tests A such that |V(xA )| = 1, i. [sent-93, score-0.295]
38 The result above is proved by exploiting the fact that fGBS (S, h) := 1 − P (V(xS (h))) + P (h) is adaptive submodular and strongly adaptively monotone [9]. [sent-99, score-0.233]
39 A function f : 2T × H is called adaptive submodular w. [sent-102, score-0.132]
40 Thus, f is adaptive submodular if the expected marginal benefits ∆ (t | xA ) of adding a new test t can only decrease as we gather more observations. [sent-106, score-0.2]
41 / The performance guarantee for GBS follows from the following general result about the greedy algorithm for adaptive submodular functions (applied with Q = 1 and η = pmin ): Theorem 1 (Theorem 10 of [9] with α = 1). [sent-112, score-0.217]
42 Suppose f : 2T × H → R≥0 is adaptive submodular and strongly adaptively monotone with respect to P and there exists Q such that f (T , h) = Q for all h. [sent-113, score-0.233]
43 Let η be any value such that f (S, h) > Q − η implies f (S, h) = Q for all sets S and hypotheses h. [sent-114, score-0.121]
44 Then for self–certifying instances the adaptive greedy policy π satisfies c(π) ≤ c(π ∗ ) ln Q η +1 . [sent-115, score-0.241]
45 The technical requirement that instances be self–certifying means that the policy will have proof that it has obtained the maximum possible objective value, Q, immediately upon doing so. [sent-116, score-0.12]
46 In the following sections, we will use the concept of adaptive submodularity to provide the first approximation guarantees for Bayesian active learning with noisy observations. [sent-119, score-0.33]
47 3 The Equivalence Class Determination Problem and the EC2 Algorithm We now wish to consider the Bayesian active learning problem where tests can have noisy outcomes. [sent-120, score-0.497]
48 Our general strategy is to reduce the problem of noisy observations to the noiseless setting. [sent-121, score-0.197]
49 To gain intuition, consider a simple model where tests have binary outcomes, and we know that the outcome of exactly one test, chosen uniformly at random unbeknown to us, is flipped. [sent-122, score-0.446]
50 If any pair of hypotheses h = h differs by the outcome of at least three tests, we can still uniquely determine the correct hypothesis after running all tests. [sent-123, score-0.399]
51 In this case we can reduce the noisy active learning problem to the noiseless setting by, for each hypothesis, creating N “noisy” copies, each obtained by flipping the outcome of one of the N tests. [sent-124, score-0.377]
52 Thus, each hypothesis hi in the original problem is now associated with a set Hi of hypotheses in the modified problem instance. [sent-127, score-0.468]
53 However, instead of selecting tests to determine which noisy copy has been realized, we only care which set Hi is realized. [sent-128, score-0.422]
54 More generally, we introduce the Equivalence Class Determination problem4 , where our set of hypotheses H is partitioned into a set m of m equivalence classes {H1 , . [sent-130, score-0.277]
55 , Hm } so that H = i=1 Hi , and the goal is to determine which class Hi the true hypothesis lies in. [sent-133, score-0.13]
56 As with the ODT problem, the goal is to minimize the expected cost of the tests, where the expectation is taken over the true hypothesis sampled from P . [sent-135, score-0.15]
57 , hn , and two equivalence classes H1 := {hi : 1 ≤ i < n} and H2 := {hn }. [sent-142, score-0.189]
58 , n} such that hi (t) = 1[i = t], all of unit cost. [sent-146, score-0.266]
59 In this case, the optimal policy only needs to select test n, however GBS may select tests 1, 2, . [sent-148, score-0.519]
60 Given our uniform prior, it takes n/2 tests in expectation until this happens, so that GBS pays, in expectation, n/2 times the optimal expected cost in this instance. [sent-152, score-0.364]
61 The poor performance of GBS in this instance may be attributed to its lack of consideration for the equivalence classes. [sent-153, score-0.125]
62 Another natural heuristic would be to run the greedy information gain policy, only with the entropy measured with respect to the probability distribution on equivalence classes rather than hypotheses. [sent-154, score-0.255]
63 It is clearly aware of the equivalence classes, as it adaptively and myopically selects tests to reduce the uncertainty of the realized class, measured w. [sent-156, score-0.567]
64 The reason why πIG fails is that there are complementarities among tests; a set of tests can be far better than the sum of its parts. [sent-164, score-0.295]
65 We adopt a very elegant idea from Dasgupta [8], and define weighted edges between hypotheses that we aim to distinguish between. [sent-166, score-0.186]
66 However, instead of introducing edges between arbitrary pairs of hypotheses (as done in [8]), we only introduce edges between hypotheses in different classes. [sent-167, score-0.312]
67 Tests will allow us to cut edges inconsistent with their outcomes, and we aim to eliminate all inconsistent edges while minimizing the expected cost incurred. [sent-168, score-0.209]
68 If all tests have unit cost, by using a modified prior [15] the approximation factor can be improved to O (log |H| + log | supp(Θ)|) as in the case of Theorem 3. [sent-171, score-0.295]
69 In this case reducing Bayesian active learning with noise to Equivalence Class Determination results in instances with exponentially-large equivalence classes. [sent-174, score-0.249]
70 This makes running EC2 on them challenging, since explicitly keeping track of the equivalence classes is impractical. [sent-175, score-0.156]
71 1), and consider the weight of edges between distinct equivalence classes Hi and Hj : w(Hi ×Hj ) = P (xT )P (xT ) = P (xT ) = P (XT ∈ Hi )P (XT ∈ Hj ). [sent-181, score-0.191]
72 Here, we focus on the case where, upon observing all tests, the hypothesis is uniquely determined, i. [sent-184, score-0.124]
73 In this case, it holds that P (XT ∈ Hi ) = P (H = hi ). [sent-187, score-0.266]
74 Input: Set of hypotheses H; Set of tests T ; prior distribution P ; function f . [sent-195, score-0.416]
75 5 Experiments Several economic theories make claims to explain how people make decisions when the payoffs are uncertain. [sent-197, score-0.263]
76 Here we use human subject experiments to compare four key theories proposed in literature. [sent-198, score-0.152]
77 The uncertainty of the payoff in a given situation is represented by a lottery L, which is simply a random variable with a range of payoffs L := { 1 , . [sent-199, score-0.198]
78 The four theories posit distinct utility functions, with agents preferring larger utility lotteries. [sent-205, score-0.182]
79 The parameters ΘP T = {ρ, λ, α} represent risk aversion, i < 0, and w(pi ) = e loss aversion and probability weighing factor respectively. [sent-209, score-0.149]
80 In Constant Relative Risk Aversion theory [20], there is a parameter ΘCRRA = a representing the level of risk aversion, and the utility posited is UCRRA (L) = i pi 1−a /(1 − a) if a = 1, and UCRRA (L) = i pi log( i ), if a = 1. [sent-211, score-0.143]
81 i The goal is to adaptively select a sequence of tests to present to a human subject in order to distinguish which of the four theories best explains the subject’s responses. [sent-212, score-0.583]
82 Based on the theory that represents behaviour, one of the lotteries would be 1 2 preferred to the other, denoted by a binary response xt ∈ {1, 2}. [sent-214, score-0.253]
83 The possible payoffs were fixed to L = {−10, 0, 10} (in dollars), and the distribution (p1 , p2 , p3 ) over the payoffs was varied, where pi ∈ {0. [sent-215, score-0.145]
84 We compare six algorithms: E FF ECX TIVE, greedily maximizing Information Gain (IG), Value of Information (VOI), Uncertainty Sampling5 (US), Generalized Binary Search (GBS), and tests selected at Random. [sent-225, score-0.374]
85 We chose parameter values for the theories such that they made distinct predictions and were consistent with the values proposed in literature [14]. [sent-227, score-0.124]
86 Responses were generated using a softmax function, with the t t probability of response xt = 1 given by P (xt = 1) = 1/(1 + eU (L2 )−U (L1 ) ). [sent-235, score-0.198]
87 5 Uncertainty sampling greedily selects the test whose outcome distribution has maximum Shannon entropy. [sent-240, score-0.218]
88 2 0 5 10 15 20 Number of tests 25 (a) Fixed parameters 30 0. [sent-256, score-0.295]
89 After obtaining informed consent according to a protocol approved by the Institutional Review Board of Caltech, we tested 11 human subjects to determine which model fit their behaviour best. [sent-287, score-0.148]
90 Subjects were presented 30 tests using E FF ECX TIVE. [sent-289, score-0.295]
91 To incentivise the subjects, one of these tests was picked at random, and subjects received payment based the outcome of their chosen lottery. [sent-290, score-0.468]
92 2 subjects were best described by prospect theory since they exhibited a high degree of loss aversion and risk aversion. [sent-294, score-0.282]
93 Although we need a larger sample to make significant claims of the validity of different economic theories, our preliminary results indicate that subject types can be identified and there is heterogeneity in the population. [sent-297, score-0.131]
94 6 Conclusions In this paper, we considered the problem of adaptively selecting which noisy tests to perform in order to identify an unknown hypothesis sampled from a known prior distribution. [sent-299, score-0.525]
95 We studied the Equivalence Class Determination problem as a means to reduce the case of noisy observations to the classic, noiseless case. [sent-300, score-0.197]
96 We introduced EC2 , an adaptive greedy algorithm that is guaranteed to choose the same hypothesis as if it had observed the outcome of all tests, and incurs near-minimal expected cost among all policies with this guarantee. [sent-301, score-0.376]
97 EC2 works by greedily optimizing an objective tailored to differentiate between sets of observations that lead to different decisions. [sent-306, score-0.128]
98 We believe that our results provide an interesting direction towards providing a theoretical foundation for practical active learning and experimental design problems. [sent-310, score-0.16]
99 Adaptive submodularity: Theory and applications in active learning and stochastic optimization. [sent-358, score-0.124]
100 Efficient test selection in active diagnosis via entropy approximation. [sent-424, score-0.193]
wordName wordTfidf (topN-words)
[('xa', 0.403), ('gbs', 0.356), ('tests', 0.295), ('hi', 0.266), ('ecx', 0.258), ('ff', 0.198), ('xt', 0.198), ('tive', 0.193), ('equivalence', 0.125), ('theories', 0.124), ('active', 0.124), ('hypotheses', 0.121), ('policy', 0.12), ('aversion', 0.113), ('outcomes', 0.108), ('outcome', 0.105), ('eff', 0.097), ('hypothesis', 0.081), ('infogain', 0.081), ('ig', 0.079), ('greedily', 0.079), ('noisy', 0.078), ('adaptively', 0.071), ('noiseless', 0.07), ('subjects', 0.068), ('adaptive', 0.068), ('hj', 0.066), ('voi', 0.065), ('lottery', 0.065), ('prospect', 0.065), ('submodular', 0.064), ('caltech', 0.06), ('submodularity', 0.06), ('determination', 0.06), ('economic', 0.058), ('crra', 0.055), ('ecd', 0.055), ('lotteries', 0.055), ('odt', 0.055), ('xb', 0.055), ('greedy', 0.053), ('payoffs', 0.053), ('heuristics', 0.05), ('observations', 0.049), ('determine', 0.049), ('golovin', 0.048), ('econometrica', 0.048), ('shannon', 0.048), ('bayesian', 0.046), ('gain', 0.046), ('heterogeneity', 0.045), ('uncertainty', 0.044), ('uniquely', 0.043), ('andreas', 0.042), ('eh', 0.042), ('cutting', 0.039), ('pi', 0.039), ('xs', 0.038), ('certifying', 0.037), ('debajyoti', 0.037), ('dollars', 0.037), ('effecxtive', 0.037), ('mvs', 0.037), ('ucrra', 0.037), ('uncertaintysampling', 0.037), ('payoff', 0.036), ('design', 0.036), ('risk', 0.036), ('cost', 0.035), ('edges', 0.035), ('select', 0.035), ('inconsistent', 0.035), ('diagnosis', 0.035), ('termination', 0.035), ('krause', 0.035), ('expected', 0.034), ('test', 0.034), ('corr', 0.034), ('tree', 0.033), ('hn', 0.033), ('bellala', 0.032), ('bhavnani', 0.032), ('clayton', 0.032), ('gowtham', 0.032), ('pmin', 0.032), ('tease', 0.032), ('myopically', 0.032), ('portfolio', 0.032), ('behaviour', 0.031), ('classes', 0.031), ('decision', 0.031), ('distinguish', 0.03), ('monotone', 0.03), ('dom', 0.03), ('alg', 0.03), ('utility', 0.029), ('subject', 0.028), ('people', 0.028), ('generalized', 0.028), ('posits', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
Author: Daniel Golovin, Andreas Krause, Debajyoti Ray
Abstract: We tackle the fundamental problem of Bayesian active learning with noise, where we need to adaptively select from a number of expensive tests in order to identify an unknown hypothesis sampled from a known prior distribution. In the case of noise–free observations, a greedy algorithm called generalized binary search (GBS) is known to perform near–optimally. We show that if the observations are noisy, perhaps surprisingly, GBS can perform very poorly. We develop EC2 , a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. Our results hold even if the tests have non–uniform cost and their noise is correlated. We also propose E FF ECXTIVE , a particularly fast approximation of EC 2 , and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty. 1
2 0.31223473 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
Author: Gowtham Bellala, Suresh Bhavnani, Clayton Scott
Abstract: Generalized Binary Search (GBS) is a well known greedy algorithm for identifying an unknown object while minimizing the number of “yes” or “no” questions posed about that object, and arises in problems such as active learning and active diagnosis. Here, we provide a coding-theoretic interpretation for GBS and show that GBS can be viewed as a top-down algorithm that greedily minimizes the expected number of queries required to identify an object. This interpretation is then used to extend GBS in two ways. First, we consider the case where the objects are partitioned into groups, and the objective is to identify only the group to which the object belongs. Then, we consider the case where the cost of identifying an object grows exponentially in the number of queries. In each case, we present an exact formula for the objective function involving Shannon or R´ nyi entropy, and e develop a greedy algorithm for minimizing it. 1
3 0.14223392 258 nips-2010-Structured sparsity-inducing norms through submodular functions
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
4 0.11735377 261 nips-2010-Supervised Clustering
Author: Pranjal Awasthi, Reza B. Zadeh
Abstract: Despite the ubiquity of clustering as a tool in unsupervised learning, there is not yet a consensus on a formal theory, and the vast majority of work in this direction has focused on unsupervised clustering. We study a recently proposed framework for supervised clustering where there is access to a teacher. We give an improved generic algorithm to cluster any concept class in that model. Our algorithm is query-efficient in the sense that it involves only a small amount of interaction with the teacher. We also present and study two natural generalizations of the model. The model assumes that the teacher response to the algorithm is perfect. We eliminate this limitation by proposing a noisy model and give an algorithm for clustering the class of intervals in this noisy model. We also propose a dynamic model where the teacher sees a random subset of the points. Finally, for datasets satisfying a spectrum of weak to strong properties, we give query bounds, and show that a class of clustering functions containing Single-Linkage will find the target clustering under the strongest property.
5 0.10619117 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
Author: Tang Jie, Pieter Abbeel
Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1
6 0.1038917 27 nips-2010-Agnostic Active Learning Without Constraints
7 0.10093101 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
8 0.098462716 152 nips-2010-Learning from Logged Implicit Exploration Data
9 0.098403633 4 nips-2010-A Computational Decision Theory for Interactive Assistants
10 0.09755259 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
11 0.094517961 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
12 0.093434945 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
13 0.086736232 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
14 0.083705239 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions
15 0.083236255 212 nips-2010-Predictive State Temporal Difference Learning
16 0.081718855 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
17 0.081451893 235 nips-2010-Self-Paced Learning for Latent Variable Models
18 0.081288263 208 nips-2010-Policy gradients in linearly-solvable MDPs
19 0.081216969 134 nips-2010-LSTD with Random Projections
20 0.080219753 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
topicId topicWeight
[(0, 0.19), (1, -0.11), (2, 0.059), (3, 0.015), (4, -0.025), (5, -0.024), (6, -0.102), (7, -0.008), (8, 0.119), (9, -0.084), (10, 0.128), (11, 0.101), (12, -0.108), (13, -0.106), (14, -0.037), (15, 0.058), (16, -0.082), (17, -0.056), (18, -0.041), (19, 0.074), (20, -0.063), (21, 0.182), (22, 0.041), (23, -0.064), (24, -0.106), (25, -0.112), (26, -0.019), (27, 0.098), (28, -0.011), (29, -0.039), (30, -0.095), (31, 0.039), (32, 0.054), (33, 0.13), (34, 0.125), (35, -0.052), (36, -0.081), (37, -0.181), (38, -0.238), (39, -0.091), (40, -0.151), (41, 0.024), (42, 0.06), (43, -0.01), (44, 0.021), (45, 0.061), (46, 0.038), (47, -0.004), (48, -0.073), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.94368851 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
Author: Daniel Golovin, Andreas Krause, Debajyoti Ray
Abstract: We tackle the fundamental problem of Bayesian active learning with noise, where we need to adaptively select from a number of expensive tests in order to identify an unknown hypothesis sampled from a known prior distribution. In the case of noise–free observations, a greedy algorithm called generalized binary search (GBS) is known to perform near–optimally. We show that if the observations are noisy, perhaps surprisingly, GBS can perform very poorly. We develop EC2 , a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. Our results hold even if the tests have non–uniform cost and their noise is correlated. We also propose E FF ECXTIVE , a particularly fast approximation of EC 2 , and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty. 1
2 0.73921698 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
Author: Gowtham Bellala, Suresh Bhavnani, Clayton Scott
Abstract: Generalized Binary Search (GBS) is a well known greedy algorithm for identifying an unknown object while minimizing the number of “yes” or “no” questions posed about that object, and arises in problems such as active learning and active diagnosis. Here, we provide a coding-theoretic interpretation for GBS and show that GBS can be viewed as a top-down algorithm that greedily minimizes the expected number of queries required to identify an object. This interpretation is then used to extend GBS in two ways. First, we consider the case where the objects are partitioned into groups, and the objective is to identify only the group to which the object belongs. Then, we consider the case where the cost of identifying an object grows exponentially in the number of queries. In each case, we present an exact formula for the objective function involving Shannon or R´ nyi entropy, and e develop a greedy algorithm for minimizing it. 1
3 0.41098112 219 nips-2010-Random Conic Pursuit for Semidefinite Programming
Author: Ariel Kleiner, Ali Rahimi, Michael I. Jordan
Abstract: We present a novel algorithm, Random Conic Pursuit, that solves semidefinite programs (SDPs) via repeated optimization over randomly selected two-dimensional subcones of the PSD cone. This scheme is simple, easily implemented, applicable to very general SDPs, scalable, and theoretically interesting. Its advantages are realized at the expense of an ability to readily compute highly exact solutions, though useful approximate solutions are easily obtained. This property renders Random Conic Pursuit of particular interest for machine learning applications, in which the relevant SDPs are generally based upon random data and so exact minima are often not a priority. Indeed, we present empirical results to this effect for various SDPs encountered in machine learning; these experiments demonstrate the potential practical usefulness of Random Conic Pursuit. We also provide a preliminary analysis that yields insight into the theoretical properties and convergence of the algorithm. 1
4 0.36771718 27 nips-2010-Agnostic Active Learning Without Constraints
Author: Alina Beygelzimer, John Langford, Zhang Tong, Daniel J. Hsu
Abstract: We present and analyze an agnostic active learning algorithm that works without keeping a version space. This is unlike all previous approaches where a restricted set of candidate hypotheses is maintained throughout learning, and only hypotheses from this set are ever returned. By avoiding this version space approach, our algorithm sheds the computational burden and brittleness associated with maintaining version spaces, yet still allows for substantial improvements over supervised learning for classification. 1
5 0.36535114 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
Author: Wei Wang, Zhi-hua Zhou
Abstract: The sample complexity of active learning under the realizability assumption has been well-studied. The realizability assumption, however, rarely holds in practice. In this paper, we theoretically characterize the sample complexity of active learning in the non-realizable case under multi-view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi-view active learning can be O(log 1 ), contrasting to single-view setting where the polynomial improveǫ ment is the best possible achievement. We also prove that in general multi-view setting the sample complexity of active learning with unbounded Tsybakov noise is O( 1 ), where the order of 1/ǫ is independent of the parameter in Tsybakov noise, ǫ contrasting to previous polynomial bounds where the order of 1/ǫ is related to the parameter in Tsybakov noise. 1
6 0.35960376 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
7 0.34066215 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning
8 0.33853298 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
9 0.33234292 19 nips-2010-A rational decision making framework for inhibitory control
10 0.33122644 258 nips-2010-Structured sparsity-inducing norms through submodular functions
11 0.32570726 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
12 0.31936368 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
13 0.30810815 4 nips-2010-A Computational Decision Theory for Interactive Assistants
14 0.30360037 22 nips-2010-Active Estimation of F-Measures
15 0.30176935 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
16 0.30084419 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
17 0.29793659 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
18 0.2973299 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
19 0.2933082 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions
20 0.29275119 212 nips-2010-Predictive State Temporal Difference Learning
topicId topicWeight
[(13, 0.043), (17, 0.011), (26, 0.267), (27, 0.07), (30, 0.053), (35, 0.012), (45, 0.201), (50, 0.044), (52, 0.052), (53, 0.024), (60, 0.043), (77, 0.042), (78, 0.027), (90, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.79229742 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
Author: Daniel Golovin, Andreas Krause, Debajyoti Ray
Abstract: We tackle the fundamental problem of Bayesian active learning with noise, where we need to adaptively select from a number of expensive tests in order to identify an unknown hypothesis sampled from a known prior distribution. In the case of noise–free observations, a greedy algorithm called generalized binary search (GBS) is known to perform near–optimally. We show that if the observations are noisy, perhaps surprisingly, GBS can perform very poorly. We develop EC2 , a novel, greedy active learning algorithm and prove that it is competitive with the optimal policy, thus obtaining the first competitiveness guarantees for Bayesian active learning with noisy observations. Our bounds rely on a recently discovered diminishing returns property called adaptive submodularity, generalizing the classical notion of submodular set functions to adaptive policies. Our results hold even if the tests have non–uniform cost and their noise is correlated. We also propose E FF ECXTIVE , a particularly fast approximation of EC 2 , and evaluate it on a Bayesian experimental design problem involving human subjects, intended to tease apart competing economic theories of how people make decisions under uncertainty. 1
2 0.74588257 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
Author: Andrew Goldberg, Ben Recht, Junming Xu, Robert Nowak, Xiaojin Zhu
Abstract: We pose transductive classification as a matrix completion problem. By assuming the underlying matrix has a low rank, our formulation is able to handle three problems simultaneously: i) multi-label learning, where each item has more than one label, ii) transduction, where most of these labels are unspecified, and iii) missing data, where a large number of features are missing. We obtained satisfactory results on several real-world tasks, suggesting that the low rank assumption may not be as restrictive as it seems. Our method allows for different loss functions to apply on the feature and label entries of the matrix. The resulting nuclear norm minimization problem is solved with a modified fixed-point continuation method that is guaranteed to find the global optimum. 1
3 0.72054511 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
4 0.67114222 258 nips-2010-Structured sparsity-inducing norms through submodular functions
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
5 0.65773511 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
6 0.65767711 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
7 0.65640599 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
8 0.65622592 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
9 0.65437716 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
10 0.65347558 155 nips-2010-Learning the context of a category
11 0.65337217 287 nips-2010-Worst-Case Linear Discriminant Analysis
12 0.65328246 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
13 0.65305066 158 nips-2010-Learning via Gaussian Herding
14 0.65272319 280 nips-2010-Unsupervised Kernel Dimension Reduction
15 0.65206063 148 nips-2010-Learning Networks of Stochastic Differential Equations
16 0.65202892 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
17 0.65194273 265 nips-2010-The LASSO risk: asymptotic results and real world examples
18 0.65162635 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
19 0.65147257 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
20 0.65140301 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models