nips nips2011 nips2011-61 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Krause, Cheng S. Ong
Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1
Reference: text
sentIndex sentText sentNum sentScore
1 These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). [sent-7, score-0.695]
2 The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. [sent-8, score-0.465]
3 We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. [sent-9, score-0.233]
4 We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. [sent-10, score-0.294]
5 We further provide generic tools for deriving regret bounds when using such composite kernel functions. [sent-11, score-0.545]
6 Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. [sent-12, score-0.335]
7 All these problems can be phrased as a contextual bandit problem (c. [sent-17, score-0.436]
8 , [1, 2], we review related work in Section 7), where in each round, we receive context (about the experimental conditions, the query, or the task), and have to choose an action (system parameters, document to retrieve). [sent-19, score-0.259]
9 The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the contextaction space, and to exploit by choosing an action deemed optimal based on the gathered data. [sent-21, score-0.498]
10 Without making any assumptions about the class of payoff functions under consideration, we cannot expect to do well. [sent-22, score-0.298]
11 A natural approach is to choose a regularizer, encoding assumptions about smoothness of the payoff function. [sent-23, score-0.302]
12 In this paper, we take a nonparametric approach, and model the payoff function as a sample from a Gaussian process defined over the joint context-action space (or having low norm in the associated RKHS). [sent-24, score-0.233]
13 This approach allows us to estimate the predictive uncertainty in the payoff function estimated from previous experiments, guiding the tradeoff between exploration and exploitation. [sent-25, score-0.292]
14 By constructing a composite kernel function for the regularizer from kernels defined over the action and context spaces (e. [sent-28, score-0.521]
15 , a linear kernel on the actions, and Gaussian kernel on the contexts), we can capture several natural contextual bandit problem formulations. [sent-30, score-0.728]
16 We prove that CGP-UCB incurs 1 sublinear contextual regret (i. [sent-31, score-0.639]
17 , prove that it competes with the optimal mapping from context to actions) for a large class of composite kernel functions constructed in this manner. [sent-33, score-0.378]
18 Lastly, we evaluate our algorithm on two real-world case studies in the context of automated vaccine design, and management of sensor networks. [sent-34, score-0.308]
19 We show that in both these problems, properly taking into account contextual information outperforms ignoring or naively using context. [sent-35, score-0.34]
20 In each round, we receive a context zt ∈ Z from a (not necessarily finite) set Z of contexts, and have to choose an action st ∈ S from a (not necessarily finite) set S of actions. [sent-38, score-0.38]
21 We then receive a payoff yt = f (st , zt ) + t , where f : S × Z → R is an (unknown) function, and t is zero mean random noise (independent across the rounds). [sent-39, score-0.4]
22 The addition of (externally chosen) contextual information captures a critical component in many applications, and generalizes the k-armed bandit setting. [sent-40, score-0.436]
23 Since f is unknown, we will not generally be able to choose the optimal action, and thus incur T regret rt = sups ∈S f (s , zt ) − f (st , zt ). [sent-41, score-0.491]
24 After T rounds, our cumulative regret is RT = t=1 rt . [sent-42, score-0.329]
25 The context-specific best action is a more demanding benchmark than the best action used in the (context-free) definition of regret. [sent-43, score-0.21]
26 Our goal will be to develop an algorithm which achieves sublinear contextual regret, i. [sent-44, score-0.376]
27 Note that achieving sublinear contextual regret requires learning (and competing with) the optimal mapping from contexts to actions. [sent-47, score-0.87]
28 Regularity assumptions are required, since without any there could be a single action s∗ ∈ S that obtains payoff of 1, and all other actions obtain payoff 0. [sent-48, score-0.73]
29 Since the random variables are action-context pairs, often there is a natural decomposition of the covariance function k into the corresponding covariance functions on actions and contexts (Section 5). [sent-57, score-0.449]
30 The choice of the kernel function turns out to be crucial in regularizing the function class to achieve sublinear regret (Section 4). [sent-72, score-0.49]
31 2 3 The Contextual Upper Confidence Bound Algorithm In the context-free case Z = ∅, the problem of trading off exploration and exploitation with payoff functions sampled from a Gaussian process is studied by [3]. [sent-74, score-0.363]
32 At round t, GP-UCB picks action st = xt such that 1/2 st = argmax µt−1 (s) + βt σt−1 (s), (1) s∈S where βt are appropriate constants. [sent-76, score-0.218]
33 Thus, when presented with context zt , this algorithm uses posterior inference to predict mean and variance for each possible decision s, conditioned on all past observations (involving both the chosen actions, the observed contexts as well as the noisy payoffs). [sent-90, score-0.425]
34 We call the greedy algorithm implementing rule 2 the contextual Gaussian process UCB algorithm (CGP-UCB). [sent-91, score-0.295]
35 As we will show in Section 5, this algorithm allows to incorporate various assumptions about the dependencies of the payoff function on the chosen actions and observed contexts. [sent-92, score-0.392]
36 In the following, we will prove that in many practical applications, CGP-UCB attains sublinear contextual regret (i. [sent-94, score-0.639]
37 , is able to compete with the optimal mapping from contexts to actions). [sent-96, score-0.231]
38 4 Bounds on the Contextual Regret Bounding the contextual regret of CGP-UCB is a challenging problem, since the regret is measured with respect to the best action for each context. [sent-97, score-0.926]
39 Intuitively, the amount of regret we incur should depend on how quickly we can gather information about the payoff function, which now jointly depends on context and actions. [sent-98, score-0.609]
40 In the following, we show that the contextual regret of CGP-UCB is bounded by an intuitive information-theoretic quantity, which quantifies the mutual information between the observed context-action pairs and the estimated payoff function f . [sent-99, score-0.791]
41 Using this notion of information gain γT , we lift the results of [3] to the much more general contextual bandit setting, shedding further light on the connection between bandit optimization and information gain. [sent-105, score-0.612]
42 In Section 5, we show how to bound γT for composite kernels, combining possibly different assumptions about the regularity of f in the action space S and context space Z. [sent-106, score-0.397]
43 √ Then the contextual regret of CGP-UCB is bounded by O∗ ( T γT βT ) w. [sent-127, score-0.558]
44 Theorem 1 (proof given in the supplemental material) shows that, in case (1) and (2), with high probability over samples from the GP, the cumulative contextual regret is bounded in terms of the maximum information gain with respect to the GP defined over S × Z. [sent-132, score-0.593]
45 In case of assumption (3), a regret bound is obtained in a more agnostic setting, where no prior on f is assumed, and much weaker assumptions are made about the noise process. [sent-133, score-0.303]
46 A natural approach is to start with kernel functions kZ : Z ×Z → R and kS : S × S → R on the space of contexts and actions, and use them to derive the kernel on the product space. [sent-137, score-0.548]
47 The intuition behind this product kernel is a conjunction of the notions of similarities induced by the kernels over context and action spaces: Two context-action pairs are similar (large correlation) if the contexts are similar and actions are similar (Figure 1(a)). [sent-140, score-0.777]
48 For example, if kZ and kS are squared exponential kernels (or Mat´ rn kernels with smoothness parameters ν), then the product k = kZ ⊗kS e is a squared exponential kernel (or Mat´ rn kernels with smoothness parameters ν). [sent-142, score-0.437]
49 5 −1 −1 Actions (b) Figure 1: Illustrations of composite kernel functions that can be incorporated into CGP-UCB. [sent-151, score-0.265]
50 (a) Product of squared exponential kernel and linear kernel; (b) additive combination of a payoff function that smoothly depends on context, and exhibits clusters of actions. [sent-152, score-0.43]
51 In general, context and action spaces are higher dimensional. [sent-153, score-0.218]
52 Since the key quantity governing the regret is the information gain γT , we would like to find a convenient way of bounding γT for composite kernels (kS ⊗ kZ and kS ⊕ kZ ), plugging in different regularity assumptions for the contexts (via kZ ) and actions (via kS ). [sent-167, score-0.935]
53 More formally, let us define γ(T ; k; V ) = max A⊆V,|A|≤T 1 log I + σ −2 [k(v, v )]v,v ∈A , 2 which quantifies the maximum possible information gain achievable by sampling T points in a GP defined over set V with kernel function k. [sent-168, score-0.232]
54 , γT for the product of a d1 dimensional linear kernel and a d2 dimensional Gaussian kernel is O(d1 (log T )d2 +1 ). [sent-182, score-0.292]
55 [9] model the expected payoff for each action as a (unknown) ∗ ∗ linear function µ(s, z) = zT θs . [sent-195, score-0.338]
56 Hereby, θs models the dependence of action s on the context z. [sent-196, score-0.218]
57 s Besides online advertising, a similar model has been proposed and experimentally studied by [6] for the problem of contextual news recommendation (see Section 7 for a discussion). [sent-197, score-0.318]
58 5 50 100 150 200 Trial t 250 300 350 0 0 CGP−UCB 10 20 30 Trial t per task 40 50 (a) Average regret (b) Maximum regret (c) Context similarity Figure 2: CGP-UCB applied to the average (a) and maximum regret over all molecules (b) for three methods on MHC benchmark. [sent-208, score-0.941]
59 In this application, additive kernel combinations may be useful to model temporal dependencies of the overall click probabilities (e. [sent-211, score-0.219]
60 In addition to controller parameters s ∈ S ⊆ RdS , the system may be exposed to changing (in an uncontrollable manner) environmental conditions, which are provided as context z ∈ Z ⊆ RdZ . [sent-217, score-0.214]
61 In this case, we may consider using a linear kernel kZ (z, z ) = zT z to model the dependence of the performance on environmental features, and a squared exponential kernel kS (s, s ) to model the smooth but nonlinear response of the system to the chosen control parameters. [sent-219, score-0.353]
62 Additive kernel combinations may allow to model the fact that control in some contexts (environments) is inherently more difficult (or noisy). [sent-221, score-0.398]
63 In this case, we may consider using a finite inter-task covariance kernel KZ with rank mZ to model the similarity of different experiments, and a Gaussian kernel kS (s, s ) to model the smooth but nonlinear dependency of the stimulus response on the experimental parameters. [sent-230, score-0.379]
64 We can cast this problem in the contextual bandit setting, where time of day is considered as the context z ∈ Z, and each action s ∈ S corresponds to picking a sensor. [sent-235, score-0.717]
65 We compare three methods: Ignoring (correlation between) contexts by running a separate instance of GP-UCB for every context (i. [sent-243, score-0.344]
66 5 4 GP−UCB ignore context GP−UCB merge context Temperature (C) 1. [sent-249, score-0.287]
67 , ignoring the molecule or time information); and running CGP-UCB, conditioning on measurements made at different contexts (MHC molecules considered / times of day) using the product kernel. [sent-258, score-0.462]
68 1 Multi-task Bayesian Optimization of MHC class-I binding affinity We perform experiments in the multi-task vaccine design problem introduced in Section 5. [sent-260, score-0.229]
69 In our experiments, we focus on a subset of MHC class I molecules that have affinity binding scores available. [sent-262, score-0.212]
70 In total, we consider identifying peptides for seven different MHC molecules (i. [sent-270, score-0.213]
71 The context similarity was obtained using the hamming distance between amino acids in the binding pocket [11] (see Figure 2(c)), and we used the Gaussian kernel on the extracted features. [sent-273, score-0.369]
72 From Figure 2(a) we see that for the first three molecules (up to trial 150), which are strongly correlated, merging contexts and CGP-UCB perform similarly, and both perform better than ignoring observations from other MHC molecules previously considered. [sent-276, score-0.568]
73 However, the fourth molecule (A 0201) has little correlation with the earlier ones, and hence simply merging contexts performs poorly. [sent-277, score-0.328]
74 Merging contexts (single instance of context-free GP-UCB) performs best for the first few timesteps (since temperature is very similar, and the highest temperature sensor does not change). [sent-290, score-0.509]
75 , until context reoccurs), it outperforms the other methods, since it is able to learn to query the maximum temperature sensors as a function of the time of the day. [sent-293, score-0.294]
76 The approach for the classical k-armed bandit setting [17] has been generalized to more complex settings, such as infinite action sets and linear payoff functions [14, 18], Lipschitz continuous payoff functions [15] and locally-Lipschitz functions [19]. [sent-295, score-0.787]
77 However, there is a strong tradeoff between strength of the assumptions and achievable √ regret bounds. [sent-296, score-0.328]
78 For example, while O(d T log T ) can be achieved in the linear setting [14], if only d+1 Lipschitz continuity is assumed, regret bounds scale as Ω(T d+2 ) [15]. [sent-297, score-0.331]
79 Srinivas et al [3] analyze the case where the payoff function is sampled from a GP, which encodes configurable assumptions. [sent-298, score-0.233]
80 The ability to incorporate contextual information, however, significantly expands the class of applications of GP-UCB. [sent-301, score-0.295]
81 Besides handling context and bounding the stronger notion of contextual regret, in this paper we provide generic techniques for obtaining regret bounds for composite kernels. [sent-302, score-0.829]
82 An alternative rule (in the context free setting) is the Expected Improvement algorithm [20], for which no bounds on the cumulative regret are known. [sent-303, score-0.418]
83 For contextual bandit problems, work has focused on the case of finitely many actions, where the goal is to obtain sublinear contextual regret against classes of functions mapping context to actions [1]. [sent-304, score-1.332]
84 This setting resembles (multi-class) classification problems, and regret bounds can be given in terms of the VC dimension of the hypothesis space [2]. [sent-305, score-0.305]
85 [6] present an approach, LinUCB, that assumes that payoffs for each action are linear combinations (with unknown coefficients) of context features. [sent-306, score-0.287]
86 In [5], it is proven that a modified variant of LinUCB achieves sublinear contextual regret. [sent-307, score-0.376]
87 Theirs is a special case of our setting (assuming a linear kernel for the contexts and diagonal kernel for the actions). [sent-308, score-0.523]
88 Another related approach is taken by Slivkins [21], who presents several algorithms with sublinear contextual regret for the case of infinite actions and contexts, assuming Lipschitz continuity of the payoff function in the context-action space. [sent-309, score-0.991]
89 However, in contrast to CGP-UCB, this approach does not enable stronger guarantees for smoother or more structured payoff functions. [sent-311, score-0.233]
90 The construction of composite kernels is common in the context of multitask learning with GPs [23, 24, 25]. [sent-312, score-0.27]
91 Theorems 2 and 3 provide convenient ways for deriving regret bounds for such problems. [sent-315, score-0.305]
92 8 Conclusions We have described an algorithm, CGP-UCB, which addresses the exploration–exploitation tradeoff in a large class of contextual bandit problems, where the regularity of the payoff function defined over the action–context space is expressed in terms of a GP prior. [sent-319, score-0.714]
93 As we discuss in Section 5, by considering various kernel functions on actions and contexts this approach allows to handle a variety of applications. [sent-320, score-0.521]
94 We show that, similar as in the context free case studied by [3], the key quantity governing the regret is a mutual information between experiments performed by CGP-UCB and the GP prior (Theorem 1). [sent-321, score-0.399]
95 In contrast to prior work, however, our approach bounds the much stronger notion of contextual regret (competing with the optimal mapping from contexts to actions). [sent-322, score-0.831]
96 We prove that in many practical settings, as discussed in Section 5, the contextual regret is sublinear. [sent-323, score-0.558]
97 We also demonstrate the effectiveness of CGP-UCB on two applications: computational vaccine design and sensor network management. [sent-325, score-0.2]
98 In both applications, we show that utilizing context information in the joint covariance function reduces regret in comparison to ignoring or naively using the context. [sent-326, score-0.458]
99 Gaussian process optimization in the bandit setting: No regret and experimental design. [sent-340, score-0.404]
100 Regret bounds for gaussian process bandit u a problems. [sent-448, score-0.22]
wordName wordTfidf (topN-words)
[('kz', 0.412), ('ks', 0.342), ('contextual', 0.295), ('regret', 0.263), ('payoff', 0.233), ('contexts', 0.231), ('gp', 0.224), ('mhc', 0.217), ('kernel', 0.146), ('bandit', 0.141), ('molecules', 0.127), ('ucb', 0.126), ('actions', 0.119), ('vaccine', 0.117), ('context', 0.113), ('temperature', 0.111), ('action', 0.105), ('composite', 0.094), ('binding', 0.085), ('zt', 0.081), ('sublinear', 0.081), ('sensors', 0.07), ('mat', 0.068), ('cgp', 0.067), ('rt', 0.066), ('mz', 0.063), ('kernels', 0.063), ('kt', 0.062), ('molecule', 0.059), ('exploration', 0.059), ('sensor', 0.056), ('peptide', 0.05), ('peptides', 0.05), ('payoffs', 0.048), ('exploitation', 0.046), ('yt', 0.045), ('regularity', 0.045), ('ignoring', 0.045), ('brochu', 0.044), ('aleksandrs', 0.044), ('slivkins', 0.043), ('bounds', 0.042), ('trade', 0.041), ('receive', 0.041), ('fz', 0.041), ('af', 0.04), ('day', 0.04), ('st', 0.04), ('assumptions', 0.04), ('environmental', 0.039), ('merging', 0.038), ('nity', 0.038), ('gaussian', 0.037), ('covariance', 0.037), ('seven', 0.036), ('bandits', 0.035), ('gain', 0.035), ('contextaction', 0.033), ('widmer', 0.033), ('fs', 0.033), ('controller', 0.033), ('merge', 0.033), ('round', 0.033), ('ads', 0.031), ('theorems', 0.03), ('gps', 0.03), ('uncontrollable', 0.029), ('smoothness', 0.029), ('additive', 0.029), ('rkhs', 0.028), ('ignore', 0.028), ('dence', 0.028), ('nite', 0.027), ('gathering', 0.027), ('linucb', 0.027), ('hereby', 0.027), ('animation', 0.027), ('multioutput', 0.027), ('quanti', 0.027), ('design', 0.027), ('log', 0.026), ('functions', 0.025), ('achievable', 0.025), ('similarity', 0.025), ('rank', 0.025), ('lihong', 0.024), ('activate', 0.024), ('picking', 0.023), ('srinivas', 0.023), ('click', 0.023), ('governing', 0.023), ('hayes', 0.023), ('zurich', 0.023), ('news', 0.023), ('squared', 0.022), ('bounding', 0.022), ('deployed', 0.022), ('dani', 0.022), ('automated', 0.022), ('combinations', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 61 nips-2011-Contextual Gaussian Process Bandit Optimization
Author: Andreas Krause, Cheng S. Ong
Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1
2 0.25858468 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
3 0.24193618 220 nips-2011-Prediction strategies without loss
Author: Michael Kapralov, Rina Panigrahy
Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1
4 0.20362526 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
5 0.18262039 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
6 0.17848459 26 nips-2011-Additive Gaussian Processes
7 0.1682979 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
8 0.16420226 32 nips-2011-An Empirical Evaluation of Thompson Sampling
9 0.14302708 100 nips-2011-Gaussian Process Training with Input Noise
10 0.11919926 80 nips-2011-Efficient Online Learning via Randomized Rounding
11 0.1152861 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
12 0.10967711 177 nips-2011-Multi-armed bandits on implicit metric spaces
13 0.10913555 272 nips-2011-Stochastic convex optimization with bandit feedback
14 0.10857867 56 nips-2011-Committing Bandits
15 0.10717384 145 nips-2011-Learning Eigenvectors for Free
16 0.10700877 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
17 0.10626417 30 nips-2011-Algorithms for Hyper-Parameter Optimization
18 0.10211843 25 nips-2011-Adaptive Hedge
19 0.1002882 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
20 0.098082684 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
topicId topicWeight
[(0, 0.21), (1, -0.269), (2, 0.037), (3, 0.041), (4, 0.18), (5, -0.051), (6, 0.075), (7, 0.075), (8, 0.176), (9, 0.124), (10, -0.167), (11, 0.058), (12, -0.012), (13, 0.063), (14, -0.022), (15, 0.128), (16, 0.007), (17, 0.058), (18, -0.134), (19, -0.092), (20, -0.028), (21, 0.033), (22, -0.096), (23, -0.079), (24, 0.044), (25, 0.024), (26, 0.015), (27, -0.005), (28, -0.029), (29, -0.07), (30, -0.014), (31, -0.013), (32, 0.02), (33, -0.017), (34, 0.009), (35, -0.008), (36, -0.032), (37, -0.111), (38, -0.076), (39, 0.012), (40, 0.033), (41, 0.034), (42, 0.042), (43, 0.011), (44, -0.056), (45, 0.076), (46, -0.035), (47, -0.047), (48, 0.086), (49, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.95889628 61 nips-2011-Contextual Gaussian Process Bandit Optimization
Author: Andreas Krause, Cheng S. Ong
Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1
2 0.69073677 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax
Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modification of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modification improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller confidence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1
3 0.67952579 32 nips-2011-An Empirical Evaluation of Thompson Sampling
Author: Olivier Chapelle, Lihong Li
Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1
4 0.66052932 272 nips-2011-Stochastic convex optimization with bandit feedback
Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin
Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1
5 0.6328544 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos
Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1
6 0.60563004 26 nips-2011-Additive Gaussian Processes
7 0.5989626 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
8 0.5715245 220 nips-2011-Prediction strategies without loss
9 0.56561327 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
10 0.54652607 30 nips-2011-Algorithms for Hyper-Parameter Optimization
11 0.51577419 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
12 0.50437009 25 nips-2011-Adaptive Hedge
13 0.49634412 100 nips-2011-Gaussian Process Training with Input Noise
14 0.4870742 56 nips-2011-Committing Bandits
15 0.44695216 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval
16 0.43513381 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo
17 0.42382371 80 nips-2011-Efficient Online Learning via Randomized Rounding
18 0.4210659 139 nips-2011-Kernel Bayes' Rule
19 0.4043211 177 nips-2011-Multi-armed bandits on implicit metric spaces
20 0.39515904 145 nips-2011-Learning Eigenvectors for Free
topicId topicWeight
[(0, 0.028), (4, 0.042), (14, 0.22), (20, 0.034), (26, 0.019), (31, 0.09), (33, 0.022), (43, 0.099), (45, 0.121), (47, 0.012), (57, 0.044), (74, 0.048), (79, 0.031), (83, 0.039), (99, 0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.8185268 61 nips-2011-Contextual Gaussian Process Bandit Optimization
Author: Andreas Krause, Cheng S. Ong
Abstract: How should we design experiments to maximize performance of a complex system, taking into account uncontrollable environmental conditions? How should we select relevant documents (ads) to display, given information about the user? These tasks can be formalized as contextual bandit problems, where at each round, we receive context (about the experimental conditions, the query), and have to choose an action (parameters, documents). The key challenge is to trade off exploration by gathering data for estimating the mean payoff function over the context-action space, and to exploit by choosing an action deemed optimal based on the gathered data. We model the payoff function as a sample from a Gaussian process defined over the joint context-action space, and develop CGP-UCB, an intuitive upper-confidence style algorithm. We show that by mixing and matching kernels for contexts and actions, CGP-UCB can handle a variety of practical applications. We further provide generic tools for deriving regret bounds when using such composite kernel functions. Lastly, we evaluate our algorithm on two case studies, in the context of automated vaccine design and sensor management. We show that context-sensitive optimization outperforms no or naive use of context. 1
2 0.74763644 202 nips-2011-On the Universality of Online Mirror Descent
Author: Nati Srebro, Karthik Sridharan, Ambuj Tewari
Abstract: We show that for a general class of convex online learning problems, Mirror Descent can always achieve a (nearly) optimal regret guarantee. 1
3 0.73998094 168 nips-2011-Maximum Margin Multi-Instance Learning
Author: Hua Wang, Heng Huang, Farhad Kamangar, Feiping Nie, Chris H. Ding
Abstract: Multi-instance learning (MIL) considers input as bags of instances, in which labels are assigned to the bags. MIL is useful in many real-world applications. For example, in image categorization semantic meanings (labels) of an image mostly arise from its regions (instances) instead of the entire image (bag). Existing MIL methods typically build their models using the Bag-to-Bag (B2B) distance, which are often computationally expensive and may not truly reflect the semantic similarities. To tackle this, in this paper we approach MIL problems from a new perspective using the Class-to-Bag (C2B) distance, which directly assesses the relationships between the classes and the bags. Taking into account the two major challenges in MIL, high heterogeneity on data and weak label association, we propose a novel Maximum Margin Multi-Instance Learning (M3 I) approach to parameterize the C2B distance by introducing the class specific distance metrics and the locally adaptive significance coefficients. We apply our new approach to the automatic image categorization tasks on three (one single-label and two multilabel) benchmark data sets. Extensive experiments have demonstrated promising results that validate the proposed method.
4 0.69706309 283 nips-2011-The Fixed Points of Off-Policy TD
Author: J. Z. Kolter
Abstract: Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. 1
5 0.67993206 258 nips-2011-Sparse Bayesian Multi-Task Learning
Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau
Abstract: We propose a new sparse Bayesian model for multi-task regression and classification. The model is able to capture correlations between tasks, or more specifically a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classification tasks it achieves competitive predictive performance compared to previously proposed methods. 1
6 0.67511111 24 nips-2011-Active learning of neural response functions with Gaussian processes
7 0.66984785 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
8 0.66946852 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
9 0.66815531 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
10 0.66717672 186 nips-2011-Noise Thresholds for Spectral Clustering
11 0.66614455 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
12 0.66573775 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
13 0.66510642 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
14 0.66486466 276 nips-2011-Structured sparse coding via lateral inhibition
15 0.66440445 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
16 0.66438425 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
17 0.66346723 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure
18 0.66346538 180 nips-2011-Multiple Instance Filtering
19 0.66305059 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
20 0.6627143 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)