nips nips2010 nips2010-203 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári
Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. [sent-5, score-0.337]
2 We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. [sent-7, score-0.493]
3 The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. [sent-8, score-0.56]
4 Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. [sent-9, score-0.452]
5 Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. [sent-11, score-0.33]
6 1 Introduction In the classical K-armed bandit problem, an agent selects at each time step one of the K arms and receives a reward that depends on the chosen action. [sent-12, score-0.885]
7 The aim of the agent is to choose the sequence of arms to be played so as to maximize the cumulated reward. [sent-13, score-0.542]
8 There is a fundamental trade-off between gathering experimental data about the reward distribution (exploration) and exploiting the arm which seems to be the most promising. [sent-14, score-0.462]
9 In the basic multi-armed bandit problem, also called the independent bandits problem, the rewards are assumed to be random and distributed independently according to a probability distribution that is specific to each arm –see [1, 2, 3, 4] and references therein. [sent-15, score-0.936]
10 Recently, structured bandit problems in which the distributions of the rewards pertaining to each arm are connected by a common unknown parameter have received much attention [5, 6, 7, 8, 9]. [sent-16, score-0.83]
11 This model is motivated by the many practical applications where the number of arms is large, but the payoffs are interrelated. [sent-17, score-0.382]
12 In one model, in each times step, a side-information, or context, is given to the agent first. [sent-19, score-0.089]
13 The payoffs of the arms depend both on this side information and the index of the arm. [sent-20, score-0.382]
14 Thus the optimal arm changes with the context [5, 6, 9]. [sent-21, score-0.319]
15 In particular, in “linear bandits” [10, 8, 11, 12], each arm a ∈ A is associated with some d-dimensional vector ma ∈ Rd known to the agent. [sent-23, score-0.406]
16 The expected payoffs of the arms are given by the inner product of their associated vector and some fixed, but initially unknown parameter vector θ∗ . [sent-24, score-0.419]
17 Thus, the expected payoff of arm a is m′ θ∗ , which is linear in θ∗ . [sent-25, score-0.421]
18 1 a In this article, we study a richer generalized linear model (GLM) in which the expectation of the reward conditionally to the action a is given by µ(m′ θ∗ ), where µ is a real-valued, non-linear a function called the (inverse) link function. [sent-26, score-0.339]
19 This generalization allows to consider a wider class of problems, and in particular cases where the rewards are counts or binary variables using, respectively, Poisson or logistic regression. [sent-27, score-0.146]
20 Interestingly, the GLM-UCB approach takes advantage of the particular structure of the parameter estimate of generalized linear models and operates only in the reward space. [sent-34, score-0.262]
21 Our second contribution is a tuning method based on asymptotic arguments. [sent-36, score-0.091]
22 The generalized linear bandit model is presented in Section 2, together with a brief survey of needed statistical results. [sent-39, score-0.433]
23 Section 4 presents our regret bounds, as well as a discussion, based on asymptotic arguments, on the optimal tuning of the method. [sent-41, score-0.325]
24 2 Generalized Linear Bandits, Generalized Linear Models We consider a structured bandit model with a finite, but possibly very large, number of arms. [sent-43, score-0.337]
25 At each time t, the agent chooses an arm At from the set A (we shall denote the cardinality of A by K). [sent-44, score-0.408]
26 The prior knowledge available to the agent consists of a collection of vectors {ma }a∈A of features which are specific to each arm and a so-called (inverse) link function µ : R → R. [sent-45, score-0.498]
27 The generalized linear bandit model investigated in this work is based on the assumption that the payoff Rt received at time t is conditionally independent of the past payoffs and choices and it satisfies (1) E [ Rt | At ] = µ(m′ t θ∗ ) , A for some unknown parameter vector θ∗ ∈ Rd . [sent-46, score-0.575]
28 This framework generalizes the linear bandit model considered by [10, 8, 12]. [sent-47, score-0.397]
29 Just like the linear bandit model builds on linear regression, our model capitalizes on the well-known statistical framework of Generalized Linear Models (GLMs). [sent-48, score-0.407]
30 The advantage of this framework is that it allows to address various, specific reward structures widely found in applications. [sent-49, score-0.143]
31 For example, when rewards are binary-valued, a suitable choice of µ is µ(x) = exp(x)/(1 + exp(x)), leading to the logistic regression model. [sent-50, score-0.182]
32 This can be easily extended to the case of multinomial (or polytomic) logistic regression, which is appropriate to model situations in which the rewards are associated with categorical variables. [sent-52, score-0.172]
33 Now, assume that, in addition to the response variable R, we have at hand a vector of covariates X ∈ Rd . [sent-60, score-0.139]
34 As we will see, this matrix also plays a crucial role in the algorithm that we propose for bandit optimization in the generalized linear bandit model. [sent-73, score-0.801]
35 3 The GLM-UCB Algorithm According to (1), the agent receives, upon playing arm a, a random reward whose expected value is µ(m′ θ∗ ), where θ∗ ∈ Θ is the unknown parameter. [sent-74, score-0.676]
36 Any arm with largest expected reward is called optimal. [sent-76, score-0.499]
37 The aim of the agent is to quickly find ˆ an optimal arm in order to maximize the received rewards. [sent-77, score-0.454]
38 As described by [8, 12] in the linear case, an optimistic algorithm consists in selecting, at time t, the arm ˆ At = argmax max Eθ [ Rt | At = a] s. [sent-80, score-0.407]
39 Generalizing this approach beyond the case of linear link functions looks challenging. [sent-84, score-0.095]
40 3 3 An alternative approach consists in directly determining an upper confidence bound for the expected reward of each arm, thus choosing the action a that maximizes Eθt [ Rt | At = a] + ρ(t) ma ˆ −1 Mt . [sent-90, score-0.333]
41 In the rest of this section, we apply this second approach to the GLM bandit model defined in (1). [sent-93, score-0.337]
42 t−1 ˆ Let gt (θ) = k=1 µ(m′ k θ)mAk be the invertible function such that the estimated parameter θt A t−1 ˆ ˆ satisfies gt (θt ) = k=1 Rk mAk . [sent-101, score-0.12]
43 Since θt might be outside of the set of admissible parameters Θ, ˜t : we “project it” to Θ, to obtain θ t−1 ˜ ˆ θt = argmin gt (θ) − gt (θt ) θ∈Θ −1 Mt = argmin gt (θ) − θ∈Θ Rk mAk k=1 −1 Mt . [sent-102, score-0.18]
44 The quantity ρ(t) is a slowly increasing t function; we prove in Section 4 that ρ(t) can be set to guarantee high-probability bounds on the a expected regret (for the actual form used, see (8)). [sent-113, score-0.34]
45 As we are mostly interested in the case when the number of arms K is much larger than the dimension d, the algorithm is simply initialized by playing actions a1 , . [sent-115, score-0.451]
46 , ad in the first d steps the agent ensures that Mt is invertible for all t. [sent-126, score-0.132]
47 1 Discussion The purpose of this section is to discuss some properties of Algorithm 1, and in particular the interpretation of the role played by ma M −1 . [sent-129, score-0.169]
48 t Generalizing UCB The standard UCB algorithm for K arms [2] can be seen as a special case of GLM-UCB where the vectors of covariates associated with the arms form an orthogonal system and µ(x) = x. [sent-130, score-0.778]
49 , K}, define the vectors {ma }a∈A as the canonical basis {ea }a∈A of Rd , and take θ ∈ Rd the vector whose component θa is the expected reward for arm a. [sent-134, score-0.566]
50 4 Then, Mt is a diagonal matrix whose a-th diagonal element is the number Nt (a) of times the a-th arm has been played up to time t. [sent-135, score-0.401]
51 Therefore, the exploration bonus in GLM-UCB is given by a ˆ ˆ ¯a βt = ρ(t)/ Nt (a). [sent-136, score-0.133]
52 Moreover, the maximum quasi-likelihood estimator θt satisfies Rt = θt (a) t−1 ¯a = 1 for all a ∈ A, where Rt k=1 I{At =a} Rk is the empirical mean of the rewards received Nt (a) while playing arm a. [sent-137, score-0.604]
53 In this case, it is known that the expected cumulated regret can be controlled upon setting the slowly varying function ρ to ρ(t) = 2 log(t), assuming that the range of the rewards is bounded by one [2]. [sent-139, score-0.451]
54 Generalizing linear bandits Obviously, setting µ(x) = x, we obtain a linear bandit model. [sent-140, score-0.592]
55 In particular, the maximum quasi-likelihood estimator becomes the least-squares estimator and as noted earlier, the algorithm behaves identically to one which chooses the parameter optimistically ˆ within the confidence ellipsoid {θ : θ − θt Mt ≤ ρ(t)}. [sent-142, score-0.143]
56 Dependence in the Number of Arms In contrast to an algorithm such as UCB, Algorithm 1 does not need that all arms be played even once. [sent-143, score-0.398]
57 4 To understand this phenomenon, observe that, 2 2 2 2 (1 + mAt M −1 ) for as Mt+1 = Mt + mAt m′ t , ma M −1 = ma M −1 − m′ Mt−1 mAt a A t t t+1 a any arm a. [sent-144, score-0.493]
58 Thus the exploration bonus βt+1 decreases for all arms, except those which are exactly orthogonal to mAt (in the Mt−1 metric). [sent-145, score-0.133]
59 The decrease is most significant for arms that are colinear to mAt . [sent-146, score-0.316]
60 This explains why the regret bounds obtained in Theorems 1 and 2 below depend on d but not on K. [sent-147, score-0.273]
61 4 Theoretical analysis In this section we first give our finite sample regret bounds and then show how the algorithm can be tuned based on asymptotic arguments. [sent-148, score-0.365]
62 The link function µ : R → R is continuously differentiable, Lipschitz with constant kµ and such that cµ = inf θ∈Θ,a∈A µ(m′ θ) > 0. [sent-152, score-0.091]
63 The norm of covariates in {ma : a ∈ A} is bounded: there exists cm < ∞ such that for all a ∈ A, ma 2 ≤ cm . [sent-155, score-0.253]
64 A As for the standard UCB algorithm, the regret can be analyzed in terms of the difference between the expected reward received playing an optimal arm and that of the best sub-optimal arm: ∆(θ∗ ) = min a:µ(m′ θ∗ )<µ(m′ ∗ θ∗ ) a a µ(m′ ∗ θ∗ ) − µ(m′ θ∗ ) . [sent-165, score-0.867]
65 a a Theorem 1 establishes a high probability bound on the regret underlying using GLM-UCB with 2kµ κRmax ρ(t) = 2d log(t) log(2 d T /δ) , (8) cµ 4 Of course, the linear bandit algorithms also share this property with our algorithm. [sent-166, score-0.632]
66 Then, under Assumptions m 1–3, for all T ≥ 1, the regret satisfies: 2 2 32κ2 Rmax kµ C d2 2d T P RegretT ≤ (d + 1)Rmax + ≥ 1 − δ with C = log2 [s T ] log . [sent-170, score-0.234]
67 ∆(θ∗ ) δ c2 µ Note that the above regret bound depends on the true value of θ∗ through ∆(θ∗ ). [sent-171, score-0.26]
68 The following theorem provides an upper-bound of the regret independently of the θ∗ . [sent-172, score-0.234]
69 Then, under m Assumptions 1–3, for all T ≥ 1, the regret satisfies P RegretT ≤ (d + 1)Rmax + Cd log [s T ] T log 2dT δ ≥ 1 − δ with C = 8Rmax kµ κ . [sent-175, score-0.234]
70 A look into the proof of the regret bound easily explains this observation as the mathematical involvement of the arguments is such that some approximations seem unavoidable, in particular several applications of the Cauchy-Schwarz inequality, leading to pessimistic confidence bounds. [sent-183, score-0.342]
71 We provide here some asymptotic arguments that suggest to choose significantly smaller exploration bonuses, which will in turn be validated by the numerical experiments presented in Section 5. [sent-184, score-0.18]
72 Consider the canonical GLM associated with an inverse link function µ and assume that the vectors of covariates X are drawn independently under a fixed distribution. [sent-185, score-0.243]
73 This random design model would for instance describe the situation when the arms are drawn randomly from a fixed distribution. [sent-186, score-0.316]
74 , inflating the exploration bonus by a factor of kµ /cµ compared to the usual UCB setting, is sufficient. [sent-196, score-0.133]
75 5 Experiments To the best of our knowledge, there is currently no public benchmark available to test bandit methods on real world data. [sent-198, score-0.337]
76 On simulated data, the proposed method unsurprisingly outperforms its competitors when the data is indeed simulated from a well-specified generalized linear model. [sent-199, score-0.127]
77 The values of the response variable for the data points assigned to each cluster are viewed as the outcomes of an arm while the centroid of the cluster is taken as the 11-dimensional vector of covariates characteristic of the arm. [sent-204, score-0.552]
78 To cast the problem into the logistic regression framework, each response variable is binarized by associating the first class (“Spruce/Fir”) to a response R = 1 and all other six classes to R = 0. [sent-205, score-0.133]
79 The proportions of responses equal to 1 in each cluster (or, in other word, the expected reward associated with each arm) ranges from 0. [sent-206, score-0.214]
80 Obviously, the logistic regression model is not satisfied, although we do expect some regularity with respect to the position of the cluster’s centroid as the logistic regression trained on all data reaches a 0. [sent-212, score-0.2]
81 2000 Regrett 1500 UCB GLM−UCB ε−greedy 1000 500 0 0 1000 2000 3000 4000 5000 t 6000 7000 8000 6000 9000 10000 GLM−UCB UCB 4000 2000 0 2 4 6 8 10 arm a 12 14 16 18 Figure 1: Top: Regret of the UCB, GLM-UCB and the ǫ-greedy algorithms. [sent-214, score-0.319]
82 Bottom: Frequencies of the 20 best arms draws using the UCB and GLM-UCB. [sent-215, score-0.316]
83 Third, an ǫ-greedy algorithm that performs logistic regression and plays the best ˆ estimated action, At = argmaxa µ(m′ θt ), with probability 1 − ǫ (with ǫ = 0. [sent-220, score-0.167]
84 We observe in a the top graph of Figure 1 that the GLM-UCB algorithm achieves the smallest average regret by a large margin. [sent-222, score-0.234]
85 When the parameter is well estimated, the greedy algorithm may find the best arm in little time and then leads to small regrets. [sent-223, score-0.346]
86 The lower plot of Figure 1 shows the number of times each of the 20 best arms have been played by the UCB and GLM-UCB algorithms. [sent-225, score-0.398]
87 The arms are sorted in decreasing order of expected reward. [sent-226, score-0.353]
88 This behavior is made possible by the predictive power of the covariates: by sharing information between arms, it is possible to obtain sufficiently accurate predictions of the expected rewards of all actions, even for those that have never (or rarely) been played. [sent-228, score-0.132]
89 The dataset also contains a record of the users clicks on the ads that were presented on these pages. [sent-233, score-0.117]
90 The pages (ads) were partitioned in 10 (respectively, 8) categories using Latent Dirichlet Allocation [15] applied to their respective textual content (in the case of ads, the textual content was that of the page pointed to by the ad’s link). [sent-236, score-0.113]
91 The action space is composed of the 80 pairs of pages and ads categories: when a pair is chosen, it is presented to a group of 50 users, randomly selected from the database, and the reward is the number of recorded clicks. [sent-238, score-0.27]
92 The vector of covariates for each pair is of dimension 19: it is composed of an intercept followed by the concatenation of two vectors of dimension 10 and 8 representing, respectively, the categories of the pages and the ads. [sent-241, score-0.215]
93 Given the rather limited predictive power of the covariates in this example, this is an encouraging illustration of the potential of techniques which use vectors of covariates in real-life applications. [sent-246, score-0.262]
94 Regret 3000 2000 UCB GLM−UCB ε−greedy 1000 0 0 1000 2000 3000 4000 5000 t Figure 2: Comparison of the regret of the UCB, GLM-UCB and the ǫ-greedy (ǫ = 0. [sent-247, score-0.234]
95 6 Conclusions We have introduced an approach that generalizes the linear regression model studied by [10, 8, 12]. [sent-249, score-0.096]
96 As in the original UCB algorithm, the proposed GLM-UCB method operates directly in the reward space. [sent-250, score-0.166]
97 An interesting open problem (already challenging in the linear case) consists in tightening the theoretical results obtained so far in order to bridge the gap between the existing (pessimistic) confidence bounds and those suggested by the asymptotic arguments presented in Section 4. [sent-253, score-0.19]
98 The epoch-greedy algorithm for multi-armed bandits with side information. [sent-291, score-0.185]
99 Forced-exploration based algorithms for a playing in stochastic linear bandits. [sent-323, score-0.123]
100 Strong consistency of maximum quasi-likelihood estimators in generalized linear models with fixed and adaptive designs. [sent-341, score-0.096]
wordName wordTfidf (topN-words)
[('ucb', 0.423), ('bandit', 0.337), ('arm', 0.319), ('arms', 0.316), ('mt', 0.258), ('regret', 0.234), ('bandits', 0.185), ('reward', 0.143), ('glm', 0.136), ('covariates', 0.116), ('rt', 0.099), ('rewards', 0.095), ('mat', 0.094), ('mak', 0.092), ('agent', 0.089), ('playing', 0.088), ('ma', 0.087), ('ads', 0.087), ('dence', 0.086), ('played', 0.082), ('regrett', 0.08), ('rmax', 0.077), ('bonus', 0.069), ('rk', 0.068), ('payoffs', 0.066), ('exploration', 0.064), ('asymptotic', 0.062), ('generalized', 0.061), ('xk', 0.061), ('telecom', 0.06), ('paristech', 0.06), ('ltci', 0.06), ('gt', 0.06), ('link', 0.06), ('estimator', 0.056), ('cumulated', 0.055), ('glms', 0.055), ('arguments', 0.054), ('poisson', 0.053), ('optimistic', 0.053), ('cnrs', 0.052), ('logistic', 0.051), ('rd', 0.05), ('argmaxa', 0.049), ('received', 0.046), ('garivier', 0.046), ('rusmevichientong', 0.046), ('textual', 0.045), ('generalizing', 0.045), ('ad', 0.043), ('action', 0.04), ('mai', 0.04), ('filippi', 0.04), ('paris', 0.04), ('bounds', 0.039), ('nt', 0.039), ('szepesv', 0.039), ('expected', 0.037), ('canonical', 0.037), ('regression', 0.036), ('linear', 0.035), ('advertisement', 0.035), ('cluster', 0.034), ('france', 0.033), ('pertaining', 0.033), ('competitors', 0.031), ('ellipsoid', 0.031), ('continuously', 0.031), ('plays', 0.031), ('con', 0.031), ('users', 0.03), ('slowly', 0.03), ('payoff', 0.03), ('vectors', 0.03), ('tuned', 0.03), ('strictly', 0.029), ('tuning', 0.029), ('carried', 0.028), ('pessimistic', 0.028), ('obviously', 0.028), ('greedy', 0.027), ('centroid', 0.026), ('forest', 0.026), ('bound', 0.026), ('categorical', 0.026), ('internet', 0.025), ('reference', 0.025), ('generalizes', 0.025), ('cm', 0.025), ('covariate', 0.025), ('actions', 0.024), ('article', 0.024), ('xx', 0.024), ('poor', 0.024), ('operates', 0.023), ('response', 0.023), ('moderate', 0.023), ('dimension', 0.023), ('categories', 0.023), ('receive', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 203 nips-2010-Parametric Bandits: The Generalized Linear Case
Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári
Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1
2 0.27088594 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
3 0.18869936 183 nips-2010-Non-Stochastic Bandit Slate Problems
Author: Satyen Kale, Lev Reyzin, Robert E. Schapire
Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1
4 0.16603726 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
5 0.16354913 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
Author: Martha White, Adam White
Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisfies traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1
6 0.14728637 192 nips-2010-Online Classification with Specificity Constraints
7 0.13533247 222 nips-2010-Random Walk Approach to Regret Minimization
8 0.12164579 229 nips-2010-Reward Design via Online Gradient Ascent
9 0.10497639 108 nips-2010-Graph-Valued Regression
10 0.10061206 4 nips-2010-A Computational Decision Theory for Interactive Assistants
11 0.10054265 138 nips-2010-Large Margin Multi-Task Metric Learning
12 0.080158696 182 nips-2010-New Adaptive Algorithms for Online Classification
13 0.075676225 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
14 0.073710479 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
15 0.072605304 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
16 0.070774801 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
17 0.070593826 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
18 0.067819513 282 nips-2010-Variable margin losses for classifier design
19 0.066571563 66 nips-2010-Double Q-learning
20 0.064737596 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
topicId topicWeight
[(0, 0.191), (1, -0.15), (2, 0.051), (3, 0.018), (4, 0.012), (5, 0.08), (6, -0.101), (7, -0.046), (8, -0.071), (9, 0.267), (10, 0.001), (11, 0.056), (12, -0.011), (13, -0.053), (14, 0.045), (15, 0.017), (16, -0.105), (17, 0.005), (18, 0.063), (19, 0.014), (20, -0.026), (21, -0.101), (22, -0.073), (23, 0.086), (24, -0.012), (25, 0.052), (26, 0.024), (27, -0.062), (28, 0.099), (29, 0.005), (30, 0.013), (31, -0.071), (32, -0.114), (33, -0.015), (34, -0.064), (35, -0.016), (36, 0.087), (37, -0.176), (38, -0.022), (39, 0.041), (40, -0.022), (41, 0.043), (42, -0.211), (43, 0.079), (44, 0.073), (45, 0.077), (46, 0.095), (47, -0.047), (48, -0.024), (49, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.92484671 203 nips-2010-Parametric Bandits: The Generalized Linear Case
Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári
Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1
2 0.75179398 183 nips-2010-Non-Stochastic Bandit Slate Problems
Author: Satyen Kale, Lev Reyzin, Robert E. Schapire
Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efficient algorithms which have regret O( T ), where the constant depends on the specific nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1
3 0.6999523 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
4 0.55800283 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
Author: Martha White, Adam White
Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent’s confidence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisfies traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1
5 0.55105335 4 nips-2010-A Computational Decision Theory for Interactive Assistants
Author: Alan Fern, Prasad Tadepalli
Abstract: We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We first introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in finite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant’s action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution. 1
6 0.54318976 229 nips-2010-Reward Design via Online Gradient Ascent
7 0.53037888 192 nips-2010-Online Classification with Specificity Constraints
8 0.50016165 222 nips-2010-Random Walk Approach to Regret Minimization
9 0.43543378 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
10 0.42916796 66 nips-2010-Double Q-learning
11 0.41330072 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
12 0.35823759 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
13 0.34875652 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies
14 0.3305479 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems
15 0.33039922 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
16 0.32814345 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
17 0.32774973 108 nips-2010-Graph-Valued Regression
18 0.32218921 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
19 0.32121569 152 nips-2010-Learning from Logged Implicit Exploration Data
20 0.31606025 202 nips-2010-Parallelized Stochastic Gradient Descent
topicId topicWeight
[(13, 0.064), (17, 0.01), (27, 0.062), (30, 0.067), (35, 0.012), (39, 0.016), (45, 0.211), (50, 0.047), (52, 0.034), (60, 0.046), (71, 0.252), (77, 0.054), (78, 0.02), (90, 0.037)]
simIndex simValue paperId paperTitle
1 0.90155452 247 nips-2010-Sparse Instrumental Variables (SPIV) for Genome-Wide Studies
Author: Paul Mckeigue, Jon Krohn, Amos J. Storkey, Felix V. Agakov
Abstract: This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modified here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as “instruments” to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype-phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early-stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for fine mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with ∼ 105 genetic instruments and ∼ 47 × 103 gene transcripts. 1
same-paper 2 0.8182292 203 nips-2010-Parametric Bandits: The Generalized Linear Case
Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári
Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive finite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difficulty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to significantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1
3 0.7438252 217 nips-2010-Probabilistic Multi-Task Feature Selection
Author: Yu Zhang, Dit-Yan Yeung, Qian Xu
Abstract: Recently, some variants of the đ?‘™1 norm, particularly matrix norms such as the đ?‘™1,2 and đ?‘™1,∞ norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the đ?‘™1,2 and đ?‘™1,∞ norms by considering a family of đ?‘™1,đ?‘ž norms for 1 < đ?‘ž ≤ ∞ and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ?‘™1,đ?‘ž norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization (EM) algorithms to learn all model parameters, including đ?‘ž, automatically. Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. 1
4 0.70343542 63 nips-2010-Distributed Dual Averaging In Networks
Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi
Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks. 1
5 0.70218217 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection
Author: Seunghak Lee, Jun Zhu, Eric P. Xing
Abstract: To understand the relationship between genomic variations among population and complex diseases, it is essential to detect eQTLs which are associated with phenotypic effects. However, detecting eQTLs remains a challenge due to complex underlying mechanisms and the very large number of genetic loci involved compared to the number of samples. Thus, to address the problem, it is desirable to take advantage of the structure of the data and prior information about genomic locations such as conservation scores and transcription factor binding sites. In this paper, we propose a novel regularized regression approach for detecting eQTLs which takes into account related traits simultaneously while incorporating many regulatory features. We first present a Bayesian network for a multi-task learning problem that includes priors on SNPs, making it possible to estimate the significance of each covariate adaptively. Then we find the maximum a posteriori (MAP) estimation of regression coefficients and estimate weights of covariates jointly. This optimization procedure is efficient since it can be achieved by using a projected gradient descent and a coordinate descent procedure iteratively. Experimental results on simulated and real yeast datasets confirm that our model outperforms previous methods for finding eQTLs.
6 0.70143944 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
7 0.70142114 117 nips-2010-Identifying graph-structured activation patterns in networks
8 0.70125991 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
9 0.7006579 182 nips-2010-New Adaptive Algorithms for Online Classification
10 0.7003333 265 nips-2010-The LASSO risk: asymptotic results and real world examples
11 0.70011067 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
12 0.69952542 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
13 0.69903529 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
14 0.69864726 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
15 0.69862324 148 nips-2010-Learning Networks of Stochastic Differential Equations
16 0.69778466 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
17 0.69721055 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
18 0.69663769 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
19 0.69659734 155 nips-2010-Learning the context of a category
20 0.69647521 280 nips-2010-Unsupervised Kernel Dimension Reduction