nips nips2010 nips2010-203 knowledge-graph by maker-knowledge-mining

203 nips-2010-Parametric Bandits: The Generalized Linear Case

Source: pdf

Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári

Abstract: We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. For these bandits, we propose a new algorithm, called GLM-UCB. We derive ﬁnite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. The analysis highlights a key difﬁculty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to signiﬁcantly better practical performance. We present two numerical experiments on real-world data that illustrate the potential of the GLM-UCB approach. Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We consider structured multi-armed bandit problems based on the Generalized Linear Model (GLM) framework of statistics. [sent-5, score-0.337]

2 We derive ﬁnite time, high probability bounds on the regret of the algorithm, extending previous analyses developed for the linear bandits to the non-linear case. [sent-7, score-0.493]

3 The analysis highlights a key difﬁculty in generalizing linear bandit algorithms to the non-linear case, which is solved in GLM-UCB by focusing on the reward space rather than on the parameter space. [sent-8, score-0.56]

4 Moreover, as the actual effectiveness of current parameterized bandit algorithms is often poor in practice, we provide a tuning method based on asymptotic arguments, which leads to signiﬁcantly better practical performance. [sent-9, score-0.452]

5 Keywords: multi-armed bandit, parametric bandits, generalized linear models, UCB, regret minimization. [sent-11, score-0.33]

6 1 Introduction In the classical K-armed bandit problem, an agent selects at each time step one of the K arms and receives a reward that depends on the chosen action. [sent-12, score-0.885]

7 The aim of the agent is to choose the sequence of arms to be played so as to maximize the cumulated reward. [sent-13, score-0.542]

8 There is a fundamental trade-off between gathering experimental data about the reward distribution (exploration) and exploiting the arm which seems to be the most promising. [sent-14, score-0.462]

9 In the basic multi-armed bandit problem, also called the independent bandits problem, the rewards are assumed to be random and distributed independently according to a probability distribution that is speciﬁc to each arm –see [1, 2, 3, 4] and references therein. [sent-15, score-0.936]

10 Recently, structured bandit problems in which the distributions of the rewards pertaining to each arm are connected by a common unknown parameter have received much attention [5, 6, 7, 8, 9]. [sent-16, score-0.83]

11 This model is motivated by the many practical applications where the number of arms is large, but the payoffs are interrelated. [sent-17, score-0.382]

12 In one model, in each times step, a side-information, or context, is given to the agent ﬁrst. [sent-19, score-0.089]

13 The payoffs of the arms depend both on this side information and the index of the arm. [sent-20, score-0.382]

14 Thus the optimal arm changes with the context [5, 6, 9]. [sent-21, score-0.319]

15 In particular, in “linear bandits” [10, 8, 11, 12], each arm a ∈ A is associated with some d-dimensional vector ma ∈ Rd known to the agent. [sent-23, score-0.406]

16 The expected payoffs of the arms are given by the inner product of their associated vector and some ﬁxed, but initially unknown parameter vector θ∗ . [sent-24, score-0.419]

17 Thus, the expected payoff of arm a is m′ θ∗ , which is linear in θ∗ . [sent-25, score-0.421]

18 1 a In this article, we study a richer generalized linear model (GLM) in which the expectation of the reward conditionally to the action a is given by µ(m′ θ∗ ), where µ is a real-valued, non-linear a function called the (inverse) link function. [sent-26, score-0.339]

19 This generalization allows to consider a wider class of problems, and in particular cases where the rewards are counts or binary variables using, respectively, Poisson or logistic regression. [sent-27, score-0.146]

20 Interestingly, the GLM-UCB approach takes advantage of the particular structure of the parameter estimate of generalized linear models and operates only in the reward space. [sent-34, score-0.262]

21 Our second contribution is a tuning method based on asymptotic arguments. [sent-36, score-0.091]

22 The generalized linear bandit model is presented in Section 2, together with a brief survey of needed statistical results. [sent-39, score-0.433]

23 Section 4 presents our regret bounds, as well as a discussion, based on asymptotic arguments, on the optimal tuning of the method. [sent-41, score-0.325]

24 2 Generalized Linear Bandits, Generalized Linear Models We consider a structured bandit model with a ﬁnite, but possibly very large, number of arms. [sent-43, score-0.337]

25 At each time t, the agent chooses an arm At from the set A (we shall denote the cardinality of A by K). [sent-44, score-0.408]

26 The prior knowledge available to the agent consists of a collection of vectors {ma }a∈A of features which are speciﬁc to each arm and a so-called (inverse) link function µ : R → R. [sent-45, score-0.498]

27 The generalized linear bandit model investigated in this work is based on the assumption that the payoff Rt received at time t is conditionally independent of the past payoffs and choices and it satisﬁes (1) E [ Rt | At ] = µ(m′ t θ∗ ) , A for some unknown parameter vector θ∗ ∈ Rd . [sent-46, score-0.575]

28 This framework generalizes the linear bandit model considered by [10, 8, 12]. [sent-47, score-0.397]

29 Just like the linear bandit model builds on linear regression, our model capitalizes on the well-known statistical framework of Generalized Linear Models (GLMs). [sent-48, score-0.407]

30 The advantage of this framework is that it allows to address various, speciﬁc reward structures widely found in applications. [sent-49, score-0.143]

31 For example, when rewards are binary-valued, a suitable choice of µ is µ(x) = exp(x)/(1 + exp(x)), leading to the logistic regression model. [sent-50, score-0.182]

32 This can be easily extended to the case of multinomial (or polytomic) logistic regression, which is appropriate to model situations in which the rewards are associated with categorical variables. [sent-52, score-0.172]

33 Now, assume that, in addition to the response variable R, we have at hand a vector of covariates X ∈ Rd . [sent-60, score-0.139]

34 As we will see, this matrix also plays a crucial role in the algorithm that we propose for bandit optimization in the generalized linear bandit model. [sent-73, score-0.801]

35 3 The GLM-UCB Algorithm According to (1), the agent receives, upon playing arm a, a random reward whose expected value is µ(m′ θ∗ ), where θ∗ ∈ Θ is the unknown parameter. [sent-74, score-0.676]

36 Any arm with largest expected reward is called optimal. [sent-76, score-0.499]

37 The aim of the agent is to quickly ﬁnd ˆ an optimal arm in order to maximize the received rewards. [sent-77, score-0.454]

38 As described by [8, 12] in the linear case, an optimistic algorithm consists in selecting, at time t, the arm ˆ At = argmax max Eθ [ Rt | At = a] s. [sent-80, score-0.407]

39 Generalizing this approach beyond the case of linear link functions looks challenging. [sent-84, score-0.095]

40 3 3 An alternative approach consists in directly determining an upper conﬁdence bound for the expected reward of each arm, thus choosing the action a that maximizes Eθt [ Rt | At = a] + ρ(t) ma ˆ −1 Mt . [sent-90, score-0.333]

41 In the rest of this section, we apply this second approach to the GLM bandit model deﬁned in (1). [sent-93, score-0.337]

42 t−1 ˆ Let gt (θ) = k=1 µ(m′ k θ)mAk be the invertible function such that the estimated parameter θt A t−1 ˆ ˆ satisﬁes gt (θt ) = k=1 Rk mAk . [sent-101, score-0.12]

43 Since θt might be outside of the set of admissible parameters Θ, ˜t : we “project it” to Θ, to obtain θ t−1 ˜ ˆ θt = argmin gt (θ) − gt (θt ) θ∈Θ −1 Mt = argmin gt (θ) − θ∈Θ Rk mAk k=1 −1 Mt . [sent-102, score-0.18]

44 The quantity ρ(t) is a slowly increasing t function; we prove in Section 4 that ρ(t) can be set to guarantee high-probability bounds on the a expected regret (for the actual form used, see (8)). [sent-113, score-0.34]

45 As we are mostly interested in the case when the number of arms K is much larger than the dimension d, the algorithm is simply initialized by playing actions a1 , . [sent-115, score-0.451]

46 , ad in the ﬁrst d steps the agent ensures that Mt is invertible for all t. [sent-126, score-0.132]

47 1 Discussion The purpose of this section is to discuss some properties of Algorithm 1, and in particular the interpretation of the role played by ma M −1 . [sent-129, score-0.169]

48 t Generalizing UCB The standard UCB algorithm for K arms [2] can be seen as a special case of GLM-UCB where the vectors of covariates associated with the arms form an orthogonal system and µ(x) = x. [sent-130, score-0.778]

49 , K}, deﬁne the vectors {ma }a∈A as the canonical basis {ea }a∈A of Rd , and take θ ∈ Rd the vector whose component θa is the expected reward for arm a. [sent-134, score-0.566]

50 4 Then, Mt is a diagonal matrix whose a-th diagonal element is the number Nt (a) of times the a-th arm has been played up to time t. [sent-135, score-0.401]

51 Therefore, the exploration bonus in GLM-UCB is given by a ˆ ˆ ¯a βt = ρ(t)/ Nt (a). [sent-136, score-0.133]

52 Moreover, the maximum quasi-likelihood estimator θt satisﬁes Rt = θt (a) t−1 ¯a = 1 for all a ∈ A, where Rt k=1 I{At =a} Rk is the empirical mean of the rewards received Nt (a) while playing arm a. [sent-137, score-0.604]

53 In this case, it is known that the expected cumulated regret can be controlled upon setting the slowly varying function ρ to ρ(t) = 2 log(t), assuming that the range of the rewards is bounded by one [2]. [sent-139, score-0.451]

54 Generalizing linear bandits Obviously, setting µ(x) = x, we obtain a linear bandit model. [sent-140, score-0.592]

55 In particular, the maximum quasi-likelihood estimator becomes the least-squares estimator and as noted earlier, the algorithm behaves identically to one which chooses the parameter optimistically ˆ within the conﬁdence ellipsoid {θ : θ − θt Mt ≤ ρ(t)}. [sent-142, score-0.143]

56 Dependence in the Number of Arms In contrast to an algorithm such as UCB, Algorithm 1 does not need that all arms be played even once. [sent-143, score-0.398]

57 4 To understand this phenomenon, observe that, 2 2 2 2 (1 + mAt M −1 ) for as Mt+1 = Mt + mAt m′ t , ma M −1 = ma M −1 − m′ Mt−1 mAt a A t t t+1 a any arm a. [sent-144, score-0.493]

58 Thus the exploration bonus βt+1 decreases for all arms, except those which are exactly orthogonal to mAt (in the Mt−1 metric). [sent-145, score-0.133]

59 The decrease is most signiﬁcant for arms that are colinear to mAt . [sent-146, score-0.316]

60 This explains why the regret bounds obtained in Theorems 1 and 2 below depend on d but not on K. [sent-147, score-0.273]

61 4 Theoretical analysis In this section we ﬁrst give our ﬁnite sample regret bounds and then show how the algorithm can be tuned based on asymptotic arguments. [sent-148, score-0.365]

62 The link function µ : R → R is continuously differentiable, Lipschitz with constant kµ and such that cµ = inf θ∈Θ,a∈A µ(m′ θ) > 0. [sent-152, score-0.091]

63 The norm of covariates in {ma : a ∈ A} is bounded: there exists cm < ∞ such that for all a ∈ A, ma 2 ≤ cm . [sent-155, score-0.253]

64 A As for the standard UCB algorithm, the regret can be analyzed in terms of the difference between the expected reward received playing an optimal arm and that of the best sub-optimal arm: ∆(θ∗ ) = min a:µ(m′ θ∗ )<µ(m′ ∗ θ∗ ) a a µ(m′ ∗ θ∗ ) − µ(m′ θ∗ ) . [sent-165, score-0.867]

65 a a Theorem 1 establishes a high probability bound on the regret underlying using GLM-UCB with 2kµ κRmax ρ(t) = 2d log(t) log(2 d T /δ) , (8) cµ 4 Of course, the linear bandit algorithms also share this property with our algorithm. [sent-166, score-0.632]

66 Then, under Assumptions m 1–3, for all T ≥ 1, the regret satisﬁes: 2 2 32κ2 Rmax kµ C d2 2d T P RegretT ≤ (d + 1)Rmax + ≥ 1 − δ with C = log2 [s T ] log . [sent-170, score-0.234]

67 ∆(θ∗ ) δ c2 µ Note that the above regret bound depends on the true value of θ∗ through ∆(θ∗ ). [sent-171, score-0.26]

68 The following theorem provides an upper-bound of the regret independently of the θ∗ . [sent-172, score-0.234]

69 Then, under m Assumptions 1–3, for all T ≥ 1, the regret satisﬁes P RegretT ≤ (d + 1)Rmax + Cd log [s T ] T log 2dT δ ≥ 1 − δ with C = 8Rmax kµ κ . [sent-175, score-0.234]

70 A look into the proof of the regret bound easily explains this observation as the mathematical involvement of the arguments is such that some approximations seem unavoidable, in particular several applications of the Cauchy-Schwarz inequality, leading to pessimistic conﬁdence bounds. [sent-183, score-0.342]

71 We provide here some asymptotic arguments that suggest to choose signiﬁcantly smaller exploration bonuses, which will in turn be validated by the numerical experiments presented in Section 5. [sent-184, score-0.18]

72 Consider the canonical GLM associated with an inverse link function µ and assume that the vectors of covariates X are drawn independently under a ﬁxed distribution. [sent-185, score-0.243]

73 This random design model would for instance describe the situation when the arms are drawn randomly from a ﬁxed distribution. [sent-186, score-0.316]

74 , inﬂating the exploration bonus by a factor of kµ /cµ compared to the usual UCB setting, is sufﬁcient. [sent-196, score-0.133]

75 5 Experiments To the best of our knowledge, there is currently no public benchmark available to test bandit methods on real world data. [sent-198, score-0.337]

76 On simulated data, the proposed method unsurprisingly outperforms its competitors when the data is indeed simulated from a well-speciﬁed generalized linear model. [sent-199, score-0.127]

77 The values of the response variable for the data points assigned to each cluster are viewed as the outcomes of an arm while the centroid of the cluster is taken as the 11-dimensional vector of covariates characteristic of the arm. [sent-204, score-0.552]

78 To cast the problem into the logistic regression framework, each response variable is binarized by associating the ﬁrst class (“Spruce/Fir”) to a response R = 1 and all other six classes to R = 0. [sent-205, score-0.133]

79 The proportions of responses equal to 1 in each cluster (or, in other word, the expected reward associated with each arm) ranges from 0. [sent-206, score-0.214]

80 Obviously, the logistic regression model is not satisﬁed, although we do expect some regularity with respect to the position of the cluster’s centroid as the logistic regression trained on all data reaches a 0. [sent-212, score-0.2]

81 2000 Regrett 1500 UCB GLM−UCB ε−greedy 1000 500 0 0 1000 2000 3000 4000 5000 t 6000 7000 8000 6000 9000 10000 GLM−UCB UCB 4000 2000 0 2 4 6 8 10 arm a 12 14 16 18 Figure 1: Top: Regret of the UCB, GLM-UCB and the ǫ-greedy algorithms. [sent-214, score-0.319]

82 Bottom: Frequencies of the 20 best arms draws using the UCB and GLM-UCB. [sent-215, score-0.316]

83 Third, an ǫ-greedy algorithm that performs logistic regression and plays the best ˆ estimated action, At = argmaxa µ(m′ θt ), with probability 1 − ǫ (with ǫ = 0. [sent-220, score-0.167]

84 We observe in a the top graph of Figure 1 that the GLM-UCB algorithm achieves the smallest average regret by a large margin. [sent-222, score-0.234]

85 When the parameter is well estimated, the greedy algorithm may ﬁnd the best arm in little time and then leads to small regrets. [sent-223, score-0.346]

86 The lower plot of Figure 1 shows the number of times each of the 20 best arms have been played by the UCB and GLM-UCB algorithms. [sent-225, score-0.398]

87 The arms are sorted in decreasing order of expected reward. [sent-226, score-0.353]

88 This behavior is made possible by the predictive power of the covariates: by sharing information between arms, it is possible to obtain sufﬁciently accurate predictions of the expected rewards of all actions, even for those that have never (or rarely) been played. [sent-228, score-0.132]

89 The dataset also contains a record of the users clicks on the ads that were presented on these pages. [sent-233, score-0.117]

90 The pages (ads) were partitioned in 10 (respectively, 8) categories using Latent Dirichlet Allocation [15] applied to their respective textual content (in the case of ads, the textual content was that of the page pointed to by the ad’s link). [sent-236, score-0.113]

91 The action space is composed of the 80 pairs of pages and ads categories: when a pair is chosen, it is presented to a group of 50 users, randomly selected from the database, and the reward is the number of recorded clicks. [sent-238, score-0.27]

92 The vector of covariates for each pair is of dimension 19: it is composed of an intercept followed by the concatenation of two vectors of dimension 10 and 8 representing, respectively, the categories of the pages and the ads. [sent-241, score-0.215]

93 Given the rather limited predictive power of the covariates in this example, this is an encouraging illustration of the potential of techniques which use vectors of covariates in real-life applications. [sent-246, score-0.262]

94 Regret 3000 2000 UCB GLM−UCB ε−greedy 1000 0 0 1000 2000 3000 4000 5000 t Figure 2: Comparison of the regret of the UCB, GLM-UCB and the ǫ-greedy (ǫ = 0. [sent-247, score-0.234]

95 6 Conclusions We have introduced an approach that generalizes the linear regression model studied by [10, 8, 12]. [sent-249, score-0.096]

96 As in the original UCB algorithm, the proposed GLM-UCB method operates directly in the reward space. [sent-250, score-0.166]

97 An interesting open problem (already challenging in the linear case) consists in tightening the theoretical results obtained so far in order to bridge the gap between the existing (pessimistic) conﬁdence bounds and those suggested by the asymptotic arguments presented in Section 4. [sent-253, score-0.19]

98 The epoch-greedy algorithm for multi-armed bandits with side information. [sent-291, score-0.185]

99 Forced-exploration based algorithms for a playing in stochastic linear bandits. [sent-323, score-0.123]

100 Strong consistency of maximum quasi-likelihood estimators in generalized linear models with ﬁxed and adaptive designs. [sent-341, score-0.096]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ucb', 0.423), ('bandit', 0.337), ('arm', 0.319), ('arms', 0.316), ('mt', 0.258), ('regret', 0.234), ('bandits', 0.185), ('reward', 0.143), ('glm', 0.136), ('covariates', 0.116), ('rt', 0.099), ('rewards', 0.095), ('mat', 0.094), ('mak', 0.092), ('agent', 0.089), ('playing', 0.088), ('ma', 0.087), ('ads', 0.087), ('dence', 0.086), ('played', 0.082), ('regrett', 0.08), ('rmax', 0.077), ('bonus', 0.069), ('rk', 0.068), ('payoffs', 0.066), ('exploration', 0.064), ('asymptotic', 0.062), ('generalized', 0.061), ('xk', 0.061), ('telecom', 0.06), ('paristech', 0.06), ('ltci', 0.06), ('gt', 0.06), ('link', 0.06), ('estimator', 0.056), ('cumulated', 0.055), ('glms', 0.055), ('arguments', 0.054), ('poisson', 0.053), ('optimistic', 0.053), ('cnrs', 0.052), ('logistic', 0.051), ('rd', 0.05), ('argmaxa', 0.049), ('received', 0.046), ('garivier', 0.046), ('rusmevichientong', 0.046), ('textual', 0.045), ('generalizing', 0.045), ('ad', 0.043), ('action', 0.04), ('mai', 0.04), ('filippi', 0.04), ('paris', 0.04), ('bounds', 0.039), ('nt', 0.039), ('szepesv', 0.039), ('expected', 0.037), ('canonical', 0.037), ('regression', 0.036), ('linear', 0.035), ('advertisement', 0.035), ('cluster', 0.034), ('france', 0.033), ('pertaining', 0.033), ('competitors', 0.031), ('ellipsoid', 0.031), ('continuously', 0.031), ('plays', 0.031), ('con', 0.031), ('users', 0.03), ('slowly', 0.03), ('payoff', 0.03), ('vectors', 0.03), ('tuned', 0.03), ('strictly', 0.029), ('tuning', 0.029), ('carried', 0.028), ('pessimistic', 0.028), ('obviously', 0.028), ('greedy', 0.027), ('centroid', 0.026), ('forest', 0.026), ('bound', 0.026), ('categorical', 0.026), ('internet', 0.025), ('reference', 0.025), ('generalizes', 0.025), ('cm', 0.025), ('covariate', 0.025), ('actions', 0.024), ('article', 0.024), ('xx', 0.024), ('poor', 0.024), ('operates', 0.023), ('response', 0.023), ('moderate', 0.023), ('dimension', 0.023), ('categories', 0.023), ('receive', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 203 nips-2010-Parametric Bandits: The Generalized Linear Case

Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári

2 0.27088594 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári

Abstract: We consider online learning in ﬁnite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the ﬁrst rigorously proved regret bound for the problem. 1

3 0.18869936 183 nips-2010-Non-Stochastic Bandit Slate Problems

Author: Satyen Kale, Lev Reyzin, Robert E. Schapire

Abstract: We consider bandit problems, motivated by applications in online advertising and news story selection, in which the learner must repeatedly select a slate, that is, a subset of size s from K possible actions, and then receives rewards for just the selected actions. The goal is to minimize the regret with respect to total reward of the best slate computed in hindsight. We consider unordered and ordered versions √ of the problem, and give efﬁcient algorithms which have regret O( T ), where the constant depends on the speciﬁc nature of the problem. We also consider versions of the problem where we have access to a number of policies which make recom√ mendations for slates in every round, and give algorithms with O( T ) regret for competing with the best such policy as well. We make use of the technique of relative entropy projections combined with the usual multiplicative weight update algorithm to obtain our algorithms. 1

4 0.16603726 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “ofﬂine” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

5 0.16354913 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

Author: Martha White, Adam White

Abstract: The reinforcement learning community has explored many approaches to obtaining value estimates and models to guide decision making; these approaches, however, do not usually provide a measure of conﬁdence in the estimate. Accurate estimates of an agent’s conﬁdence are useful for many applications, such as biasing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing conﬁdence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agentenvironment interaction rarely satisﬁes traditional assumptions. Samples of valueestimates are dependent, likely non-normally distributed and often limited, particularly in early learning when conﬁdence estimates are pivotal. In this work, we investigate how to compute robust conﬁdences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute conﬁdence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applicability of our conﬁdence estimation algorithms with experiments on exploration, parameter estimation and tracking. 1

6 0.14728637 192 nips-2010-Online Classification with Specificity Constraints

7 0.13533247 222 nips-2010-Random Walk Approach to Regret Minimization

8 0.12164579 229 nips-2010-Reward Design via Online Gradient Ascent

9 0.10497639 108 nips-2010-Graph-Valued Regression

10 0.10061206 4 nips-2010-A Computational Decision Theory for Interactive Assistants

11 0.10054265 138 nips-2010-Large Margin Multi-Task Metric Learning

12 0.080158696 182 nips-2010-New Adaptive Algorithms for Online Classification

13 0.075676225 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

14 0.073710479 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

15 0.072605304 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

16 0.070774801 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

17 0.070593826 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

18 0.067819513 282 nips-2010-Variable margin losses for classifier design

19 0.066571563 66 nips-2010-Double Q-learning

20 0.064737596 270 nips-2010-Tight Sample Complexity of Large-Margin Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.191), (1, -0.15), (2, 0.051), (3, 0.018), (4, 0.012), (5, 0.08), (6, -0.101), (7, -0.046), (8, -0.071), (9, 0.267), (10, 0.001), (11, 0.056), (12, -0.011), (13, -0.053), (14, 0.045), (15, 0.017), (16, -0.105), (17, 0.005), (18, 0.063), (19, 0.014), (20, -0.026), (21, -0.101), (22, -0.073), (23, 0.086), (24, -0.012), (25, 0.052), (26, 0.024), (27, -0.062), (28, 0.099), (29, 0.005), (30, 0.013), (31, -0.071), (32, -0.114), (33, -0.015), (34, -0.064), (35, -0.016), (36, 0.087), (37, -0.176), (38, -0.022), (39, 0.041), (40, -0.022), (41, 0.043), (42, -0.211), (43, 0.079), (44, 0.073), (45, 0.077), (46, 0.095), (47, -0.047), (48, -0.024), (49, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92484671 203 nips-2010-Parametric Bandits: The Generalized Linear Case

Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári

2 0.75179398 183 nips-2010-Non-Stochastic Bandit Slate Problems

Author: Satyen Kale, Lev Reyzin, Robert E. Schapire

3 0.6999523 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári

4 0.55800283 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

Author: Martha White, Adam White

5 0.55105335 4 nips-2010-A Computational Decision Theory for Interactive Assistants

Author: Alan Fern, Prasad Tadepalli

Abstract: We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We ﬁrst introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in ﬁnite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant’s action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution. 1

6 0.54318976 229 nips-2010-Reward Design via Online Gradient Ascent

7 0.53037888 192 nips-2010-Online Classification with Specificity Constraints

8 0.50016165 222 nips-2010-Random Walk Approach to Regret Minimization

9 0.43543378 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

10 0.42916796 66 nips-2010-Double Q-learning

11 0.41330072 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

12 0.35823759 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

13 0.34875652 121 nips-2010-Improving Human Judgments by Decontaminating Sequential Dependencies

14 0.3305479 178 nips-2010-Multivariate Dyadic Regression Trees for Sparse Learning Problems

15 0.33039922 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions

16 0.32814345 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices

17 0.32774973 108 nips-2010-Graph-Valued Regression

18 0.32218921 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

19 0.32121569 152 nips-2010-Learning from Logged Implicit Exploration Data

20 0.31606025 202 nips-2010-Parallelized Stochastic Gradient Descent

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.064), (17, 0.01), (27, 0.062), (30, 0.067), (35, 0.012), (39, 0.016), (45, 0.211), (50, 0.047), (52, 0.034), (60, 0.046), (71, 0.252), (77, 0.054), (78, 0.02), (90, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90155452 247 nips-2010-Sparse Instrumental Variables (SPIV) for Genome-Wide Studies

Author: Paul Mckeigue, Jon Krohn, Amos J. Storkey, Felix V. Agakov

Abstract: This paper describes a probabilistic framework for studying associations between multiple genotypes, biomarkers, and phenotypic traits in the presence of noise and unobserved confounders for large genetic studies. The framework builds on sparse linear methods developed for regression and modiﬁed here for inferring causal structures of richer networks with latent variables. The method is motivated by the use of genotypes as “instruments” to infer causal associations between phenotypic biomarkers and outcomes, without making the common restrictive assumptions of instrumental variable methods. The method may be used for an effective screening of potentially interesting genotype-phenotype and biomarker-phenotype associations in genome-wide studies, which may have important implications for validating biomarkers as possible proxy endpoints for early-stage clinical trials. Where the biomarkers are gene transcripts, the method can be used for ﬁne mapping of quantitative trait loci (QTLs) detected in genetic linkage studies. The method is applied for examining effects of gene transcript levels in the liver on plasma HDL cholesterol levels for a sample of sequenced mice from a heterogeneous stock, with ∼ 105 genetic instruments and ∼ 47 × 103 gene transcripts. 1

same-paper 2 0.8182292 203 nips-2010-Parametric Bandits: The Generalized Linear Case

Author: Sarah Filippi, Olivier Cappe, Aurélien Garivier, Csaba Szepesvári

3 0.7438252 217 nips-2010-Probabilistic Multi-Task Feature Selection

Author: Yu Zhang, Dit-Yan Yeung, Qian Xu

Abstract: Recently, some variants of the đ?‘™1 norm, particularly matrix norms such as the đ?‘™1,2 and đ?‘™1,âˆž norms, have been widely used in multi-task learning, compressed sensing and other related areas to enforce sparsity via joint regularization. In this paper, we unify the đ?‘™1,2 and đ?‘™1,âˆž norms by considering a family of đ?‘™1,đ?‘ž norms for 1 < đ?‘ž â‰¤ âˆž and study the problem of determining the most appropriate sparsity enforcing norm to use in the context of multi-task feature selection. Using the generalized normal distribution, we provide a probabilistic interpretation of the general multi-task feature selection problem using the đ?‘™1,đ?‘ž norm. Based on this probabilistic interpretation, we develop a probabilistic model using the noninformative Jeffreys prior. We also extend the model to learn and exploit more general types of pairwise relationships between tasks. For both versions of the model, we devise expectation-maximization (EM) algorithms to learn all model parameters, including đ?‘ž, automatically. Experiments have been conducted on two cancer classiďŹ cation applications using microarray gene expression data. 1

4 0.70343542 63 nips-2010-Distributed Dual Averaging In Networks

Author: Alekh Agarwal, Martin J. Wainwright, John C. Duchi

Abstract: The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is conﬁrmed both by theoretical lower bounds and simulations for various networks. 1

5 0.70218217 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection

Author: Seunghak Lee, Jun Zhu, Eric P. Xing

Abstract: To understand the relationship between genomic variations among population and complex diseases, it is essential to detect eQTLs which are associated with phenotypic effects. However, detecting eQTLs remains a challenge due to complex underlying mechanisms and the very large number of genetic loci involved compared to the number of samples. Thus, to address the problem, it is desirable to take advantage of the structure of the data and prior information about genomic locations such as conservation scores and transcription factor binding sites. In this paper, we propose a novel regularized regression approach for detecting eQTLs which takes into account related traits simultaneously while incorporating many regulatory features. We ﬁrst present a Bayesian network for a multi-task learning problem that includes priors on SNPs, making it possible to estimate the signiﬁcance of each covariate adaptively. Then we ﬁnd the maximum a posteriori (MAP) estimation of regression coefﬁcients and estimate weights of covariates jointly. This optimization procedure is efﬁcient since it can be achieved by using a projected gradient descent and a coordinate descent procedure iteratively. Experimental results on simulated and real yeast datasets conﬁrm that our model outperforms previous methods for ﬁnding eQTLs.

6 0.70143944 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

7 0.70142114 117 nips-2010-Identifying graph-structured activation patterns in networks

8 0.70125991 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

9 0.7006579 182 nips-2010-New Adaptive Algorithms for Online Classification

10 0.7003333 265 nips-2010-The LASSO risk: asymptotic results and real world examples

11 0.70011067 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing

12 0.69952542 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability

13 0.69903529 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

14 0.69864726 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

15 0.69862324 148 nips-2010-Learning Networks of Stochastic Differential Equations

16 0.69778466 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning

17 0.69721055 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

18 0.69663769 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

19 0.69659734 155 nips-2010-Learning the context of a category

20 0.69647521 280 nips-2010-Unsupervised Kernel Dimension Reduction