nips nips2011 nips2011-245 knowledge-graph by maker-knowledge-mining

245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Source: pdf

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

Abstract: The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a ﬁnite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 net Abstract The problem of selecting the right state-representation in a reinforcement learning problem is considered. [sent-6, score-0.164]

2 Without knowing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). [sent-8, score-0.511]

3 We propose an algorithm that achieves that, with a regret of order T 2/3 where T is the horizon time. [sent-9, score-0.48]

4 1 Introduction We consider the problem of selecting the right state-representation in an average-reward reinforcement learning problem. [sent-10, score-0.164]

5 Each state-representation is deﬁned by a model φj (to which corresponds a state space Sφj ) and we assume that the number J of available models is ﬁnite and that (at least) one model is a weakly-communicating Markov decision process (MDP). [sent-11, score-0.128]

6 This problem is considered in the general reinforcement learning setting, where an agent interacts with an unknown environment in a single stream of repeated observations, actions and rewards. [sent-13, score-0.181]

7 Our goal is to construct an algorithm that performs almost as well as the algorithm that knows both which model is a MDP (knows the “true” model) and the characteristics of this MDP (the transition probabilities and rewards). [sent-15, score-0.173]

8 Given a potentially large number of feature combinations of this kind, we want to ﬁnd a policy whose average reward is as good as that of the best policy for the right combination of features. [sent-21, score-0.45]

9 Again, we would like to ﬁnd a policy that is as good as the optimal policy for the right discretization. [sent-28, score-0.37]

10 In fact, the constructed algorithm never “knows” which model is the right one; it is “only” able to get the same average level of reward as if it knew. [sent-33, score-0.161]

11 Such a problem has been pioneered in the reinforcement learning literature by [7] and then improved in various ways by [4, 11, 12, 6, 3]; UCRL2 achieves a regret of the order DT 1/2 in any weakly-communicating MDP with diameter D, with respect to the best policy for this MDP. [sent-37, score-0.861]

12 The diameter D of a MDP is deﬁned in [6] as the expected minimum time required to reach any state starting from any other state. [sent-38, score-0.188]

13 A similar approach has been considered in [10]; the difference is that in that work the probabilistic characteristics of each model are completely known, but the models are not assumed to be Markovian, and belong to a countably inﬁnite (rather than ﬁnite) set. [sent-40, score-0.206]

14 In the usual bandit setting, the rewards are assumed to be i. [sent-44, score-0.304]

15 thus one can estimate the mean value of the arms while switching arbitrarily from one arm to the next (the quality of the estimate only depends on the number of pulls of each arm). [sent-47, score-0.127]

16 However, in our setting, estimating the average-reward of a policy requires playing it many times consecutively. [sent-48, score-0.24]

17 We show that despite the fact that the true Markov model of states is unknown and that nothing is assumed on the wrong representations, it is still possible to derive a ﬁnite-time analysis of the regret for this problem. [sent-51, score-0.605]

18 This is stated in Theorem 1; the bound on the regret that we obtain is of order T 2/3 . [sent-52, score-0.486]

19 The intuition is that if the “true” model φ∗ is known, but its probabilistic properties are not, then we still know that there exists an optimal control policy that depends on the observed state sj ∗ ,t only. [sent-53, score-0.318]

20 Therefore, the optimal rate of rewards can be obtained by a clever exploration/exploitation strategy, such as UCRL2 algorithm [6]. [sent-54, score-0.221]

21 Since we do not know in advance which model is a MDP, we need to explore them all, for a sufﬁciently long time in order to estimate the rate of rewards that one can get using a good policy in that model. [sent-55, score-0.461]

22 2 Notation and deﬁnitions We consider a space of observations O, a space of actions A, and a space of rewards R (all assumed def to be Polish). [sent-61, score-0.533]

23 Deﬁnition 1 We say that an environment e with a state representation function φ is Markov, or, for short, that φ is a Markov model (of e), if the process (st,φ , at , rt ), t ∈ N is a (weakly communicating) Markov decision process. [sent-69, score-0.227]

24 For that purpose we deﬁne the regret of any strategy at time T , like in [6, 3], as T def ∆(T ) = T ρ − rt , t=1 where rt are the rewards received when following the proposed strategy and ρ is the average optimal T 1 value in the best Markov model, i. [sent-74, score-1.171]

25 , ρ = limT T E( t=1 rt (π )) where rt (π ) are the rewards received when following the optimal policy for the best Markov model. [sent-76, score-0.602]

26 Note that this deﬁnition makes sense since when the MDP is weakly communicating, the average optimal value of reward does not depend on the initial state. [sent-77, score-0.166]

27 Also, one could replace T ρ∗ with the expected sum of rewards √ obtained in T steps (following the optimal policy) at the price of an additional O( T ) term. [sent-78, score-0.253]

28 In the next subsection, we describe an algorithm that achieves a sub-linear regret of order T 2/3 . [sent-79, score-0.418]

29 Each stage consists in 2 phases: an exploration and an exploitation phase. [sent-83, score-0.77]

30 In the exploration phase, BLB plays the UCRL2 algorithm on each model (φj )1 j J successively, as if each model φj was a Markov model, for a ﬁxed number τi,1,J of rounds. [sent-84, score-0.242]

31 The exploitation part consists in selecting ﬁrst the model with highest lower bound, according to the empirical rewards obtained in the previous exploration phase. [sent-85, score-0.788]

32 This model is initially selected for the same time as in the exploration phase, and then a test decides to either continue playing this model (if its performance during exploitation is still above the corresponding lower bound, i. [sent-86, score-0.712]

33 if the rewards obtained are still at least as good as if it was playing the best model). [sent-88, score-0.276]

34 Until the exploitation phase (of ﬁxed length τi,2 ) ﬁnishes and the next stage starts. [sent-90, score-0.757]

35 3 Parameters: f, δ For each stage i 1 do Set the total length of stage i to be τi := 2i . [sent-91, score-0.649]

36 – Compute the corresponding average empirical reward µi,1 (φj ) received during this exploration phase. [sent-99, score-0.359]

37 While the current length of the exploitation part is less than τi,2 do – Select j = argmax µi,1 (φj ) − 2B(i, φj , δ) (using (3)). [sent-106, score-0.4]

38 j∈J – Run UCRL2 with parameter δi (δ) using φj : update at each time step t the current average empirical reward µi,2,t (φj ) from the beginning of the run. [sent-107, score-0.131]

39 Provided that the length of the current run is larger than τi,1,J , do the test µi,2,t (φj ) µi,1 (φj ) − 2B(i, φj , δ) . [sent-108, score-0.163]

40 def The length of stage i is ﬁxed and deﬁned to be τi = 2i . [sent-115, score-0.66]

41 Thus for a total time horizon T , the number def of stages I(T ) before time T is I(T ) = log2 (T + 1) . [sent-116, score-0.494]

42 Each stage i (of length τi ) is further decomposed into an exploration (length τi,1 ) and an exploitation (length τi,2 ) phases. [sent-117, score-0.847]

43 All the models {φj }j J are played one after another for the same amount of def τ τi,1,J = i,1 . [sent-119, score-0.391]

44 J time Each episode 1 j J consists in running the UCRL2 algorithm using the model of states and transitions induced by the state-representation function φj . [sent-120, score-0.188]

45 Note that UCRL2 does not require the horizon T in advance, but requires a parameter p in order to ensure a near optimal regret bound with probability higher than 1 − p. [sent-121, score-0.646]

46 We deﬁne this parameter p to be δi (δ) in stage i, where def δi (δ) = (2i − (J −1 + 1)22i/3 + 4)−1 2−i+1 δ . [sent-122, score-0.583]

47 The average empirical reward received during each episode is written µi,1 (φj ). [sent-123, score-0.306]

48 We use the empirical rewards µi,1 (φj ) received in the previous exploration part of stage i together with a conﬁdence bound in order to select the model to play. [sent-125, score-0.824]

49 Moreover, a model φ is no longer run for a ﬁxed period of time (as in the exploration part of stage i), but for a period τi,2 (φ) that depends on some test; we ﬁrst initialize J := {1, . [sent-126, score-0.523]

50 , J} and then choose def j = argmax µi,1 (φj ) − 2B(i, φj , δ) , (2) j∈J where we deﬁne τ def B(i, φ, δ) = 34f (τi − 1 + τi,1 )|Sφ | A log( δi,1,J ) i (δ) , (3) τi,1,J where δ and the function f are parameters of the BLB algorithm. [sent-129, score-0.624]

51 In parallel we test whether the average empirical reward we receive during this exploitation phase is high enough; at time t, if the length of the current episode is larger than τ1,i,J , we test if µi,2,t (φj ) µi,1 (φj ) − 2B(i, φj , δ). [sent-131, score-0.826]

52 Now, if the test fails, then the model j is discarded (until the end of stage i) i. [sent-133, score-0.347]

53 We repeat those steps until the total time τi,2 of the exploitation phase of stage i is over. [sent-136, score-0.742]

54 4 Remark Note that the model selected for exploitation in (2) is the one that has the best lower bound. [sent-137, score-0.381]

55 We know that if the right model is selected, then with high probability, this model will be kept during the whole exploitation phase. [sent-139, score-0.452]

56 If this is not the right model, then either the policy provides good rewards and we should keep playing it, or it does not, in which case it will not pass the test (4) and will be removed from the set of models that will be exploited in this phase. [sent-140, score-0.561]

57 If there are several such models, let φ be the one with the highest average reward of the optimal policy of the corresponding MDP. [sent-143, score-0.289]

58 Importantly, the algorithm considered here does not know in advance the diameter D of the true model, nor the time horizon T . [sent-148, score-0.34]

59 log(t)) on this diameter, which result in the additional regret term c(f, D) and the additional factor f (T ); knowing D would enable to remove both of them, but this is a strong assumption. [sent-151, score-0.418]

60 Choosing f (t) := log2 t + 1 gives a bound which is of order T 2/3 in T but is exponential in D; taking f (t) := tε we get a bound of order T 2/3+ε in T but of polynomial order 1/ε in D. [sent-152, score-0.136]

61 The “wrong” models are used during exploitation stages only as long as they are giving rewards that are higher than the rewards that could be obtained in the “true” model. [sent-155, score-0.882]

62 All the models are explored sufﬁciently long so as to be able to estimate the optimal reward level in the true model, and to learn its policy. [sent-156, score-0.21]

63 As long as the rewards are high we do not care where the model we are using is correct or not. [sent-163, score-0.256]

64 While passing from a ﬁnite to a countably inﬁnite set appears rather straightforward, getting rid of the assumption that this set contains the true model seems more difﬁcult. [sent-167, score-0.128]

65 What one would want to obtain in this setting is sub-linear regret with respect to the performance of the optimal policy in the best model; this, however, seems difﬁcult without additional assumptions on the probabilistic characteristics of the models. [sent-168, score-0.689]

66 The reader familiar with adversarial bandit literature will notice that our bound of order T 2/3 is worse than T 1/2 that usually appears in this context (see, for example [2]). [sent-173, score-0.189]

67 The reason is that our notion of regret is different: in adversarial bandit literature, the regret is measured with respect to the best choice of the arm for the given ﬁxed history. [sent-174, score-0.997]

68 In contrast, we measure the regret with respect to the best policy (for knows the correct model and its parameters) that, in general, would obtain completely different (from what our algorithm would get) rewards and observations right from the beginning. [sent-175, score-0.937]

69 As previously mentioned, a possibly large additive constant c(f, D) appears in the regret since we do not known a bound on the diameter of the MDP in the “true” model, and use log T instead. [sent-177, score-0.67]

70 Finding a way to properly address this problem by estimating online the diameter of the MDP is an interesting open question. [sent-178, score-0.151]

71 First, we notice that, as reported in [6], when we compute an optimistic model based on the empirical rewards and transitions of the true model, the span of the corresponding optimistic value function sp(V + ) is always smaller than the diameter D. [sent-180, score-0.681]

72 This span increases as we get more rewards and transitions samples, which gives a natural empirical lower bound on D. [sent-181, score-0.376]

73 In [3], the authors derive a regret bound that scales with the span of the true value function sp(V ), which is also less than D, and can be signiﬁcantly smaller in some cases. [sent-183, score-0.595]

74 However, since we do not have the property that sp(V + ) sp(V ), we need to introduce an explicit penalization in order to control the span of the computed optimistic models, and this requires assuming we know an upper bound B on sp(V ) in order to guarantee a ﬁnal regret bound scaling with B. [sent-184, score-0.737]

75 Then we decompose the regret into the sum of the regret in each stage i. [sent-189, score-1.107]

76 After analyzing the contribution to the regret in stage i, we then gather all stages and tune the length of each stage and episode in order to get the ﬁnal regret bound. [sent-190, score-1.698]

77 What is interesting is that this diameter does not need to be known by the algorithm. [sent-193, score-0.151]

78 Also by carefully looking at the proof of UCRL, it can be shown that the following bound is also valid with probability higher than 1 − δ : 1 τ t1 +τ −1 rt − ρ 34D|Sφ | t=t1 τ A log( δ ) . [sent-194, score-0.226]

79 τ We now deﬁne the following quantity, for every model φ, episode length τ and δ ∈ (0, 1) τ A log( δ ) def BD (τ, φ, δ ) = 34D|Sφ | . [sent-195, score-0.545]

80 2 Regret of stage i In this section we analyze the regret of the stage i, which we denote ∆i . [sent-197, score-0.96]

81 Note that since each stage i I is of length τi = 2i except the last one I that may stop before, we have I(T ) ∆(T ) = ∆i , (8) i=1 where I(T ) = log2 (T +1) . [sent-198, score-0.373]

82 We further decompose ∆i = ∆1,i +∆i,2 into the regret corresponding to the exploration stage ∆1,i and the regret corresponding to the exploitation stage ∆i,2 . [sent-199, score-1.877]

83 τi,1 is the total length of the exploration stage i and τi,2 is the total length of the exploitation stage i. [sent-200, score-1.255]

84 def τ For each model φ, we write τi,1,J = i,1 the number of consecutive steps during which the UCRL2 J algorithm is run with model φ in the exploration stage i, and τi,2 (φ) the number of consecutive steps during which the UCRL2 algorithm is run with model φ in the exploitation stage i. [sent-201, score-1.678]

85 Let us now introduce the two following sets of models, deﬁned after the end of the exploration stage, i. [sent-203, score-0.176]

86 def Gi = {φ ∈ Φ ; µi,1 (φ) − 2B(i, φ, δ) ≥ µi,1 (φ ) − 2B(i, φ , δ)}\{φ∗ } , def Bi = {φ ∈ Φ ; µi,1 (φ) − 2B(i, φ, δ) < µi,1 (φ ) − 2B(i, φ , δ)} . [sent-206, score-0.624]

87 1 Regret in the exploration phase Since in the exploration stage i each model φ is run for τi,1,J many steps, the regret for each model φ = φ is bounded by τi,1,J ρ . [sent-210, score-1.272]

88 Now the regret for the true model is τi,1,J (ρ − µ1 (φ )), thus the total contribution to the regret in the exploration stage i is upper-bounded by ∆i,1 τi,1,J (ρ − µ1 (φ )) + (J − 1)τi,1,J ρ . [sent-211, score-1.433]

89 2 Regret in the exploitation phase By deﬁnition, all models in Gi ∪ {φ } are selected before any model in Bi is selected. [sent-214, score-0.492]

90 Let us consider some φ ∈ Gi and an event Ωi under which the exploitation phase does not reset. [sent-216, score-0.492]

91 We deduce that the total contribution to the regret of all the models φ ∈ Gi in the exploitation stages on the event Ωi is bounded by max{τi,1,J ρ , (τi,2 (φ) − 1)(ρ − µi,1 (φ ) + 2B(i, φ , δ)) + ρ } . [sent-218, score-1.108]

92 Now we show that, with high probability, the true model φ passes all the tests (equation (4)) until the end of the episode i, and thus equivalently, with high probability no model φ ∈ Bi is selected, so that τi,2 (φ) = 0. [sent-222, score-0.302]

93 φ∈Bi For the true model, after τ (φ , t) τi,1,J , there remains at most (τi,2 −τi,1,J +1) possible timesteps where we do the test for the true model φ . [sent-223, score-0.19]

94 Now provided that f (ti ) D, then BD (τi,1,J , φ , δi (δ)) B(i, φ , δ) , thus the true model passes all tests until the end of the exploitation part of stage i on an event Ω3,i of probability higher than def 1 − (τi,2 − τi,1,J + 2)δi (δ). [sent-226, score-1.185]

95 The total length of stage i is by deﬁnition τi = τi,1 + τi,2 = τi,1,J J + τi,2 , def def 2/3 τ 2/3 2/3 where τi = 2i . [sent-237, score-1.002]

96 A log def We now deﬁne δi (δ) such that δi (δ) = (2i − (J −1 + 1)22i/3 + 4)−1 2−i+1 δ . [sent-240, score-0.345]

97 We conclude by using the fact that since I(T ) 1 − δ, the following bound on the regret holds ∆(T ) A log cf (T )S AJ log(Jδ)−1 log2 (T ) 1/2 log2 (T + 1), then with probability higher than T 2/3 + c DS A log(δ −1 ) log2 (T )T 1/2 + c(f, D) . [sent-243, score-0.624]

98 def for some constant c, c , and where c(f, D) = i∈I0 2i ρ . [sent-244, score-0.312]

99 REGAl: a regularization based algorithm for reinforcement learning in weakly communicating mdps. [sent-258, score-0.211]

100 Optimistic linear programming gives logarithmic regret for irreducible mdps. [sent-292, score-0.418]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('regret', 0.418), ('exploitation', 0.323), ('def', 0.312), ('stage', 0.271), ('bd', 0.23), ('rewards', 0.197), ('exploration', 0.176), ('policy', 0.161), ('diameter', 0.151), ('mdp', 0.15), ('blb', 0.133), ('gi', 0.132), ('episode', 0.123), ('reinforcement', 0.108), ('reward', 0.104), ('stages', 0.09), ('phase', 0.086), ('rt', 0.084), ('event', 0.083), ('bandit', 0.083), ('ti', 0.082), ('sp', 0.081), ('daniil', 0.08), ('playing', 0.079), ('knows', 0.078), ('length', 0.077), ('deduce', 0.073), ('environment', 0.073), ('markov', 0.071), ('bi', 0.071), ('bound', 0.068), ('optimistic', 0.066), ('communicating', 0.065), ('horizon', 0.062), ('characteristics', 0.062), ('arms', 0.062), ('nord', 0.057), ('true', 0.057), ('markovian', 0.056), ('played', 0.054), ('ryabko', 0.053), ('histories', 0.053), ('lille', 0.053), ('received', 0.052), ('span', 0.052), ('peter', 0.051), ('europe', 0.051), ('higher', 0.05), ('herbert', 0.047), ('wrong', 0.044), ('test', 0.043), ('run', 0.043), ('inria', 0.041), ('aj', 0.04), ('arm', 0.04), ('know', 0.039), ('countably', 0.038), ('weakly', 0.038), ('consecutive', 0.038), ('adversarial', 0.038), ('state', 0.037), ('nicol', 0.037), ('bounded', 0.036), ('marcus', 0.035), ('ambuj', 0.035), ('subroutine', 0.034), ('model', 0.033), ('reset', 0.033), ('log', 0.033), ('passes', 0.032), ('transitions', 0.032), ('pass', 0.032), ('steps', 0.032), ('selecting', 0.032), ('advance', 0.031), ('cf', 0.031), ('nite', 0.031), ('contribution', 0.03), ('total', 0.03), ('nothing', 0.029), ('fails', 0.028), ('empirical', 0.027), ('correct', 0.026), ('upper', 0.026), ('ds', 0.026), ('stop', 0.025), ('switching', 0.025), ('auer', 0.025), ('models', 0.025), ('selected', 0.025), ('probability', 0.024), ('right', 0.024), ('optimal', 0.024), ('assumed', 0.024), ('nitely', 0.024), ('game', 0.024), ('probabilistic', 0.024), ('virginia', 0.023), ('pioneered', 0.023), ('quantizations', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

2 0.29095763 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

Abstract: We improve the theoretical analysis and empirical performance of algorithms for the stochastic multi-armed bandit problem and the linear stochastic multi-armed bandit problem. In particular, we show that a simple modiﬁcation of Auer’s UCB algorithm (Auer, 2002) achieves with high probability constant regret. More importantly, we modify and, consequently, improve the analysis of the algorithm for the for linear stochastic bandit problem studied by Auer (2002), Dani et al. (2008), Rusmevichientong and Tsitsiklis (2010), Li et al. (2010). Our modiﬁcation improves the regret bound by a logarithmic factor, though experiments show a vast improvement. In both cases, the improvement stems from the construction of smaller conﬁdence sets. For their construction we use a novel tail inequality for vector-valued martingales. 1

3 0.26148203 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette

Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is signiﬁcantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1

4 0.22850771 220 nips-2011-Prediction strategies without loss

Author: Michael Kapralov, Rina Panigrahy

Abstract: Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say ‘predict 0’ or ‘predict 1’, and our payoff is +1 if the prediction is correct and −1 otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting 0 or always predicting 1. For a sequence of length T our algorithm has regret 14 T and loss √ 2 2 T e− T in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of N experts, where the related problem of trading off regret to the best expert for regret to the ’special’ expert has been studied by Even-Dar et al. (COLT’07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to k-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for any window of size n the regret of our algorithm to any expert never exceeds O( n(log N + log T )), where N is the number of experts and T is the time horizon, while maintaining the essentially zero loss property. 1

5 0.20986629 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

6 0.20274286 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

7 0.19906305 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

8 0.18987462 56 nips-2011-Committing Bandits

9 0.18262039 61 nips-2011-Contextual Gaussian Process Bandit Optimization

10 0.16505949 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

11 0.1571801 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

12 0.14084999 272 nips-2011-Stochastic convex optimization with bandit feedback

13 0.14078633 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

14 0.14078312 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

15 0.1382079 80 nips-2011-Efficient Online Learning via Randomized Rounding

16 0.13665599 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

17 0.13470536 215 nips-2011-Policy Gradient Coagent Networks

18 0.1270155 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

19 0.12625518 145 nips-2011-Learning Eigenvectors for Free

20 0.1213149 25 nips-2011-Adaptive Hedge

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.231), (1, -0.415), (2, 0.084), (3, 0.172), (4, 0.063), (5, 0.009), (6, 0.041), (7, 0.116), (8, 0.045), (9, 0.048), (10, -0.115), (11, 0.041), (12, -0.109), (13, -0.034), (14, 0.036), (15, 0.002), (16, -0.028), (17, -0.051), (18, -0.035), (19, -0.024), (20, -0.002), (21, 0.021), (22, 0.013), (23, -0.069), (24, -0.037), (25, -0.042), (26, -0.055), (27, -0.007), (28, 0.014), (29, 0.061), (30, 0.059), (31, 0.002), (32, -0.088), (33, -0.047), (34, -0.07), (35, 0.018), (36, 0.026), (37, 0.032), (38, 0.037), (39, 0.035), (40, 0.043), (41, -0.001), (42, 0.064), (43, 0.012), (44, -0.005), (45, 0.057), (46, 0.02), (47, -0.004), (48, 0.05), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95651543 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

2 0.86236346 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

3 0.80806798 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

Author: Yasin Abbasi-yadkori, Csaba Szepesvári, David Tax

4 0.78880334 56 nips-2011-Committing Bandits

Author: Loc X. Bui, Ramesh Johari, Shie Mannor

Abstract: We consider a multi-armed bandit problem where there are two phases. The ﬁrst phase is an experimentation phase where the decision maker is free to explore multiple options. In the second phase the decision maker has to commit to one of the arms and stick with it. Cost is incurred during both phases with a higher cost during the experimentation phase. We analyze the regret in this setup, and both propose algorithms and provide upper and lower bounds that depend on the ratio of the duration of the experimentation phase to the duration of the commitment phase. Our analysis reveals that if given the choice, it is optimal to experiment Θ(ln T ) steps and then commit, where T is the time horizon.

5 0.73840892 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

Author: Shie Mannor, Ohad Shamir

Abstract: We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known “experts” setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds. 1

6 0.73054516 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

7 0.72365195 272 nips-2011-Stochastic convex optimization with bandit feedback

8 0.6400823 97 nips-2011-Finite Time Analysis of Stratified Sampling for Monte Carlo

9 0.63563466 61 nips-2011-Contextual Gaussian Process Bandit Optimization

10 0.60486436 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

11 0.57857251 175 nips-2011-Multi-Bandit Best Arm Identification

12 0.57029915 220 nips-2011-Prediction strategies without loss

13 0.53149503 25 nips-2011-Adaptive Hedge

14 0.523821 177 nips-2011-Multi-armed bandits on implicit metric spaces

15 0.51796663 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

16 0.49405345 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

17 0.4928773 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

18 0.48027116 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

19 0.47334811 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

20 0.47057486 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (4, 0.031), (20, 0.033), (26, 0.029), (31, 0.129), (33, 0.019), (35, 0.191), (43, 0.089), (45, 0.149), (57, 0.013), (65, 0.02), (74, 0.051), (79, 0.018), (83, 0.037), (99, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91207623 122 nips-2011-How Do Humans Teach: On Curriculum Learning and Teaching Dimension

Author: Faisal Khan, Bilge Mutlu, Xiaojin Zhu

Abstract: We study the empirical strategies that humans follow as they teach a target concept with a simple 1D threshold to a robot.1 Previous studies of computational teaching, particularly the teaching dimension model and the curriculum learning principle, offer contradictory predictions on what optimal strategy the teacher should follow in this teaching task. We show through behavioral studies that humans employ three distinct teaching strategies, one of which is consistent with the curriculum learning principle, and propose a novel theoretical framework as a potential explanation for this strategy. This framework, which assumes a teaching goal of minimizing the learner’s expected generalization error at each iteration, extends the standard teaching dimension model and offers a theoretical justiﬁcation for curriculum learning. 1

same-paper 2 0.84012616 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

Author: Odalric-ambrym Maillard, Daniil Ryabko, Rémi Munos

3 0.77044028 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

Abstract: Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We ﬁnd that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1

4 0.77021289 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

Abstract: We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a speciﬁc model of how the variance of an nstep return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which deﬁnes the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ (C), with a capacity parameter C. TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. 1

5 0.76842463 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

6 0.76731819 180 nips-2011-Multiple Instance Filtering

7 0.7649163 258 nips-2011-Sparse Bayesian Multi-Task Learning

8 0.76236379 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

9 0.75975955 271 nips-2011-Statistical Tests for Optimization Efficiency

10 0.75911033 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

11 0.75828773 283 nips-2011-The Fixed Points of Off-Policy TD

12 0.75815755 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

13 0.75519693 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

14 0.75483388 32 nips-2011-An Empirical Evaluation of Thompson Sampling

15 0.75443852 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

16 0.75328195 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

17 0.75100583 263 nips-2011-Sparse Manifold Clustering and Embedding

18 0.75039119 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

19 0.75006968 66 nips-2011-Crowdclustering

20 0.74960989 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning