nips nips2001 nips2001-55 knowledge-graph by maker-knowledge-mining

55 nips-2001-Convergence of Optimistic and Incremental Q-Learning


Source: pdf

Author: Eyal Even-dar, Yishay Mansour

Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. [sent-2, score-0.757]

2 We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. [sent-3, score-0.192]

3 The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. [sent-4, score-0.484]

4 We show that incremental Q-learning converges, in the limit, to the optimal policy. [sent-5, score-0.343]

5 Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. [sent-6, score-0.355]

6 The environment is modeled by an MDP and we can only observe the trajectory of states, actions and rewards generated by the agent wandering in the MDP. [sent-8, score-0.22]

7 The first is model base, where we first reconstruct a model of the MDP, and then find an optimal policy for the approximate model. [sent-10, score-0.275]

8 The second are direct methods that update their estimated policy after each step. [sent-12, score-0.215]

9 ·Although, it is an off policy algorithm, in many cases its estimated value function is used to guide the selection of actions. [sent-18, score-0.215]

10 Being always greedy with respect to the value function may result in poor performance, due to the lack of exploration, and often randomization is used guarantee proper exploration. [sent-19, score-0.208]

11 The first is optimistic *School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel. [sent-21, score-0.304]

12 We prove that optimistic Q-Iearning, with the right setting of initial values, converge to a near optimal policy. [sent-31, score-0.388]

13 We show the convergence of the widely used optimistic Q-Iearning, thus explaining and supporting the results observed in practice. [sent-33, score-0.409]

14 Our second result is a new and novel deterministic algorithm incremental Qlearning, which gradually promotes the values of actions that are not taken. [sent-34, score-0.525]

15 We show that the frequency of sub-optimal actions vanishes, in the limit, and that the strategy defined by incremental Q-Iearning converges, in the limit, to the optimal policy (rather than a near optimal policy). [sent-35, score-0.833]

16 Another view of incremental Q-Iearning is as a derandomization of the E-greedy Q-Iearning. [sent-36, score-0.355]

17 The E-greedy Qlearning performs the sub optimal action every liE times in expectation, while the incremental Q-learning performs sub optimal action every (Q(s, a(s)) - Q(s, b))jE times. [sent-37, score-1.088]

18 1 A Markov Decision process (MDP) M is a 4-tuple (8, A, P, R), where S is a set of the states, A is a set of actions, P/:J (a) is the transition probability from state i to state j when performing action a E A in state i, and RM(S, a) is the reward received when performing action a in state s. [sent-40, score-1.364]

19 A strategy for an MDP assigns, at each time t, for each state S a probability for performing action a E A, given a history F t - 1 == {sl,al,rl, . [sent-41, score-0.604]

20 ,St-l,at-l,rt-l} which includes the states, actions and rewards observed until time t - 1. [sent-44, score-0.226]

21 While executing a strategy 1f we perform at time t action at in state St and observe a reward rt (distributed according to RM(S, a)), and the next state St+l distributed according to P:! [sent-45, score-0.825]

22 In this work we focus on discounted return, which has a parameter, E (0,1), and the discounted return of policy 1f is VM== L:o ,trt , where Tt is the reward observed at time t. [sent-48, score-0.487]

23 We define a value function for ~ach state s, under policy 1f, as VM(s) == E[L:o Ti,i] , where the expectation is over a run of policy 1f starting at state s, and a state-action (a)ViI(s/). [sent-52, score-0.815]

24 sl Let 1f* be an optimal policy which maximizes the return from any start state. [sent-54, score-0.437]

25 This implies that for any policy 1f and any state S we have VM(s) ~ ViI (s), and * 1f*(s) == argmaxa(E[RM(S, a)] + ,(LsI P:! [sent-55, score-0.443]

26 We say that a policy 1f is an Given a trajectory let Ts,a be the. [sent-59, score-0.285]

27 set of times in which we perform action a in state s, TS == UaTs,a be the times when state s is visited, Ts,not(a) == TS \ Ts,a be the set J of times where in state s an action a' =1= a is performed, and Tnot(s) == UsJ=I=sTs be the set of times in which a state s' =1= s is visited. [sent-60, score-1.386]

28 Also, [#(8, a, t)] is the number of times action a is performed in state 8 up to time t, Le. [sent-61, score-0.614]

29 Finally, throughout the paper we assume that the MDP is a uni-chain (see [9]), namely that from every state we can reach any other state. [sent-63, score-0.246]

30 A learning rate at is well-behaved if for every state action pair (s, a): (1) 2::1 at(8, a) == 00 and (2) 2::1o'. [sent-67, score-0.595]

31 If the learning rate is well-behaved and every state action pair is performed infinitely often then Q-Learning converges to Q* with probability 1 (see [1, 11, 12, 8, 6]). [sent-69, score-1.11]

32 The convergence of Q-Iearning holds using any exploration policy, and only requires that each state action pair is executed infinitely often. [sent-70, score-0.959]

33 The greedy policy with respect to the Q-values tries to exploit continuously, however, since it does not explore properly, it might result in poor return. [sent-71, score-0.321]

34 At the other extreme random policy continuously explores, but its actual return may be very poor. [sent-72, score-0.312]

35 This policy executes the greedy policy with probability 1 - E and the random policy with probability E. [sent-74, score-0.791]

36 This balance between exploration and exploitation both guarantees convergence and often good performance. [sent-75, score-0.181]

37 In this work we explore strategies which perform exploration but avoids randomization and uses deterministic strategies. [sent-77, score-0.176]

38 4 Optimistic Q-Learning Optimistic Q-learning is a simple greedy algorithm with respect to the Q-values, where the initial Q-values are set to large values, larger than their optimal values. [sent-78, score-0.19]

39 We show that optimistic Q-Iearning converges to an E-optimal policy if the initial Q-values are set sufficiently large. [sent-79, score-0.711]

40 == (l- a t)Qt(s, a)+at(rt+,Vi(s/)) == (3TQO(S, a)+ T i=l i=l L T}i,Tri(S, a)+, L T}i,T"Vt (Si), i where T == [#(s, a, t)] and Si is the next state arrived at time ti when action a is performed for the ith time in state s. [sent-85, score-0.849]

41 First we show that as long as T == [#(s, a, t)] :::; T actions a are performed in state s, we have Qt(s, a) ~ Vmax ' Latter we will use this to show that action a is performed at least T times in state s. [sent-86, score-0.961]

42 1 In optimistic Q-learning for any state s, action a and time t, such that T == [#(s, a, t)] :::; T we have Qt(s, a) ~ Vmax ~ Q*(s, a). [sent-88, score-0.799]

43 A time T(E, 8) is an initialization time if Prl'v't ~ T : Zt :::; E] ~ 1 - 8. [sent-99, score-0.204]

44 The following lemma bounds the initialization time as a function of the parameter w of the learning rate. [sent-100, score-0.356]

45 The following lemma bounds the difference between Q* and Qt. [sent-105, score-0.189]

46 3 Consider optimistic Q-learning and let T == T(E, 8) be the initialization time. [sent-107, score-0.421]

47 4 Consider optimistic Q-learning and let T == T((I-I')E,8/ISIIAI) be the initialization time. [sent-118, score-0.421]

48 With probability at least 1- 8 for any state s and time t, we have V*(s) - vt(s) :::; E. [sent-119, score-0.295]

49 3 we have that with probability 1- {) for every state s, action a and time t we have Q* (s, a) - Qt(s, a) :::; (1-I')E. [sent-121, score-0.565]

50 We show by induction on t that V*(s) - vt(s) :::; E, for every state s. [sent-122, score-0.242]

51 Let (8, a) be the state action pair executed in time t + 1. [sent-125, score-0.645]

52 Otherwise, let a* be the optimal action at state s, then, V*(s) - vt+l(S) < Q*(s,a*) - Qt+l(s,a*) Q*(s,a*) - Qt+l(s,a*) + Qt+l(s,a*) - Qt+l(s,a*) ". [sent-129, score-0.53]

53 (V*(Si) - vti (Si)), i==l where T == [#(s, a, t)], ti is the time when the i-th time the action a is performed in state 8, and state Si is the next state. [sent-131, score-0.849]

54 Since ti :::; t, by the inductive hypothesis we have that V* (Si) - vt i (Si) :::; E, and therefore, V* (s) - vt+l (s) :::; (1 - I')E + I'E == E. [sent-132, score-0.177]

55 5 Consider optimistic Q-learning and let T == T((I-I')E,{)/\SIIAI) be the initialization time. [sent-137, score-0.421]

56 With probability at least 1 - {) any state action pair (s, a) that is executed infinitely often is E-optimal, i. [sent-138, score-0.911]

57 Proof: Given a trajectory let U' be the set of state action pairs that are executed infinitely often, and let M' be the original MDP M restricted to U'. [sent-141, score-0.865]

58 For M' we can use the classical convergence proofs, and claim that vt (s) converges to ViII (s ) and Qt(s, a), for (s, a) E U', converges to QMI (s, a), both with probability 1. [sent-142, score-0.523]

59 Since (8, a) E U' is performed infinitely often it implies that Qt (s, a) converges to vt (s) == VM, (s) and therefore QM' (s, a) == ViII (s). [sent-143, score-0.645]

60 Q*(s,a)}, then optimistic Q-Iearning converges to the optimal policy. [sent-152, score-0.497]

61 For any constant ~, with probability at least 1 - {) there is a time T~ > T such that at any time t > T~ the strategy defined by the optimistic Q-learning is (E + ~)/(1 - ,)-optimal. [sent-155, score-0.547]

62 6 Consider optimistic Q-learning and let T 5 Incremental Q-Iearning In this section we describe a new algorithm that we call incremental Q-learning. [sent-157, score-0.624]

63 Incremental Q-Iearning is a greedy policy with respect to the estimated Q-values plus a promotion term. [sent-159, score-0.53]

64 The promotion term of a state-action pair (s, a) is promoted each time the action a is not executed in state s, and zeroed each time action a is executed. [sent-160, score-1.171]

65 We show that in incremental Q-Iearning every state-action pair is taken infinitely often, which implies standard convergence of the estimates. [sent-161, score-0.735]

66 We show that the fraction of time in which sub-optimal actions are executed vanishes in the limit. [sent-162, score-0.312]

67 This implies that the strategy defined by incremental Q-Iearning converges, in the limit, to the optimal policy. [sent-163, score-0.457]

68 Incremental Q-Iearning estimates the Q-function as in Q-Iearning: Qt+l(S, a) == (1 - (It(s, a))Qt(s, a) + (It(s, a)(rt(s, a) + IVi(s/)) where Sl is the next state reached when performing action a in state s at time t. [sent-164, score-0.71]

69 The promotion term At is define as follows: A t+1 (s, a) == 0: t E Ts,a A t+1 (s, a) == At(s, a) + "p([#(s, a, t)]): t E Ts,not(a) A t+1 (s, a) == At(s, a): t E Tnot(s) , where "p(i) is a promotion junction which in our case depends only on the number of times we performed (s, a' ), al :j:. [sent-165, score-0.595]

70 We say that a promotion function "p is well-behaved if: (1) The function "p converges to zero, Le. [sent-167, score-0.342]

71 For example "p(i) == is well behaved promotion function. [sent-169, score-0.355]

72 t Incremental Q-Iearning is a greedy policy with respect to St(s, a) == Qt(s, a) + At(s, a). [sent-170, score-0.321]

73 First we show that Qt, in incremental Q-Iearning, converges to Q*. [sent-171, score-0.416]

74 1 Consider incremental Q-Iearning using a well-behaved learning rate and a well-behaved promotion function. [sent-173, score-0.556]

75 Proof: Since the learning rate is well-behaved, we need only to show that each state action pair is performed infinitely often. [sent-175, score-0.868]

76 We show that each state that is visited infinitely often, all of its actions are performed infinitely often. [sent-176, score-0.909]

77 Since the MDP is uni-chain this will imply that with probability 1 we reach all states infinitely often, which completes the proof. [sent-177, score-0.328]

78 Since s is visited infinitely often, there has to be a non-empty subset of the actions AI which are performed infinitely often in s. [sent-179, score-0.77]

79 Let tl be the last time that an action not in A' is performed in state s. [sent-182, score-0.627]

80 Since"p is well behaved we have that 'ljJ(tl) is constant for a fixed tl , it implies that At(s, a) diverges for a fj. [sent-183, score-0.288]

81 Therefore, eventually we reach a time t2 > tl such that A t2 (s, a) > Vmax , for every a fj. [sent-185, score-0.184]

82 Since the actions in AI are performed infinitely often there is a time t3 > t 2 such that each action al E AI is performed at least once in [t2, ts]. [sent-187, score-0.928]

83 Therefore, some action in a E A \ AI will be performed after t 1 , contradicting our assumption. [sent-190, score-0.333]

84 The following lemma shows that the frequency of sub-optimal actions vanishes. [sent-194, score-0.34]

85 2 Consider incremental Q-learning using a well behaved learning rate and a well behaved promotion function. [sent-196, score-0.848]

86 Let a* be an optimal action in state s and a be a sub-optimal action. [sent-201, score-0.493]

87 The leaning rate of both Incremantal and epsilon greedy Q-Iearning is set to 0. [sent-224, score-0.192]

88 Qt(s, a*) converges to Q*(s, a*) == V*(s) and Qt(s, a) converges to Q*(s, a). [sent-227, score-0.266]

89 Since the promotion function is well behaved, the number of time steps required until At(s, a) changes from 0 to h increases after each time we perform (s,a). [sent-229, score-0.333]

90 3 Consider incremental Q-learning with learning rate at(s, a) == l/[#(s, a, t)] and '¢(k) == l/e k . [sent-233, score-0.347]

91 The number of times (s, a) is performed in the first n visits to state s is 8( V'(s~~~(s,a»)' for sufficiently large n. [sent-235, score-0.332]

92 Furthermore, the return obtained by incremental Q-Iearning converges to the optimal return. [sent-236, score-0.547]

93 4 Consider incremental Q-learning using a well behaved learning rate and a well behaved promotion function. [sent-238, score-0.848]

94 1T defined by incremental Q-Iearning is €-optimal with probability 1. [sent-240, score-0.337]

95 as a derandomization of €t-greedy Q-Learning, where the promotion function satisfies 'l/Jt == €t·· The experiment was made on MDP, which includes 50 states and two actions per state. [sent-243, score-0.444]

96 Each state action pair immediate reward is randomly chosen in the interval [0, 10]. [sent-244, score-0.529]

97 For each state and action (s, a) the next state transition is random,i. [sent-245, score-0.611]

98 , for every state Sl we have a random variable X:: a E [0, 1] and PS~SI = E:::;. [sent-247, score-0.217]

99 For the €t-greedy Q-Iearning, we have €t == 10000/t at time t, while for the incremental we have 'l/Jt == 10000/t. [sent-249, score-0.345]

100 We plan to further investigate the theoretical connection between €-greedy, Boltzman machine and incremental Q-Learning. [sent-253, score-0.283]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qt', 0.407), ('optimistic', 0.304), ('incremental', 0.283), ('action', 0.255), ('infinitely', 0.234), ('vmax', 0.216), ('policy', 0.215), ('promotion', 0.209), ('lemma', 0.189), ('state', 0.178), ('behaved', 0.146), ('converges', 0.133), ('actions', 0.129), ('vm', 0.125), ('mdp', 0.12), ('si', 0.117), ('vt', 0.111), ('qlearning', 0.096), ('sl', 0.091), ('executed', 0.091), ('rm', 0.084), ('greedy', 0.084), ('initialization', 0.08), ('performed', 0.078), ('vii', 0.073), ('exploration', 0.072), ('derandomization', 0.072), ('reinforcement', 0.071), ('return', 0.071), ('convergence', 0.07), ('corollary', 0.07), ('ai', 0.069), ('randomization', 0.063), ('time', 0.062), ('optimal', 0.06), ('pair', 0.059), ('qm', 0.057), ('visited', 0.056), ('tl', 0.054), ('lsi', 0.053), ('discounted', 0.051), ('rt', 0.051), ('implies', 0.05), ('boltzman', 0.048), ('epsilon', 0.048), ('initializes', 0.048), ('promotes', 0.048), ('tnot', 0.048), ('zt', 0.048), ('claim', 0.045), ('viii', 0.042), ('deterministic', 0.041), ('times', 0.041), ('strategy', 0.041), ('often', 0.039), ('every', 0.039), ('rate', 0.039), ('diverges', 0.038), ('reward', 0.037), ('performing', 0.037), ('let', 0.037), ('proof', 0.036), ('ti', 0.036), ('widely', 0.035), ('rewards', 0.035), ('sufficiently', 0.035), ('states', 0.034), ('trajectory', 0.033), ('limit', 0.032), ('oo', 0.032), ('probability', 0.031), ('ts', 0.03), ('inductive', 0.03), ('vanishes', 0.03), ('st', 0.029), ('reach', 0.029), ('define', 0.029), ('al', 0.029), ('sub', 0.028), ('ta', 0.028), ('continuously', 0.026), ('learning', 0.025), ('follows', 0.025), ('induction', 0.025), ('least', 0.024), ('proofs', 0.024), ('gradually', 0.024), ('initial', 0.024), ('defined', 0.023), ('observe', 0.023), ('stochastic', 0.023), ('frequency', 0.022), ('respect', 0.022), ('lii', 0.021), ('brafman', 0.021), ('contradiction', 0.021), ('extremes', 0.021), ('je', 0.021), ('leaning', 0.021), ('sts', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

Author: Eyal Even-dar, Yishay Mansour

Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1

2 0.20882966 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

3 0.20782547 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

4 0.1920464 40 nips-2001-Batch Value Function Approximation via Support Vectors

Author: Thomas G. Dietterich, Xin Wang

Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1

5 0.18711218 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

6 0.18662383 59 nips-2001-Direct value-approximation for factored MDPs

7 0.16778156 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

8 0.16032638 128 nips-2001-Multiagent Planning with Factored MDPs

9 0.14146817 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

10 0.14052999 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

11 0.11400812 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

12 0.11078092 36 nips-2001-Approximate Dynamic Programming via Linear Programming

13 0.10221072 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

14 0.10077156 148 nips-2001-Predictive Representations of State

15 0.10041298 126 nips-2001-Motivated Reinforcement Learning

16 0.089173086 8 nips-2001-A General Greedy Approximation Algorithm with Applications

17 0.081627823 35 nips-2001-Analysis of Sparse Bayesian Learning

18 0.073137052 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

19 0.072648786 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

20 0.07073994 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.194), (1, -0.11), (2, 0.402), (3, 0.026), (4, 0.047), (5, 0.079), (6, 0.024), (7, 0.008), (8, -0.001), (9, -0.017), (10, 0.027), (11, 0.029), (12, 0.007), (13, 0.034), (14, -0.008), (15, -0.037), (16, 0.047), (17, 0.001), (18, 0.024), (19, -0.057), (20, 0.026), (21, -0.025), (22, 0.011), (23, -0.063), (24, 0.052), (25, -0.007), (26, 0.055), (27, -0.006), (28, 0.012), (29, -0.037), (30, 0.088), (31, 0.051), (32, 0.109), (33, -0.031), (34, 0.024), (35, 0.036), (36, 0.044), (37, 0.03), (38, 0.019), (39, -0.04), (40, 0.068), (41, 0.064), (42, -0.085), (43, -0.027), (44, 0.056), (45, -0.004), (46, -0.115), (47, -0.027), (48, 0.02), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97005314 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

Author: Eyal Even-dar, Yishay Mansour

Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1

2 0.88546807 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

Author: Xin Wang, Thomas G. Dietterich

Abstract: We address the problem of non-convergence of online reinforcement learning algorithms (e.g., Q learning and SARSA(A)) by adopting an incremental-batch approach that separates the exploration process from the function fitting process. Our BFBP (Batch Fit to Best Paths) algorithm alternates between an exploration phase (during which trajectories are generated to try to find fragments of the optimal policy) and a function fitting phase (during which a function approximator is fit to the best known paths from start states to terminal states). An advantage of this approach is that batch value-function fitting is a global process, which allows it to address the tradeoffs in function approximation that cannot be handled by local, online algorithms. This approach was pioneered by Boyan and Moore with their GROWSUPPORT and ROUT algorithms. We show how to improve upon their work by applying a better exploration process and by enriching the function fitting procedure to incorporate Bellman error and advantage error measures into the objective function. The results show improved performance on several benchmark problems. 1

3 0.82234639 40 nips-2001-Batch Value Function Approximation via Support Vectors

Author: Thomas G. Dietterich, Xin Wang

Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1

4 0.74767476 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

5 0.73584867 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

6 0.71507913 59 nips-2001-Direct value-approximation for factored MDPs

7 0.65061498 13 nips-2001-A Natural Policy Gradient

8 0.64755625 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

9 0.63060677 126 nips-2001-Motivated Reinforcement Learning

10 0.62677407 128 nips-2001-Multiagent Planning with Factored MDPs

11 0.62157255 36 nips-2001-Approximate Dynamic Programming via Linear Programming

12 0.61631244 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

13 0.61038488 177 nips-2001-Switch Packet Arbitration via Queue-Learning

14 0.59170544 148 nips-2001-Predictive Representations of State

15 0.52263576 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

16 0.46446648 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

17 0.41959703 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

18 0.41523597 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

19 0.38869148 35 nips-2001-Analysis of Sparse Bayesian Learning

20 0.37221262 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.037), (17, 0.1), (19, 0.026), (27, 0.105), (30, 0.025), (36, 0.017), (38, 0.017), (59, 0.054), (72, 0.054), (79, 0.04), (83, 0.027), (91, 0.163), (95, 0.246)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83511752 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

Author: Eyal Even-dar, Yishay Mansour

Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1

2 0.65427053 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

Author: Jens Kohlmorgen, Steven Lemm

Abstract: We propose a novel method for the analysis of sequential data that exhibits an inherent mode switching. In particular, the data might be a non-stationary time series from a dynamical system that switches between multiple operating modes. Unlike other approaches, our method processes the data incrementally and without any training of internal parameters. We use an HMM with a dynamically changing number of states and an on-line variant of the Viterbi algorithm that performs an unsupervised segmentation and classification of the data on-the-fly, i.e. the method is able to process incoming data in real-time. The main idea of the approach is to track and segment changes of the probability density of the data in a sliding window on the incoming data stream. The usefulness of the algorithm is demonstrated by an application to a switching dynamical system. 1

3 0.65222466 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

Author: Andrew Y. Ng, Michael I. Jordan, Yair Weiss

Abstract: Despite many empirical successes of spectral clustering methodsalgorithms that cluster points using eigenvectors of matrices derived from the data- there are several unresolved issues. First, there are a wide variety of algorithms that use the eigenvectors in slightly different ways. Second, many of these algorithms have no proof that they will actually compute a reasonable clustering. In this paper, we present a simple spectral clustering algorithm that can be implemented using a few lines of Matlab. Using tools from matrix perturbation theory, we analyze the algorithm, and give conditions under which it can be expected to do well. We also show surprisingly good experimental results on a number of challenging clustering problems. 1

4 0.64846456 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.64147049 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

Author: Marzia Polito, Pietro Perona

Abstract: Locally Linear Embedding (LLE) is an elegant nonlinear dimensionality-reduction technique recently introduced by Roweis and Saul [2]. It fails when the data is divided into separate groups. We study a variant of LLE that can simultaneously group the data and calculate local embedding of each group. An estimate for the upper bound on the intrinsic dimension of the data set is obtained automatically. 1

6 0.63970721 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

7 0.63847196 59 nips-2001-Direct value-approximation for factored MDPs

8 0.63839912 121 nips-2001-Model-Free Least-Squares Policy Iteration

9 0.6379565 89 nips-2001-Grouping with Bias

10 0.63579714 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

11 0.63392866 36 nips-2001-Approximate Dynamic Programming via Linear Programming

12 0.63045633 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

13 0.63033152 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

14 0.62926221 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

15 0.62724513 24 nips-2001-Active Information Retrieval

16 0.62714922 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources

17 0.62608182 44 nips-2001-Blind Source Separation via Multinode Sparse Representation

18 0.62488574 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM

19 0.62308073 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

20 0.62227958 143 nips-2001-PAC Generalization Bounds for Co-training