nips nips2000 nips2000-112 knowledge-graph by maker-knowledge-mining

112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region


Source: pdf

Author: Geoffrey J. Gordon

Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Many algorithms for approximate reinforcement learning are not known to converge. [sent-4, score-0.087]

2 In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. [sent-5, score-0.191]

3 This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. [sent-6, score-0.17]

4 In fact, there are known counterexamples to many proposed algorithms. [sent-9, score-0.048]

5 Given the similarities between SARSA(O) and Q-Iearning, and between V(O) and value iteration, one might suppose that their convergence properties would be identical. [sent-11, score-0.081]

6 That is not the case: while Q-Iearning can diverge for some exploration strategies, this paper proves that the iterates for trajectory-based SARSA(O) converge with probability 1 to a fixed region. [sent-12, score-0.408]

7 Similarly, while value iteration can diverge for some exploration strategies, this paper proves that the iterates for trajectory-based V(O) converge with probability 1 to a fixed region. [sent-13, score-0.439]

8 l The question ofthe convergence behavior of SARSA()') is one of the four open theoretical questions of reinforcement learning that Sutton [5] identifies as "particularly important, pressing, or opportune. [sent-14, score-0.135]

9 " This paper covers SARSA(O), and together lIn a ''trajectory-based'' algorithm, the exploration policy may not change within a single episode of learning. [sent-15, score-0.339]

10 The policy may change between episodes, and the value function may change within a single episode. [sent-16, score-0.317]

11 (Episodes end when the agent enters a terminal state. [sent-17, score-0.23]

12 This paper considers only episodic tasks, but since any discounted task can be transformed into an equivalent episodic task, the algorithms apply to non-episodic tasks as well . [sent-18, score-0.111]

13 ) with an earlier paper [4] describes its convergence behavior: it is stable in the sense that there exist bounded regions which with probability 1 it eventually enters and never leaves, but for some Markov decision processes it may not converge to a single point. [sent-19, score-0.338]

14 Unfortunately the bound given here is not of much use as a practical guarantee: it is loose enough that it provides little reason to believe that SARSA(O) and V(O) produce useful approximations to the state- and action-value functions. [sent-21, score-0.077]

15 Second, such a bound is often the first step towards proving stronger results. [sent-24, score-0.033]

16 Finally, in practice it often happens that after some initial exploration period, only a few different policies are ever greedy; if this is the case, the strategy of this paper could be used to prove much tighter bounds. [sent-25, score-0.197]

17 The V(O) algorithm was popularized by its use in the TD-Gammon backgammon playing program [8]. [sent-28, score-0.106]

18 2 Fix a Markov decision process M, with a finite set 8 of states, a finite set A of actions, a terminal state T, an initial distribution 8 0 over 8, a one-step reward function r : 8 x A -+ R, and a transition function 8 : 8 x A -+ 8 U {T}. [sent-29, score-0.37]

19 (M may also have a discount factor 'Y specifying how to trade future rewards against present ones. [sent-30, score-0.078]

20 ) Both the transition and reward functions may be stochastic, so long as successive samples are independent (the Markov property) and the reward has bounded expectation and variance. [sent-32, score-0.373]

21 We assume that all states in 8 are reachable with positive probability. [sent-33, score-0.128]

22 We define a policy 7r to be a function mapping states to probability distributions over actions. [sent-34, score-0.38]

23 Given a policy we can sample a trajectory (a sequence of states, actions, and one-step rewards) by the following rule: begin by selecting a state So according to 8 0 . [sent-35, score-0.483]

24 Now choose an action ao according to 7r(so), Now choose a onestep reward ro according to r(so, ao). [sent-36, score-0.306]

25 Finally choose a new state Sl according to 8(so, ao). [sent-37, score-0.103]

26 We assume that all policies are proper, that is, that the agent reaches T with probability 1 no matter what policy it follows. [sent-39, score-0.56]

27 ) The reward for a trajectory is the sum of all of its one-step rewards. [sent-41, score-0.247]

28 Our goal is to find an optimal policy, that is, a policy which on average generates trajectories with the highest possible reward. [sent-42, score-0.257]

29 Define Q*(s, a) to be the best total expected reward that we can achieve by starting in state s, performing action a, and acting optimally afterwards. [sent-43, score-0.351]

30 Knowledge of either Q* or the combination of V*, 8, and r is enough to determine an optimal policy. [sent-45, score-0.077]

31 We will write Q(s,a) for s E 8 and a E A to refer to this approximation. [sent-47, score-0.116]

32 For convenience of notation, we will write Q(T, a) = 0 for all a E A, and tack an arbitrary action onto the end of all trajectories (which would otherwise end with the terminal state). [sent-49, score-0.379]

33 After seeing 2The proof given here does not cover the TD-Gammon program, since TD-Gammon uses a nonlinear function approximator to represent its value function. [sent-50, score-0.126]

34 Interestingly, though, the proof extends easily to cover games such as backgammon in addition to MDPs. [sent-51, score-0.16]

35 It also extends to cover SARSA('x) and V(,x) for ,X > O. [sent-52, score-0.039]

36 We also make the standard assumption that the sequence of learning rates is fixed before the start of learning and satisfies Et O:t = 00 and Et o:~ < 00. [sent-54, score-0.085]

37 At the beginning of each trajectory, it selects the E-greedy policy for its current Q function. [sent-57, score-0.288]

38 From state s, the E-greedy policy chooses the action argmax a Q(s, a) with probability 1 - E, and otherwise selects uniformly at random among all actions. [sent-58, score-0.518]

39 This rule ensures that, no matter the sequence of learned Q functions, each state-action pair will be visited infinitely often. [sent-59, score-0.138]

40 We just need to be able to find a region that contains all of the approximate value functions for every policy considered, and a bound on the convergence rate of TD(O). [sent-61, score-0.391]

41 ) We can compare the SARSA(O) update rule to the one for Q-Iearning: Q(s, a) +- r + maxQ(s, b) b Often a' in the SARSA(O) update rule will be the same as the maximizing b in the Q-Iearning update rule; the difference only appears when the agent takes an exploring action, i. [sent-62, score-0.446]

42 , one which is not greedy for the current Q function. [sent-64, score-0.058]

43 The V(O) algorithm maintains an approximation to V* which we will write V(s) for all s E S. [sent-65, score-0.149]

44 Again, we will assume V is a full-rank linear function of parameters w, and V(T) is held fixed at O. [sent-66, score-0.054]

45 After seeing a trajectory fragment s,a,r,s', V(O) sets V(s) +- r + V(s') This update ignores a. [sent-67, score-0.334]

46 Often a is chosen according to a greedy or E-greedy policy for a recent V. [sent-68, score-0.315]

47 However, for our analysis we only need to assume that we consider finitely many policies and that the policy remains fixed during each trajectory. [sent-69, score-0.497]

48 We leave open the question of whether updates to w happen immediately after each transition or only at the end of each trajectory. [sent-70, score-0.242]

49 As pointed out in [9], this difference will not affect convergence: the updates within a single trajectory are 0(0:), so they cause a change in Q(s,a) or V(s) of 0(0:), which means subsequent updates are affected by at most 0(0: 2 ). [sent-71, score-0.332]

50 (If we were to change policies during the trajectory, this argument would no longer hold, since small changes in Q or V can cause large changes in the policy. [sent-73, score-0.243]

51 ) 3 The result Our result is that the weights w in either SARSA(O) or V(O) converge with probability 1 to a fixed region. [sent-74, score-0.195]

52 The proof of the result is based on the following intuition: while SARSA(O) and V(O) might consider many different policies over time, on any given trajectory they always follow the TD(O) update rule for some policy. [sent-75, score-0.489]

53 Crucially, under general conditions, all of these fixed points are within some bounded region. [sent-77, score-0.14]

54 So, we can view the SARSA(O) and V(O) update rules as contraction mappings plus a bounded amount of "slop. [sent-78, score-0.239]

55 " With this observation, standard convergence theorems show that the weight vectors generated by SARSA(O) and V(O) cannot diverge. [sent-79, score-0.048]

56 Theorem 1 For any Markov decision process M satisfying our assumptions, there is a bounded region R such that the SARSA(O) algorithm, when acting on M, produces a series of weight vectors which with probability 1 converges to R. [sent-80, score-0.357]

57 Similarly, there is another bounded region R' such that the V(O) algorithm acting on M produces a series of weight vectors converging with probability 1 to R' . [sent-81, score-0.337]

58 PROOF: Lemma 2, below, shows that both the SARSA(O) and V(O) updates can be written in the form Wt+1 = Wt - at (Atwt - rt + Et) where At is positive definite, at is the current learning rate, E(Et) = 0, Var(Et) ::::: K(l + IlwtI12), and At and rt depend only on the currently greedy policy. [sent-82, score-0.31]

59 (At and rt represent, in a manner described in the lemma, the transition probabilities and one-step costs which result from following the current policy. [sent-83, score-0.142]

60 Of course, Wt, At, and rt will be different depending on whether we are following SARSA(O) or V(O). [sent-84, score-0.065]

61 ) Since At is positive definite, the SARSA(O) and V(O) updates are 2-norm contractions for small enough at. [sent-85, score-0.199]

62 So, if we kept the policy fixed rather than changing it at the beginning of each trajectory, standard results such as Lemma 1 below would guarantee convergence. [sent-86, score-0.311]

63 The intuition is that we can define a nonnegative potential function J(w) and show that, on average, the updates tend to decrease J(w) as long as at is small enough and J (w) starts out large enough compared to at. [sent-87, score-0.283]

64 To apply Lemma 1 under the assumption that we keep the policy constant rather than changing it every trajectory, write At = A and rt = r for all t, and write w" = A -1 r. [sent-88, score-0.633]

65 Let p be the smallest eigenvalue of A (which must be real and positive since A is positive definite). [sent-89, score-0.098]

66 Write St = AWt - r + Et for the update direction at step t. [sent-90, score-0.148]

67 Then if we take J(w) = Ilw - w,,11 2, = = > 2(wt - w,,)T(Awt - r + E(Et)) 2(wt - w,,)T(Awt - Aw,,) 2pllwt - w,,11 2 = E(V J(Wt)T stlwt) 2pJ(wt) so that -St is a descent direction in the sense required by the lemma. [sent-91, score-0.113]

68 It is easy to check the lemma's variance condition. [sent-92, score-0.037]

69 So, Lemma 1 shows that J(Wt) converges with probability 1 to 0, which means Wt must converge with probability 1 to W". [sent-93, score-0.231]

70 If we pick an arbitrary vector u and define H(w) = max(O, Ilw - ull - C)2 for a sufficiently large constant C, then the same argument reaches the weaker conclusion that Wt must converge with probability 1 to a sphere of radius C centered at u. [sent-94, score-0.427]

71 To see why, note that -St is also a descent direction for H(w): inside the sphere, H = 0 and V H = 0, so the descent condition is satisfied trivially. [sent-95, score-0.217]

72 Outside the sphere, VH(w) V H(Wt)T E(stlwt) = 2(w - u) Il w-ull-C Ilw-ull = d(w)(w - d(wt)(wt - u)TE(stlwt) u) = d(wt)(wt - w" + w" - U)T A(wt - w,,) w,,11 2-llw" - ullllAllllwt - w"ID The positive term will be larger than the negative one if Ilwt - w" II is large enough. [sent-96, score-0.049]

73 ~ d(wt)(pllwt - So, if we choose C large enough, the descent condition will be satisfied. [sent-97, score-0.103]

74 So, Lemma 1 shows that H(wt) converges with probability 1 to 0, which means that Wt must converge with probability 1 to the sphere of radius C centered at u. [sent-100, score-0.325]

75 But now we are done: since there are finitely many policies that SARSA(O) or V(O) can consider, we can pick any u and then choose a C large enough that the above argument holds for all policies simultaneously. [sent-101, score-0.536]

76 With this choice of C the update for any policy decreases H(wt) on average as long as at is small enough, so the update for SARSA(O) or V(O) does too, and Lemma 1 applies. [sent-102, score-0.495]

77 Lemma 1 Let J be a differentiable function, bounded below by J*, and let \7 J be Lipschitz continuous. [sent-106, score-0.086]

78 Suppose - St is a descent direction for J in the sense that E(stlwt) \7 J(Wt) > 6(E) > 0 whenever J(Wt) > J* + Eo Suppose also that E(llstI12Iwt) ::; Kd(wt) + K2E(stlwt)T\7J(Wt) + K3 and finally that the constants at satisfy at > 0 L: at = 00 t Then J(Wt) -+ J* with probability 1. [sent-111, score-0.17]

79 Most of the work in proving the next lemma is already present in [1]. [sent-112, score-0.208]

80 The transformation from an MDP under a fixed policy to a Markov chain is standard. [sent-113, score-0.311]

81 Lemma 2 The update made by SARSA(O) or V(O) during a single trajectory can be written in the form where the constant matrix A" and constant vector r" depend on the currently greedy policy 7f, a is the current learning rate, and E(E) = O. [sent-114, score-0.638]

82 definite, and there is a constant K such that Var(E) ::; K(l + IlwI1 2 Consider the following Markov process M,,: M" has one state for each state-action pair in M. [sent-116, score-0.101]

83 If M has a transition which goes from state S under action a with reward r to state s' with probability p, then M" has a transition from state (s,a) with reward r to state (s',a') for every a'; the probability of this transition is p7r(a'ls'). [sent-117, score-0.958]

84 With these definitions, it is easy to see that TD(O) acting on M" produces exactly the same sequence of parameter changes PROOF: as SARSA(O) acting on M under the fixed policy state of M", will be visited infinitely often. [sent-119, score-0.687]

85 (And since 7r(als) > 0, every Write T", for the transition probability matrix of the above Markov process. [sent-121, score-0.182]

86 That is, the entry of T", in row (s, a) and column (s', a') will be equal to the probability of taking a step to (s', a') given that we start in (s, a). [sent-122, score-0.088]

87 Write s for the vector whose (s, a)th element is So(s)7r(als), that is, the probability that we start in state s and take action a. [sent-125, score-0.292]

88 , [11], d", is the vector of expected visitation frequencies under 7rj that is, the element of d", corresponding to state sand action a is the expected number of times that the agent will visit state s and select action a during a single trajectory following policy 7r. [sent-129, score-1.027]

89 Write D1f for the diagonal matrix with d1f on its diagonaL Write r for the vector of expected rewardsj that is, the component of r corresponding to state s and action a is E(r(s, a)). [sent-130, score-0.173]

90 25, there are two sources of variance in the update direction: variation in the number of times each transition is visited, and variation in the one-step rewards. [sent-137, score-0.219]

91 The visitation frequencies and the one-step rewards both have bounded variance, and are independent of one another. [sent-138, score-0.338]

92 They enter into the overall update in two ways: there is one set of terms which is bilinear in the one-step rewards and the visitation frequencies, and there is another set of terms which is bilinear in the visitation frequencies and the weights w. [sent-139, score-0.559]

93 Because the policy is fixed, W is independent of the visitation frequencies, and so the latter set of terms has variance proportional to Ilw112. [sent-141, score-0.414]

94 So, there is a constant K such that the total variance in f can be bounded by K(1 + IlwI1 2 ). [sent-142, score-0.154]

95 In this case we define M", to have the same states as M, and to have the transition matrix T", whose element s, s' is the probability of landing in s' in M on step t + 1, given that we start in s at step t and follow 7r. [sent-144, score-0.262]

96 Since we have assumed that all policies are proper and that every policy considered has a positive probability of reaching any state, the update matrix A", = XT D", (I - T",)X is strictly positive definite. [sent-147, score-0.746]

97 So, the norm of '\7 H is 0 inside the unit sphere and 1 outside. [sent-151, score-0.163]

98 At the boundary of the unit sphere, '\7 H is continuous, and its directional derivatives from every direction are bounded by the argument above. [sent-152, score-0.245]

99 On the existence of fixed points for approximate value iteration and temporal-difference learning. [sent-193, score-0.085]

100 On the convergence of stochastic iterative dynamic programming algorithms. [sent-209, score-0.048]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sarsa', 0.6), ('wt', 0.316), ('policy', 0.257), ('lemma', 0.175), ('trajectory', 0.156), ('policies', 0.145), ('stlwt', 0.12), ('visitation', 0.12), ('write', 0.116), ('update', 0.105), ('terminal', 0.104), ('diverge', 0.104), ('action', 0.103), ('sphere', 0.094), ('lipschitz', 0.094), ('td', 0.093), ('reward', 0.091), ('acting', 0.087), ('bounded', 0.086), ('converge', 0.084), ('rewards', 0.078), ('enough', 0.077), ('transition', 0.077), ('updates', 0.073), ('awt', 0.072), ('backgammon', 0.072), ('descent', 0.07), ('state', 0.07), ('visited', 0.069), ('argument', 0.068), ('rt', 0.065), ('markov', 0.064), ('agent', 0.063), ('greedy', 0.058), ('reinforcement', 0.058), ('probability', 0.057), ('fixed', 0.054), ('frequencies', 0.054), ('exploration', 0.052), ('sutton', 0.052), ('proof', 0.049), ('geoffrey', 0.049), ('positive', 0.049), ('convergence', 0.048), ('every', 0.048), ('als', 0.048), ('contraction', 0.048), ('counterexamples', 0.048), ('wold', 0.048), ('ao', 0.046), ('definite', 0.045), ('direction', 0.043), ('ilw', 0.041), ('finitely', 0.041), ('episodic', 0.041), ('converging', 0.041), ('axt', 0.041), ('reachable', 0.041), ('bilinear', 0.041), ('cover', 0.039), ('region', 0.038), ('states', 0.038), ('reaches', 0.038), ('var', 0.038), ('seeing', 0.038), ('episodes', 0.038), ('variance', 0.037), ('proper', 0.036), ('et', 0.035), ('norm', 0.035), ('fragment', 0.035), ('oscillate', 0.035), ('enters', 0.035), ('happen', 0.035), ('infinitely', 0.035), ('rule', 0.034), ('program', 0.034), ('inside', 0.034), ('converges', 0.033), ('suppose', 0.033), ('choose', 0.033), ('maintains', 0.033), ('proving', 0.033), ('start', 0.031), ('element', 0.031), ('selects', 0.031), ('iteration', 0.031), ('constant', 0.031), ('change', 0.03), ('iterates', 0.029), ('open', 0.029), ('algorithms', 0.029), ('decision', 0.028), ('end', 0.028), ('define', 0.028), ('proves', 0.028), ('long', 0.028), ('produces', 0.028), ('notation', 0.028), ('pick', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

Author: Geoffrey J. Gordon

Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1

2 0.21771783 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Author: Brian Sallans, Geoffrey E. Hinton

Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.

3 0.18983559 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

4 0.15071057 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

Author: Dirk Ormoneit, Peter W. Glynn

Abstract: Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem. 1

5 0.114091 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

Author: Anders Jonsson, Andrew G. Barto

Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.

6 0.1094934 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

7 0.1065924 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

8 0.10526069 105 nips-2000-Programmable Reinforcement Learning Agents

9 0.10200752 6 nips-2000-A Neural Probabilistic Language Model

10 0.096891008 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

11 0.096361704 48 nips-2000-Exact Solutions to Time-Dependent MDPs

12 0.086815007 43 nips-2000-Dopamine Bonuses

13 0.082694784 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

14 0.080194011 113 nips-2000-Robust Reinforcement Learning

15 0.071254089 22 nips-2000-Algorithms for Non-negative Matrix Factorization

16 0.064087808 75 nips-2000-Large Scale Bayes Point Machines

17 0.063191928 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

18 0.057592906 137 nips-2000-The Unscented Particle Filter

19 0.057400681 120 nips-2000-Sparse Greedy Gaussian Process Regression

20 0.051428396 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.207), (1, -0.008), (2, 0.073), (3, -0.329), (4, -0.218), (5, 0.003), (6, -0.046), (7, -0.016), (8, -0.08), (9, -0.051), (10, -0.039), (11, -0.004), (12, 0.006), (13, 0.046), (14, -0.083), (15, -0.012), (16, 0.04), (17, 0.037), (18, 0.054), (19, -0.035), (20, 0.103), (21, 0.1), (22, -0.072), (23, 0.069), (24, 0.103), (25, -0.074), (26, -0.065), (27, 0.006), (28, 0.009), (29, 0.111), (30, 0.063), (31, -0.041), (32, 0.025), (33, 0.059), (34, -0.05), (35, -0.004), (36, 0.037), (37, 0.124), (38, 0.094), (39, 0.072), (40, -0.068), (41, 0.132), (42, -0.091), (43, 0.023), (44, -0.124), (45, -0.109), (46, -0.089), (47, -0.025), (48, -0.049), (49, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96710706 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

Author: Geoffrey J. Gordon

Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1

2 0.67876065 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

Author: Brian Sallans, Geoffrey E. Hinton

Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.

3 0.66904503 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

4 0.64565659 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

Author: Dirk Ormoneit, Peter W. Glynn

Abstract: Many approaches to reinforcement learning combine neural networks or other parametric function approximators with a form of temporal-difference learning to estimate the value function of a Markov Decision Process. A significant disadvantage of those procedures is that the resulting learning algorithms are frequently unstable. In this work, we present a new, kernel-based approach to reinforcement learning which overcomes this difficulty and provably converges to a unique solution. By contrast to existing algorithms, our method can also be shown to be consistent in the sense that its costs converge to the optimal costs asymptotically. Our focus is on learning in an average-cost framework and on a practical application to the optimal portfolio choice problem. 1

5 0.61436689 105 nips-2000-Programmable Reinforcement Learning Agents

Author: David Andre, Stuart J. Russell

Abstract: We present an expressive agent design language for reinforcement learning that allows the user to constrain the policies considered by the learning process.The language includes standard features such as parameterized subroutines, temporary interrupts, aborts, and memory variables, but also allows for unspecified choices in the agent program. For learning that which isn't specified, we present provably convergent learning algorithms. We demonstrate by example that agent programs written in the language are concise as well as modular. This facilitates state abstraction and the transferability of learned skills.

6 0.51250434 113 nips-2000-Robust Reinforcement Learning

7 0.51082695 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

8 0.47237816 48 nips-2000-Exact Solutions to Time-Dependent MDPs

9 0.40683481 6 nips-2000-A Neural Probabilistic Language Model

10 0.40448517 43 nips-2000-Dopamine Bonuses

11 0.3683646 22 nips-2000-Algorithms for Non-negative Matrix Factorization

12 0.35297629 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

13 0.34102824 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

14 0.3182652 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

15 0.30450803 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

16 0.2651976 44 nips-2000-Efficient Learning of Linear Perceptrons

17 0.26388457 137 nips-2000-The Unscented Particle Filter

18 0.25578231 120 nips-2000-Sparse Greedy Gaussian Process Regression

19 0.2536431 138 nips-2000-The Use of Classifiers in Sequential Inference

20 0.25184354 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.035), (17, 0.07), (32, 0.017), (33, 0.051), (54, 0.014), (55, 0.011), (62, 0.525), (65, 0.013), (67, 0.059), (75, 0.011), (76, 0.045), (79, 0.012), (81, 0.023), (90, 0.027), (97, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99377942 63 nips-2000-Hierarchical Memory-Based Reinforcement Learning

Author: Natalia Hernandez-Gardiol, Sridhar Mahadevan

Abstract: A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hierarchy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher levels in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task. 1

same-paper 2 0.97738594 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

Author: Geoffrey J. Gordon

Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1

3 0.92788792 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

Author: Igor V. Cadez, Padhraic Smyth

Abstract: We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifically we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within firstorder as a function of model complexity. This general property of

4 0.77239943 26 nips-2000-Automated State Abstraction for Options using the U-Tree Algorithm

Author: Anders Jonsson, Andrew G. Barto

Abstract: Learning a complex task can be significantly facilitated by defining a hierarchy of subtasks. An agent can learn to choose between various temporally abstract actions, each solving an assigned subtask, to accomplish the overall task. In this paper, we study hierarchical learning using the framework of options. We argue that to take full advantage of hierarchical structure, one should perform option-specific state abstraction, and that if this is to scale to larger tasks, state abstraction should be automated. We adapt McCallum's U-Tree algorithm to automatically build option-specific representations of the state feature space, and we illustrate the resulting algorithm using a simple hierarchical task. Results suggest that automated option-specific state abstraction is an attractive approach to making hierarchical learning systems more effective.

5 0.7427246 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

Author: Christian R. Shelton

Abstract: For many problems which would be natural for reinforcement learning, the reward signal is not a single scalar value but has multiple scalar components. Examples of such problems include agents with multiple goals and agents with multiple users. Creating a single reward value by combining the multiple components can throwaway vital information and can lead to incorrect solutions. We describe the multiple reward source problem and discuss the problems with applying traditional reinforcement learning. We then present an new algorithm for finding a solution and results on simulated environments.

6 0.72559291 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task

7 0.68406999 105 nips-2000-Programmable Reinforcement Learning Agents

8 0.68317908 1 nips-2000-APRICODD: Approximate Policy Construction Using Decision Diagrams

9 0.61029661 48 nips-2000-Exact Solutions to Time-Dependent MDPs

10 0.6053831 80 nips-2000-Learning Switching Linear Models of Human Motion

11 0.57916087 113 nips-2000-Robust Reinforcement Learning

12 0.56479317 92 nips-2000-Occam's Razor

13 0.5522089 43 nips-2000-Dopamine Bonuses

14 0.53628492 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

15 0.53620869 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

16 0.53443712 139 nips-2000-The Use of MDL to Select among Computational Models of Cognition

17 0.52404374 138 nips-2000-The Use of Classifiers in Sequential Inference

18 0.52375138 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

19 0.51138204 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

20 0.5038293 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes