nips nips2002 nips2002-61 knowledge-graph by maker-knowledge-mining

61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Source: pdf

Author: Ralf Schoknecht, Artur Merke

Abstract: Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Convergence for iterative reinforcement learning algorithms like TD(O) depends on the sampling strategy for the transitions. [sent-5, score-0.51]

2 However, in practical applications it is convenient to take transition data from arbitrary sources without losing convergence. [sent-6, score-0.313]

3 In this paper we investigate the problem of repeated synchronous updates based on a fixed set of transitions. [sent-7, score-0.375]

4 Our main theorem yields sufficient conditions of convergence for combinations of reinforcement learning algorithms and linear function approximation. [sent-8, score-0.522]

5 This allows to analyse if a certain reinforcement learning algorithm and a certain function approximator are compatible. [sent-9, score-0.459]

6 For the combination of the residual gradient algorithm with grid-based linear interpolation we show that there exists a universal constant learning rate such that the iteration converges independently of the concrete transition data. [sent-10, score-0.809]

7 In large, possibly continuous, state spaces a tabular representation and adaptation of the value function is not feasible with respect to time and memory considerations. [sent-12, score-0.189]

8 However, it has been shown that synchronous TD(O), i. [sent-14, score-0.254]

9 dynamic programming, diverges for general linear function approximation [1]. [sent-16, score-0.166]

10 Convergence with probability one for TD('\) with general linear function approximation has been proved in [12]. [sent-17, score-0.146]

11 They establish the crucial condition of sampling states according to the steady-state distribution of the Markov chain in order to ensure convergence. [sent-18, score-0.231]

12 This requirement is reasonable for the pure prediction task but may be disadvantageous for policy improvement as shown in [6] because it may lead to bad action choices in rarely visited parts of the state space. [sent-19, score-0.397]

13 When transition data is taken from arbitrary sources a certain sampling distribution cannot be assured which may prevent convergence. [sent-20, score-0.481]

14 An alternative to such iterative TD approaches are least-squares TD (LSTD) methods [4, 3, 6, 8]. [sent-21, score-0.19]

15 They eliminate the learning rate parameter and carry out a matrix inversion in order to compute the fixed point of the iteration directly. [sent-22, score-0.462]

16 In [4] a leastsquares approach for TD(O) is presented which is generalised to TD(A) in [3]. [sent-23, score-0.047]

17 Both approaches still sample the states according to the steady-state distribution. [sent-24, score-0.084]

18 In [6, 8] arbitrary sampling distributions are used such that the transition data could be taken from any source. [sent-25, score-0.316]

19 This may yield solutions that are not achievable by the corresponding iterative approach because this iteration diverges. [sent-26, score-0.317]

20 All the LSTD approaches have the problem that the matrix to be inverted may be singular. [sent-27, score-0.084]

21 This case can occur if the basis functions are not linearly independent or if the Markov chain is not recurrent. [sent-28, score-0.121]

22 In order to apply the LSTD approach the problem would have to be preprocessed by sorting out the linear dependent basis functions and the transient states of the Markov chain. [sent-29, score-0.293]

23 In practice one would like to save this additional work. [sent-30, score-0.044]

24 Thus, the least-squares TD algorithm can fail due to matrix singularity and the iterative TD(O) algorithm can fail if the sampling distribution is different from the steady-state distribution. [sent-31, score-0.566]

25 Hence, there are problems for which neither an iterative nor a least-squares TD solution exist. [sent-32, score-0.19]

26 The actual reason for the failure of the iterative TD(O) approach lies in an incompatible combination of the RL algorithm and the function approximator. [sent-33, score-0.322]

27 Thus, the idea is that either a change in the RL algorithm or a change in the approximator may yield a convergent iteration. [sent-34, score-0.373]

28 Here, a change in the TD(O) algorithm is not meant to completely alter the character of the algorithm. [sent-35, score-0.157]

29 We require that only modifications of the TD(O) algorithm be considered that are consistent according to the definition in the next section. [sent-36, score-0.131]

30 In this paper we propose a unified framework for the analysis of a whole class of synchronous iterative RL algorithms combined with arbitrary linear function approximation. [sent-37, score-0.651]

31 For the sparse iteration matrices that occur in RL such an iterative approach is superior to a method that uses matrix inversion as the LSTD approach does [5]. [sent-38, score-0.442]

32 Our main theorem states sufficient conditions under which combinations of RL algorithms and linear function approximation converge. [sent-39, score-0.452]

33 We hope that these conditions and the convergence analysis, that is based on the eigenvalues of the iteration matrix, bring new insight in the interplay of RL and function approximation. [sent-40, score-0.328]

34 For an arbitrary linear function approximator and for arbitrary fixed transition data the theorem allows to predict the existence of a constant learning rate such that the synchronous residual gradient algorithm [1] converges. [sent-41, score-1.141]

35 Moreover, in combination with interpolating grid-based function approximators we are able to specify a formula for a constant learning rate such that the synchronous residual gradient algorithm converges independently of the transition data. [sent-42, score-0.886]

36 This is very useful because otherwise the learning rate would have to be decreased which slows down convergence. [sent-43, score-0.18]

37 2 A Framework for Synchronous Iterative RL Algorithms For a Markov decision process (MDP) with N states S = {S1' . [sent-44, score-0.084]

38 ,SN}, action space A, state transition probabilities p : (S, S, A) -+ [0,1] and stochastic reward function r : (S, A) -+ R policy evaluation is concerned with solving the Bellman equation V 7r = 'Y P7rV7r + R7r (1) for a fixed policy 7r : S -+ A. [sent-47, score-0.781]

39 Vt denotes the value of state Si, Pi7j = P(Si ' Sj, 7r(Si)) , Ri = E{r(si,7r(Si))} and 'Y is the discount factor. [sent-48, score-0.09]

40 As the policy 7r is fixed we will omit it in the following to make notation easier. [sent-49, score-0.281]

41 If the state space S gets too large the exact solution of equation (1) becomes very costly with respect to both memory and computation time. [sent-50, score-0.168]

42 Therefore, often linear feature-based function approximation is applied. [sent-51, score-0.113]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('td', 0.509), ('rl', 0.395), ('synchronous', 0.254), ('lstd', 0.232), ('iterative', 0.19), ('policy', 0.16), ('approximator', 0.148), ('transition', 0.146), ('reinforcement', 0.14), ('merke', 0.133), ('si', 0.125), ('fixed', 0.121), ('schoknecht', 0.116), ('sampling', 0.102), ('tabular', 0.093), ('iteration', 0.091), ('residual', 0.086), ('inversion', 0.085), ('states', 0.084), ('convergence', 0.08), ('convergent', 0.076), ('concrete', 0.076), ('sufficient', 0.071), ('arbitrary', 0.068), ('decreased', 0.067), ('approximation', 0.066), ('rate', 0.06), ('interpolating', 0.058), ('analyse', 0.058), ('ilkd', 0.058), ('informatik', 0.058), ('combinations', 0.058), ('markov', 0.057), ('fail', 0.056), ('theorem', 0.054), ('germany', 0.054), ('approximators', 0.053), ('unified', 0.053), ('diverges', 0.053), ('singularity', 0.053), ('karlsruhe', 0.053), ('modifications', 0.053), ('slows', 0.053), ('gradient', 0.052), ('sources', 0.05), ('interplay', 0.049), ('infinitely', 0.049), ('sorting', 0.049), ('inverted', 0.049), ('losing', 0.049), ('state', 0.048), ('memory', 0.048), ('combination', 0.048), ('linear', 0.047), ('incompatible', 0.047), ('bellman', 0.047), ('generalised', 0.047), ('converges', 0.046), ('independently', 0.046), ('chain', 0.045), ('save', 0.044), ('alter', 0.044), ('strongest', 0.044), ('assured', 0.044), ('ralf', 0.044), ('discount', 0.042), ('occur', 0.041), ('preprocessed', 0.041), ('bad', 0.041), ('definition', 0.041), ('visited', 0.039), ('algorithms', 0.039), ('strategy', 0.039), ('evaluation', 0.039), ('change', 0.038), ('certain', 0.038), ('universal', 0.038), ('mdp', 0.038), ('meant', 0.038), ('eliminate', 0.038), ('bring', 0.038), ('action', 0.038), ('algorithm', 0.037), ('costly', 0.037), ('rarely', 0.037), ('transient', 0.037), ('hope', 0.037), ('yield', 0.036), ('interpolation', 0.036), ('concerned', 0.036), ('sj', 0.036), ('matrix', 0.035), ('basis', 0.035), ('gets', 0.035), ('pure', 0.034), ('conditions', 0.033), ('reward', 0.033), ('prevent', 0.033), ('proved', 0.033), ('carry', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Author: Ralf Schoknecht, Artur Merke

2 0.58004189 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

Author: Ralf Schoknecht

Abstract: There are several reinforcement learning algorithms that yield approximate solutions for the problem of policy evaluation when the value function is represented with a linear function approximator. In this paper we show that each of the solutions is optimal with respect to a specific objective function. Moreover, we characterise the different solutions as images of the optimal exact value function under different projection operations. The results presented here will be useful for comparing the algorithms in terms of the error they achieve relative to the error of the optimal approximate solution. 1

3 0.24377383 199 nips-2002-Timing and Partial Observability in the Dopamine System

Author: Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky

Abstract: According to a series of inﬂuential models, dopamine (DA) neurons signal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models predicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the underlying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can explain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been conﬁned to more abstract psychological models.

4 0.2215573 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the ﬁrst convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

5 0.10495941 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

Author: Benjamin V. Roy, Daniela D. Farias

Abstract: This paper extends our earlier analysis on approximate linear programming as an approach to approximating the cost-to-go function in a discounted-cost dynamic program [6]. In this paper, we consider the average-cost criterion and a version of approximate linear programming that generates approximations to the optimal average cost and differential cost function. We demonstrate that a naive version of approximate linear programming prioritizes approximation of the optimal average cost and that this may not be well-aligned with the objective of deriving a policy with low average cost. For that, the algorithm should aim at producing a good approximation of the differential cost function. We propose a twophase variant of approximate linear programming that allows for external control of the relative accuracy of the approximation of the differential cost function over different portions of the state space via state-relevance weights. Performance bounds suggest that the new algorithm is compatible with the objective of optimizing performance and provide guidance on appropriate choices for state-relevance weights.

6 0.095901355 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

7 0.089198217 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

8 0.082627848 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

9 0.078945607 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

10 0.071922444 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

11 0.070441015 205 nips-2002-Value-Directed Compression of POMDPs

12 0.068652876 20 nips-2002-Adaptive Caching by Refetching

13 0.057963733 115 nips-2002-Informed Projections

14 0.056697916 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

15 0.052796267 134 nips-2002-Learning to Take Concurrent Actions

16 0.048782628 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

17 0.047340177 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

18 0.047082771 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

19 0.046952371 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

20 0.04579699 78 nips-2002-Efficient Learning Equilibrium

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.155), (1, -0.016), (2, -0.351), (3, -0.124), (4, -0.022), (5, -0.154), (6, 0.033), (7, -0.082), (8, 0.033), (9, -0.021), (10, 0.023), (11, -0.327), (12, -0.131), (13, 0.357), (14, 0.13), (15, -0.074), (16, -0.083), (17, 0.23), (18, -0.174), (19, -0.144), (20, 0.059), (21, 0.121), (22, 0.166), (23, 0.09), (24, 0.105), (25, 0.014), (26, 0.073), (27, -0.009), (28, -0.03), (29, 0.015), (30, -0.04), (31, -0.068), (32, -0.041), (33, 0.021), (34, -0.076), (35, 0.076), (36, 0.021), (37, -0.017), (38, 0.027), (39, 0.045), (40, -0.02), (41, -0.037), (42, -0.041), (43, -0.057), (44, -0.008), (45, 0.063), (46, -0.038), (47, 0.022), (48, -0.027), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97254562 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Author: Ralf Schoknecht, Artur Merke

2 0.93147576 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

Author: Ralf Schoknecht

3 0.63413024 199 nips-2002-Timing and Partial Observability in the Dopamine System

Author: Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky

4 0.51490086 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

5 0.40050113 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

Author: Benjamin V. Roy, Daniela D. Farias

6 0.29729605 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

7 0.27424362 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

8 0.25417584 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

9 0.25374684 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

10 0.24342561 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

11 0.23764747 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

12 0.23762955 205 nips-2002-Value-Directed Compression of POMDPs

13 0.22928934 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

14 0.2232489 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

15 0.21418844 20 nips-2002-Adaptive Caching by Refetching

16 0.21140817 128 nips-2002-Learning a Forward Model of a Reflex

17 0.20256674 115 nips-2002-Informed Projections

18 0.18446004 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution

19 0.18052576 22 nips-2002-Adaptive Nonlinear System Identification with Echo State Networks

20 0.17687804 111 nips-2002-Independent Components Analysis through Product Density Estimation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.026), (23, 0.047), (42, 0.11), (54, 0.111), (55, 0.044), (58, 0.351), (67, 0.015), (74, 0.062), (92, 0.057), (98, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7977463 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

Author: Ralf Schoknecht, Artur Merke

2 0.77964693 186 nips-2002-Spike Timing-Dependent Plasticity in the Address Domain

Author: R. J. Vogelstein, Francesco Tenore, Ralf Philipp, Miriam S. Adlerstein, David H. Goldberg, Gert Cauwenberghs

Abstract: Address-event representation (AER), originally proposed as a means to communicate sparse neural events between neuromorphic chips, has proven efﬁcient in implementing large-scale networks with arbitrary, conﬁgurable synaptic connectivity. In this work, we further extend the functionality of AER to implement arbitrary, conﬁgurable synaptic plasticity in the address domain. As proof of concept, we implement a biologically inspired form of spike timing-dependent plasticity (STDP) based on relative timing of events in an AER framework. Experimental results from an analog VLSI integrate-and-ﬁre network demonstrate address domain learning in a task that requires neurons to group correlated inputs.

3 0.65998966 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

4 0.60179257 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

Author: Ralf Schoknecht

5 0.48505968 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

6 0.48207584 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

7 0.48061544 46 nips-2002-Boosting Density Estimation

8 0.47862551 169 nips-2002-Real-Time Particle Filters

9 0.47826442 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

10 0.47634023 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions