nips nips2001 nips2001-128 knowledge-graph by maker-knowledge-mining

128 nips-2001-Multiagent Planning with Factored MDPs

Source: pdf

Author: Carlos Guestrin, Daphne Koller, Ronald Parr

Abstract: We present a principled and efﬁcient planning algorithm for cooperative multiagent dynamic systems. A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). The action space of the resulting MDP is the joint action space of the entire set of agents. Our approach is based on the use of factored linear value functions as an approximation to the joint value function. This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. We provide a simple and efﬁcient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. We thereby avoid the exponential blowup in the state and action space. We show that our approach compares favorably with approaches based on reward sharing. We also show that our algorithm is an efﬁcient alternative to more complicated algorithms even in the single agent case.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a principled and efﬁcient planning algorithm for cooperative multiagent dynamic systems. [sent-7, score-0.445]

2 A striking feature of our method is that the coordination and communication between the agents is not imposed, but derived directly from the system dynamics and function approximation architecture. [sent-8, score-0.634]

3 We view the entire multiagent system as a single, large Markov decision process (MDP), which we assume can be represented in a factored way using a dynamic Bayesian network (DBN). [sent-9, score-0.63]

4 The action space of the resulting MDP is the joint action space of the entire set of agents. [sent-10, score-0.522]

5 Our approach is based on the use of factored linear value functions as an approximation to the joint value function. [sent-11, score-0.547]

6 This factorization of the value function allows the agents to coordinate their actions at runtime using a natural message passing scheme. [sent-12, score-0.691]

7 We provide a simple and efﬁcient method for computing such an approximate value function by solving a single linear program, whose size is determined by the interaction between the value function structure and the DBN. [sent-13, score-0.267]

8 We thereby avoid the exponential blowup in the state and action space. [sent-14, score-0.412]

9 We also show that our algorithm is an efﬁcient alternative to more complicated algorithms even in the single agent case. [sent-16, score-0.42]

10 We want to ﬁnd a mechanism for coordinating the agents’ actions so as to maximize their joint utility. [sent-18, score-0.24]

11 One obvious approach to this problem is to represent the system as a Markov decision process (MDP), where the “action” is a joint action for all of the agents and the reward is the total reward for all of the agents. [sent-19, score-0.923]

12 Unfortunately, the action space is exponential in the number of agents, rendering this approach impractical in most cases. [sent-20, score-0.279]

13 We present a novel approach based on approximating the joint value function as a linear combination of local value functions, each of which relates only to the parts of the system controlled by a small number of agents. [sent-22, score-0.272]

14 We show how such factored value functions allow the agents to ﬁnd a globally optimal joint action using a very natural message passing scheme. [sent-23, score-1.199]

15 We provide a very efﬁcient algorithm for computing such a factored approximation to the true value function using a linear programming approach. [sent-24, score-0.397]

16 This approach is of independent interest, as it is signiﬁcantly faster and compares very favorably to previous approximate algorithms for single agent MDPs. [sent-25, score-0.456]

17 We also compare our multiagent algorithm to the multiagent reward and value sharing algorithms of Schneider et al. [sent-26, score-0.705]

18 2 Cooperative Action Selection We begin by considering a simpler problem of selecting a globally optimal joint action in order to maximize the joint immediate value achieved by a set of agents. [sent-30, score-0.562]

19 Suppose we have a collection of agents, where each agent chooses an action . [sent-31, score-0.584]

20 Each agent has a local Q function , which represents its local contribution to the total utility function. [sent-33, score-0.439]

21 An agent’s local function might be inﬂuenced by its action and those of some other agents; we deﬁne the scope to be the set of agents whose action inﬂuences . [sent-35, score-1.021]

22 Our task is to select a joint action that maximizes . [sent-37, score-0.293]

23 The fact that the depend on the actions of multiple agents forces the agents to coordinate their action choices. [sent-38, score-1.119]

24 We can represent the coordination requirements of the system using a coordination graph, where there is a node for each agent and an edge between two agents if they must directly coordinate their actions to optimize some particular . [sent-39, score-1.256]

25 A graph for an example where graph structure suggests the use of a cost network [5], which can be solved using non-serial dynamic programming [1] or a variable elimination algorithm which is virtually identical to variable elimination in a Bayesian network. [sent-42, score-0.528]

26 Let us begin our optimization summands involving with agent 4. [sent-45, score-0.355]

27 In effect, it is computing a conditional strategy, with a (possibly) different action choice for each action choice of agents 2 and 3. [sent-49, score-0.835]

28 Agent 4 can summarize the value that it brings to the system in the different circumstances using a new function whose value at the point is the value of the internal expression. [sent-50, score-0.249]

29 Note that introduces a new induced communication dependency between agents and , the dashed line in Fig. [sent-51, score-0.491]

30 Next, agent 3 makes its decision, giving: , where . [sent-54, score-0.355]

31 Agent 2 now makes its decision, giving , and agent 1 can now simply choose the action that maximizes . [sent-55, score-0.584]

32 We can recover the maximizing set of actions by performing the entire process in reverse: The maximizing choice for selects the action for agent 1. [sent-56, score-0.685]

33 To fulﬁll its commitment to agent 1, agent 2 must choose the value which maximizes . [sent-57, score-0.793]

34 This, in turn, forces agent 3 and then agent 4 to select their actions appropriately. [sent-58, score-0.811]

35 The algorithm then repeats the following steps: (1) Select an uneliminated agent . [sent-60, score-0.388]

36 The cost of this algorithm is linear in the number of new “function values” introduced, or exponential in the induced width of the coordination graph [5]. [sent-64, score-0.515]

37 Furthermore, each agent does not need to communicate directly with every other agent, instead the necessary communication bandwidth will also be the induced width of the graph, further simplifying the coordination mechanism. [sent-65, score-0.742]

38 % q q © D &@8642 A 9 7 5 3 % # ' Bc ps' £ ¡ ¥¤¢6 © © ¨ ¡ w u xvt q © ¦ §y 0 ( H 3 One-Step Lookahead We now consider two elaborations to the action selection problem of the previous section. [sent-68, score-0.37]

39 First, we assume that the agents are acting in a space described by a set of discrete state vari, where each takes on values in some ﬁnite domain . [sent-69, score-0.509]

40 The scope of the local functions that comprise the value can include both action choices and state variables. [sent-71, score-0.663]

41 We assume that the agents have full observability of the relevant state variables, so by itself, this extension is fairly trivial: The functions deﬁne a conditional cost network. [sent-72, score-0.614]

42 Given a particular state , the agents instantiate the state variables and then solve the cost network as in the previous section. [sent-73, score-0.742]

43 However, we note that the agents do not actually need to have access to all of the state variables: agent only needs to observe the variables that are in the scope of its local function, thereby decreasing considerably the amount of information each agent needs to observe. [sent-74, score-1.417]

44 The second extension is somewhat more complicated: We assume that the agents are trying to maximize the sum of an immediate reward and a value that they expect to receive one step in the future. [sent-75, score-0.649]

45 1(b) shows a DDN for a simple four-agent problem, where the ovals represent the variables and the rectangles the agent actions. [sent-87, score-0.414]

46 For any setting of the state variables, , the agents aim to maximize , i. [sent-89, score-0.505]

47 , the immediate reward plus the expected value of the next state. [sent-91, score-0.229]

48 By replacing the expectation with the backprojection, we can once again generate a set , and apply our coordination graph algorithm unchanged. [sent-106, score-0.298]

49 An MDP is deﬁned as a 4-tuple where: is a ﬁnite set of states; is a set of actions; is a reward function IR, such that represents the reward obtained in state after taking action ; and is a Markovian transition model where represents the probability of going from state to state with action . [sent-109, score-0.946]

50 is deﬁned so that the value of a state must be the maxiThe optimal value function mal value achievable by any action at that state. [sent-111, score-0.597]

51 , where is the action the A stationary policy for an MDP is a mapping agent takes at state . [sent-114, score-0.793]

52 For any value function , we can deﬁne the policy obtained by acting greedily relative to : Greedy . [sent-115, score-0.254]

53 The greedy policy relative to is the optimal policy Greedy . [sent-116, score-0.282]

54 We use the common approach of restricting attention to value functions that are compactly represented as a linear combination of basis functions . [sent-125, score-0.316]

55 Our representation of the one-step transition dynamics in Section 3 is precisely a factored MDP, where we factor not only the states but also the actions. [sent-133, score-0.302]

56 In [8], we proposed the use of factored linear value functions to approximate the value function in a factored MDP. [sent-134, score-0.789]

57 These value functions are a weighted linear combination of basis functions, as above, but each basis function is restricted to depend only on a small subset of state variables. [sent-135, score-0.425]

58 If we had a value function represented in this way, then we could use our algorithm of Section 3 to implement Greedy by having the agents use our message passing coordination algorithm at each step. [sent-138, score-0.815]

59 ) In previous work [9, 6], we presented algorithms for computing approximate value functions of this form for factored MDPs. [sent-140, score-0.469]

60 These algorithms can circumvent the exponential blowup in the number of state variables, but explicitly enumerate the action space of the MDP, making them unsuitable for the exponentially large action space in multiagent MDPs. [sent-141, score-0.915]

61 Thus, the right side of the constraint can be viewed as the sum of restricted scope functions parameterized by . [sent-153, score-0.294]

62 Select a set of restricted scope basis functions . [sent-169, score-0.328]

63 Use the one-step lookahead planning algorithm (Section 3) with as a value function functions for each agent. [sent-173, score-0.314]

64 Each agent instantiates with values of state variables in scope of 2. [sent-176, score-0.643]

65 Agents apply coordination graph algorithm (Section 2) with local coordinate approximately optimal global action. [sent-177, score-0.409]

66 functions to # % # % # % ' Figure 2: Algorithm for multiagent planning with factored MDPs ¨ ! [sent-179, score-0.628]

67 In the case of cooperative multiagent MDPs, the actions of the individual agents become variables in the cost network, so that the set is simply . [sent-185, score-0.882]

68 The functions are simply the local functions corresponding to the rewards , the bases and their backprojections . [sent-186, score-0.242]

69 T 0 T 1 5 0 % X % xR over the entire exponential state space and joint action space using a number of constraints which is only exponential in the induced tree width of the cost network, rather than exponential in the number of actions and state variables in the problem. [sent-190, score-0.974]

70 A traditional single agent is, of course, a special case of the multiagent case. [sent-191, score-0.612]

71 The approach of [6] (which is substantially more efﬁcient than that of [9]) requires that we solve an LP for each step in policy iteration, and each LP contains constraints corresponding to multiple cost networks (whose number depends on the complexity of the policy representation). [sent-194, score-0.354]

72 Our overall algorithm for multiagent planning with factored MDPs in shown in Fig. [sent-196, score-0.581]

73 Q P 8 B h r © 0 Q I C@ 6 P 5 QH P 0 Bb 0 i Q I C" P 6 Experimental results We ﬁrst validate our approximate LP approach by comparing the quality of the solution to the approximate policy iteration (PI) approach of [6]. [sent-198, score-0.262]

74 As the approximate PI algorithm is not able to deal with the exponentially large action spaces of multiagent problems, we compare these two approaches on the single agent SysAdmin problem presented in [6], on a unidirectional ring network of up to 32 machines (over 4 billion states). [sent-199, score-1.226]

75 3(b), our new approximate LP algorithm for factored MDPs is signiﬁcantly faster than the approximate PI algorithm. [sent-201, score-0.408]

76 In fact, approximate PI with single-variable basis functions variables is more costly than the LP approach using basis functions over consecutive triples of variables. [sent-202, score-0.434]

77 3(c), for singleton basis functions, the approximate PI policy obtains slightly better performance for some problem sizes. [sent-204, score-0.266]

78 However, as we increase the number of basis functions for the approximate LP formulation, the value of the resulting policy is much better. [sent-205, score-0.429]

79 Thus, in this problem, our new approximate linear programming formulation allows us to use more basis functions and to obtain a resulting policy of higher value, while still maintaining a faster running time. [sent-206, score-0.436]

80 Graphs: Approximate LP versus approximate PI on single agent SysAdmin on unidirectional ring: (b) running time; (c) estimated value of policy. [sent-208, score-0.638]

81 Each machine is associated with an agent and good, faulty, dead , and Load idle, loaded, process suctwo variables: Status cessful . [sent-224, score-0.416]

82 Each agent must decide whether machine should be rebooted, in which case the Status becomes good and any running process is lost. [sent-229, score-0.401]

83 For a network of machines, the number of states in the MDP is and the joint action space contains possible actions, e. [sent-230, score-0.391]

84 , a problem with agents has over states and a billion possible actions. [sent-232, score-0.464]

85 We implemented the factored approximate linear programming and the message passing coordination algorithms in Matlab, using CPLEX as the LP solver. [sent-233, score-0.639]

86 We experimented with two types of basis functions: “single”, which contains an indicator basis function for each value of each and ; and “pair” which, in addition, contains indicators over joint assignments of the Status variables of neighboring agents. [sent-234, score-0.352]

87 Here we used 10000 learning iterations, and respectively and a decaying with learning and exploration rates starting at schedule after 5000 iterations; the observations for each agent were the status and load of its machine. [sent-242, score-0.44]

88 We also computed a utopic upper bound on the value of the optimal policy by removing the (negative) effect of the neighbors on the status of the machines. [sent-245, score-0.387]

89 For both network topologies tested, the estimated value of the approximate LP solution using single basis was signiﬁcantly higher than that of the DR and DRF algorithms. [sent-247, score-0.356]

90 Note that the single R © £ £ R ¡ 1 ( 1 D 2 Y1 % R © ¤ 12 R 1 B2 R F § ¨` 0 © £ ¢ ¥ © ¤ Y % ¡ 1 §¦ R basis solution requires no coordination when acting, so this is a “fair” comparison to DR and DRF which also do not communicate while acting. [sent-248, score-0.334]

91 If we allow for pair bases, which implies agent communication, we achieve a further improvement in terms of estimated value. [sent-249, score-0.355]

92 but for the clients, the agent waits until the process terminates or the machine dies before rebooting. [sent-253, score-0.396]

93 7 Conclusion We have provided principled and efﬁcient approach to planning in multiagent domains. [sent-254, score-0.311]

94 Rather than placing a priori restrictions on the communication structure between agents, we ﬁrst choose the form of an approximate value function and derive the optimal communication structure given the value function architecture. [sent-255, score-0.395]

95 This approach provides a uniﬁed view of value function approximation and agent communication. [sent-256, score-0.438]

96 The inter-agent communication and the LP avoid the exponential blowup in the state and action spaces, having computational complexity dependent, instead, upon the induced tree width of the coordination graph used by the agents to negotiate their action selection. [sent-258, score-1.441]

97 By exploiting structure in both the state and action spaces, we can deal with considerably larger MDPs than those described in previous work. [sent-259, score-0.314]

98 In a family of multiagent network administration problems with over states and over a billion actions, we have demonstrated near optimal performance which is superior to a priori reward or value sharing schemes. [sent-260, score-0.632]

99 We believe the methods described herein signiﬁcantly further extend the efﬁciency, applicability and general usability of factored value functions and models for the control of dynamic systems. [sent-261, score-0.453]

100 Computing factored value functions for policies in structured MDPs. [sent-306, score-0.4]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('agents', 0.377), ('agent', 0.355), ('lp', 0.304), ('factored', 0.237), ('action', 0.229), ('multiagent', 0.225), ('coordination', 0.194), ('scope', 0.144), ('xvt', 0.141), ('policy', 0.124), ('mdp', 0.112), ('reward', 0.101), ('actions', 0.101), ('mdps', 0.1), ('star', 0.088), ('planning', 0.086), ('state', 0.085), ('status', 0.085), ('value', 0.083), ('xp', 0.08), ('functions', 0.08), ('ep', 0.074), ('parents', 0.074), ('basis', 0.073), ('cost', 0.072), ('graph', 0.071), ('faulty', 0.07), ('approximate', 0.069), ('ring', 0.064), ('koller', 0.064), ('network', 0.064), ('joint', 0.064), ('communication', 0.063), ('backprojection', 0.061), ('ddn', 0.061), ('dead', 0.061), ('drf', 0.061), ('lpsinglebasis', 0.061), ('numberofagents', 0.061), ('sysadmin', 0.061), ('utopic', 0.061), ('elimination', 0.06), ('variables', 0.059), ('message', 0.054), ('schneider', 0.053), ('billion', 0.053), ('dept', 0.053), ('unidirectional', 0.053), ('dynamic', 0.053), ('pi', 0.052), ('decision', 0.051), ('induced', 0.051), ('ef', 0.05), ('exponential', 0.05), ('exponentially', 0.049), ('rings', 0.048), ('guestrin', 0.048), ('blowup', 0.048), ('cooperative', 0.048), ('acting', 0.047), ('running', 0.046), ('immediate', 0.045), ('dr', 0.045), ('programming', 0.044), ('width', 0.044), ('maximize', 0.043), ('local', 0.042), ('passing', 0.041), ('dean', 0.041), ('dies', 0.041), ('distributedreward', 0.041), ('distributedvaluefunction', 0.041), ('lppairbasis', 0.041), ('numbe', 0.041), ('rebooted', 0.041), ('singlebasis', 0.041), ('yxp', 0.041), ('yxv', 0.041), ('rewards', 0.04), ('constraint', 0.039), ('qh', 0.038), ('sharing', 0.038), ('coordinate', 0.035), ('farias', 0.035), ('topologies', 0.035), ('communicate', 0.035), ('optimal', 0.034), ('states', 0.034), ('constraints', 0.034), ('algorithm', 0.033), ('enforce', 0.032), ('lookahead', 0.032), ('ine', 0.032), ('dbn', 0.032), ('coordinating', 0.032), ('parr', 0.032), ('single', 0.032), ('cients', 0.032), ('restricted', 0.031), ('transition', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 128 nips-2001-Multiagent Planning with Factored MDPs

Author: Carlos Guestrin, Daphne Koller, Ronald Parr

2 0.30881134 59 nips-2001-Direct value-approximation for factored MDPs

Author: Dale Schuurmans, Relu Patrascu

Abstract: We present a simple approach for computing reasonable policies for factored Markov decision processes (MDPs), when the optimal value function can be approximated by a compact linear form. Our method is based on solving a single linear program that approximates the best linear fit to the optimal value function. By applying an efficient constraint generation procedure we obtain an iterative solution method that tackles concise linear programs. This direct linear programming approach experimentally yields a significant reduction in computation time over approximate value- and policy-iteration methods (sometimes reducing several hours to a few seconds). However, the quality of the solutions produced by linear programming is weaker-usually about twice the approximation error for the same approximating class. Nevertheless, the speed advantage allows one to use larger approximation classes to achieve similar error in reasonable time. 1

3 0.26946798 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The ﬁrst concerns the efﬁcacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to signiﬁcant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢

4 0.2521269 40 nips-2001-Batch Value Function Approximation via Support Vectors

Author: Thomas G. Dietterich, Xin Wang

Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1

5 0.17954656 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

Author: Bram Bakker

Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1

6 0.17787988 36 nips-2001-Approximate Dynamic Programming via Linear Programming

7 0.1722649 121 nips-2001-Model-Free Least-Squares Policy Iteration

8 0.16524795 13 nips-2001-A Natural Policy Gradient

9 0.16032638 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

10 0.15655145 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

11 0.14502099 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

12 0.1228475 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

13 0.11368722 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations

14 0.10216574 84 nips-2001-Global Coordination of Local Linear Models

15 0.099640474 115 nips-2001-Linear-time inference in Hierarchical HMMs

16 0.093223535 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

17 0.09192688 126 nips-2001-Motivated Reinforcement Learning

18 0.091765001 148 nips-2001-Predictive Representations of State

19 0.087718934 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

20 0.076382287 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.237), (1, -0.145), (2, 0.418), (3, 0.001), (4, -0.004), (5, 0.086), (6, -0.068), (7, -0.069), (8, -0.028), (9, 0.045), (10, -0.022), (11, -0.021), (12, 0.07), (13, -0.056), (14, -0.024), (15, -0.036), (16, -0.036), (17, -0.028), (18, -0.036), (19, -0.008), (20, 0.089), (21, -0.051), (22, 0.008), (23, -0.016), (24, -0.058), (25, 0.061), (26, 0.113), (27, -0.029), (28, 0.103), (29, -0.052), (30, -0.062), (31, 0.014), (32, -0.098), (33, 0.078), (34, -0.038), (35, 0.031), (36, -0.059), (37, -0.104), (38, -0.018), (39, 0.085), (40, -0.045), (41, -0.027), (42, -0.017), (43, 0.024), (44, -0.21), (45, -0.111), (46, -0.023), (47, -0.057), (48, 0.076), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96833152 128 nips-2001-Multiagent Planning with Factored MDPs

Author: Carlos Guestrin, Daphne Koller, Ronald Parr

2 0.84047216 59 nips-2001-Direct value-approximation for factored MDPs

Author: Dale Schuurmans, Relu Patrascu

3 0.7519716 40 nips-2001-Batch Value Function Approximation via Support Vectors

Author: Thomas G. Dietterich, Xin Wang

4 0.74607682 36 nips-2001-Approximate Dynamic Programming via Linear Programming

Author: Daniela Farias, Benjamin V. Roy

Abstract: The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large- scale stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The approach

5 0.70738506 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efﬁcient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly inﬂuenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efﬁciently by merely observing a relatively small number of completely random trials.

6 0.69896823 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

7 0.60202289 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

8 0.60146165 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

9 0.56599879 177 nips-2001-Switch Packet Arbitration via Queue-Learning

10 0.54880863 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

11 0.54808539 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

12 0.51168162 13 nips-2001-A Natural Policy Gradient

13 0.50786763 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

14 0.44999439 148 nips-2001-Predictive Representations of State

15 0.4381862 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

16 0.42612943 126 nips-2001-Motivated Reinforcement Learning

17 0.41480589 115 nips-2001-Linear-time inference in Hierarchical HMMs

18 0.37088573 84 nips-2001-Global Coordination of Local Linear Models

19 0.3603881 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing

20 0.36005819 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.026), (17, 0.047), (19, 0.032), (27, 0.115), (30, 0.06), (36, 0.014), (38, 0.02), (45, 0.019), (59, 0.022), (66, 0.285), (72, 0.089), (79, 0.044), (83, 0.028), (91, 0.127)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79434294 128 nips-2001-Multiagent Planning with Factored MDPs

Author: Carlos Guestrin, Daphne Koller, Ronald Parr

2 0.7140826 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama

Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classiﬁcation algorithms can be employed without further modiﬁcations. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1

3 0.60243911 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

4 0.60131121 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

5 0.60012227 59 nips-2001-Direct value-approximation for factored MDPs

Author: Dale Schuurmans, Relu Patrascu

6 0.5903132 121 nips-2001-Model-Free Least-Squares Policy Iteration

7 0.58977503 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

8 0.58787352 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

9 0.58743054 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

10 0.58729035 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

11 0.58700597 36 nips-2001-Approximate Dynamic Programming via Linear Programming

12 0.58612823 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

13 0.58550978 135 nips-2001-On Spectral Clustering: Analysis and an algorithm