nips nips2001 nips2001-187 knowledge-graph by maker-knowledge-mining

187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning


Source: pdf

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. [sent-6, score-0.917]

2 The objective of the learning agent is to have its long-term average reward vector belong to a given target set. [sent-7, score-0.74]

3 Each entry of the reward vector represents a scalar reward (or cost) function which is of interest. [sent-13, score-0.88]

4 Focusing on the long-term average reward, we assume that the desired performance is specified through a given target set, to which the average reward vector should eventually belong. [sent-14, score-0.77]

5 Accordingly, the specified goal of the decision maker is to ensure that the average reward vector will converge to the target set. [sent-15, score-0.722]

6 Following terminology from game theory, we refer to such convergence of the reward vector as approaching the target set. [sent-16, score-1.073]

7 This problem formulation is derived from the so-called theory of approachability that was introduced in [3] in the context of repeated matrix games with vector payoffs. [sent-20, score-0.306]

8 Using a geometric viewpoint, it characterizes the sets in the reward space that a player can guarantee for himself for any possible policy of the other player, and provides appropriate policies for approaching these sets. [sent-21, score-1.071]

9 In this paper we add the learning aspect, and consider the problem of learning such approaching policies on-line, using Reinforcement Learning (RL) or similar algorithms. [sent-23, score-0.445]

10 Their construction relies on a geometric viewpoint, whereby the average reward vector is “steered” in the direction of the target set by the use of direction-dependent (and possibly stationary) control policies. [sent-25, score-0.778]

11 To motivate the steering viewpoint, consider the following one dimensional example of an automatic temperature controlling agent. [sent-26, score-0.569]

12 The measured property is the temperature which should be in some prescribed range [T , T ], the agent may activate a cooler or a heater at will. [sent-27, score-0.396]

13 An obvious algorithm that achieves the prescribed temperature range is – when the average temperature is higher than T choose a “policy” that reduces it, namely activate the cooler; and if the average temperature is lower than T use the heater. [sent-28, score-0.881]

14 a b Humidity Target Heating policy T Cooling policy T Temperature Temperature Figure 1: (a) The single dimensional temperature example. [sent-32, score-0.765]

15 If the temperature is higher than T the control is to cool, and if the temperature is lower than T the control is to heat. [sent-33, score-0.37]

16 This problem is naturally characterized as a multi-objective problem, in which the objective of the controller is to have the average reward in some target set. [sent-39, score-0.655]

17 However, we can reformulate the temperature requirement as an average reward objective by measuring the fraction of times that the temperature is outside the target range, and require this fraction to be zero. [sent-41, score-0.973]

18 A steering policy for the controller is not as simple anymore. [sent-46, score-0.726]

19 In place of the two directions (left/right) of the one-dimensional case, we now face a continuum of possible directions, each associated with a possibly different steering policy. [sent-47, score-0.51]

20 For the purpose of the proposed learning algorithm we shall require to consider only a finite number of steering policies. [sent-48, score-0.501]

21 RL for average reward Markov Decision Processes (MDPs) was suggested in [13, 10] and later analyzed in [1]. [sent-54, score-0.474]

22 Several methods exist for average reward RL, including Q-learning [1] the E 3 algorithm [8], actor-critic schemes [2] and more. [sent-55, score-0.515]

23 The paper is organized as follows: In Section 2 we describe the stochastic game setup, recall ap- proachability theory, and mention a key theorem that allows to consider only a finite number of directions for approaching a set. [sent-56, score-0.8]

24 We also briefly discuss learning in multi-criteria single controller environments, as this case is a special case of the more general game model. [sent-58, score-0.404]

25 2 Multi-Criteria Stochastic Games In this section we present the multi-criteria stochastic game model. [sent-60, score-0.397]

26 We recall some known results from approachability theory for stochastic games with vector-valued reward, and state a key theorem which decomposes the problem of approaching a target set into a finite number of scalar control problems. [sent-61, score-0.973]

27 We consider a two-person average reward stochastic game model, with a vector-valued reward function. [sent-62, score-1.243]

28 The game is defined by: the state space S; the sets of actions for P1 and P2, respectively, in each state s, A and B; the state transition kernel, P = (P (s |s, a, b)); a vector-valued reward function m : S × A × B → IRk . [sent-64, score-0.985]

29 The reward itself is allowed to be random, in which case it is assumed to have a bounded second moment. [sent-65, score-0.372]

30 At each time epoch n ≥ 0, both players observe the current state sn , and then P1 and P2 simultaneously choose actions an and bn , respectively. [sent-66, score-0.359]

31 As a result P1 receives the reward vector mn = m(sn , an , bn ) and the next state is determined according to the transition probability P (·|sn , an , bn ). [sent-67, score-0.755]

32 More generally, we allow the actual reward mn to be random, in which case m(sn , an , bn ) denotes its mean and a bounded second moment is assumed. [sent-68, score-0.573]

33 A policy π ∈ Π for P1 is a mapping which assigns to each possible observed history a mixed action in ∆(A), namely a probability vector over P1’s action set A. [sent-70, score-0.348]

34 A policy of either player is called stationary if the mixed action it prescribes depends only on the current state n−1 1 sn . [sent-72, score-0.593]

35 Let mn denote the average reward by time n: mn = n t=0 mt . [sent-73, score-0.774]

36 If the game is finite then this assumption is satisfied if state s∗ is accessible from all other states under any pair of stationary deterministic policies [14]. [sent-78, score-0.609]

37 We often consider the projected game in direction u as the zero-sum stochastic game with same dynamic as above, and scalar rewards r n := mn · u. [sent-81, score-1.028]

38 Denote this game by Γs (u), where s is the initial state. [sent-83, score-0.311]

39 The scalar stochastic game Γs (u), has a value, denoted vΓs (u), if s s vΓs (u) = sup inf lim inf Eπσ (mn · u) = inf sup lim sup Eπσ (mn · u) . [sent-84, score-0.889]

40 We next consider the task of approaching a given target set in the reward space, and introduce approaching policies for the case where the game parameters are fully known to P1. [sent-88, score-1.404]

41 1 The set T ⊂ IRk is approachable (from initial state s) if there exists a T s approaching policy π ∗ of P1 such that d(mn , T ) → 0 Pπ∗ ,σ -a. [sent-92, score-0.718]

42 The policy π ∗ in that definition will be called an approaching policy for P1. [sent-95, score-0.809]

43 Noting that approaching a set and its closure are the same, we shall henceforth suppose that the set T is closed. [sent-97, score-0.329]

44 Let ∗ φ(π, σ) = τ −1 s Eπ,σ ( t=0 mt ) s∗ Eπ,σ (τ ) (1) denote the average per-cycle reward vector, which is the expected total reward over the cycle that starts and ends in the reference state, divided by the expected duration of that cycle. [sent-99, score-0.878]

45 Assume that for every point x ∈ T there exists a policy π(x) such that: (φ(π(x), σ) − Cx ) · ux ≥ 0 , ∀σ ∈ Σ . [sent-103, score-0.377]

46 An approaching policy is: If sn = s∗ and mn ∈ T , play π(mn ) until ˆ ˆ the next visit to state s∗ ; otherwise, play arbitrarily. [sent-105, score-1.007]

47 Geometrically, the condition in (2) means that P1 can ensure, irrespectively of P2’s policy, that the average per-cycle reward will be on the other side (relative to x) of the hyperplane which is perpendicular to the line segment that points from x to Cx . [sent-108, score-0.474]

48 We shall refer to the direction ux as the steering direction from point x, and to the policy π(x) as the steering policy from x. [sent-109, score-1.602]

49 The approaching policy uses the following rule: between successive visits to the reference state, a fixed (possibly stationary) policy is used. [sent-110, score-0.841]

50 When in the reference state, the current average reward vector mn is inspected. [sent-111, score-0.688]

51 If this vector is not in T , then the steering policy that satisfies (2) with x = m n is ˆ ˆ selected for the next cycle. [sent-112, score-0.706]

52 Consequently, the average reward is “steered” towards T , and eventually converges to it. [sent-113, score-0.507]

53 Recalling the definition of the projected game in direction u and its value vΓ(u), the condition in (2) may be equivalently stated as vΓ(ux ) ≥ Cx · ux . [sent-114, score-0.464]

54 Furthermore, the policy π(x) can always be chosen as the stationary policy which is optimal for P1 in the game Γ(u x ). [sent-115, score-0.975]

55 In particular, the steering policy π(x) needs to depend only on the corresponding steering direction ux . [sent-116, score-1.211]

56 Standard approachability results, as outlined above, require to consider an infinite number of steering directions whenever the reward in non-scalar. [sent-118, score-0.996]

57 The corresponding set of steering policies may turn out to be infinite as well. [sent-119, score-0.518]

58 For the purpose of our learning scheme, we shall require an approaching policy which relies on a finite set of steering directions and policies. [sent-120, score-1.118]

59 In the following, let M be an upper bound on the magnitude of the expected one-stage reward vector, so that ||m(s, a, b)|| < M for all (s, a, b) (|| · || denote the Euclidean norm). [sent-122, score-0.372]

60 Suppose that πi is an optimal strategy in the scalar game Γ(ui ) (1 ≤ i ≤ J). [sent-134, score-0.472]

61 Then the following policy approaches T , the expansion of T : If sn = s∗ and mn ∈ T , then choose j so that umn is closest to uj (in Euclidean ˆ ˆ norm) and play πj until the next visit to state s∗ ; otherwise, play arbitrarily. [sent-135, score-0.935]

62 Remark: It follows immediately from the last theorem that the set T itself (rather than its expansion) is approachable with a finite number of steering directions if T − , the shrinkage of T , satisfies (2). [sent-140, score-0.649]

63 We consider the controlled Markov model of Section 2, but here we assume that P1, the learning agent, does not know the model parameters, namely the state transition probabilities and reward functions. [sent-143, score-0.538]

64 A policy of P1 that does not rely on knowledge of these parameters will be referred to as a learning policy. [sent-144, score-0.331]

65 P1’s task is to approach a given target set T , namely to ensure convergence of the average reward vector to this set irrespective of P2’s actions. [sent-145, score-0.661]

66 The proposed learning algorithm relies on the construction of the previous section of approaching policies with a finite number of steering directions. [sent-146, score-0.874]

67 Recall that each such game is a standard zero-sum stochastic game with average reward. [sent-148, score-0.81]

68 The required learning algorithm for game Γ(u) should secure an average reward that is not less than the value vΓ(u) of that game. [sent-149, score-0.867]

69 Consider a zero-sum stochastic game, with reward function r(s, a, b), average reward r n and value ˆ v. [sent-150, score-0.932]

70 We say that a learning policy π of P1 is -optimal in this game if, for any policy σ of P2, the average reward satisfies lim inf rn ≥ v − ˆ n→∞ Pπσ a. [sent-152, score-1.511]

71 , where Pπσ is the probability measure induced by the algorithm π, P2’s policy σ and the game dynamics. [sent-154, score-0.642]

72 Note that P1 may be unable to learn a min-max policy as P2 may play an inferior policy and refrain from playing certain actions, thereby keeping some parts of the game unobserved. [sent-155, score-0.977]

73 Remark: RL for average reward zero-sum stochastic games can be devised in a similar manner to average reward Markov decision processes. [sent-156, score-1.212]

74 The algorithm there is similar to the E 3 algorithm [8] which is based on explicit exploration-exploitation tradeoff and estimation of the game reward and transition structure. [sent-160, score-0.793]

75 is the approximation level and M is a known bound on the norm of the expected reward per step. [sent-163, score-0.372]

76 When arriving to s∗ , the decision maker checks if the average reward vector is outside the set T . [sent-170, score-0.593]

77 In that case, he switches to an appropriate policy that is intended to “steer” the average reward vector towards the target set. [sent-171, score-0.925]

78 The steering policy (πj ) is chosen according to closest direction (uj ) to the actual direction needed according to the problem geometry. [sent-172, score-0.832]

79 Recall that each πj is actually a learning policy with respect to a scalar reward function. [sent-173, score-0.807]

80 Note however that some “off-policy” algorithms (such as Q-learning) can learn the optimal policy even while playing a different policy. [sent-175, score-0.372]

81 While sn = s∗ play according to πi , the reward πi receives is ui · mn . [sent-192, score-0.778]

82 If a direction is not played infinitely often it has a negligible effect on the long term average reward vector. [sent-199, score-0.627]

83 Since the learning algorithms are nearly optimal, then any policy πj that is played infinitely often, eventually attains a (scalar) average reward of vΓ(u j ) − /2. [sent-200, score-0.924]

84 2 for the set T /2 to verify that the overall policy is an approaching policy for the target set. [sent-202, score-0.938]

85 Note that for convex target sets the algorithm is consistent in the sense that if the set is approachable then the algorithm attains it. [sent-203, score-0.339]

86 Remark: Multi-criteria Markov Decision Process (MDP) models may be regarded as a special case of the stochastic game model that was considered so far, with P2 eliminated from the problem. [sent-204, score-0.397]

87 These are generally simpler than for the game problem. [sent-207, score-0.311]

88 Indeed, any point in feasible set of state-action frequencies may be achieved by such a stationary policy [5]. [sent-211, score-0.348]

89 Another observation is that all steering policies learned and used within the MCRL may now be deterministic stationary policies, which simplifies the implementation of this algorithm. [sent-213, score-0.576]

90 A sample path of the average reward generated by the MCRL algorithm is shown in Figure 4(b). [sent-220, score-0.551]

91 It can be seen that the learning algorithm pushes towards the target set so that the path is mostly on the edge of the target set. [sent-223, score-0.376]

92 Note that in this example a small number of directions was quite enough for approaching the target set. [sent-224, score-0.452]

93 a b A sample path of average reward Problem dynamics for different strategies 2 2 1. [sent-225, score-0.51]

94 (b) A sample path from ’S’ to ’E’ 5 Conclusion We have presented a learning algorithm that approaches a prescribed target set in multi-dimensional performance space, provided this set satisfies a certain sufficient condition. [sent-240, score-0.302]

95 Our approach essentially relies on the theory of approachability for stochastic games, which is based on the idea of steering the average reward vector towards the target set. [sent-241, score-1.296]

96 We provided a key result stating that a set can be approached to a given precision using only a finite number of steering policies, which may be learned on-line. [sent-242, score-0.422]

97 An interesting observation regarding the proposed learning algorithm is that the learned optimal polices in each direction are essentially independent of the target set T . [sent-243, score-0.34]

98 Of further interest is the question of reduction of the number of steering directions used in the algorithm. [sent-246, score-0.478]

99 The scaling of he algorithm with the dimension of the reward space is exponential. [sent-249, score-0.413]

100 Average reward reinforcement learning: Foundations, algorithms, and empirical results. [sent-314, score-0.457]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('steering', 0.384), ('reward', 0.372), ('game', 0.311), ('policy', 0.29), ('approaching', 0.229), ('temperature', 0.185), ('mcrl', 0.183), ('mn', 0.15), ('approachability', 0.146), ('policies', 0.134), ('target', 0.129), ('sn', 0.128), ('games', 0.128), ('approachable', 0.128), ('irk', 0.11), ('scalar', 0.104), ('average', 0.102), ('directions', 0.094), ('humidity', 0.092), ('ux', 0.087), ('cx', 0.087), ('stochastic', 0.086), ('reinforcement', 0.085), ('rl', 0.073), ('ui', 0.072), ('nite', 0.072), ('state', 0.071), ('direction', 0.066), ('uj', 0.064), ('agent', 0.064), ('inf', 0.064), ('actions', 0.061), ('played', 0.06), ('satis', 0.06), ('stationary', 0.058), ('play', 0.056), ('cooler', 0.055), ('prescribed', 0.055), ('controller', 0.052), ('bn', 0.051), ('decision', 0.05), ('players', 0.048), ('erent', 0.047), ('player', 0.046), ('nitely', 0.046), ('markov', 0.046), ('relies', 0.045), ('recurrence', 0.043), ('theorem', 0.043), ('modelled', 0.043), ('learning', 0.041), ('algorithm', 0.041), ('lim', 0.041), ('su', 0.04), ('es', 0.039), ('sup', 0.038), ('approached', 0.038), ('recall', 0.037), ('adversary', 0.037), ('blackwell', 0.037), ('greenhouse', 0.037), ('heater', 0.037), ('maker', 0.037), ('mannor', 0.037), ('payo', 0.037), ('polices', 0.037), ('shimkin', 0.037), ('steered', 0.037), ('umn', 0.037), ('suppose', 0.036), ('path', 0.036), ('environment', 0.036), ('assumption', 0.035), ('di', 0.035), ('shall', 0.035), ('markovian', 0.035), ('viewpoint', 0.033), ('remark', 0.033), ('eventually', 0.033), ('reference', 0.032), ('vector', 0.032), ('possibly', 0.032), ('unit', 0.031), ('brie', 0.031), ('strategy', 0.031), ('playing', 0.03), ('expansion', 0.03), ('henceforth', 0.029), ('goto', 0.029), ('ball', 0.029), ('technion', 0.029), ('transition', 0.028), ('illustrative', 0.027), ('negligible', 0.027), ('visit', 0.027), ('algorithms', 0.026), ('closest', 0.026), ('arbitrarily', 0.026), ('namely', 0.026), ('optimal', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

2 0.27806839 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

3 0.22672704 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

4 0.21477942 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

5 0.20058216 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms.       ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢  ¢

6 0.1914089 59 nips-2001-Direct value-approximation for factored MDPs

7 0.16092041 40 nips-2001-Batch Value Function Approximation via Support Vectors

8 0.15655145 128 nips-2001-Multiagent Planning with Factored MDPs

9 0.15398884 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

10 0.14052999 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

11 0.13610949 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

12 0.11463653 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

13 0.1145839 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

14 0.1138887 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

15 0.10777524 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

16 0.10735416 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning

17 0.10124508 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

18 0.09792196 126 nips-2001-Motivated Reinforcement Learning

19 0.091998272 36 nips-2001-Approximate Dynamic Programming via Linear Programming

20 0.080513507 99 nips-2001-Intransitive Likelihood-Ratio Classifiers


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.21), (1, -0.132), (2, 0.409), (3, 0.04), (4, 0.039), (5, 0.132), (6, -0.041), (7, -0.039), (8, 0.049), (9, 0.019), (10, -0.006), (11, -0.061), (12, -0.028), (13, -0.053), (14, -0.04), (15, 0.054), (16, -0.023), (17, 0.027), (18, -0.082), (19, 0.017), (20, -0.199), (21, 0.082), (22, 0.032), (23, 0.154), (24, -0.0), (25, -0.009), (26, -0.156), (27, 0.024), (28, -0.08), (29, -0.023), (30, -0.034), (31, -0.089), (32, 0.01), (33, -0.031), (34, 0.039), (35, -0.066), (36, 0.019), (37, -0.017), (38, 0.043), (39, -0.003), (40, 0.057), (41, -0.042), (42, 0.049), (43, 0.02), (44, 0.047), (45, 0.052), (46, 0.076), (47, 0.05), (48, -0.015), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95888263 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

2 0.84853661 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

3 0.7426272 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

Author: Michael L. Littman, Michael J. Kearns, Satinder P. Singh

Abstract: We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games. 1

4 0.70335329 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.69358999 121 nips-2001-Model-Free Least-Squares Policy Iteration

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We propose a new approach to reinforcement learning which combines least squares function approximation with policy iteration. Our method is model-free and completely off policy. We are motivated by the least squares temporal difference learning algorithm (LSTD), which is known for its efficient use of sample experiences compared to pure temporal difference algorithms. LSTD is ideal for prediction problems, however it heretofore has not had a straightforward application to control problems. Moreover, approximations learned by LSTD are strongly influenced by the visitation distribution over states. Our new algorithm, Least Squares Policy Iteration (LSPI) addresses these issues. The result is an off-policy method which can use (or reuse) data collected from any source. We have tested LSPI on several problems, including a bicycle simulator in which it learns to guide the bicycle to a goal efficiently by merely observing a relatively small number of completely random trials.

6 0.65664154 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

7 0.54847318 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

8 0.54264224 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

9 0.50997216 59 nips-2001-Direct value-approximation for factored MDPs

10 0.50119209 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

11 0.48457512 40 nips-2001-Batch Value Function Approximation via Support Vectors

12 0.45962527 128 nips-2001-Multiagent Planning with Factored MDPs

13 0.4379063 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

14 0.43487599 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

15 0.41408202 177 nips-2001-Switch Packet Arbitration via Queue-Learning

16 0.39557612 126 nips-2001-Motivated Reinforcement Learning

17 0.38900125 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning

18 0.38382038 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

19 0.36510831 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

20 0.36202401 161 nips-2001-Reinforcement Learning with Long Short-Term Memory


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.027), (17, 0.033), (19, 0.023), (20, 0.021), (27, 0.079), (30, 0.059), (38, 0.024), (59, 0.027), (72, 0.066), (79, 0.028), (83, 0.425), (91, 0.11)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87015402 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

2 0.83545095 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

3 0.79818654 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

Abstract: This paper proposes an approach to classification of adjacent segments of a time series as being either of classes. We use a hierarchical model that consists of a feature extraction stage and a generative classifier which is built on top of these features. Such two stage approaches are often used in signal and image processing. The novel part of our work is that we link these stages probabilistically by using a latent feature space. To use one joint model is a Bayesian requirement, which has the advantage to fuse information according to its certainty. The classifier is implemented as hidden Markov model with Gaussian and Multinomial observation distributions defined on a suitably chosen representation of autoregressive models. The Markov dependency is motivated by the assumption that successive classifications will be correlated. Inference is done with Markov chain Monte Carlo (MCMC) techniques. We apply the proposed approach to synthetic data and to classification of EEG that was recorded while the subjects performed different cognitive tasks. All experiments show that using a latent feature space results in a significant improvement in generalization accuracy. Hence we expect that this idea generalizes well to other hierarchical models.

4 0.52470499 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter

Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are finite, although all our results extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this defines the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisfies: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisfies µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Define the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modification [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a finite ergodic Markov chain M = (S, P ) is defined as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a finite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Briefly, we show that for a finite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The final term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of finding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3,  2 µu (y) min σ 2 = σ 2 − E  E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y  y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline defined in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline defined in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some fixed value function Jβ : S → R. Define ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of fixed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized fixed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and fixed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained fixed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained fixed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of fixed value function. r The optimum baseline, and fixed value function, will not normally be known. The online plots show experiments where the baseline and fixed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained fixed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Infinite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artificial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35   0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained fixed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained fixed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.

5 0.51730889 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

6 0.49293551 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

7 0.47176838 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

8 0.4682081 56 nips-2001-Convolution Kernels for Natural Language

9 0.46748787 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

10 0.46595114 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies

11 0.45423526 121 nips-2001-Model-Free Least-Squares Policy Iteration

12 0.45041174 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

13 0.44614285 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

14 0.44130677 118 nips-2001-Matching Free Trees with Replicator Equations

15 0.4410643 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

16 0.43972939 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

17 0.43843627 13 nips-2001-A Natural Policy Gradient

18 0.43529391 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

19 0.43485025 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

20 0.43392015 169 nips-2001-Small-World Phenomena and the Dynamics of Information