nips nips2008 nips2008-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jonathan Taylor, Doina Precup, Prakash Panagaden
Abstract: We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. [sent-9, score-0.374]
2 We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). [sent-10, score-0.349]
3 We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. [sent-11, score-0.563]
4 We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. [sent-14, score-0.297]
5 A significant problem is computing the optimal strategy when the state and action space are very large and/or continuous. [sent-17, score-0.136]
6 A popular approach is state abstraction, in which states are grouped together in partitions, or aggregates, and the optimal policy is computed over these. [sent-18, score-0.254]
7 Bisimulation is a well-known, well-studied notion of behavioral equivalence between systems (Larsen & Skou, 1991; Milner, 1995) which has been specialized for MDPs by Givan et al (2003). [sent-22, score-0.076]
8 One of the disadvantages of bisimulation and the corresponding metrics is that they require that the behavior matches for exactly the same actions. [sent-25, score-0.437]
9 However, in many cases of practical interest, actions with the exact same label may not match, but the environment may contain symmetries and other types of special structure, which may allow correspondences between states by matching their behavior with different actions. [sent-26, score-0.212]
10 MDP homomorphisms specify a map matching equivalent states as well as equivalent actions in such states. [sent-28, score-0.287]
11 However, like any equivalence relations in probabilistic systems, MDP homomorphisms are brittle: a small change in the transition probabilities or the rewards can cause two previously equivalent state-action pairs to become distinct. [sent-30, score-0.222]
12 As a solution to this problem, Ravindran & Barto (2004) proposed using approximate homomorphisms, which allow aggregating states that are not exactly equivalent. [sent-32, score-0.085]
13 In this paper, we attempt to construct provably good, approximate MDP homomorphisms from first principles. [sent-38, score-0.135]
14 First, we relate the notion of MDP homomorphisms to the concept of lax bisimulation, explored recently in the process algebra literature (Arun-Kumar, 2006). [sent-39, score-0.372]
15 This allows us to define a metric on states, similarly to existing bisimulation metrics. [sent-40, score-0.522]
16 We show that the difference in the optimal value function of two states is bounded above by this metric. [sent-42, score-0.102]
17 This allows us to provide a state aggregation algorithm with provable approximation guarantees. [sent-43, score-0.128]
18 We illustrate empirically the fact that this approach can provide much better state space compression than the use of existing bisimulation metrics. [sent-44, score-0.447]
19 For the purpose of this paper, the state space S is assumed to be finite, but the action set A could be finite or infinite (as will be detailed later). [sent-46, score-0.119]
20 A deterministic policy π : S → A specifies which action should be taken in every state. [sent-48, score-0.147]
21 By following ∞ policy π from state s, an agent can expect a value of V π (s) = E(∑t=1 γt−1 rt |s0 = s, π) where γ ∈ (0, 1) is a discount factor and rt is the sample reward received at time t. [sent-49, score-0.163]
22 Given the optimal value function, an optimal policy is easily inferred by simply taking at every state the greedy action with respect to the one-steplookahead value. [sent-51, score-0.235]
23 Ideally, if the state space is very large, “similar” states should be grouped together in order to speed up this type of computation. [sent-53, score-0.155]
24 A relation E ⊆ S × S is a bisimulation relation if: sEu ⇔ ∀a. [sent-56, score-0.457]
25 The relation ∼ is the union of all bisimulation relations and two states in an MDP are said to be bisimilar if s ∼ u. [sent-59, score-0.586]
26 From this definition, it follows that bisimilar states can match each others’ actions to achieve the same returns. [sent-60, score-0.235]
27 Hence, bisimilar states have the same optimal value (Givan et al. [sent-61, score-0.187]
28 However, bisimulation is not robust to small changes in the rewards or the transition probabilities. [sent-63, score-0.429]
29 One way to avoid this problem is to quantify the similarity between states using a (pseudo)-metric. [sent-64, score-0.123]
30 (2004) proposed a bisimulation metric, defined as the least fixed point of the following operator on the lattice of 1-bounded metrics d : S × S → [0, 1]: G(d)(s,t) = max(cr |R(s, a) − R(u, a)| + c pK(d)(P(s, a, ·), P(u, a, ·)) (1) a The first term above measures reward similarity. [sent-66, score-0.464]
31 The second term is the Kantorovich metric between the probability distributions of the two states. [sent-67, score-0.129]
32 Given probability distributions P and Q over the state space S, and a semimetric d on S, the Kantorovich metric K(d)(P, Q) is defined by the following linear program: |S| max ∑ (P(si ) − Q(si ))vi subject to: ∀i, j. [sent-68, score-0.248]
33 (2004) showed that by applying (1) iteratively, the least fixed point e f ix can be obtained, and that s and u are bisimilar if and only if e f ix (s, u) = 0. [sent-74, score-0.531]
34 In other words, bisimulation is the kernel of this metric. [sent-75, score-0.393]
35 3 Lax bisimulation In many cases of practical interest, actions with exactly the same label may not match, but the environment may contain symmetries and other types of special structure, which may allow correspondences between different actions at certain states. [sent-76, score-0.587]
36 Because of symmetry, going south in state N6 is “equivalent” to going north in state S6. [sent-78, score-0.108]
37 Recent work in process algebra has rethought the definition of bisimulation to allow certain distinct actions to be essentially equivalent (Arun-Kumar, 2006). [sent-80, score-0.46]
38 Here, we define lax bisimulation in the context of MDPs. [sent-81, score-0.614]
39 A relation B is a lax (probabilistic) bisimulation relation if whenever sBu we have that: ∀a ∃b such that R(s, a) = R(u, b) and for all B-closed sets X we have that Pr(X|s, a) = P(X|u, b), and vice versa. [sent-83, score-0.678]
40 The lax bisimulation ∼ is the union of all the lax bisimulation relations. [sent-84, score-1.228]
41 It is easy to see that B is an equivalence relation and we denote the equivalence classes of S by S/B. [sent-85, score-0.104]
42 Note that the definition above assumes that any action can be matched by any other action. [sent-86, score-0.065]
43 However, the set of actions that can be used to match another action can be restricted based on prior knowledge. [sent-87, score-0.154]
44 Lax bisimulation is very closely related to the idea of MDP homomorphisms (Ravindran & Barto, 2003). [sent-88, score-0.528]
45 We now show that homomorphisms are identical to lax probabilistic bisimulation. [sent-92, score-0.356]
46 Two states s and u are bisimilar if and only if they are related by some MDP homomorphism f , {gs : s ∈ S} in the sense that f (s) = f (u). [sent-94, score-0.221]
47 Proof: For the first direction, let h be a MDP homomorphism and define the relation B such that sBu iff f (s) = f (u). [sent-95, score-0.107]
48 Since gu is a surjection to A, there must be some b ∈ A with gu (b) = gs (a). [sent-96, score-0.851]
49 Hence, R(s, a) = R′ ( f (s), gs (a)) = R′ ( f (u), gu (b)) = R(u, b) Let X be a non-empty B-closed set such that f −1 ( f (s′ )) = X for some s′ . [sent-97, score-0.634]
50 Then: P(s, a, X) = P′ ( f (s), gs (a), f (s′ )) = P′ ( f (u), gu (b), f (s′ )) = P(u, b, X) so B is a lax bisimulation relation. [sent-98, score-1.248]
51 For the other direction, let B be a lax bisimulation relation. [sent-99, score-0.614]
52 We will construct an MDP homomorphism in which sBu =⇒ f (s) = f (u). [sent-100, score-0.075]
53 Consider the partition S/B induced by the equivalence relation B on set S. [sent-101, score-0.094]
54 For each equivalence class X ∈ S/B, we choose a representative state sX ∈ X and define f (sX ) = sX and gsX (a) = a, ∀a ∈ A. [sent-102, score-0.114]
55 Then, we have: P′ ( f (s), gs (a), f (s′ )) = P′ ( f (sX ), b′ , f −1 ( f (s′ )) = P(sX , b, f −1 ( f (s′ )) = P(s, a, f −1 ( f (s′ )) Also, R′ ( f (s), gs (a)) = R′ ( f (sX ), b) = R(sX , a). [sent-108, score-0.834]
56 ⋄ 4 A metric for lax bisimulation We will now define a lax bisimulation metric for measuring similarity between state-action pairs, following the approach used by Ferns et al. [sent-110, score-1.552]
57 We want to say that states s and u are close exactly when every action of one state is close to some action available in the other state. [sent-112, score-0.269]
58 In order to capture this meaning, we first define similarity between state-action pairs, then we lift this to states using the Hausdorff metric (Munkres, 1999). [sent-113, score-0.237]
59 Given a 1-bounded semi-metric d on S, the metric δ(d) : S × A → [0, 1] is defined as follows: δ(d)((s, a), (u, b)) = cr |R(s, a) − R(u, b)| + c pK(d)(P(s, a, ·), P(u, b, ·)) We now have to measure the distance between the set of of actions at state s and the set of actions at state u. [sent-116, score-0.737]
60 Given a metric between pairs of points, the Hausdorff metric can be used to measure the distance between sets of points. [sent-117, score-0.258]
61 Given a finite 1-bounded metric space (M , d), let P (M ) be the set of compact spaces (e. [sent-120, score-0.15]
62 The Hausdorff metric H(d) : P (M ) × P (M ) → [0, 1] is defined as: H(d)(X,Y ) = max(sup inf d(x, y), sup inf d(x, y)) x∈X y∈Y y∈Y x∈X Definition 6. [sent-123, score-0.129]
63 We define the operator F : M → M as F(d)(s, u) = H(δ(d))(Xs , Xu ) We note that the same definition can be applied both for discrete and for compact continuous action spaces. [sent-126, score-0.086]
64 If the action set is compact then Xs = {s} × A is also compact, so the Hausdorff metric is still well defined. [sent-127, score-0.215]
65 For simplicity, we consider the discrete case, so that max and min are defined. [sent-128, score-0.065]
66 F is monotonic and has a least fixed point d f ix in which d f ix (s, u) = 0 iff s ∼ u. [sent-130, score-0.47]
67 As both e f ix and d f ix quantify the difference in behaviour between states, it is not surprising to see that they constrain the difference in optimal value. [sent-133, score-0.502]
68 , 2004) for e f ix , but we also show that our metric d f ix is tighter. [sent-135, score-0.599]
69 Then we have: cr |V ∗ (s) − V ∗ (u)| ≤ d f ix (s, u) ≤ e f ix (s, u) Proof: We show via induction on n that for the sequence of iterates Vn encountered during value iteration, cr |Vn (s) − Vn (u)| ≤ d f ix (s, u) ≤ e f ix (s, u), and then the result follows by merely taking limits. [sent-139, score-1.672]
70 For the base case note that cr |V0 (s) − V0(u)| = d0 (s, u) = e0 (s, u) = 0. [sent-140, score-0.366]
71 We then continue the c inequality: cr |Vn+1 (s) − Vn+1 (u)| ≤ maxa minb (cr |R(s, a) − R(u, b)| + c pK(dn )(P(s, a), P(u, b))) = F(dn )(s, u) = dn+1 (s, u)⋄ 5 State aggregation We now show how we can use this notion of lax bisimulation metrics to construct approximate MDP homomorphisms. [sent-145, score-1.131]
72 Thus, a policy in the aggregate MDP can be lifted to the original MDP by using this relabeling. [sent-149, score-0.168]
73 If M ′ is an aggregation of MDP M and π′ is a policy in M ′ , then the lifted policy is defined by π(s) = gs (π′ (s′ )). [sent-151, score-0.741]
74 Using a lax bisimulation metric, it is possible to choose appropriate re-labelings so that states within a partition can approximately match each other’s actions. [sent-152, score-0.747]
75 If M ′ is a dζ -consistent aggregation of a MDP M and n ≤ ζ, then ∀s ∈ S we have: n−1 cr |Vn (ρ(s)) − Vn(s)| ≤ m(ρ(s)) + M ∑ γn−k . [sent-156, score-0.44]
76 k=1 (s′ , u), s′ where m(C) = 2 maxu∈C dζ denotes the representative state of C and M = maxC m(C). [sent-157, score-0.078]
77 One appropriate way to aggregrate states is to choose some desired error bound ε > 0 and ensure that the states in each partition are within an ε-ball. [sent-163, score-0.196]
78 A simple way to do this is to pick states and random and add to a partition each state within the ε-ball. [sent-164, score-0.165]
79 It has been noted that when the above condition holds, then under the unlaxed bisimulation metric 2ε e f ix , we can be assured that for each state s, |V ∗ (ρ(s)) − V (s)| is bounded by cr (1−γ) . [sent-166, score-1.244]
80 The theorem 4ε above shows that under the lax bisimulation metric d f ix this difference is actually bounded by cr (1−γ) . [sent-167, score-1.362]
81 a massive reduction in the size of the state space can ε be achieved by moving from e f ix to d f ix , even when using ε′ = 2 . [sent-169, score-0.524]
82 For large systems, it might not be feasible to compute the metric e f ix in the original MDP. [sent-170, score-0.381]
83 Ravindran & Barto (2003) provided, based on a result from Whitt (1978), a bound on the difference in values between the optimal policy in the aggregated MDP and the lifted policy in the original MDP. [sent-172, score-0.321]
84 We now show that our metric can be used to tighten this bound. [sent-173, score-0.129]
85 The last inequality holds since π′ is an optimal policy and thus by Theorem 8 we know that { V π′ (C) cr : C ∈ S′ } is a feasible solution. [sent-183, score-0.496]
86 Given two finite distributions P and Q, the total variation metric TV (P, Q) is defined 1 as: TV (P, Q) = ∑s 2 |P(s) − Q(s)| Corollary 16. [sent-186, score-0.129]
87 Let ∆ = maxC,a R(C, a) − minC,a R(C, a) be the maximum difference in rewards in the aggregated MDP. [sent-187, score-0.075]
88 The graph plots the size of the aggregated MDPs obtained against ε, using the lax and the non-lax bisimulation metrics. [sent-192, score-0.668]
89 In the case of the lax metric, we used ε′ = ε/2 to compensate for the factor of 2 difference in the error bound. [sent-193, score-0.221]
90 Under the unlaxed metric, this is not likely to occur, and thus the first states to be partitioned together are the ones neighbouring each other (which can actually have quite different behaviours). [sent-196, score-0.134]
91 7 Discussion and future work We defined a metric for measuring the similarity of state-action pairs in a Markov Decision Process and used it in an algorithm for constructing approximate MDP homomorphisms. [sent-197, score-0.171]
92 Our approach works significantly better than the bisimulation metrics of Ferns et al. [sent-198, score-0.461]
93 Although the metric is potentially expensive to compute, there are domains in which having an accurate aggregation is worth it. [sent-201, score-0.203]
94 The metric can also be used to find subtasks in a larger problem that can be solved using controllers from a pre-supplied library. [sent-203, score-0.129]
95 For example, if a controller is available to navigate single rooms, the metric might be used to lump states in a building schematic into “rooms”. [sent-204, score-0.254]
96 We are also investigating the possibility of replacing the Kantorovich metric (which is very convenient from the theoretical point of view) with a more practical approximation. [sent-209, score-0.129]
97 Finally, the extension to continuous states is very important. [sent-210, score-0.085]
98 Methods for computing state similarity in Markov Decision Processes. [sent-226, score-0.077]
99 Metrics for markov decision processes with infinite state spaces. [sent-239, score-0.108]
100 Towards a unified theory of state abstraction for MDPs. [sent-266, score-0.069]
wordName wordTfidf (topN-words)
[('gs', 0.417), ('bisimulation', 0.393), ('cr', 0.366), ('vn', 0.302), ('mdp', 0.257), ('ix', 0.235), ('lax', 0.221), ('gu', 0.217), ('ferns', 0.16), ('homomorphisms', 0.135), ('metric', 0.129), ('ravindran', 0.118), ('sx', 0.107), ('barto', 0.09), ('lifted', 0.086), ('states', 0.085), ('policy', 0.082), ('dn', 0.077), ('homomorphism', 0.075), ('aggregation', 0.074), ('en', 0.072), ('actions', 0.067), ('action', 0.065), ('max', 0.065), ('bisimilar', 0.061), ('kantorovich', 0.061), ('state', 0.054), ('aggregated', 0.054), ('givan', 0.049), ('gz', 0.049), ('panangaden', 0.049), ('unlaxed', 0.049), ('pk', 0.049), ('cp', 0.046), ('metrics', 0.044), ('tv', 0.039), ('hausdorff', 0.039), ('sbu', 0.037), ('whitt', 0.037), ('equivalence', 0.036), ('precup', 0.035), ('mdps', 0.034), ('puterman', 0.032), ('relation', 0.032), ('decision', 0.03), ('reward', 0.027), ('symmetries', 0.026), ('partition', 0.026), ('larsen', 0.025), ('laxed', 0.025), ('maxc', 0.025), ('maxu', 0.025), ('milner', 0.025), ('munkres', 0.025), ('navigate', 0.025), ('prakash', 0.025), ('skou', 0.025), ('representative', 0.024), ('et', 0.024), ('markov', 0.024), ('similarity', 0.023), ('nition', 0.023), ('match', 0.022), ('rewards', 0.021), ('compact', 0.021), ('partitions', 0.021), ('xs', 0.02), ('aggregates', 0.02), ('crp', 0.02), ('rooms', 0.02), ('measuring', 0.019), ('vi', 0.019), ('assured', 0.018), ('montreal', 0.018), ('prentice', 0.018), ('pr', 0.018), ('theorem', 0.018), ('environment', 0.017), ('optimal', 0.017), ('feasible', 0.017), ('continue', 0.017), ('correspondences', 0.017), ('mcgill', 0.017), ('notion', 0.016), ('proceedings', 0.016), ('grouped', 0.016), ('wolfe', 0.016), ('abstraction', 0.015), ('su', 0.015), ('controller', 0.015), ('proof', 0.015), ('canada', 0.015), ('relations', 0.015), ('transition', 0.015), ('quantify', 0.015), ('nite', 0.014), ('bounding', 0.014), ('inequality', 0.014), ('tuple', 0.014), ('pseudo', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
Author: Jonathan Taylor, Doina Precup, Prakash Panagaden
Abstract: We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.
2 0.16067056 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
Author: Peter Auer, Thomas Jaksch, Ronald Ortner
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1
3 0.15043983 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
4 0.088320367 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not specific to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the first algorithm to achieve all four desirable features and be stable and convergent for all finite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a finite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we first consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The final i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to define the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we define the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a fixed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to infinity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the first expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any fixed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the first k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is defined by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) prefixed by an estimate of the matrix A . Although we find this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the first expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the first k − 1 samples, with u1 = 0. The GTD(0) algorithm is defined by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-defined because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-defined as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the verification of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-defined for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisfies 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefficient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisfies E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisfied by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the definition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-defined and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the filtration defined by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redefined as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a finite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that defining the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is infinite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exemplified by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to find the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufficient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsification methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have confirmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efficiently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientific, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artificial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artificial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8
5 0.080679946 195 nips-2008-Regularized Policy Iteration
Author: Amir M. Farahmand, Mohammad Ghavamzadeh, Shie Mannor, Csaba Szepesvári
Abstract: In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2 -regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions. 1
6 0.078122579 72 nips-2008-Empirical performance maximization for linear rank statistics
7 0.067785785 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
8 0.066082142 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
9 0.055976227 181 nips-2008-Policy Search for Motor Primitives in Robotics
10 0.050426956 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
11 0.046598952 174 nips-2008-Overlaying classifiers: a practical approach for optimal ranking
12 0.046518568 212 nips-2008-Skill Characterization Based on Betweenness
13 0.045954183 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
14 0.045255173 159 nips-2008-On Bootstrapping the ROC Curve
15 0.044643253 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
16 0.039585847 223 nips-2008-Structure Learning in Human Sequential Decision-Making
17 0.038522437 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
18 0.038457941 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
19 0.03694633 168 nips-2008-Online Metric Learning and Fast Similarity Search
20 0.036331538 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
topicId topicWeight
[(0, -0.096), (1, 0.16), (2, -0.035), (3, -0.059), (4, 0.043), (5, 0.011), (6, 0.137), (7, -0.091), (8, -0.047), (9, 0.018), (10, 0.012), (11, -0.037), (12, 0.051), (13, -0.006), (14, -0.019), (15, -0.011), (16, -0.02), (17, 0.015), (18, 0.002), (19, -0.05), (20, -0.05), (21, 0.018), (22, 0.038), (23, -0.017), (24, -0.028), (25, -0.011), (26, 0.011), (27, 0.06), (28, -0.023), (29, 0.002), (30, 0.037), (31, -0.014), (32, 0.03), (33, -0.084), (34, 0.037), (35, 0.092), (36, -0.021), (37, 0.033), (38, -0.009), (39, -0.004), (40, -0.085), (41, 0.03), (42, -0.035), (43, 0.01), (44, 0.081), (45, 0.004), (46, -0.069), (47, -0.069), (48, 0.058), (49, 0.102)]
simIndex simValue paperId paperTitle
same-paper 1 0.95150298 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
Author: Jonathan Taylor, Doina Precup, Prakash Panagaden
Abstract: We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.
2 0.75661445 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
Author: Peter Auer, Thomas Jaksch, Ronald Ortner
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1
3 0.7211625 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
4 0.61138552 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
Author: Ali Nouri, Michael L. Littman
Abstract: The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE’s broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches. 1
Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not specific to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the first algorithm to achieve all four desirable features and be stable and convergent for all finite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a finite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we first consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The final i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to define the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we define the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a fixed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to infinity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the first expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any fixed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the first k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is defined by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) prefixed by an estimate of the matrix A . Although we find this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the first expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the first k − 1 samples, with u1 = 0. The GTD(0) algorithm is defined by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-defined because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-defined as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the verification of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-defined for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisfies 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefficient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisfies E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisfied by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the definition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-defined and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the filtration defined by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redefined as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a finite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that defining the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is infinite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exemplified by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to find the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufficient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsification methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have confirmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efficiently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientific, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artificial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artificial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8
6 0.58040279 195 nips-2008-Regularized Policy Iteration
7 0.56406343 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
8 0.54035169 181 nips-2008-Policy Search for Motor Primitives in Robotics
9 0.50205511 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
10 0.48767865 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
11 0.45660692 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
12 0.32877979 174 nips-2008-Overlaying classifiers: a practical approach for optimal ranking
13 0.31261247 72 nips-2008-Empirical performance maximization for linear rank statistics
14 0.30508286 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
15 0.30289486 159 nips-2008-On Bootstrapping the ROC Curve
16 0.29189774 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
17 0.29099295 170 nips-2008-Online Optimization in X-Armed Bandits
18 0.28897113 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
19 0.28848445 212 nips-2008-Skill Characterization Based on Betweenness
20 0.2821877 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
topicId topicWeight
[(4, 0.506), (6, 0.033), (7, 0.038), (12, 0.029), (28, 0.108), (57, 0.033), (59, 0.014), (63, 0.024), (71, 0.028), (77, 0.052), (83, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.79913175 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
Author: Jonathan Taylor, Doina Precup, Prakash Panagaden
Abstract: We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.
2 0.79151946 58 nips-2008-Dependence of Orientation Tuning on Recurrent Excitation and Inhibition in a Network Model of V1
Author: Klaus Wimmer, Marcel Stimberg, Robert Martin, Lars Schwabe, Jorge Mariño, James Schummers, David C. Lyon, Mriganka Sur, Klaus Obermayer
Abstract: The computational role of the local recurrent network in primary visual cortex is still a matter of debate. To address this issue, we analyze intracellular recording data of cat V1, which combine measuring the tuning of a range of neuronal properties with a precise localization of the recording sites in the orientation preference map. For the analysis, we consider a network model of Hodgkin-Huxley type neurons arranged according to a biologically plausible two-dimensional topographic orientation preference map. We then systematically vary the strength of the recurrent excitation and inhibition relative to the strength of the afferent input. Each parametrization gives rise to a different model instance for which the tuning of model neurons at different locations of the orientation map is compared to the experimentally measured orientation tuning of membrane potential, spike output, excitatory, and inhibitory conductances. A quantitative analysis shows that the data provides strong evidence for a network model in which the afferent input is dominated by strong, balanced contributions of recurrent excitation and inhibition. This recurrent regime is close to a regime of “instability”, where strong, self-sustained activity of the network occurs. The firing rate of neurons in the best-fitting network is particularly sensitive to small modulations of model parameters, which could be one of the functional benefits of a network operating in this particular regime. 1
Author: Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, behavior policy, and target policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d. policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L2 norm. We prove that this algorithm is stable and convergent under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without LSTD’s quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods. 1 Off-policy learning methods Off-policy methods have an important role to play in the larger ambitions of modern reinforcement learning. In general, updates to a statistic of a dynamical process are said to be “off-policy” if their distribution does not match the dynamics of the process, particularly if the mismatch is due to the way actions are chosen. The prototypical example in reinforcement learning is the learning of the value function for one policy, the target policy, using data obtained while following another policy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an offpolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated action values, and the behavior policy is something more exploratory, such as a corresponding greedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model human-level world knowledge and state representations as predictions of option outcomes (e.g., Sutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006). Unfortunately, off-policy methods such as Q-learning are not sound when used with approximations that are linear in the learned parameters—the most popular form of function approximation in reinforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which Q-learning’s parameters diverge to infinity for any positive step size. This is a severe problem in so far as function approximation is widely viewed as necessary for large-scale applications of reinforcement learning. The need is so great that practitioners have often simply ignored the problem and continued to use Q-learning with linear function approximation anyway. Although no instances ∗ Csaba Szepesv´ ri is on leave from MTA SZTAKI. a 1 of absolute divergence in applications have been reported in the literature, the potential for instability is disturbing and probably belies real but less obvious problems. The stability problem is not specific to reinforcement learning. Classical dynamic programming methods such as value and policy iteration are also off-policy methods and also diverge on some problems when used with linear function approximation. Reinforcement learning methods are actually an improvement over conventional dynamic programming methods in that at least they can be used stably with linear function approximation in their on-policy form. The stability problem is also not due to the interaction of control and prediction, or to stochastic approximation effects; the simplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation (see Baird 1995; Sutton & Barto 1998). Prior to the current work, the possibility of instability could not be avoided whenever four individually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference learning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step computation. If any one of these four is abandoned, then stable methods can be obtained relatively easily. But each feature brings value and practitioners are loath to give any of them up, as we discuss later in a penultimate related-work section. In this paper we present the first algorithm to achieve all four desirable features and be stable and convergent for all finite Markov decision processes, all target and behavior policies, and all feature representations for the linear approximator. Moreover, our algorithm does not use importance sampling and can be expected to be much better conditioned and of lower variance than importance sampling methods. Our algorithm can be viewed as performing stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD solution. Our algorithm is also incremental and suitable for online use just as are simple temporaldifference learning algorithms such as Q-learning and TD(λ) (Sutton 1988). Our algorithm can be broadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0). 2 Sub-sampling and i.i.d. formulations of temporal-difference learning In this section we formulate the off-policy policy-evaluation problem for one-step temporaldifference learning such that the data consists of independent, identically-distributed (i.i.d.) samples. We start by considering the standard reinforcement learning framework, in which a learning agent interacts with an environment consisting of a finite Markov decision process (MDP). At each of a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st ∈ S, the agent chooses an action at ∈ A, and then the environment emits a reward rt ∈ R, and transitions to its next state st+1 ∈ S. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, and on the next state. The agent process generating the actions is termed the behavior policy. To start, we assume a deterministic target policy π : S → A. The objective is to learn an approximation to its state-value function: ∞ V π (s) = Eπ γ t−1 rt |s1 = s , (1) t=1 where γ ∈ [0, 1) is the discount rate. The learning is to be done without knowledge of the process dynamics and from observations of a single continuous trajectory with no resets. In many problems of interest the state set is too large for it to be practical to approximate the value of each state individually. Here we consider linear function approximation, in which states are mapped to feature vectors with fewer components than the number of states. That is, for each state s ∈ S there is a corresponding feature vector φ(s) ∈ Rn , with n |S|. The approximation to the value function is then required to be linear in the feature vectors and a corresponding parameter vector θ ∈ Rn : V π (s) ≈ θ φ(s). (2) Further, we assume that the states st are not visible to the learning agent in any way other than through the feature vectors. Thus this function approximation formulation can include partialobservability formulations such as POMDPs as a special case. The environment and the behavior policy together generate a stream of states, actions and rewards, s1 , a1 , r1 , s2 , a2 , r2 , . . ., which we can break into causally related 4-tuples, (s1 , a1 , r1 , s1 ), 2 (s2 , a2 , r2 , s2 ), . . . , where st = st+1 . For some tuples, the action will match what the target policy would do in that state, and for others it will not. We can discard all of the latter as not relevant to the target policy. For the former, we can discard the action because it can be determined from the state via the target policy. With a slight abuse of notation, let sk denote the kth state in which an on-policy action was taken, and let rk and sk denote the associated reward and next state. The kth on-policy transition, denoted (sk , rk , sk ), is a triple consisting of the starting state of the transition, the reward on the transition, and the ending state of the transition. The corresponding data available to the learning algorithm is the triple (φ(sk ), rk , φ(sk )). The MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary state-occupancy distribution µ(s) = limk→∞ P r{sk = s}. For any state s, the MDP and target policy together determine an N × N state-transition-probability matrix P , where pss = P r{sk = s |sk = s}, and an N × 1 expected-reward vector R, where Rs = E[rk |sk = s]. These two together completely characterize the statistics of on-policy transitions, and all the samples in the sequence of (φ(sk ), rk , φ(sk )) respect these statistics. The problem still has a Markov structure in that there are temporal dependencies between the sample transitions. In our analysis we first consider a formulation without such dependencies, the i.i.d. case, and then prove that our results extend to the original case. In the i.i.d. formulation, the states sk are generated independently and identically distributed according to an arbitrary probability distribution µ. From each sk , a corresponding sk is generated according to the on-policy state-transition matrix, P , and a corresponding rk is generated according to an arbitrary bounded distribution with expected value Rsk . The final i.i.d. data sequence, from which an approximate value function is to be learned, is then the sequence (φ(sk ), rk , φ(sk )), for k = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single tuple of random variables (φ, r, φ ) drawn from µ. It remains to define the objective of learning. The TD error for the linear setting is δ = r + γθ φ − θ φ. (3) Given this, we define the one-step linear TD solution as any value of θ at which 0 = E[δφ] = −Aθ + b, (4) where A = E φ(φ − γφ ) and b = E[rφ]. This is the parameter value to which the linear TD(0) algorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0) (Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a fixed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if θ does not exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and eventually diverge to infinity. 3 The GTD(0) algorithm We next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As discussed above, the vector E[δφ] can be viewed as an error in the current solution θ. The vector should be zero, so its norm is a measure of how far we are away from the TD solution. A distinctive feature of our gradient-descent analysis of temporal-difference learning is that we use as our objective function the L2 norm of this vector: J(θ) = E[δφ] E[δφ] . (5) This objective function is quadratic and unimodal; it’s minimum value of 0 is achieved when E[δφ] = 0, which can always be achieved. The gradient of this objective function is θ J(θ) = 2( = 2E φ( θ E[δφ])E[δφ] θ δ) E[δφ] = −2E φ(φ − γφ ) E[δφ] . (6) This last equation is key to our analysis. We would like to take a stochastic gradient-descent approach, in which a small change is made on each sample in such a way that the expected update 3 is the direction opposite to the gradient. This is straightforward if the gradient can be written as a single expected value, but here we have a product of two expected values. One cannot sample both of them because the sample product will be biased by their correlation. However, one could store a long-term, quasi-stationary estimate of either of the expectations and then sample the other. The question is, which expectation should be estimated and stored, and which should be sampled? Both ways seem to lead to interesting learning algorithms. First let us consider the algorithm obtained by forming and storing a separate estimate of the first expectation, that is, of the matrix A = E φ(φ − γφ ) . This matrix is straightforward to estimate from experience as a simple arithmetic average of all previously observed sample outer products φ(φ − γφ ) . Note that A is a stationary statistic in any fixed-policy policy-evaluation problem; it does not depend on θ and would not need to be re-estimated if θ were to change. Let Ak be the estimate of A after observing the first k samples, (φ1 , r1 , φ1 ), . . . , (φk , rk , φk ). Then this algorithm is defined by k 1 Ak = φi (φi − γφi ) (7) k i=1 along with the gradient descent rule: θk+1 = θk + αk Ak δk φk , k ≥ 1, (8) where θ1 is arbitrary, δk = rk + γθk φk − θk φk , and αk > 0 is a series of step-size parameters, possibly decreasing over time. We call this algorithm A TD(0) because it is essentially conventional TD(0) prefixed by an estimate of the matrix A . Although we find this algorithm interesting, we do not consider it further here because it requires O(n2 ) memory and computation per time step. The second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form and store an estimate of the second expectation, the vector E[δφ], and to sample the first expectation, E φ(φ − γφ ) . Let uk denote the estimate of E[δφ] after observing the first k − 1 samples, with u1 = 0. The GTD(0) algorithm is defined by uk+1 = uk + βk (δk φk − uk ) (9) and θk+1 = θk + αk (φk − γφk )φk uk , (10) where θ1 is arbitrary, δk is as in (3) using θk , and αk > 0 and βk > 0 are step-size parameters, possibly decreasing over time. Notice that if the product is formed right-to-left, then the entire computation is O(n) per time step. 4 Convergence The purpose of this section is to establish that GTD(0) converges with probability one to the TD solution in the i.i.d. problem formulation under standard assumptions. In particular, we have the following result: Theorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se∞ ∞ 2 quences αk and βk satisfying βk = ηαk , η > 0, αk , βk ∈ (0, 1], k=0 αk = ∞, k=0 αk < ∞. Further assume that (φk , rk , φk ) is an i.i.d. sequence with uniformly bounded second moments. Let A = E φk (φk − γφk ) and b = E[rk φk ] (note that A and b are well-defined because the distribution of (φk , rk , φk ) does not depend on the sequence index k). Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4). Proof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First, we rewrite the algorithm’s two iterations as a single iteration in a combined parameter vector with √ 2n components ρk = (vk , θk ), where vk = uk / η, and a new reward-related vector with 2n components gk+1 = (rk φk , 0 ): √ ρk+1 = ρk + αk η (Gk+1 ρk + gk+1 ) , where Gk+1 = √ − ηI (φk − γφk )φk 4 φk (γφk − φk ) 0 . Let G = E[Gk ] and g = E[gk ]. Note that G and g are well-defined as by the assumption the process {φk , rk , φk }k is i.i.d. In particular, √ − η I −A b G= , g= . 0 A 0 Further, note that (4) follows from Gρ + g = 0, (11) where ρ = (v , θ ). Now we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write ρk+1 = ρk + √ √ αk η(Gρk +g+(Gk+1 −G)ρk +(gk+1 −g)) = ρk +αk (h(ρk )+Mk+1 ), where αk = αk η, h(ρ) = g + Gρ and Mk+1 = (Gk+1 − G)ρk + gk+1 − g. Let Fk = σ(ρ1 , M1 , . . . , ρk−1 , Mk ). Theorem 2.2 requires the verification of the following conditions: (i) The function h is Lipschitz and h∞ (ρ) = limr→∞ h(rρ)/r is well-defined for every ρ ∈ R2n ; (ii-a) The sequence (Mk , Fk ) is a martingale difference sequence, and (ii-b) for some C0 > 0, E Mk+1 2 | Fk ≤ C0 (1 + ρk 2 ) holds for ∞ any initial parameter vector ρ1 ; (iii) The sequence αk satisfies 0 < αk ≤ 1, k=1 αk = ∞, ∞ 2 ˙ k=1 (αk ) < +∞; and (iv) The ODE ρ = h(ρ) has a globally asymptotically stable equilibrium. Clearly, h(ρ) is Lipschitz with coefficient G and h∞ (ρ) = Gρ. By construction, (Mk , Fk ) satisfies E[Mk+1 |Fk ] = 0 and Mk ∈ Fk , i.e., it is a martingale difference sequence. Condition (ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of the the second moments of (φk , rk , φk ). Condition (iii) is satisfied by our conditions on the step-size sequences αk , βk . Finally, the last condition (iv) will follow from the elementary theory of linear differential equations if we can show that the real parts of all the eigenvalues of G are negative. First, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we get det(G) = det(A A) = 0. This indicates that all the eigenvalues of G are non-zero. Now, let λ ∈ C, λ = 0 be an eigenvalue of G with corresponding normalized eigenvector x ∈ C2n ; 2 that is, x = x∗ x = 1, where x∗ is the complex conjugate of x. Hence x∗ Gx = λ. Let √ 2 x = (x1 , x2 ), where x1 , x2 ∈ Cn . Using the definition of G, λ = x∗ Gx = − η x1 + x∗ Ax2 − x∗ A x1 . Because A is real, A∗ = A , and it follows that (x∗ Ax2 )∗ = x∗ A x1 . Thus, 1 2 1 2 √ 2 Re(λ) = Re(x∗ Gx) = − η x1 ≤ 0. We are now done if we show that x1 cannot be zero. If x1 = 0, then from λ = x∗ Gx we get that λ = 0, which contradicts with λ = 0. The next result concerns the convergence of GTD(0) when (φk , rk , φk ) is obtained by the off-policy sub-sampling process described originally in Section 2. We make the following assumption: Assumption A1 The behavior policy πb (generator of the actions at ) selects all actions of the target policy π with positive probability in every state, and the target policy is deterministic. This assumption is needed to ensure that the sub-sampled process sk is well-defined and that the obtained sample is of “high quality”. Under this assumption it holds that sk is again a Markov chain by the strong Markov property of Markov processes (as the times selected when actions correspond to those of the behavior policy form Markov times with respect to the filtration defined by the original process st ). The following theorem shows that the conclusion of the previous result continues to hold in this case: Theorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the parameters θk , uk be updated by (9,10). Further assume that (φk , rk , φk ) is such that E φk 2 |sk−1 , 2 E rk |sk−1 , E φk 2 |sk−1 are uniformly bounded. Assume that the Markov chain (sk ) is aperiodic and irreducible, so that limk→∞ P(sk = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, and let s be a state obtained by following π for one time step in the MDP from s. Further, let r(s, s ) be the reward incurred. Let A = E φ(s)(φ(s) − γφ(s )) and b = E[r(s, s )φ(s)]. Assume that A is non-singular. Then the parameter vector θk converges with probability one to the TD solution (4), provided that s1 ∼ µ. Proof. The proof of Theorem 4.1 goes through without any changes once we observe that G = E[Gk+1 |Fk ] and g = E[gk+1 | Fk ]. 1 R According to this rule, if A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×n , D ∈ Rm×m then for F = [A B; C D] ∈ , det(F ) = det(A) det(D − CA−1 B). (n+m)×(n+m) 5 The condition that (sk ) is aperiodic and irreducible guarantees the existence of the steady state distribution µ. Further, the aperiodicity and irreducibility of (sk ) follows from the same property of the original process (st ). For further discussion of these conditions cf. Section 6.3 of Bertsekas and Tsitsiklis (1996). With considerable more work the result can be extended to the case when s1 follows an arbitrary distribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of the form ρk+1 + ρk (h(ρk ) + Mk+1 + ek+1 ), where ek+1 is a fast decaying perturbation (see, e.g., the proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)). 5 Extensions to action values, stochastic target policies, and other sample weightings The GTD algorithm extends immediately to the case of off-policy learning of action-value functions. For this assume that a behavior policy πb is followed that samples all actions in every state with positive probability. Let the target policy to be evaluated be π. In this case the basis functions are dependent on both the states and actions: φ : S × A → Rn . The learning equations are unchanged, except that φt and φt are redefined as follows: φt = φ(st , at ), (12) φt = (13) π(st+1 , a)φ(st+1 , a). a (We use time indices t denoting physical time.) Here π(s, a) is the probability of selecting action a in state s under the target policy π. Let us call the resulting algorithm “one-step gradient-based Q-evaluation,” or GQE(0). Theorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by following some stationary policy πb in a finite MDP. Let rt be the corresponding sequence of rewards and let φt , φt be given by the respective equations (12) and (13), and assume that E φt 2 |st−1 , 2 E rt |st−1 , E φt 2 |st−1 are uniformly bounded. Let the parameters θt , ut be updated by Equations (9) and (10). Assume that the Markov chain (st ) is aperiodic and irreducible, so that limt→∞ P(st = s |s0 = s) = µ(s ) exists and is unique. Let s be a state randomly drawn from µ, a be an action chosen by πb in s, let s be the next state obtained and let a = π(s ) be the action chosen by the target policy in state s . Further, let r(s, a, s ) be the reward incurred in this transition. Let A = E φ(s, a)(φ(s, a) − γφ(s , a )) and b = E[r(s, a, s )φ(s, a)]. Assume that A is non-singular. Then the parameter vector θt converges with probability one to a TD solution (4), provided that s1 is selected from the steady-state distribution µ. The proof is almost identical to that of Theorem 4.2, and hence it is omitted. Our main convergence results are also readily generalized to stochastic target policies by replacing the sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of including or excluding transitions depending upon whether the action taken matches a deterministic policy, we include all transitions but give each a weight. For example, we might let the weight wt for time step t be equal to the probability π(st , at ) of taking the action actually taken under the target policy. We can consider the i.i.d. samples now to have four components (φk , rk , φk , wk ), with the update rules (9) and (10) replaced by uk+1 = uk + βk (δk φk − uk )wk , (14) θk+1 = θk + αk (φk − γφk )φk uk wk . (15) and Each sample is also weighted by wk in the expected values, such as that defining the TD solution (4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The reweighting is, in effect, an adjustment to the i.i.d. sampling distribution, µ, and thus our results hold because they hold for all µ. The choice wt = π(st , at ) is only one possibility, notable for its equivalence to our original case if the target policy is deterministic. Another natural weighting is wt = π(st , at )/πb (st , at ), where πb is the behavior policy. This weighting may result in the TD solution (4) better matching the target policy’s value function (1). 6 6 Related work There have been several prior attempts to attain the four desirable algorithmic features mentioned at the beginning this paper (off-policy stability, temporal-difference learning, linear function approximation, and O(n) complexity) but none has been completely successful. One idea for retaining all four desirable features is to use importance sampling techniques to reweight off-policy updates so that they are in the same direction as on-policy updates in expected value (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can sometimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis & Van Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products of (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the corresponding updates may be of very high variance. The use of “recognizers” to construct the target policy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strategy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton & Dasgupta 2001). However, it remains the case that for all of such methods to date there are always choices of problem, behavior policy, and target policy for which the variance is infinite, and thus for which there is no guarantee of convergence. Residual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four desirable features. These methods can be viewed as gradient descent in the expected squared TD error, E δ 2 ; thus they converge stably to the solution that minimizes this objective for arbitrary differentiable function approximators. However, this solution has always been found to be much inferior to the TD solution (exemplified by (4) for the one-step linear case). In the literature (Baird 1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to find the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in which two samples can be drawn for each next state (e.g., for which a simulation model is available). Our own analysis indicates that even these two special requirements are insufficient to guarantee convergence to the TD solution.2 Gordon (1995) and others have questioned the need for linear function approximation. He has proposed replacing linear function approximation with a more restricted class of approximators, known as averagers, that never extrapolate outside the range of the observed data and thus cannot diverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been used on large applications involving online learning. Linear methods, on the other hand, have been widely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka & Jussila 2001). The need for linear complexity has also been questioned. Second-order methods for linear approximators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003; see also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the number of features in the linear approximator is n, then these methods require memory and per-timestep computation that is O(n2 ). Newer incremental methods such as iLSTD (Geramifard, Bowling & Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2 ) in memory. Sparsification methods may reduce the complexity further, they do not help in the general case, and may apply to O(n) methods as well to further reduce their complexity. Linear function approximation is most powerful when very large numbers of features are used, perhaps millions of features (e.g., as in Silver, Sutton & M¨ ller 2007). In such cases, O(n2 ) methods are not feasible. u 7 Conclusion GTD(0) is the first off-policy TD algorithm to converge under general conditions with linear function approximation and linear complexity. As such, it breaks new ground in terms of important, 2 For a counterexample, consider that given in Dayan’s (1992) Figure 2, except now consider that state A is actually two states, A and A’, which share the same feature vector. The two states occur with 50-50 probability, and when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when the other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will converge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is deterministic, and even if multiple samples are drawn from each state (they will all be the same). 7 absolute abilities not previous available in existing algorithms. We have conducted empirical studies with the GTD(0) algorithm and have confirmed that it converges reliably on standard off-policy counterexamples such as Baird’s (1995) “star” problem. On on-policy problems such as the n-state random walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as efficiently as classic TD(0), although we are still exploring different ways of setting the step-size parameters, and other variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will be a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new ground and achieves important abilities that were previously unattainable. Acknowledgments The authors gratefully acknowledge insights and assistance they have received from David Silver, Eric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported by iCORE, NSERC and the Alberta Ingenuity Fund. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pp. 30–37. Morgan Kaufmann. Baxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences. International Computer Chess Association Journal, 21, 84–99. Bertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scientific, 1996. Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control And Optimization , 38(2):447–469. Boyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233– 246. Bradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57. Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learning, 8:341–362. Geramifard, A., Bowling, M., Sutton, R. S. (2006). Incremental least-square temporal difference learning. Proceedings of the National Conference on Artificial Intelligence, pp. 356–361. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco. Lagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research, 4:1107-1149. Peters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European Conference on Machine Learning, pp. 280–291. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417–424. Precup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers. Advances in Neural Information Processing Systems 18. Precup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of the 17th International Conference on Machine Learning, pp. 759–766. Morgan Kaufmann. Schaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance gameplaying program. Proceedings of the International Joint Conference on Artificial Intelligence, pp. 529–534. Silver, D., Sutton, R. S., M¨ ller, M. (2007). Reinforcement learning of local shape in the game of Go. u Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1053–1058. Sturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceedings of the 5th International Conference on Computers and Games. Sutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9–44. Sutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211. Sutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances in Neural Information Processing Systems 18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine Learning 42:241–267 Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42:674–690. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University. 8
4 0.55973297 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
Author: Matthew Botvinick, James An
Abstract: Research in animal learning and behavioral neuroscience has distinguished between two forms of action control: a habit-based form, which relies on stored action values, and a goal-directed form, which forecasts and compares action outcomes based on a model of the environment. While habit-based control has been the subject of extensive computational research, the computational principles underlying goal-directed control in animals have so far received less attention. In the present paper, we advance a computational framework for goal-directed control in animals and humans. We take three empirically motivated points as founding premises: (1) Neurons in dorsolateral prefrontal cortex represent action policies, (2) Neurons in orbitofrontal cortex represent rewards, and (3) Neural computation, across domains, can be appropriately understood as performing structured probabilistic inference. On a purely computational level, the resulting account relates closely to previous work using Bayesian inference to solve Markov decision problems, but extends this work by introducing a new algorithm, which provably converges on optimal plans. On a cognitive and neuroscientific level, the theory provides a unifying framework for several different forms of goal-directed action selection, placing emphasis on a novel form, within which orbitofrontal reward representations directly drive policy selection. 1 G oal- d irect ed act i on cont rol In the study of human and animal behavior, it is a long-standing idea that reward-based decision making may rely on two qualitatively different mechanisms. In habit-based decision making, stimuli elicit reflex-like responses, shaped by past reinforcement [1]. In goal-directed or purposive decision making, on the other hand, actions are selected based on a prospective consideration of possible outcomes and future lines of action [2]. Over the past twenty years or so, the attention of cognitive neuroscientists and computationally minded psychologists has tended to focus on habit-based control, due in large part to interest in potential links between dopaminergic function and temporal-difference algorithms for reinforcement learning. However, a resurgence of interest in purposive action selection is now being driven by innovations in animal behavior research, which have yielded powerful new behavioral assays [3], and revealed specific effects of focal neural damage on goaldirected behavior [4]. In discussing some of the relevant data, Daw, Niv and Dayan [5] recently pointed out the close relationship between purposive decision making, as understood in the behavioral sciences, and model-based methods for the solution of Markov decision problems (MDPs), where action policies are derived from a joint analysis of a transition function (a mapping from states and actions to outcomes) and a reward function (a mapping from states to rewards). Beyond this important insight, little work has yet been done to characterize the computations underlying goal-directed action selection (though see [6, 7]). As discussed below, a great deal of evidence indicates that purposive action selection depends critically on a particular region of the brain, the prefrontal cortex. However, it is currently a critical, and quite open, question what the relevant computations within this part of the brain might be. Of course, the basic computational problem of formulating an optimal policy given a model of an MDP has been extensively studied, and there is no shortage of algorithms one might consider as potentially relevant to prefrontal function (e.g., value iteration, policy iteration, backward induction, linear programming, and others). However, from a cognitive and neuroscientific perspective, there is one approach to solving MDPs that it seems particularly appealing to consider. In particular, several researchers have suggested methods for solving MDPs through probabilistic inference [8-12]. The interest of this idea, in the present context, derives from a recent movement toward framing human and animal information processing, as well as the underlying neural computations, in terms of structured probabilistic inference [13, 14]. Given this perspective, it is inviting to consider whether goal-directed action selection, and the neural mechanisms that underlie it, might be understood in those same terms. One challenge in investigating this possibility is that previous research furnishes no ‘off-theshelf’ algorithm for solving MDPs through probabilistic inference that both provably yields optimal policies and aligns with what is known about action selection in the brain. We endeavor here to start filling in that gap. In the following section, we introduce an account of how goal-directed action selection can be performed based on probabilisitic inference, within a network whose components map grossly onto specific brain structures. As part of this account, we introduce a new algorithm for solving MDPs through Bayesian inference, along with a convergence proof. We then present results from a set of simulations illustrating how the framework would account for a variety of behavioral phenomena that are thought to involve purposive action selection. 2 Co m p u t a t i o n a l m o d el As noted earlier, the prefrontal cortex (PFC) is believed to play a pivotal role in purposive behavior. This is indicated by a broad association between prefrontal lesions and impairments in goal-directed action in both humans (see [15]) and animals [4]. Single-unit recording and other data suggest that different sectors of PFC make distinct contributions. In particular, neurons in dorsolateral prefrontal cortex (DLPFC) appear to encode taskspecific mappings from stimuli to responses (e.g., [16]): “task representations,” in the language of psychology, or “policies” in the language of dynamic programming. Although there is some understanding of how policy representations in DLPFC may guide action execution [15], little is yet known about how these representations are themselves selected. Our most basic proposal is that DLPFC policy representations are selected in a prospective, model-based fashion, leveraging information about action-outcome contingencies (i.e., the transition function) and about the incentive value associated with specific outcomes or states (the reward function). There is extensive evidence to suggest that state-reward associations are represented in another area of the PFC, the orbitofrontal cortex (OFC) [17, 18]. As for the transition function, although it is clear that the brain contains detailed representations of action-outcome associations [19], their anatomical localization is not yet entirely clear. However, some evidence suggests that the enviromental effects of simple actions may be represented in inferior fronto-parietal cortex [20], and there is also evidence suggesting that medial temporal structures may be important in forecasting action outcomes [21]. As detailed in the next section, our model assumes that policy representations in DLPFC, reward representations in OFC, and representations of states and actions in other brain regions, are coordinated within a network structure that represents their causal or statistical interdependencies, and that policy selection occurs, within this network, through a process of probabilistic inference. 2.1 A rc h i t e c t u re The implementation takes the form of a directed graphical model [22], with the layout shown in Figure 1. Each node represents a discrete random variable. State variables (s), representing the set of m possible world states, serve the role played by parietal and medial temporal cortices in representing action outcomes. Action variables (a) representing the set of available actions, play the role of high-level cortical motor areas involved in the programming of action sequences. Policy variables ( ), each repre-senting the set of all deterministic policies associated with a specific state, capture the representational role of DLPFC. Local and global utility variables, described further Fig 1. Left: Single-step decision. Right: Sequential decision. below, capture the role of OFC in Each time-slice includes a set of m policy nodes. representing incentive value. A separate set of nodes is included for each discrete time-step up to the planning horizon. The conditional probabilities associated with each variable are represented in tabular form. State probabilities are based on the state and action variables in the preceding time-step, and thus encode the transition function. Action probabilities depend on the current state and its associated policy variable. Utilities depend only on the current state. Rather than representing reward magnitude as a continuous variable, we adopt an approach introduced by [23], representing reward through the posterior probability of a binary variable (u). States associated with large positive reward raise p(u) (i.e, p(u=1|s)) near to one; states associated with large negative rewards reduce p(u) to near zero. In the simulations reported below, we used a simple linear transformation to map from scalar reward values to p(u): p (u si ) = 1 R ( si ) +1 , rmax 2 rmax max j R ( s j ) (1) In situations involving sequential actions, expected returns from different time-steps must be integrated into a global representation of expected value. In order to accomplish this, we employ a technique proposed by [8], introducing a “global” utility variable (u G). Like u, this 1 is a binary random variable, but associated with a posterior probability determined as: p (uG ) = 1 N p(u i ) (2) i where N is the number of u nodes. The network as whole embodies a generative model for instrumental action. The basic idea is to use this model as a substrate for probabilistic inference, in order to arrive at optimal policies. There are three general methods for accomplishing this, which correspond three forms of query. First, a desired outcome state can be identified, by treating one of the state variables (as well as the initial state variable) as observed (see [9] for an application of this approach). Second, the expected return for specific plans can be evaluated and compared by conditioning on specific sets of values over the policy nodes (see [5, 21]). However, our focus here is on a less obvious possibility, which is to condition directly on the utility variable u G , as explained next. 2.2 P o l i c y s e l e c t i o n b y p ro b a b i l i s t i c i n f e re n c e : a n i t e r a t i v e a l g o r i t h m Cooper [23] introduced the idea of inferring optimal decisions in influence diagrams by treating utility nodes into binary random variables and then conditioning on these variables. Although this technique has been adopted in some more recent work [9, 12], we are aware of no application that guarantees optimal decisions, in the expected-reward sense, in multi-step tasks. We introduce here a simple algorithm that does furnish such a guarantee. The procedure is as follows: (1) Initialize the policy nodes with any set of non-deterministic 2 priors. (2) Treating the initial state and u G as observed variables (u G = 1), use standard belief 1 Note that temporal discounting can be incorporated into the framework through minimal modifications to Equation 2. 2 In the single-action situation, where there is only one u node, it is this variable that is treated as observed (u = 1). propagation (or a comparable algorithm) to infer the posterior distributions over all policy nodes. (3) Set the prior distributions over the policy nodes to the values (posteriors) obtained in step 2. (4) Go to step 2. The next two sections present proofs of monotonicity and convergence for this algorithm. 2.2.1 Monotonicity We show first that, at each policy node, the probability associated with the optimal policy will rise on every iteration. Define * as follows: ( * p uG , + ) > p (u + , G ), * (3) where + is the current set of probability distributions at all policy nodes on subsequent time-steps. (Note that we assume here, for simplicity, that there is a unique optimal policy.) The objective is to establish that: p ( t* ) > p ( t* 1 ) (4) where t indexes processing iterations. The dynamics of the network entail that p( ) = p( t t 1 uG ) (5) where represents any value (i.e., policy) of the decision node being considered. Substituting this into (4) gives p t* 1 uG > p ( t* 1 ) (6) ( ) From this point on the focus is on a single iteration, which permits us to omit the relevant subscripts. Applying Bayes’ law to (6) yields p (uG * p (uG ) p( ) > p * )p ( ) ( ) * (7) Canceling, and bringing the denominator up, this becomes p (uG * )> p (uG ) p( ) (8) Rewriting the left hand side, we obtain p ( uG * ) p( ) > p (uG ) p( ) (9) Subtracting and further rearranging: p (uG p (uG * ) p ( uG * * * ) p (uG ) p( ) + p (uG * * ) * p ( uG ) p( ) > 0 p (uG * ) p (uG ) p( ) > 0 (10) ) p( ) > 0 (11) (12) Note that this last inequality (12) follows from the definition of *. Remark: Of course, the identity of * depends on +. In particular, the policy * will only be part of a globally optimal plan if the set of choices + is optimal. Fortunately, this requirement is guaranteed to be met, as long as no upper bound is placed on the number of processing cycles. Recalling that we are considering only finite-horizon problems, note that for policies leading to states with no successors, + is empty. Thus * at the relevant policy nodes is fixed, and is guaranteed to be part of the optimal policy. The proof above shows that * will continuously rise. Once it reaches a maximum, * at immediately preceding decisions will perforce fit with the globally optimal policy. The process works backward, in the fashion of backward induction. 2.2.2 Convergence Continuing with the same notation, we show now that pt ( limt uG ) = 1 * (13) Note that, if we apply Bayes’ law recursively, pt ( uG ) = ( ) p ( ) = p (u p uG t ) G pi (uG ) 2 pt pi (uG ) pt 1 ( ( )= ) p uG 1 ( uG ) pt (uG ) pt 3 pt 2 ( ) 1 ( u G ) pt 2 ( u G ) … (14) Thus, p1 ( uG ) = ( p uG ) p ( ), p ( p (u ) 1 uG ) = 2 1 G 2 ( ) p ( ), p uG 1 p2 (uG ) p1 (uG ) p3 ( 3 ( ) p( ) p uG uG ) = 1 p3 (uG ) p2 (uG ) p1 (uG ) , (15) and so forth. Thus, what we wish to prove is ( * p uG ) p ( ) =1 * 1 (16) pt (uG ) t =1 or, rearranging, pt (uG ) ( = p1 ( ) p uG t =1 (17) ). Note that, given the stipulated relationship between p( ) on each processing iteration and p( | uG) on the previous iteration, p (uG pt (uG ) = )p ( ) = p ( uG = pt 1 )p ( p (uG t uG ) = t 1 3 )p ( ) 4 p (uG t 1 = (uG ) pt 2 (uG ) pt 1 ) pt 2 p (uG )p ( ) t 1 pt 1 1 ( ) (uG ) pt 2 (uG ) pt 3 (uG ) ( uG ) (18) … With this in mind, we can rewrite the left hand side product in (17) as follows: p ( uG p1 (uG ) ( p uG ) p (u G 2 )p( ) ) p (u 1 G ) 3 p (uG 1 ( p uG )p( ) ) p (u 1 G 4 p (uG 1 ( ) p2 (uG ) p uG ) p (u 1 ) p( ) 1 G ) p2 (uG ) p3 (uG ) … (19) Note that, given (18), the numerator in each factor of (19) cancels with the denominator in the subsequent factor, leaving only p(uG| *) in that denominator. The expression can thus be rewritten as 1 ( p uG 1 ) p (u G ) p (u G 4 p (uG 1 ) ) p( ) 1 ( p uG ) … = p (uG ( p uG ) ) p1 ( ). (20) The objective is then to show that the above equals p( *). It proceeds directly from the definition of * that, for all other than *, p ( uG ( p uG ) ) <1 (21) Thus, all but one of the terms in the sum above approach zero, and the remaining term equals p1( *). Thus, p (uG ( p uG ) ) p1 ( ) = p1 ( ) (22) 3 Simulations 3.1 Binary choice We begin with a simulation of a simple incentive choice situation. Here, an animal faces two levers. Pressing the left lever reliably yields a preferred food (r = 2), the right a less preferred food (r = 1). Representing these contingencies in a network structured as in Fig. 1 (left) and employing the iterative algorithm described in section 2.2 yields the results in Figure 2A. Shown here are the posterior probabilities for the policies press left and press right, along with the marginal value of p(u = 1) under these posteriors (labeled EV for expected value). The dashed horizontal line indicates the expected value for the optimal plan, to which the model obviously converges. A key empirical assay for purposive behavior involves outcome devaluation. Here, actions yielding a previously valued outcome are abandoned after the incentive value of the outcome is reduced, for example by pairing with an aversive event (e.g., [4]). To simulate this within the binary choice scenario just described, we reduced to zero the reward value of the food yielded by the left lever (fL), by making the appropriate change to p(u|fL). This yielded a reversal in lever choice (Fig. 2B). Another signature of purposive actions is that they are abandoned when their causal connection with rewarding outcomes is removed (contingency degradation, see [4]). We simulated this by starting with the model from Fig. 2A and changing conditional probabilities at s for t=2 to reflect a decoupling of the left action from the fL outcome. The resulting behavior is shown in Fig. 2C. Fig 2. Simulation results, binary choice. 3.2 Stochastic outcomes A critical aspect of the present modeling paradigm is that it yields reward-maximizing choices in stochastic domains, a property that distinguishes it from some other recent approaches using graphical models to do planning (e.g., [9]). To illustrate, we used the architecture in Figure 1 (left) to simulate a choice between two fair coins. A ‘left’ coin yields $1 for heads, $0 for tails; a ‘right’ coin $2 for heads but for tails a $3 loss. As illustrated in Fig. 2D, the model maximizes expected value by opting for the left coin. Fig 3. Simulation results, two-step sequential choice. 3.3 Sequential decision Here, we adopt the two-step T-maze scenario used by [24] (Fig. 3A). Representing the task contingencies in a graphical model based on the template from Fig 1 (right), and using the reward values indicated in Fig. 3A, yields the choice behavior shown in Figure 3B. Following [24], a shift in motivational state from hunger to thirst can be represented in the graphical model by changing the reward function (R(cheese) = 2, R(X) = 0, R(water) = 4, R(carrots) = 1). Imposing this change at the level of the u variables yields the choice behavior shown in Fig. 3C. The model can also be used to simulate effort-based decision. Starting with the scenario in Fig. 2A, we simulated the insertion of an effort-demanding scalable barrier at S 2 (R(S 2 ) = -2) by making appropriate changes p(u|s). The resulting behavior is shown in Fig. 3D. A famous empirical demonstration of purposive control involves detour behavior. Using a maze like the one shown in Fig. 4A, with a food reward placed at s5 , Tolman [2] found that rats reacted to a barrier at location A by taking the upper route, but to a barrier at B by taking the longer lower route. We simulated this experiment by representing the corresponding 3 transition and reward functions in a graphical model of the form shown in Fig. 1 (right), representing the insertion of barriers by appropriate changes to the transition function. The resulting choice behavior at the critical juncture s2 is shown in Fig. 4. Fig 4. Simulation results, detour behavior. B: No barrier. C: Barrier at A. D: Barrier at B. Another classic empirical demonstration involves latent learning. Blodgett [25] allowed rats to explore the maze shown in Fig. 5. Later insertion of a food reward at s13 was followed immediately by dramatic reductions in the running time, reflecting a reduction in entries into blind alleys. We simulated this effect in a model based on the template in Fig. 1 (right), representing the maze layout via an appropriate transition function. In the absence of a reward at s12 , random choices occurred at each intersection. However, setting R(s13 ) = 1 resulted in the set of choices indicated by the heavier arrows in Fig. 5. 4 Fig 5. Latent learning. Rel a t i o n t o p revi o u s work Initial proposals for how to solve decision problems through probabilistic inference in graphical models, including the idea of encoding reward as the posterior probability of a random utility variable, were put forth by Cooper [23]. Related ideas were presented by Shachter and Peot [12], including the use of nodes that integrate information from multiple utility nodes. More recently, Attias [11] and Verma and Rao [9] have used graphical models to solve shortest-path problems, leveraging probabilistic representations of rewards, though not in a way that guaranteed convergence on optimal (reward maximizing) plans. More closely related to the present research is work by Toussaint and Storkey [10], employing the EM algorithm. The iterative approach we have introduced here has a certain resemblance to the EM procedure, which becomes evident if one views the policy variables in our models as parameters on the mapping from states to actions. It seems possible that there may be a formal equivalence between the algorithm we have proposed and the one reported by [10]. As a cognitive and neuroscientific proposal, the present work bears a close relation to recent work by Hasselmo [6], addressing the prefrontal computations underlying goal-directed action selection (see also [7]). The present efforts are tied more closely to normative principles of decision-making, whereas the work in [6] is tied more closely to the details of neural circuitry. In this respect, the two approaches may prove complementary, and it will be interesting to further consider their interrelations. 3 In this simulation and the next, the set of states associated with each state node was limited to the set of reachable states for the relevant time-step, assuming an initial state of s1 . Acknowledgments Thanks to Andrew Ledvina, David Blei, Yael Niv, Nathaniel Daw, and Francisco Pereira for useful comments. R e f e re n c e s [1] Hull, C.L., Principles of Behavior. 1943, New York: Appleton-Century. [2] Tolman, E.C., Purposive Behavior in Animals and Men. 1932, New York: Century. [3] Dickinson, A., Actions and habits: the development of behavioral autonomy. Philosophical Transactions of the Royal Society (London), Series B, 1985. 308: p. 67-78. [4] Balleine, B.W. and A. Dickinson, Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology, 1998. 37: p. 407-419. [5] Daw, N.D., Y. Niv, and P. Dayan, Uncertainty-based competition between prefrontal and striatal systems for behavioral control. Nature Neuroscience, 2005. 8: p. 1704-1711. [6] Hasselmo, M.E., A model of prefrontal cortical mechanisms for goal-directed behavior. Journal of Cognitive Neuroscience, 2005. 17: p. 1115-1129. [7] Schmajuk, N.A. and A.D. Thieme, Purposive behavior and cognitive mapping. A neural network model. Biological Cybernetics, 1992. 67: p. 165-174. [8] Tatman, J.A. and R.D. Shachter, Dynamic programming and influence diagrams. IEEE Transactions on Systems, Man and Cybernetics, 1990. 20: p. 365-379. [9] Verma, D. and R.P.N. Rao. Planning and acting in uncertain enviroments using probabilistic inference. in IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006. [10] Toussaint, M. and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. in Proceedings of the 23rd International Conference on Machine Learning. 2006. Pittsburgh, PA. [11] Attias, H. Planning by probabilistic inference. in Proceedings of the 9th Int. Workshop on Artificial Intelligence and Statistics. 2003. [12] Shachter, R.D. and M.A. Peot. Decision making using probabilistic inference methods. in Uncertainty in artificial intelligence: Proceedings of the Eighth Conference (1992). 1992. Stanford University: M. Kaufmann. [13] Chater, N., J.B. Tenenbaum, and A. Yuille, Probabilistic models of cognition: conceptual foundations. Trends in Cognitive Sciences, 2006. 10(7): p. 287-291. [14] Doya, K., et al., eds. The Bayesian Brain: Probabilistic Approaches to Neural Coding. 2006, MIT Press: Cambridge, MA. [15] Miller, E.K. and J.D. Cohen, An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 2001. 24: p. 167-202. [16] Asaad, W.F., G. Rainer, and E.K. Miller, Task-specific neural activity in the primate prefrontal cortex. Journal of Neurophysiology, 2000. 84: p. 451-459. [17] Rolls, E.T., The functions of the orbitofrontal cortex. Brain and Cognition, 2004. 55: p. 11-29. [18] Padoa-Schioppa, C. and J.A. Assad, Neurons in the orbitofrontal cortex encode economic value. Nature, 2006. 441: p. 223-226. [19] Gopnik, A., et al., A theory of causal learning in children: causal maps and Bayes nets. Psychological Review, 2004. 111: p. 1-31. [20] Hamilton, A.F.d.C. and S.T. Grafton, Action outcomes are represented in human inferior frontoparietal cortex. Cerebral Cortex, 2008. 18: p. 1160-1168. [21] Johnson, A., M.A.A. van der Meer, and D.A. Redish, Integrating hippocampus and striatum in decision-making. Current Opinion in Neurobiology, 2008. 17: p. 692-697. [22] Jensen, F.V., Bayesian Networks and Decision Graphs. 2001, New York: Springer Verlag. [23] Cooper, G.F. A method for using belief networks as influence diagrams. in Fourth Workshop on Uncertainty in Artificial Intelligence. 1988. University of Minnesota, Minneapolis. [24] Niv, Y., D. Joel, and P. Dayan, A normative perspective on motivation. Trends in Cognitive Sciences, 2006. 10: p. 375-381. [25] Blodgett, H.C., The effect of the introduction of reward upon the maze performance of rats. University of California Publications in Psychology, 1929. 4: p. 113-134.
5 0.54471499 29 nips-2008-Automatic online tuning for fast Gaussian summation
Author: Vlad I. Morariu, Balaji V. Srinivasan, Vikas C. Raykar, Ramani Duraiswami, Larry S. Davis
Abstract: Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
6 0.50797862 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
7 0.37759906 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
8 0.37599298 131 nips-2008-MDPs with Non-Deterministic Policies
9 0.3741115 181 nips-2008-Policy Search for Motor Primitives in Robotics
10 0.3717716 195 nips-2008-Regularized Policy Iteration
11 0.35598576 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
12 0.34247059 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
13 0.3195349 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
14 0.30766633 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
15 0.3064687 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
16 0.30250928 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
17 0.30221692 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
18 0.30219507 231 nips-2008-Temporal Dynamics of Cognitive Control
19 0.29913303 223 nips-2008-Structure Learning in Human Sequential Decision-Making
20 0.29887292 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach