nips nips2005 nips2005-144 knowledge-graph by maker-knowledge-mining

144 nips-2005-Off-policy Learning with Options and Recognizers


Source: pdf

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. [sent-3, score-1.223]

2 We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. [sent-4, score-0.34]

3 This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. [sent-5, score-0.337]

4 Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. [sent-6, score-0.643]

5 Off-policy learning is learning about one way of behaving while actually behaving in another way. [sent-7, score-0.175]

6 For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e. [sent-8, score-1.022]

7 Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. [sent-11, score-0.357]

8 For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. [sent-12, score-0.215]

9 For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. [sent-13, score-0.306]

10 Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). [sent-14, score-0.258]

11 Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. [sent-15, score-0.168]

12 Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). [sent-18, score-0.457]

13 They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. [sent-20, score-0.707]

14 They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). [sent-21, score-0.208]

15 First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. [sent-23, score-0.758]

16 The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. [sent-26, score-0.109]

17 Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. [sent-29, score-1.472]

18 For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. [sent-30, score-0.556]

19 The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. [sent-31, score-0.406]

20 The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. [sent-32, score-0.154]

21 In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. [sent-33, score-0.527]

22 Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. [sent-34, score-1.359]

23 The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. [sent-35, score-0.705]

24 We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. [sent-36, score-0.624]

25 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. [sent-37, score-0.265]

26 Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i. [sent-38, score-0.296]

27 according to probability density b : [0, 1] → ℜ+ (the behavior density). [sent-41, score-0.196]

28 For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. [sent-42, score-0.192]

29 For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . [sent-43, score-0.28]

30 The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0. [sent-47, score-0.441]

31 The most straightforward way is to be equally interested in all actions within the designated region. [sent-51, score-0.251]

32 One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5. [sent-52, score-0.383]

33 5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer . [sent-56, score-1.986]

34 9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. [sent-59, score-1.273]

35 The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . [sent-63, score-0.165]

36 n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0. [sent-64, score-0.213]

37 Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). [sent-67, score-0.179]

38 One professes to be interested in actions to the extent that they are both selected by b and within the designated region. [sent-71, score-0.315]

39 This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). [sent-72, score-0.518]

40 To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . [sent-74, score-0.433]

41 The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0. [sent-75, score-0.473]

42 The target policy is defined in general by c(a)b(a) c(a)b(a) = . [sent-78, score-0.495]

43 (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. [sent-79, score-0.217]

44 The constant µ can easily be estimated as the fraction of recognized actions in the sample. [sent-82, score-0.253]

45 Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. [sent-85, score-0.298]

46 For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. [sent-86, score-0.243]

47 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. [sent-88, score-0.54]

48 Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i. [sent-94, score-0.772]

49 Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . [sent-97, score-1.546]

50 Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. [sent-98, score-1.335]

51 We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . [sent-102, score-0.529]

52 Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . [sent-103, score-0.705]

53 At each time step t, the agent receives a state st and chooses an action at . [sent-106, score-0.239]

54 We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. [sent-107, score-0.489]

55 The behavior policy is used to generate a sequence of experience (observations, actions and rewards). [sent-108, score-0.737]

56 An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. [sent-114, score-1.243]

57 In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. [sent-115, score-0.516]

58 Here, we will instead define options implicitly, using the notion of a recognizer. [sent-116, score-0.145]

59 A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. [sent-117, score-0.929]

60 In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. [sent-119, score-0.175]

61 Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. [sent-120, score-0.348]

62 A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i. [sent-121, score-1.451]

63 , the probability that an action will be accepted at s when behavior is generated according to b. [sent-123, score-0.241]

64 The policy π is only defined at states for which µ(s) > 0. [sent-124, score-0.42]

65 The numerator gives the probability that action a is produced by the behavior and recognized in s. [sent-125, score-0.294]

66 Note that if the recognizer accepts all state-action pairs, i. [sent-126, score-0.417]

67 Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). [sent-129, score-1.415]

68 An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. [sent-130, score-0.428]

69 In other words, the option must terminate if no actions are recognized at a given state. [sent-132, score-0.491]

70 We will focus on computing the reward model of an option o, which represents the expected total return. [sent-134, score-0.311]

71 The expected values of different features at the end of the option can be estimated similarly. [sent-135, score-0.283]

72 + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. [sent-139, score-0.782]

73 Suppose that an option’s policy π was used to generate behavior. [sent-144, score-0.42]

74 In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. [sent-145, score-0.338]

75 Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . [sent-147, score-0.304]

76 This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. [sent-149, score-0.279]

77 n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). [sent-151, score-0.327]

78 In our case, however, trajectories are generated according to the behavior policy b. [sent-152, score-0.57]

79 The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. [sent-153, score-0.376]

80 Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. [sent-154, score-0.2]

81 The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). [sent-156, score-0.278]

82 (1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 ). [sent-166, score-0.278]

83 ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. [sent-171, score-0.202]

84 When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. [sent-172, score-1.237]

85 We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. [sent-173, score-0.434]

86 No corrections are required for the experience prior to this point. [sent-174, score-0.183]

87 This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). [sent-175, score-0.226]

88 Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi . [sent-178, score-0.345]

89 With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. [sent-185, score-0.216]

90 This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). [sent-187, score-0.13]

91 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. [sent-189, score-1.092]

92 This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. [sent-190, score-0.638]

93 Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. [sent-191, score-0.781]

94 Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. [sent-195, score-0.382]

95 We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. [sent-196, score-0.662]

96 In this case, we will define recognizers which behave consistently in each partition, i. [sent-197, score-0.239]

97 This means that an action is either recognized or not recognized in all states of the partition. [sent-200, score-0.245]

98 Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. [sent-202, score-0.112]

99 At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i. [sent-206, score-0.42]

100 Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. [sent-209, score-0.413]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('policy', 0.42), ('recognizer', 0.382), ('rt', 0.31), ('option', 0.258), ('eb', 0.239), ('recognizers', 0.239), ('precup', 0.208), ('sutton', 0.192), ('actions', 0.156), ('corrections', 0.148), ('bi', 0.145), ('yt', 0.139), ('behavior', 0.126), ('options', 0.119), ('ai', 0.116), ('dasgupta', 0.097), ('importance', 0.097), ('action', 0.091), ('sampling', 0.082), ('updates', 0.079), ('recognized', 0.077), ('target', 0.075), ('policies', 0.07), ('koop', 0.069), ('singh', 0.066), ('st', 0.058), ('alberta', 0.053), ('designated', 0.052), ('pss', 0.052), ('rafols', 0.052), ('variance', 0.051), ('state', 0.049), ('termination', 0.049), ('behaving', 0.049), ('zi', 0.048), ('density', 0.046), ('abstraction', 0.044), ('truncated', 0.043), ('edmonton', 0.042), ('predictions', 0.042), ('problematic', 0.041), ('baird', 0.04), ('professes', 0.04), ('psaiisi', 0.04), ('restarts', 0.04), ('tadic', 0.04), ('kt', 0.04), ('specify', 0.039), ('outcome', 0.035), ('experience', 0.035), ('accepts', 0.035), ('eo', 0.035), ('tanner', 0.035), ('ab', 0.034), ('canada', 0.033), ('stationary', 0.033), ('ways', 0.033), ('tsitsiklis', 0.032), ('mb', 0.032), ('rss', 0.032), ('return', 0.032), ('initiated', 0.03), ('convergence', 0.029), ('recognition', 0.028), ('reward', 0.028), ('learn', 0.028), ('roy', 0.028), ('initiation', 0.028), ('approximator', 0.028), ('forward', 0.027), ('variances', 0.027), ('notion', 0.026), ('learning', 0.026), ('starting', 0.026), ('expected', 0.025), ('way', 0.025), ('extent', 0.025), ('according', 0.024), ('picking', 0.024), ('theorem', 0.024), ('selected', 0.024), ('panel', 0.023), ('formulations', 0.021), ('step', 0.021), ('stream', 0.021), ('partition', 0.021), ('outcomes', 0.02), ('proof', 0.02), ('fraction', 0.02), ('suppose', 0.02), ('temporally', 0.02), ('agent', 0.02), ('selecting', 0.019), ('approximation', 0.019), ('specifying', 0.019), ('backward', 0.019), ('empirical', 0.018), ('proven', 0.018), ('interested', 0.018), ('pose', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

2 0.33518034 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

Author: Eddie Rafols, Anna Koop, Richard S. Sutton

Abstract: We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional specification of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as “What will I see if I step forward and look right?” or “If I open the fridge, will I see a bottle of beer?” Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the first prediction with each of the other predictions. TD networks are the first framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question—something to be predicted—and has associated with it a value representing an answer to the question—a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2’s question; its value is a prediction about node 2’s prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer’s head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way—node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al., 2005). We note that the questions used in all previous work with PSRs are defined in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like “if I were to go to the coffee room, would I see John?” would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Significant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the first intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose ‘policies’ are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3.... In each state st ∈ S, the agent selects an action at ∈ A, determining the next state st+1 .1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The first is the initiation set, I ⊂ S, the subset of states in which the option can be started. The second component of an option is its policy, π : S × A ⇒ [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, β : S × A ⇒ [0, 1], specifies how the option ends: β(s) denotes the probability of terminating when in state s. The option is thus completely and formally defined by the 3-tuple (I, π, β). 2 Conventional TD networks In this section we briefly present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot ∈ O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1 , a1 , o2 , a2 , o3 · · ·. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt , . . . , yt )T . The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modifiable parameter W, which is often taken to be of a linear form: yt = u(yt−1 , at−1 , ot , Wt ) = σ(Wt xt ), (1) where xt ∈ m is an m-vector of features created from (yt−1 , at−1 , ot ), Wt is an n × m matrix (whose elements are sometimes referred to as weights), and σ is the n-vector 1 form of either the identity function or the S-shaped logistic function σ(s) = 1+e−s . The feature vector is an arbitrary vector-valued function of yt−1 , at−1 , and ot . For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent’s action, observations, and predictions from the previous time step. The overall update u defines the answer network. The question network consists of a set of target functions, z i : O × n → , and condition i y functions, ci : A× n → [0, 1]n . We define zt = z i (ot+1 , ˜t+1 ) as the target for prediction i 2 i i yt . Similarly, we define ct = c (at , yt ) as the condition at time t. The learning algorithm ij for each component wt of Wt can then be written ij ij i i wt+1 = wt + α zt − yt ci t i ∂yt , (2) ij ∂wt where α is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of specific action sequences (as in PSRs, e.g., Littman et al., 2002; Singh et al., 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node’s outgoing link in the question 2 The quantity ˜ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ˜ is calculated by weights that are one step out of date as compared to y, y i.e., ˜t = u(yt−1 , at−1 , ot , Wt−1 ) (cf. equation 1). y network will now correspond to an option applying over possibly many steps. The policy of the ith node’s option corresponds to the condition function ci , which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a significant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function β generalizes naturally to TD networks. Each node i is i given a corresponding termination function, β i : O× n → [0, 1], where βt = β i (ot+1 , yt ) i is the probability of terminating at time t.3 βt = 1 indicates that the option has terminated i at time t; βt = 0 indicates that it has not, and intermediate values of β correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node’s own next value, yt+1 , should ˜i be the target. The termination function specifies which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i i i i i i δt = βt zt + (1 − βt )˜t+1 − yt . y (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n × m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node’s prediction during the time the agent has been acting consistently with the node’s option. The components eij of the eligibility matrix are updated by i eij = ci λeij (1 − βt ) + t t t−1 i ∂yt ij ∂wt , (4) where 0 ≤ λ ≤ 1 is the trace-decay parameter familiar from the TD(λ) learning algorithm. Because of the ci factor, all of a node’s traces will be immediately reset to zero whenever t the agent deviates from the node’s option’s policy. If the agent follows the policy and the option does not terminate, then the trace decays by λ and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij ij i wt+1 = wt + α δt eij . t 4 (5) Fully observable experiment This experiment was designed to test the correctness of the algorithm in a simple gridworld where the environmental state is observable. We applied an options-extended TD network to the problem of learning to predict observations from interaction with the gridworld environment shown on the left in Figure 1. Empty squares indicate spaces where the agent can move freely, and colored squares (shown shaded in the figure) indicate walls. The agent is egocentric. At each time step the agent receives from the environment six bits representing the color it is facing (red, green, blue, orange, yellow, or white). In this first experiment we also provided 6 × 6 × 4 = 144 other bits directly indicating the complete state of the environment (square and orientation). 3 The fact that the option depends only on the current predictions, action, and observation means that we are considering only Markov options. Figure 1: The test world (left) and the question network (right) used in the experiments. The triangle in the world indicates the location and orientation of the agent. The walls are labeled R, O, Y, G, and B representing the colors red, orange, yellow, green and blue. Note that the left wall is mostly blue but partly green. The right diagram shows in full the portion of the question network corresponding to the red bit. This structure is repeated, but not shown, for the other four (non-white) colors. L, R, and F are primitive actions, and Forward and Wander are options. There are three possible actions: A ={F, R, L}. Actions were selected according to a fixed stochastic policy independent of the state. The probability of the F, L, and R actions were 0.5, 0.25, and 0.25 respectively. L and R cause the agent to rotate 90 degrees to the left or right. F causes the agent to move ahead one square with probability 1 − p and to stay in the same square with probability p. The probability p is called the slipping probability. If the forward movement would cause the agent to move into a wall, then the agent does not move. In this experiment, we used p = 0, p = 0.1, and p = 0.5. In addition to these primitive actions, we provided two temporally abstract options, Forward and Wander. The Forward option takes the action F in every state and terminates when the agent senses a wall (color) in front of it. The policy of the Wander option is the same as that actually followed by the agent. Wander terminates with probability 1 when a wall is sensed, and spontaneously with probability 0.5 otherwise. We used the question network shown on the right in Figure 1. The predictions of nodes 1, 2, and 3 are estimates of the probability that the red bit would be observed if the corresponding primitive action were taken. Node 4 is a prediction of whether the agent will see the red bit upon termination of the Wander option if it were taken. Node 5 predicts the probability of observing the red bit given that the Forward option is followed until termination. Nodes 6 and 7 represent predictions of the outcome of a primitive action followed by the Forward option. Nodes 8 and 9 take this one step further: they represent predictions of the red bit if the Forward option were followed to termination, then a primitive action were taken, and then the Forward option were followed again to termination. We applied our algorithm to learn the parameter W of the answer network for this question network. The step-size parameter α was 1.0, and the trace-decay parameter λ was 0.9. The initial W0 , E0 , and y0 were all 0. Each run began with the agent in the state indicated in Figure 1 (left). In this experiment σ(·) was the identity function. For each value of p, we ran 50 runs of 20,000 time steps. On each time step, the root-meansquared (RMS) error in each node’s prediction was computed and then averaged over all the nodes. The nodes corresponding to the Wander option were not included in the average because of the difficulty of calculating their correct predictions. This average was then 0.4 Fully Observable 0.4 RMS Error RMS Error p=0 0 0 Partially Observable p = 0.1 5000 p = 0.5 10000 15000 20000 Steps 0 0 100000 200000 Steps 300000 Figure 2: Learning curves in the fully-observable experiment for each slippage probability (left) and in the partially-observable experiment (right). itself averaged over the 50 runs and bins of 1,000 time steps to produce the learning curves shown on the left in Figure 2. For all slippage probabilities, the error in all predictions fell almost to zero. After approximately 12,000 trials, the agent made almost perfect predictions in all cases. Not surprisingly, learning was slower at the higher slippage probabilities. These results show that our augmented TD network is able to make a complete temporally-abstract model of this world. 5 Partially observable experiment In our second experiment, only the six color observation bits were available to the agent. This experiment provides a more challenging test of our algorithm. To model the environment well, the TD network must construct a representation of state from very sparse information. In fact, completely accurate prediction is not possible in this problem with our question network. In this experiment the input vector consisted of three groups of 46 components each, 138 in total. If the action was R, the first 46 components were set to the 40 node values and the six observation bits, and the other components were 0. If the action was L, the next group of 46 components was filled in in the same way, and the first and third groups were zero. If the action was F, the third group was filled. This technique enables the answer network as function approximator to represent a wider class of functions in a linear form than would otherwise be possible. In this experiment, σ(·) was the S-shaped logistic function. The slippage probability was p = 0.1. As our performance measure we used the RMS error, as in the first experiment, except that the predictions for the primitive actions (nodes 1-3) were not included. These predictions can never become completely accurate because the agent can’t tell in detail where it is located in the open space. As before, we averaged RMS error over 50 runs and 1,000 time step bins, to produce the learning curve shown on the right in Figure 2. As before, the RMS error approached zero. Node 5 in Figure 1 holds the prediction of red if the agent were to march forward to the wall ahead of it. Corresponding nodes in the other subnetworks hold the predictions of the other colors upon Forward. To make these predictions accurately, the agent must keep track of which wall it is facing, even if it is many steps away from it. It has to learn a sort of compass that it can keep updated as it turns in the middle of the space. Figure 3 is a demonstration of the compass learned after a representative run of 200,000 time steps. At the end of the run, the agent was driven manually to the state shown in the first row (relative time index t = 1). On steps 1-25 the agent was spun clockwise in place. The third column shows the prediction for node 5 in each portion of the question network. That is, the predictions shown are for each color-observation bit at termination of the Forward option. At t = 1, the agent is facing the orange wall and it predicts that the Forward option would result in seeing the orange bit and none other. Over steps 2-5 we see that the predictions are maintained accurately as the agent spins despite the fact that its observation bits remain the same. Even after spinning for 25 steps the agent knows exactly which way it is facing. While spinning, the agent correctly never predicts seeing the green bit (after Forward), but if it is driven up and turned, as in the last row of the figure, the green bit is accurately predicted. The fourth column shows the prediction for node 8 in each portion of the question network. Recall that these nodes correspond to the sequence Forward, L, Forward. At time t = 1, the agent accurately predicts that Forward will bring it to orange (third column) and also predicts that Forward, L, Forward will bring it to green. The predictions made for node 8 at each subsequent step of the sequence are also correct. These results show that the agent is able to accurately maintain its long term predictions without directly encountering sensory verification. How much larger would the TD network have to be to handle a 100x100 gridworld? The answer is not at all. The same question network applies to any size problem. If the layout of the colored walls remain the same, then even the answer network transfers across worlds of widely varying sizes. In other experiments, training on successively larger problems, we have shown that the same TD network as used here can learn to make all the long-term predictions correctly on a 100x100 version of the 6x6 gridworld used here. t y5 t st y8 t 1 1 O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG 2 3 4 5 25 29 Figure 3: An illustration of part of what the agent learns in the partially observable environment. The second column is a sequence of states with (relative) time index as given by the first column. The sequence was generated by controlling the agent manually. On steps 1-25 the agent was spun clockwise in place, and the trajectory after that is shown by the line in the last state diagram. The third and fourth columns show the values of the nodes corresponding to 5 and 8 in Figure 1, one for each color-observation bit. 6 Conclusion Our experiments show that option-extended TD networks can learn effectively. They can learn facts about their environments that are not representable in conventional TD networks or in any other method for learning models of the world. One concern is that our intra-option learning algorithm is an off-policy learning method incorporating function approximation and bootstrapping (learning from predictions). The combination of these three is known to produce convergence problems for some methods (see Sutton & Barto, 1998), and they may arise here. A sound solution may require modifications to incorporate importance sampling (see Precup, Sutton & Dasgupta, 2001). In this paper we have considered only intra-option eligibility traces—traces extending over the time span within an option but not persisting across options. Tanner and Sutton (2005) have proposed a method for inter-option traces that could perhaps be combined with our intra-option traces. The primary contribution of this paper is the introduction of a new learning algorithm for TD networks that incorporates options and eligibility traces. Our experiments are small and do little more than exercise the learning algorithm, showing that it does not break immediately. More significant is the greater representational power of option-extended TD networks. Options are a general framework for temporal abstraction, predictive state representations are a promising strategy for state abstraction, and TD networks are able to represent compositional questions. The combination of these three is potentially very powerful and worthy of further study. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Mark Ring, Brian Tanner, Satinder Singh, Doina Precup, and all the members of the rlai.net group. References Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371-1398. MIT Press. Littman, M., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In T. G. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Information Processing Systems 14, pp. 1555-1561. MIT Press. Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417-424. San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Sutton, R.S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Rivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of finite automata. In Proceedings of the Twenty Eighth Annual Symposium on Foundations of Computer Science, (pp. 78–87). IEEE Computer Society. Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference in Uncertainty in Artificial Intelligence, (pp. 512–519). AUAI Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp. 181-211. Sutton, R. S., & Tanner, B. (2005). Conference 17. Temporal-difference networks. To appear in Neural Information Processing Systems Tanner, B., Sutton, R. S. (2005) Temporal-difference networks with history. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.

3 0.27866536 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

4 0.23121838 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

5 0.20540768 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

6 0.18512231 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

7 0.15764134 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

8 0.15745306 153 nips-2005-Policy-Gradient Methods for Planning

9 0.1069122 45 nips-2005-Conditional Visual Tracking in Kernel Space

10 0.10642826 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

11 0.10467299 53 nips-2005-Cyclic Equilibria in Markov Games

12 0.077639401 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

13 0.077324942 36 nips-2005-Bayesian models of human action understanding

14 0.070412807 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

15 0.069752745 148 nips-2005-Online Discovery and Learning of Predictive State Representations

16 0.067330927 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

17 0.064535365 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

18 0.051267043 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

19 0.050809894 41 nips-2005-Coarse sample complexity bounds for active learning

20 0.050410621 111 nips-2005-Learning Influence among Interacting Markov Chains


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, -0.015), (2, 0.47), (3, -0.126), (4, -0.026), (5, 0.021), (6, 0.081), (7, 0.018), (8, 0.004), (9, -0.065), (10, -0.012), (11, 0.073), (12, 0.021), (13, 0.075), (14, -0.043), (15, 0.03), (16, -0.02), (17, -0.012), (18, -0.062), (19, -0.061), (20, -0.095), (21, 0.033), (22, -0.127), (23, -0.021), (24, 0.163), (25, 0.133), (26, 0.079), (27, 0.0), (28, -0.051), (29, -0.052), (30, 0.054), (31, 0.164), (32, 0.026), (33, -0.033), (34, -0.048), (35, -0.058), (36, 0.124), (37, 0.023), (38, 0.036), (39, -0.1), (40, -0.105), (41, 0.021), (42, -0.027), (43, -0.069), (44, 0.054), (45, -0.024), (46, -0.094), (47, -0.005), (48, 0.041), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96580291 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

2 0.6955235 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

Author: Jin Yu, Douglas Aberdeen, Nicol N. Schraudolph

Abstract: Reinforcement learning by direct policy gradient estimation is attractive in theory but in practice leads to notoriously ill-behaved optimization problems. We improve its robustness and speed of convergence with stochastic meta-descent, a gain vector adaptation method that employs fast Hessian-vector products. In our experiments the resulting algorithms outperform previously employed online stochastic, offline conjugate, and natural policy gradient methods. 1

3 0.69176263 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

Author: Eddie Rafols, Anna Koop, Richard S. Sutton

Abstract: We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional specification of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as “What will I see if I step forward and look right?” or “If I open the fridge, will I see a bottle of beer?” Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the first prediction with each of the other predictions. TD networks are the first framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question—something to be predicted—and has associated with it a value representing an answer to the question—a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2’s question; its value is a prediction about node 2’s prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer’s head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way—node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al., 2005). We note that the questions used in all previous work with PSRs are defined in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like “if I were to go to the coffee room, would I see John?” would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Significant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the first intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose ‘policies’ are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3.... In each state st ∈ S, the agent selects an action at ∈ A, determining the next state st+1 .1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The first is the initiation set, I ⊂ S, the subset of states in which the option can be started. The second component of an option is its policy, π : S × A ⇒ [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, β : S × A ⇒ [0, 1], specifies how the option ends: β(s) denotes the probability of terminating when in state s. The option is thus completely and formally defined by the 3-tuple (I, π, β). 2 Conventional TD networks In this section we briefly present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot ∈ O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1 , a1 , o2 , a2 , o3 · · ·. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt , . . . , yt )T . The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modifiable parameter W, which is often taken to be of a linear form: yt = u(yt−1 , at−1 , ot , Wt ) = σ(Wt xt ), (1) where xt ∈ m is an m-vector of features created from (yt−1 , at−1 , ot ), Wt is an n × m matrix (whose elements are sometimes referred to as weights), and σ is the n-vector 1 form of either the identity function or the S-shaped logistic function σ(s) = 1+e−s . The feature vector is an arbitrary vector-valued function of yt−1 , at−1 , and ot . For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent’s action, observations, and predictions from the previous time step. The overall update u defines the answer network. The question network consists of a set of target functions, z i : O × n → , and condition i y functions, ci : A× n → [0, 1]n . We define zt = z i (ot+1 , ˜t+1 ) as the target for prediction i 2 i i yt . Similarly, we define ct = c (at , yt ) as the condition at time t. The learning algorithm ij for each component wt of Wt can then be written ij ij i i wt+1 = wt + α zt − yt ci t i ∂yt , (2) ij ∂wt where α is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of specific action sequences (as in PSRs, e.g., Littman et al., 2002; Singh et al., 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node’s outgoing link in the question 2 The quantity ˜ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ˜ is calculated by weights that are one step out of date as compared to y, y i.e., ˜t = u(yt−1 , at−1 , ot , Wt−1 ) (cf. equation 1). y network will now correspond to an option applying over possibly many steps. The policy of the ith node’s option corresponds to the condition function ci , which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a significant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function β generalizes naturally to TD networks. Each node i is i given a corresponding termination function, β i : O× n → [0, 1], where βt = β i (ot+1 , yt ) i is the probability of terminating at time t.3 βt = 1 indicates that the option has terminated i at time t; βt = 0 indicates that it has not, and intermediate values of β correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node’s own next value, yt+1 , should ˜i be the target. The termination function specifies which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i i i i i i δt = βt zt + (1 − βt )˜t+1 − yt . y (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n × m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node’s prediction during the time the agent has been acting consistently with the node’s option. The components eij of the eligibility matrix are updated by i eij = ci λeij (1 − βt ) + t t t−1 i ∂yt ij ∂wt , (4) where 0 ≤ λ ≤ 1 is the trace-decay parameter familiar from the TD(λ) learning algorithm. Because of the ci factor, all of a node’s traces will be immediately reset to zero whenever t the agent deviates from the node’s option’s policy. If the agent follows the policy and the option does not terminate, then the trace decays by λ and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij ij i wt+1 = wt + α δt eij . t 4 (5) Fully observable experiment This experiment was designed to test the correctness of the algorithm in a simple gridworld where the environmental state is observable. We applied an options-extended TD network to the problem of learning to predict observations from interaction with the gridworld environment shown on the left in Figure 1. Empty squares indicate spaces where the agent can move freely, and colored squares (shown shaded in the figure) indicate walls. The agent is egocentric. At each time step the agent receives from the environment six bits representing the color it is facing (red, green, blue, orange, yellow, or white). In this first experiment we also provided 6 × 6 × 4 = 144 other bits directly indicating the complete state of the environment (square and orientation). 3 The fact that the option depends only on the current predictions, action, and observation means that we are considering only Markov options. Figure 1: The test world (left) and the question network (right) used in the experiments. The triangle in the world indicates the location and orientation of the agent. The walls are labeled R, O, Y, G, and B representing the colors red, orange, yellow, green and blue. Note that the left wall is mostly blue but partly green. The right diagram shows in full the portion of the question network corresponding to the red bit. This structure is repeated, but not shown, for the other four (non-white) colors. L, R, and F are primitive actions, and Forward and Wander are options. There are three possible actions: A ={F, R, L}. Actions were selected according to a fixed stochastic policy independent of the state. The probability of the F, L, and R actions were 0.5, 0.25, and 0.25 respectively. L and R cause the agent to rotate 90 degrees to the left or right. F causes the agent to move ahead one square with probability 1 − p and to stay in the same square with probability p. The probability p is called the slipping probability. If the forward movement would cause the agent to move into a wall, then the agent does not move. In this experiment, we used p = 0, p = 0.1, and p = 0.5. In addition to these primitive actions, we provided two temporally abstract options, Forward and Wander. The Forward option takes the action F in every state and terminates when the agent senses a wall (color) in front of it. The policy of the Wander option is the same as that actually followed by the agent. Wander terminates with probability 1 when a wall is sensed, and spontaneously with probability 0.5 otherwise. We used the question network shown on the right in Figure 1. The predictions of nodes 1, 2, and 3 are estimates of the probability that the red bit would be observed if the corresponding primitive action were taken. Node 4 is a prediction of whether the agent will see the red bit upon termination of the Wander option if it were taken. Node 5 predicts the probability of observing the red bit given that the Forward option is followed until termination. Nodes 6 and 7 represent predictions of the outcome of a primitive action followed by the Forward option. Nodes 8 and 9 take this one step further: they represent predictions of the red bit if the Forward option were followed to termination, then a primitive action were taken, and then the Forward option were followed again to termination. We applied our algorithm to learn the parameter W of the answer network for this question network. The step-size parameter α was 1.0, and the trace-decay parameter λ was 0.9. The initial W0 , E0 , and y0 were all 0. Each run began with the agent in the state indicated in Figure 1 (left). In this experiment σ(·) was the identity function. For each value of p, we ran 50 runs of 20,000 time steps. On each time step, the root-meansquared (RMS) error in each node’s prediction was computed and then averaged over all the nodes. The nodes corresponding to the Wander option were not included in the average because of the difficulty of calculating their correct predictions. This average was then 0.4 Fully Observable 0.4 RMS Error RMS Error p=0 0 0 Partially Observable p = 0.1 5000 p = 0.5 10000 15000 20000 Steps 0 0 100000 200000 Steps 300000 Figure 2: Learning curves in the fully-observable experiment for each slippage probability (left) and in the partially-observable experiment (right). itself averaged over the 50 runs and bins of 1,000 time steps to produce the learning curves shown on the left in Figure 2. For all slippage probabilities, the error in all predictions fell almost to zero. After approximately 12,000 trials, the agent made almost perfect predictions in all cases. Not surprisingly, learning was slower at the higher slippage probabilities. These results show that our augmented TD network is able to make a complete temporally-abstract model of this world. 5 Partially observable experiment In our second experiment, only the six color observation bits were available to the agent. This experiment provides a more challenging test of our algorithm. To model the environment well, the TD network must construct a representation of state from very sparse information. In fact, completely accurate prediction is not possible in this problem with our question network. In this experiment the input vector consisted of three groups of 46 components each, 138 in total. If the action was R, the first 46 components were set to the 40 node values and the six observation bits, and the other components were 0. If the action was L, the next group of 46 components was filled in in the same way, and the first and third groups were zero. If the action was F, the third group was filled. This technique enables the answer network as function approximator to represent a wider class of functions in a linear form than would otherwise be possible. In this experiment, σ(·) was the S-shaped logistic function. The slippage probability was p = 0.1. As our performance measure we used the RMS error, as in the first experiment, except that the predictions for the primitive actions (nodes 1-3) were not included. These predictions can never become completely accurate because the agent can’t tell in detail where it is located in the open space. As before, we averaged RMS error over 50 runs and 1,000 time step bins, to produce the learning curve shown on the right in Figure 2. As before, the RMS error approached zero. Node 5 in Figure 1 holds the prediction of red if the agent were to march forward to the wall ahead of it. Corresponding nodes in the other subnetworks hold the predictions of the other colors upon Forward. To make these predictions accurately, the agent must keep track of which wall it is facing, even if it is many steps away from it. It has to learn a sort of compass that it can keep updated as it turns in the middle of the space. Figure 3 is a demonstration of the compass learned after a representative run of 200,000 time steps. At the end of the run, the agent was driven manually to the state shown in the first row (relative time index t = 1). On steps 1-25 the agent was spun clockwise in place. The third column shows the prediction for node 5 in each portion of the question network. That is, the predictions shown are for each color-observation bit at termination of the Forward option. At t = 1, the agent is facing the orange wall and it predicts that the Forward option would result in seeing the orange bit and none other. Over steps 2-5 we see that the predictions are maintained accurately as the agent spins despite the fact that its observation bits remain the same. Even after spinning for 25 steps the agent knows exactly which way it is facing. While spinning, the agent correctly never predicts seeing the green bit (after Forward), but if it is driven up and turned, as in the last row of the figure, the green bit is accurately predicted. The fourth column shows the prediction for node 8 in each portion of the question network. Recall that these nodes correspond to the sequence Forward, L, Forward. At time t = 1, the agent accurately predicts that Forward will bring it to orange (third column) and also predicts that Forward, L, Forward will bring it to green. The predictions made for node 8 at each subsequent step of the sequence are also correct. These results show that the agent is able to accurately maintain its long term predictions without directly encountering sensory verification. How much larger would the TD network have to be to handle a 100x100 gridworld? The answer is not at all. The same question network applies to any size problem. If the layout of the colored walls remain the same, then even the answer network transfers across worlds of widely varying sizes. In other experiments, training on successively larger problems, we have shown that the same TD network as used here can learn to make all the long-term predictions correctly on a 100x100 version of the 6x6 gridworld used here. t y5 t st y8 t 1 1 O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG 2 3 4 5 25 29 Figure 3: An illustration of part of what the agent learns in the partially observable environment. The second column is a sequence of states with (relative) time index as given by the first column. The sequence was generated by controlling the agent manually. On steps 1-25 the agent was spun clockwise in place, and the trajectory after that is shown by the line in the last state diagram. The third and fourth columns show the values of the nodes corresponding to 5 and 8 in Figure 1, one for each color-observation bit. 6 Conclusion Our experiments show that option-extended TD networks can learn effectively. They can learn facts about their environments that are not representable in conventional TD networks or in any other method for learning models of the world. One concern is that our intra-option learning algorithm is an off-policy learning method incorporating function approximation and bootstrapping (learning from predictions). The combination of these three is known to produce convergence problems for some methods (see Sutton & Barto, 1998), and they may arise here. A sound solution may require modifications to incorporate importance sampling (see Precup, Sutton & Dasgupta, 2001). In this paper we have considered only intra-option eligibility traces—traces extending over the time span within an option but not persisting across options. Tanner and Sutton (2005) have proposed a method for inter-option traces that could perhaps be combined with our intra-option traces. The primary contribution of this paper is the introduction of a new learning algorithm for TD networks that incorporates options and eligibility traces. Our experiments are small and do little more than exercise the learning algorithm, showing that it does not break immediately. More significant is the greater representational power of option-extended TD networks. Options are a general framework for temporal abstraction, predictive state representations are a promising strategy for state abstraction, and TD networks are able to represent compositional questions. The combination of these three is potentially very powerful and worthy of further study. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Mark Ring, Brian Tanner, Satinder Singh, Doina Precup, and all the members of the rlai.net group. References Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371-1398. MIT Press. Littman, M., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In T. G. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Information Processing Systems 14, pp. 1555-1561. MIT Press. Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417-424. San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Sutton, R.S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Rivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of finite automata. In Proceedings of the Twenty Eighth Annual Symposium on Foundations of Computer Science, (pp. 78–87). IEEE Computer Society. Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference in Uncertainty in Artificial Intelligence, (pp. 512–519). AUAI Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp. 181-211. Sutton, R. S., & Tanner, B. (2005). Conference 17. Temporal-difference networks. To appear in Neural Information Processing Systems Tanner, B., Sutton, R. S. (2005) Temporal-difference networks with history. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.

4 0.68363971 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

5 0.64520139 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

6 0.62809765 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

7 0.62652183 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

8 0.55799633 153 nips-2005-Policy-Gradient Methods for Planning

9 0.54838425 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

10 0.40286595 45 nips-2005-Conditional Visual Tracking in Kernel Space

11 0.34631532 36 nips-2005-Bayesian models of human action understanding

12 0.33588493 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

13 0.29702216 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

14 0.27454442 156 nips-2005-Prediction and Change Detection

15 0.26330817 48 nips-2005-Context as Filtering

16 0.25954139 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

17 0.25940832 53 nips-2005-Cyclic Equilibria in Markov Games

18 0.25938284 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

19 0.2582024 148 nips-2005-Online Discovery and Learning of Predictive State Representations

20 0.2380552 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.04), (10, 0.069), (11, 0.017), (26, 0.089), (27, 0.056), (31, 0.095), (34, 0.085), (55, 0.029), (69, 0.046), (73, 0.047), (74, 0.212), (77, 0.011), (88, 0.073), (91, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78738385 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

2 0.68582809 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

Author: Eddie Rafols, Anna Koop, Richard S. Sutton

Abstract: We present a generalization of temporal-difference networks to include temporally abstract options on the links of the question network. Temporal-difference (TD) networks have been proposed as a way of representing and learning a wide variety of predictions about the interaction between an agent and its environment. These predictions are compositional in that their targets are defined in terms of other predictions, and subjunctive in that that they are about what would happen if an action or sequence of actions were taken. In conventional TD networks, the inter-related predictions are at successive time steps and contingent on a single action; here we generalize them to accommodate extended time intervals and contingency on whole ways of behaving. Our generalization is based on the options framework for temporal abstraction. The primary contribution of this paper is to introduce a new algorithm for intra-option learning in TD networks with function approximation and eligibility traces. We present empirical examples of our algorithm’s effectiveness and of the greater representational expressiveness of temporallyabstract TD networks. The primary distinguishing feature of temporal-difference (TD) networks (Sutton & Tanner, 2005) is that they permit a general compositional specification of the goals of learning. The goals of learning are thought of as predictive questions being asked by the agent in the learning problem, such as “What will I see if I step forward and look right?” or “If I open the fridge, will I see a bottle of beer?” Seeing a bottle of beer is of course a complicated perceptual act. It might be thought of as obtaining a set of predictions about what would happen if certain reaching and grasping actions were taken, about what would happen if the bottle were opened and turned upside down, and of what the bottle would look like if viewed from various angles. To predict seeing a bottle of beer is thus to make a prediction about a set of other predictions. The target for the overall prediction is a composition in the mathematical sense of the first prediction with each of the other predictions. TD networks are the first framework for representing the goals of predictive learning in a compositional, machine-accessible form. Each node of a TD network represents an individual question—something to be predicted—and has associated with it a value representing an answer to the question—a prediction of that something. The questions are represented by a set of directed links between nodes. If node 1 is linked to node 2, then node 1 rep- resents a question incorporating node 2’s question; its value is a prediction about node 2’s prediction. Higher-level predictions can be composed in several ways from lower ones, producing a powerful, structured representation language for the targets of learning. The compositional structure is not just in a human designer’s head; it is expressed in the links and thus is accessible to the agent and its learning algorithm. The network of these links is referred to as the question network. An entirely separate set of directed links between the nodes is used to compute the values (predictions, answers) associated with each node. These links collectively are referred to as the answer network. The computation in the answer network is compositional in a conventional way—node values are computed from other node values. The essential insight of TD networks is that the notion of compositionality should apply to questions as well as to answers. A secondary distinguishing feature of TD networks is that the predictions (node values) at each moment in time can be used as a representation of the state of the world at that time. In this way they are an instance of the idea of predictive state representations (PSRs) introduced by Littman, Sutton and Singh (2002), Jaeger (2000), and Rivest and Schapire (1987). Representing a state by its predictions is a potentially powerful strategy for state abstraction (Rafols et al., 2005). We note that the questions used in all previous work with PSRs are defined in terms of concrete actions and observations, not other predictions. They are not compositional in the sense that TD-network questions are. The questions we have discussed so far are subjunctive, meaning that they are conditional on a certain way of behaving. We predict what we would see if we were to step forward and look right, or if we were to open the fridge. The questions in conventional TD networks are subjunctive, but they are conditional only on primitive actions or open-loop sequences of primitive actions (as are conventional PSRs). It is natural to generalize this, as we have in the informal examples above, to questions that are conditional on closed-loop temporally extended ways of behaving. For example, opening the fridge is a complex, high-level action. The arm must be lifted to the door, the hand shaped for grasping the handle, etc. To ask questions like “if I were to go to the coffee room, would I see John?” would require substantial temporal abstraction in addition to state abstraction. The options framework (Sutton, Precup & Singh, 1999) is a straightforward way of talking about temporally extended ways of behaving and about predictions of their outcomes. In this paper we extend the options framework so that it can be applied to TD networks. Significant extensions of the original options framework are needed. Novel features of our option-extended TD networks are that they 1) predict components of option outcomes rather than full outcome probability distributions, 2) learn according to the first intra-option method to use eligibility traces (see Sutton & Barto, 1998), and 3) include the possibility of options whose ‘policies’ are indifferent to which of several actions are selected. 1 The options framework In this section we present the essential elements of the options framework (Sutton, Precup & Singh, 1999) that we will need for our extension of TD networks. In this framework, an agent and an environment interact at discrete time steps t = 1, 2, 3.... In each state st ∈ S, the agent selects an action at ∈ A, determining the next state st+1 .1 An action is a way of behaving for one time step; the options framework lets us talk about temporally extended ways of behaving. An individual option consists of three parts. The first is the initiation set, I ⊂ S, the subset of states in which the option can be started. The second component of an option is its policy, π : S × A ⇒ [0, 1], specifying how the agent behaves when 1 Although the options framework includes rewards, we omit them here because we are concerned only with prediction, not control. following the option. Finally, a termination function, β : S × A ⇒ [0, 1], specifies how the option ends: β(s) denotes the probability of terminating when in state s. The option is thus completely and formally defined by the 3-tuple (I, π, β). 2 Conventional TD networks In this section we briefly present the details of the structure and the learning algorithm comprising TD networks as introduced by Sutton and Tanner (2005). TD networks address a prediction problem in which the agent may not have direct access to the state of the environment. Instead, at each time step the agent receives an observation ot ∈ O dependent on the state. The experience stream thus consists of a sequence of alternating actions and observations, o1 , a1 , o2 , a2 , o3 · · ·. The TD network consists of a set of nodes, each representing a single scalar prediction, interlinked by the question and answer networks as suggested previously. For a network 1 n of n nodes, the vector of all predictions at time step t is denoted yt = (yt , . . . , yt )T . The predictions are estimates of the expected value of some scalar quantity, typically of a bit, in which case they can be interpreted as estimates of probabilities. The predictions are updated at each time step according to a vector-valued function u with modifiable parameter W, which is often taken to be of a linear form: yt = u(yt−1 , at−1 , ot , Wt ) = σ(Wt xt ), (1) where xt ∈ m is an m-vector of features created from (yt−1 , at−1 , ot ), Wt is an n × m matrix (whose elements are sometimes referred to as weights), and σ is the n-vector 1 form of either the identity function or the S-shaped logistic function σ(s) = 1+e−s . The feature vector is an arbitrary vector-valued function of yt−1 , at−1 , and ot . For example, in the simplest case the feature vector is a unit basis vector with the location of the one communicating the current state. In a partially observable environment, the feature vector may be a combination of the agent’s action, observations, and predictions from the previous time step. The overall update u defines the answer network. The question network consists of a set of target functions, z i : O × n → , and condition i y functions, ci : A× n → [0, 1]n . We define zt = z i (ot+1 , ˜t+1 ) as the target for prediction i 2 i i yt . Similarly, we define ct = c (at , yt ) as the condition at time t. The learning algorithm ij for each component wt of Wt can then be written ij ij i i wt+1 = wt + α zt − yt ci t i ∂yt , (2) ij ∂wt where α is a positive step-size parameter. Note that the targets here are functions of the observation and predictions exactly one time step later, and that the conditions are functions of a single primitive action. This is what makes this algorithm suitable only for learning about one-step TD relationships. By chaining together multiple nodes, Sutton and Tanner (2005) used it to predict k steps ahead, for various particular values of k, and to predict the outcome of specific action sequences (as in PSRs, e.g., Littman et al., 2002; Singh et al., 2004). Now we consider the extension to temporally abstract actions. 3 Option-extended TD networks In this section we present our intra-option learning algorithm for TD networks with options and eligibility traces. As suggested earlier, each node’s outgoing link in the question 2 The quantity ˜ is almost the same as y, and we encourage the reader to think of them as identical y here. The difference is that ˜ is calculated by weights that are one step out of date as compared to y, y i.e., ˜t = u(yt−1 , at−1 , ot , Wt−1 ) (cf. equation 1). y network will now correspond to an option applying over possibly many steps. The policy of the ith node’s option corresponds to the condition function ci , which we think of as a recognizer for the option. It inspects each action taken to assess whether the option is being followed: ci = 1 if the agent is acting consistently with the option policy and ci = 0 othert t wise (intermediate values are also possible). When an agent ceases to act consistently with the option policy, we say that the option has diverged. The possibility of recognizing more than one action as consistent with the option is a significant generalization of the original idea of options. If no actions are recognized as acceptable in a state, then the option cannot be followed and thus cannot be initiated. Here we take the set of states with at least one recognized action to be the initiation set of the option. The option-termination function β generalizes naturally to TD networks. Each node i is i given a corresponding termination function, β i : O× n → [0, 1], where βt = β i (ot+1 , yt ) i is the probability of terminating at time t.3 βt = 1 indicates that the option has terminated i at time t; βt = 0 indicates that it has not, and intermediate values of β correspond to soft i or stochastic termination conditions. If an option terminates, then zt acts as the target, but if the option is ongoing without termination, then the node’s own next value, yt+1 , should ˜i be the target. The termination function specifies which of the two targets (or mixture of the two targets) is used to produce a form of TD error for each node i: i i i i i i δt = βt zt + (1 − βt )˜t+1 − yt . y (3) Our option-extended algorithm incorporates eligibility traces (see Sutton & Barto, 1998) as short-term memory variables organized in an n × m matrix E, paralleling the weight matrix. The traces are a record of the effect that each weight could have had on each node’s prediction during the time the agent has been acting consistently with the node’s option. The components eij of the eligibility matrix are updated by i eij = ci λeij (1 − βt ) + t t t−1 i ∂yt ij ∂wt , (4) where 0 ≤ λ ≤ 1 is the trace-decay parameter familiar from the TD(λ) learning algorithm. Because of the ci factor, all of a node’s traces will be immediately reset to zero whenever t the agent deviates from the node’s option’s policy. If the agent follows the policy and the option does not terminate, then the trace decays by λ and increments by the gradient in the way typical of eligibility traces. If the policy is followed and the option does terminate, then the trace will be reset to zero on the immediately following time step, and a new trace will start building. Finally, our algorithm updates the weights on each time step by ij ij i wt+1 = wt + α δt eij . t 4 (5) Fully observable experiment This experiment was designed to test the correctness of the algorithm in a simple gridworld where the environmental state is observable. We applied an options-extended TD network to the problem of learning to predict observations from interaction with the gridworld environment shown on the left in Figure 1. Empty squares indicate spaces where the agent can move freely, and colored squares (shown shaded in the figure) indicate walls. The agent is egocentric. At each time step the agent receives from the environment six bits representing the color it is facing (red, green, blue, orange, yellow, or white). In this first experiment we also provided 6 × 6 × 4 = 144 other bits directly indicating the complete state of the environment (square and orientation). 3 The fact that the option depends only on the current predictions, action, and observation means that we are considering only Markov options. Figure 1: The test world (left) and the question network (right) used in the experiments. The triangle in the world indicates the location and orientation of the agent. The walls are labeled R, O, Y, G, and B representing the colors red, orange, yellow, green and blue. Note that the left wall is mostly blue but partly green. The right diagram shows in full the portion of the question network corresponding to the red bit. This structure is repeated, but not shown, for the other four (non-white) colors. L, R, and F are primitive actions, and Forward and Wander are options. There are three possible actions: A ={F, R, L}. Actions were selected according to a fixed stochastic policy independent of the state. The probability of the F, L, and R actions were 0.5, 0.25, and 0.25 respectively. L and R cause the agent to rotate 90 degrees to the left or right. F causes the agent to move ahead one square with probability 1 − p and to stay in the same square with probability p. The probability p is called the slipping probability. If the forward movement would cause the agent to move into a wall, then the agent does not move. In this experiment, we used p = 0, p = 0.1, and p = 0.5. In addition to these primitive actions, we provided two temporally abstract options, Forward and Wander. The Forward option takes the action F in every state and terminates when the agent senses a wall (color) in front of it. The policy of the Wander option is the same as that actually followed by the agent. Wander terminates with probability 1 when a wall is sensed, and spontaneously with probability 0.5 otherwise. We used the question network shown on the right in Figure 1. The predictions of nodes 1, 2, and 3 are estimates of the probability that the red bit would be observed if the corresponding primitive action were taken. Node 4 is a prediction of whether the agent will see the red bit upon termination of the Wander option if it were taken. Node 5 predicts the probability of observing the red bit given that the Forward option is followed until termination. Nodes 6 and 7 represent predictions of the outcome of a primitive action followed by the Forward option. Nodes 8 and 9 take this one step further: they represent predictions of the red bit if the Forward option were followed to termination, then a primitive action were taken, and then the Forward option were followed again to termination. We applied our algorithm to learn the parameter W of the answer network for this question network. The step-size parameter α was 1.0, and the trace-decay parameter λ was 0.9. The initial W0 , E0 , and y0 were all 0. Each run began with the agent in the state indicated in Figure 1 (left). In this experiment σ(·) was the identity function. For each value of p, we ran 50 runs of 20,000 time steps. On each time step, the root-meansquared (RMS) error in each node’s prediction was computed and then averaged over all the nodes. The nodes corresponding to the Wander option were not included in the average because of the difficulty of calculating their correct predictions. This average was then 0.4 Fully Observable 0.4 RMS Error RMS Error p=0 0 0 Partially Observable p = 0.1 5000 p = 0.5 10000 15000 20000 Steps 0 0 100000 200000 Steps 300000 Figure 2: Learning curves in the fully-observable experiment for each slippage probability (left) and in the partially-observable experiment (right). itself averaged over the 50 runs and bins of 1,000 time steps to produce the learning curves shown on the left in Figure 2. For all slippage probabilities, the error in all predictions fell almost to zero. After approximately 12,000 trials, the agent made almost perfect predictions in all cases. Not surprisingly, learning was slower at the higher slippage probabilities. These results show that our augmented TD network is able to make a complete temporally-abstract model of this world. 5 Partially observable experiment In our second experiment, only the six color observation bits were available to the agent. This experiment provides a more challenging test of our algorithm. To model the environment well, the TD network must construct a representation of state from very sparse information. In fact, completely accurate prediction is not possible in this problem with our question network. In this experiment the input vector consisted of three groups of 46 components each, 138 in total. If the action was R, the first 46 components were set to the 40 node values and the six observation bits, and the other components were 0. If the action was L, the next group of 46 components was filled in in the same way, and the first and third groups were zero. If the action was F, the third group was filled. This technique enables the answer network as function approximator to represent a wider class of functions in a linear form than would otherwise be possible. In this experiment, σ(·) was the S-shaped logistic function. The slippage probability was p = 0.1. As our performance measure we used the RMS error, as in the first experiment, except that the predictions for the primitive actions (nodes 1-3) were not included. These predictions can never become completely accurate because the agent can’t tell in detail where it is located in the open space. As before, we averaged RMS error over 50 runs and 1,000 time step bins, to produce the learning curve shown on the right in Figure 2. As before, the RMS error approached zero. Node 5 in Figure 1 holds the prediction of red if the agent were to march forward to the wall ahead of it. Corresponding nodes in the other subnetworks hold the predictions of the other colors upon Forward. To make these predictions accurately, the agent must keep track of which wall it is facing, even if it is many steps away from it. It has to learn a sort of compass that it can keep updated as it turns in the middle of the space. Figure 3 is a demonstration of the compass learned after a representative run of 200,000 time steps. At the end of the run, the agent was driven manually to the state shown in the first row (relative time index t = 1). On steps 1-25 the agent was spun clockwise in place. The third column shows the prediction for node 5 in each portion of the question network. That is, the predictions shown are for each color-observation bit at termination of the Forward option. At t = 1, the agent is facing the orange wall and it predicts that the Forward option would result in seeing the orange bit and none other. Over steps 2-5 we see that the predictions are maintained accurately as the agent spins despite the fact that its observation bits remain the same. Even after spinning for 25 steps the agent knows exactly which way it is facing. While spinning, the agent correctly never predicts seeing the green bit (after Forward), but if it is driven up and turned, as in the last row of the figure, the green bit is accurately predicted. The fourth column shows the prediction for node 8 in each portion of the question network. Recall that these nodes correspond to the sequence Forward, L, Forward. At time t = 1, the agent accurately predicts that Forward will bring it to orange (third column) and also predicts that Forward, L, Forward will bring it to green. The predictions made for node 8 at each subsequent step of the sequence are also correct. These results show that the agent is able to accurately maintain its long term predictions without directly encountering sensory verification. How much larger would the TD network have to be to handle a 100x100 gridworld? The answer is not at all. The same question network applies to any size problem. If the layout of the colored walls remain the same, then even the answer network transfers across worlds of widely varying sizes. In other experiments, training on successively larger problems, we have shown that the same TD network as used here can learn to make all the long-term predictions correctly on a 100x100 version of the 6x6 gridworld used here. t y5 t st y8 t 1 1 O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG O Y R BG 2 3 4 5 25 29 Figure 3: An illustration of part of what the agent learns in the partially observable environment. The second column is a sequence of states with (relative) time index as given by the first column. The sequence was generated by controlling the agent manually. On steps 1-25 the agent was spun clockwise in place, and the trajectory after that is shown by the line in the last state diagram. The third and fourth columns show the values of the nodes corresponding to 5 and 8 in Figure 1, one for each color-observation bit. 6 Conclusion Our experiments show that option-extended TD networks can learn effectively. They can learn facts about their environments that are not representable in conventional TD networks or in any other method for learning models of the world. One concern is that our intra-option learning algorithm is an off-policy learning method incorporating function approximation and bootstrapping (learning from predictions). The combination of these three is known to produce convergence problems for some methods (see Sutton & Barto, 1998), and they may arise here. A sound solution may require modifications to incorporate importance sampling (see Precup, Sutton & Dasgupta, 2001). In this paper we have considered only intra-option eligibility traces—traces extending over the time span within an option but not persisting across options. Tanner and Sutton (2005) have proposed a method for inter-option traces that could perhaps be combined with our intra-option traces. The primary contribution of this paper is the introduction of a new learning algorithm for TD networks that incorporates options and eligibility traces. Our experiments are small and do little more than exercise the learning algorithm, showing that it does not break immediately. More significant is the greater representational power of option-extended TD networks. Options are a general framework for temporal abstraction, predictive state representations are a promising strategy for state abstraction, and TD networks are able to represent compositional questions. The combination of these three is potentially very powerful and worthy of further study. Acknowledgments The authors gratefully acknowledge the ideas and encouragement they have received in this work from Mark Ring, Brian Tanner, Satinder Singh, Doina Precup, and all the members of the rlai.net group. References Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371-1398. MIT Press. Littman, M., Sutton, R. S., & Singh, S. (2002). Predictive representations of state. In T. G. Dietterich, S. Becker and Z. Ghahramani (eds.), Advances In Neural Information Processing Systems 14, pp. 1555-1561. MIT Press. Precup, D., Sutton, R. S., & Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In C. E. Brodley, A. P. Danyluk (eds.), Proceedings of the Eighteenth International Conference on Machine Learning, pp. 417-424. San Francisco, CA: Morgan Kaufmann. Rafols, E. J., Ring, M., Sutton, R.S., & Tanner, B. (2005). Using predictive representations to improve generalization in reinforcement learning. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence. Rivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of finite automata. In Proceedings of the Twenty Eighth Annual Symposium on Foundations of Computer Science, (pp. 78–87). IEEE Computer Society. Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference in Uncertainty in Artificial Intelligence, (pp. 512–519). AUAI Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Sutton, R. S., Precup, D., Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, pp. 181-211. Sutton, R. S., & Tanner, B. (2005). Conference 17. Temporal-difference networks. To appear in Neural Information Processing Systems Tanner, B., Sutton, R. S. (2005) Temporal-difference networks with history. To appear in Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence.

3 0.62830633 88 nips-2005-Gradient Flow Independent Component Analysis in Micropower VLSI

Author: Abdullah Celik, Milutin Stanacevic, Gert Cauwenberghs

Abstract: We present micropower mixed-signal VLSI hardware for real-time blind separation and localization of acoustic sources. Gradient flow representation of the traveling wave signals acquired over a miniature (1cm diameter) array of four microphones yields linearly mixed instantaneous observations of the time-differentiated sources, separated and localized by independent component analysis (ICA). The gradient flow and ICA processors each measure 3mm × 3mm in 0.5 µm CMOS, and consume 54 µW and 180 µW power, respectively, from a 3 V supply at 16 ks/s sampling rate. Experiments demonstrate perceptually clear (12dB) separation and precise localization of two speech sources presented through speakers positioned at 1.5m from the array on a conference room table. Analysis of the multipath residuals shows that they are spectrally diffuse, and void of the direct path.

4 0.59630364 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

5 0.58573192 153 nips-2005-Policy-Gradient Methods for Planning

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

6 0.58254331 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

7 0.57864606 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

8 0.57599688 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

9 0.56387031 46 nips-2005-Consensus Propagation

10 0.56160194 45 nips-2005-Conditional Visual Tracking in Kernel Space

11 0.56111783 36 nips-2005-Bayesian models of human action understanding

12 0.56002939 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

13 0.55897027 23 nips-2005-An Application of Markov Random Fields to Range Sensing

14 0.55893427 184 nips-2005-Structured Prediction via the Extragradient Method

15 0.55810785 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

16 0.55578405 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

17 0.55538684 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

18 0.55304599 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

19 0.55135763 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

20 0.55074126 110 nips-2005-Learning Depth from Single Monocular Images