nips nips2005 nips2005-186 knowledge-graph by maker-knowledge-mining

186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration


Source: pdf

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a finite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a finite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is defined by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , defined by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is defined by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, defined by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we define a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justified by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an affine space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique fixed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The first inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefficients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for fixed MDP parameters (S, U, g, p) and fixed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to finite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefficient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a finite state space. However, as the following theorem establishes, the coefficient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else fixed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular definition: the weights ˜ ˜ are used in defining r and now we are defining the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisfies Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated benefits. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two significant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is defined by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we define a modified dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modified dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efficiently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisfies Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the first time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a fixed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a fixed policy, suggest that approximation error is amplified by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisfies this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes infinite as α approaches 1. On the other hand, generalization of Theorem 6 yields the first performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-defined because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Office of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic benefited from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has benefited from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psfiles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. [sent-2, score-0.579]

2 We establish performance loss bounds for policies derived from approximations associated with fixed points. [sent-3, score-0.278]

3 These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. [sent-4, score-0.32]

4 Such projection weighting leads to the same fixed points as TD(0). [sent-5, score-0.131]

5 Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. [sent-6, score-0.507]

6 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a finite state space S = {1, . [sent-7, score-0.208]

7 At each state x ∈ S, there is a finite set Ux of admissible actions. [sent-11, score-0.085]

8 If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). [sent-12, score-0.564]

9 A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. [sent-16, score-0.45]

10 If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). [sent-17, score-0.55]

11 The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. [sent-18, score-0.208]

12 Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , defined by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. [sent-19, score-0.304]

13 A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. [sent-20, score-0.613]

14 Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . [sent-24, score-0.463]

15 Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. [sent-25, score-0.127]

16 Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, defined by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . [sent-26, score-0.515]

17 This sequence converges to J ∗ for any initialization of J0 . [sent-27, score-0.077]

18 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. [sent-28, score-0.085]

19 One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . [sent-29, score-0.085]

20 This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. [sent-33, score-0.356]

21 To represent an approximation, we define a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . [sent-34, score-0.33]

22 The use of such a policy µr which is greedy with respect to Φr is justified by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. [sent-37, score-0.848]

23 One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . [sent-38, score-0.141]

24 The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an affine space, and the projection is taken to be the one with minimal norm Φr 2,π . [sent-41, score-0.175]

25 Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). [sent-42, score-0.171]

26 Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . [sent-43, score-0.179]

27 It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. [sent-44, score-0.309]

28 By the contraction mapping theorem, Ππ T has a unique fixed point Φ˜, which is the limit of the sequence Φr( ) . [sent-48, score-0.152]

29 Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. [sent-49, score-0.245]

30 The first inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. [sent-50, score-0.49]

31 The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. [sent-51, score-0.296]

32 Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. [sent-52, score-0.282]

33 Our motivation of the normalizing constant begs the question of whether, for fixed MDP parameters (S, U, g, p) and fixed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). [sent-57, score-0.269]

34 For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). [sent-61, score-0.466]

35 Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. [sent-62, score-0.566]

36 The performance loss bound still exhibits an undesirable dependence on α through the coefficient 4α/(1 − α). [sent-63, score-0.233]

37 Unfortunately, the bound is sharp, as expressed by the following theorem. [sent-67, score-0.089]

38 Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. [sent-69, score-0.25]

39 (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. [sent-70, score-0.193]

40 The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. [sent-71, score-0.284]

41 Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . [sent-72, score-0.25]

42 In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. [sent-73, score-0.207]

43 It turns out that this is not true, at least if we restrict to a finite state space. [sent-74, score-0.085]

44 However, as the following theorem establishes, the coefficient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else fixed. [sent-75, score-0.156]

45 Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. [sent-76, score-0.804]

46 For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. [sent-78, score-0.254]

47 Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. [sent-80, score-0.804]

48 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. [sent-81, score-0.164]

49 We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector. [sent-82, score-0.212]

50 1 This leads to a circular definition: the weights ˜ ˜ are used in defining r and now we are defining the weights in terms of r. [sent-83, score-0.129]

51 The following theorem captures ˜ r r ˜ the associated benefits. [sent-85, score-0.156]

52 ) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . [sent-87, score-0.078]

53 ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. [sent-88, score-0.282]

54 In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. [sent-92, score-0.316]

55 Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. [sent-93, score-0.11]

56 Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . [sent-95, score-0.866]

57 r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . [sent-97, score-0.212]

58 In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. [sent-98, score-0.127]

59 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. [sent-99, score-0.112]

60 It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. [sent-102, score-0.22]

61 A stochastic policy µ maps state-action pairs to probabilities. [sent-105, score-0.334]

62 For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. [sent-106, score-0.146]

63 Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is defined by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . [sent-108, score-0.505]

64 −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. [sent-109, score-0.505]

65 Further, we define a modified dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . [sent-110, score-0.208]

66 −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modified dynamic programming operators become the dynamic programming operator. [sent-111, score-0.661]

67 More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. [sent-112, score-0.508]

68 n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. [sent-115, score-1.033]

69 For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5. [sent-118, score-0.108]

70 We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. [sent-120, score-0.339]

71 It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . [sent-125, score-0.096]

72 Also required is a way to efficiently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. [sent-129, score-0.161]

73 This is typically possible when the action set Ux and the support of px· (u) (i. [sent-130, score-0.096]

74 , the set of states that can follow x if action u is selected) are not too large. [sent-132, score-0.088]

75 In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. [sent-134, score-0.085]

76 Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. [sent-136, score-0.191]

77 of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . [sent-141, score-0.566]

78 Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. [sent-144, score-0.106]

79 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. [sent-145, score-0.692]

80 It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. [sent-146, score-0.189]

81 This is not the first time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. [sent-147, score-0.3]

82 In prior work involving approximation of a cost-to-go function for a fixed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. [sent-148, score-0.831]

83 In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a fixed policy, suggest that approximation error is amplified by a factor of 1/(1 − α) as λ is changed from 1 to 0. [sent-153, score-0.228]

84 The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. [sent-154, score-0.237]

85 Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. [sent-155, score-0.401]

86 In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. [sent-156, score-0.377]

87 Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. [sent-157, score-0.11]

88 Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. [sent-158, score-0.141]

89 Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. [sent-159, score-0.108]

90 r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . [sent-161, score-0.25]

91 The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . [sent-162, score-0.499]

92 There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisfies this equation [19, 20, 21]. [sent-163, score-0.076]

93 Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes infinite as α approaches 1. [sent-164, score-0.138]

94 Here, λµr ∈ denotes the average cost under policy µr , which is well-defined because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. [sent-166, score-0.57]

95 This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. [sent-167, score-0.245]

96 It is easy to see that the limit of the left-hand side is λµr − λ∗ . [sent-168, score-0.094]

97 The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . [sent-169, score-0.094]

98 On the existence of fixed points for approximate value iteration and temporal-difference learning. [sent-194, score-0.173]

99 An upper-bound on the loss from approximate optimalvalue functions. [sent-232, score-0.172]

100 Performance loss bounds for approximate value iteration with state aggregation. [sent-299, score-0.436]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ux', 0.388), ('policy', 0.304), ('minr', 0.269), ('mdp', 0.263), ('lim', 0.254), ('boltzmann', 0.17), ('exploration', 0.165), ('pxy', 0.161), ('partition', 0.159), ('theorem', 0.156), ('tsitsiklis', 0.147), ('van', 0.129), ('invariant', 0.127), ('farias', 0.124), ('communicating', 0.123), ('iteration', 0.107), ('sk', 0.106), ('loss', 0.106), ('td', 0.102), ('gu', 0.098), ('dynamic', 0.094), ('tu', 0.094), ('bound', 0.089), ('state', 0.085), ('greedy', 0.084), ('programming', 0.081), ('cost', 0.074), ('bounds', 0.072), ('projection', 0.07), ('approximators', 0.069), ('kth', 0.068), ('approximate', 0.066), ('vi', 0.066), ('contraction', 0.065), ('approximator', 0.065), ('approximation', 0.063), ('policies', 0.062), ('mini', 0.062), ('minu', 0.062), ('discounted', 0.062), ('action', 0.061), ('mdps', 0.059), ('min', 0.058), ('parameterized', 0.058), ('weights', 0.051), ('solves', 0.05), ('side', 0.049), ('shortcomings', 0.049), ('ted', 0.046), ('aggregation', 0.046), ('amherst', 0.046), ('inf', 0.046), ('limit', 0.045), ('immediate', 0.044), ('euclidean', 0.043), ('intersecting', 0.043), ('bertsekas', 0.043), ('sequence', 0.042), ('optimal', 0.039), ('open', 0.039), ('performance', 0.038), ('corollary', 0.037), ('requirement', 0.037), ('established', 0.037), ('bene', 0.037), ('respect', 0.036), ('support', 0.035), ('converges', 0.035), ('norm', 0.035), ('indicator', 0.035), ('weighting', 0.034), ('broader', 0.034), ('accommodate', 0.034), ('operations', 0.033), ('exists', 0.033), ('ma', 0.033), ('operator', 0.033), ('whether', 0.032), ('proof', 0.031), ('equation', 0.031), ('xt', 0.03), ('stochastic', 0.03), ('factor', 0.03), ('generates', 0.03), ('linearly', 0.028), ('said', 0.028), ('states', 0.027), ('costs', 0.027), ('omit', 0.027), ('intersects', 0.027), ('bvr', 0.027), ('irreducible', 0.027), ('twelfth', 0.027), ('naval', 0.027), ('ity', 0.027), ('unavailable', 0.027), ('watkins', 0.027), ('counterexample', 0.027), ('machine', 0.027), ('leads', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a finite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a finite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is defined by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , defined by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is defined by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, defined by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we define a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justified by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an affine space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique fixed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The first inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefficients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for fixed MDP parameters (S, U, g, p) and fixed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to finite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefficient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a finite state space. However, as the following theorem establishes, the coefficient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else fixed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular definition: the weights ˜ ˜ are used in defining r and now we are defining the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisfies Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated benefits. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two significant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is defined by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we define a modified dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modified dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efficiently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisfies Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the first time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a fixed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a fixed policy, suggest that approximation error is amplified by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisfies this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes infinite as α approaches 1. On the other hand, generalization of Theorem 6 yields the first performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-defined because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Office of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic benefited from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has benefited from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psfiles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

2 0.24938722 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

3 0.19564575 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

4 0.18512231 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

5 0.14694414 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

6 0.13705401 153 nips-2005-Policy-Gradient Methods for Planning

7 0.12525195 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

8 0.10443542 47 nips-2005-Consistency of one-class SVM and related algorithms

9 0.09781491 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

10 0.092227392 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

11 0.089476936 53 nips-2005-Cyclic Equilibria in Markov Games

12 0.089202493 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

13 0.086694039 58 nips-2005-Divergences, surrogate loss functions and experimental design

14 0.086214848 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

15 0.080199137 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

16 0.078581966 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

17 0.073533125 50 nips-2005-Convex Neural Networks

18 0.070564836 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

19 0.068728097 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

20 0.06843584 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.231), (1, 0.054), (2, 0.298), (3, -0.189), (4, -0.032), (5, 0.035), (6, -0.047), (7, -0.041), (8, -0.003), (9, -0.134), (10, 0.002), (11, 0.011), (12, 0.011), (13, 0.031), (14, 0.014), (15, 0.066), (16, 0.021), (17, -0.115), (18, 0.032), (19, -0.094), (20, -0.009), (21, -0.0), (22, -0.086), (23, -0.031), (24, 0.032), (25, 0.116), (26, 0.002), (27, -0.059), (28, -0.095), (29, -0.014), (30, 0.026), (31, 0.067), (32, 0.126), (33, 0.025), (34, 0.087), (35, -0.023), (36, -0.065), (37, 0.05), (38, -0.029), (39, -0.003), (40, -0.061), (41, 0.038), (42, -0.094), (43, -0.085), (44, 0.026), (45, 0.001), (46, -0.029), (47, 0.076), (48, 0.088), (49, 0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95482558 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a finite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a finite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is defined by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , defined by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is defined by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, defined by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we define a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justified by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an affine space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique fixed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The first inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefficients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for fixed MDP parameters (S, U, g, p) and fixed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to finite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefficient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a finite state space. However, as the following theorem establishes, the coefficient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else fixed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular definition: the weights ˜ ˜ are used in defining r and now we are defining the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisfies Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated benefits. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two significant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is defined by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we define a modified dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modified dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efficiently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisfies Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the first time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a fixed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a fixed policy, suggest that approximation error is amplified by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisfies this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes infinite as α approaches 1. On the other hand, generalization of Theorem 6 yields the first performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-defined because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Office of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic benefited from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has benefited from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psfiles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

2 0.82988048 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

3 0.74942732 144 nips-2005-Off-policy Learning with Options and Recognizers

Author: Doina Precup, Cosmin Paduraru, Anna Koop, Richard S. Sutton, Satinder P. Singh

Abstract: We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e.g., according to an ε-greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai ∈ [0, 1], selected i.i.d. according to probability density b : [0, 1] → ℜ+ (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai , we observe a corresponding outcome, zi ∈ ℜ, a random variable whose distribution depends only on ai . Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) ∑n zi . The off-policy problem is to use this same data to learn what ˆ i=1 the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0.7 and 0.9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density π : [0, 1] → ℜ+ , which in the example would be 5.0 between 0.7 and 0.9, and zero elsewhere, as in the dashed line in 12 Probability density functions 1.5 Target policy with recognizer 1 Target policy w/o recognizer without recognizer .5 Behavior policy 0 0 Action 0.7 Empirical variances (average of 200 sample variances) 0.9 1 0 10 with recognizer 100 200 300 400 500 Number of sample actions Figure 1: The left panel shows the behavior policy and the target policies for the formulations of the problem with and without recognizers. The right panel shows empirical estimates of the variances for the two formulations as a function of the number sample actions. The lowest line is for the formulation using empirically-estimated recognition probabilities. Figure 1 (left). The importance- sampling estimate of the mean outcome is 1 n π(ai ) mπ = ∑ ˆ zi . n i=1 b(ai ) (1) This approach is problematic if there are parts of the region of interest where the behavior density is zero or very nearly so, such as near 0.72 and 0.85 in the example. Here the importance sampling ratios are exceedingly large and the estimate is poorly conditioned (large variance). The upper curve in Figure 1 (right) shows the empirical variance of this estimate as a function of the number of samples. The spikes and uncertain decline of the empirical variance indicate that the distribution is very skewed and that the estimates are very poorly conditioned. The second way to pose the problem uses recognizers. One professes to be interested in actions to the extent that they are both selected by b and within the designated region. This leads to the target policy shown in blue in the left panel of Figure 1 (it is taller because it still must sum to 1). For this problem, the variance of (1) is much smaller, as shown in the lower two lines of Figure 1 (right). To make this way of posing the problem clear, we introduce the notion of a recognizer function c : A → ℜ+ . The action space in the example is A = [0, 1] and the recognizer is c(a) = 1 for a between 0.7 and 0.9 and is zero elsewhere. The target policy is defined in general by c(a)b(a) c(a)b(a) = . (2) π(a) = µ ∑x c(x)b(x) where µ = ∑x c(x)b(x) is a constant, equal to the probability of recognizing an action from the behavior policy. Given π, mπ from (1) can be rewritten in terms of the recognizer as ˆ n π(ai ) 1 n c(ai )b(ai ) 1 1 n c(ai ) 1 mπ = ∑ zi ˆ = ∑ zi = ∑ zi (3) n i=1 b(ai ) n i=1 µ b(ai ) n i=1 µ Note that the target density does not appear at all in the last expression and that the behavior distribution appears only in µ, which is independent of the sample action. If this constant is known, then this estimator can be computed with no knowledge of π or b. The constant µ can easily be estimated as the fraction of recognized actions in the sample. The lowest line in Figure 1 (right) shows the variance of the estimator using this fraction in place of the recognition probability. Its variance is low, no worse than that of the exact algorithm, and apparently slightly lower. Because this algorithm does not use the behavior density, it can be applied when the behavior density is unknown or does not even exist. For example, suppose actions were selected in some deterministic, systematic way that in the long run produced an empirical distribution like b. This would be problematic for the other algorithms but would require no modification of the recognition-fraction algorithm. 2 Recognizers improve conditioning of off-policy learning The main use of recognizers is in formulating a target density π about which we can successfully learn predictions, based on the current behavior being followed. Here we formalize this intuition. Theorem 1 Let A = {a1 , . . . ak } ⊆ A be a subset of all the possible actions. Consider a fixed behavior policy b and let πA be the class of policies that only choose actions from A, i.e., if π(a) > 0 then a ∈ A. Then the policy induced by b and the binary recognizer cA is the policy with minimum-variance one-step importance sampling corrections, among those in πA : π(ai ) 2 π as given by (2) = arg min Eb (4) π∈πA b(ai ) Proof: Denote π(ai ) = πi , b(ai ) = bi . Then the expected variance of the one-step importance sampling corrections is: Eb πi bi πi bi 2 2 − Eb = ∑ bi i πi bi 2 −1 = ∑ i π2 i − 1, bi where the summation (here and everywhere below) is such that the action ai ∈ A. We want to find πi that minimizes this expression, subject to the constraint that ∑i πi = 1. This is a constrained optimization problem. To solve it, we write down the corresponding Lagrangian: π2 L(πi , β) = ∑ i − 1 + β(∑ πi − 1) i i bi We take the partial derivatives wrt πi and β and set them to 0: βbi ∂L 2 = πi + β = 0 ⇒ πi = − ∂πi bi 2 (5) ∂L = πi − 1 = 0 ∂β ∑ i (6) By taking (5) and plugging into (6), we get the following expression for β: − β 2 bi = 1 ⇒ β = − 2∑ ∑i bi i By substituting β into (5) we obtain: πi = bi ∑i b i This is exactly the policy induced by the recognizer defined by c(ai ) = 1 iff ai ∈ A. We also note that it is advantageous, from the point of view of minimizing the variance of the updates, to have recognizers that accept a broad range of actions: Theorem 2 Consider two binary recognizers c1 and c2 , such that µ1 > µ2 . Then the importance sampling corrections for c1 have lower variance than the importance sampling corrections for c2 . Proof: From the previous theorem, we have the variance of a recognizer cA : Var = ∑ i π2 bi i −1 = ∑ bi ∑ j∈A b j i 2 1 1 1 −1 = −1 = −1 bi µ ∑ j∈A b j 3 Formal framework for sequential problems We turn now to the full case of learning about sequential decision processes with function approximation. We use the standard framework in which an agent interacts with a stochastic environment. At each time step t, the agent receives a state st and chooses an action at . We assume for the moment that actions are selected according to a fixed behavior policy, b : S × A → [0, 1] where b(s, a) is the probability of selecting action a in state s. The behavior policy is used to generate a sequence of experience (observations, actions and rewards). The goal is to learn, from this data, predictions about different ways of behaving. In this paper we focus on learning predictions about expected returns, but other predictions can be tackled as well (for instance, predictions of transition models for options (Sutton, Precup & Singh, 1999), or predictions specified by a TD-network (Sutton & Tanner, 2005; Sutton, Rafols & Koop, 2006)). We assume that the state space is large or continuous, and function approximation must be used to compute any values of interest. In particular, we assume a space of feature vectors Φ and a mapping φ : S → Φ. We denote by φs the feature vector associated with s. An option is defined as a triple o = I, π, β where I ⊆ S is the set of states in which the option can be initiated, π is the internal policy of the option and β : S → [0, 1] is a stochastic termination condition. In the option work (Sutton, Precup & Singh, 1999), each of these elements has to be explicitly specified and fixed in order for an option to be well defined. Here, we will instead define options implicitly, using the notion of a recognizer. A recognizer is defined as a function c : S × A → [0, 1], where c(s, a) indicates to what extent the recognizer allows action a in state s. An important special case, which we treat in this paper, is that of binary recognizers. In this case, c is an indicator function, specifying a subset of actions that are allowed, or recognized, given a particular state. Note that recognizers do not specify policies; instead, they merely give restrictions on the policies that are allowed or recognized. A recognizer c together with a behavior policy b generates a target policy π, where: b(s, a)c(s, a) b(s, a)c(s, a) π(s, a) = (7) = µ(s) ∑x b(s, x)c(s, x) The denominator of this fraction, µ(s) = ∑x b(s, x)c(s, x), is the recognition probability at s, i.e., the probability that an action will be accepted at s when behavior is generated according to b. The policy π is only defined at states for which µ(s) > 0. The numerator gives the probability that action a is produced by the behavior and recognized in s. Note that if the recognizer accepts all state-action pairs, i.e. c(s, a) = 1, ∀s, a, then π is the same as b. Since a recognizer and a behavior policy can specify together a target policy, we can use recognizers as a way to specify policies for options, using (7). An option can only be initiated at a state for which at least one action is recognized, so µ(s) > 0, ∀s ∈ I. Similarly, the termination condition of such an option, β, is defined as β(s) = 1 if µ(s) = 0. In other words, the option must terminate if no actions are recognized at a given state. At all other states, β can be defined between 0 and 1 as desired. We will focus on computing the reward model of an option o, which represents the expected total return. The expected values of different features at the end of the option can be estimated similarly. The quantity that we want to compute is Eo {R(s)} = E{r1 + r2 + . . . + rT |s0 = s, π, β} where s ∈ I, experience is generated according to the policy of the option, π, and T denotes the random variable representing the time step at which the option terminates according to β. We assume that linear function approximation is used to represent these values, i.e. Eo {R(s)} ≈ θT φs where θ is a vector of parameters. 4 Off-policy learning algorithm In this section we present an adaptation of the off-policy learning algorithm of Precup, Sutton & Dasgupta (2001) to the case of learning about options. Suppose that an option’s policy π was used to generate behavior. In this case, learning the reward model of the option is a special case of temporal-difference learning of value functions. The forward ¯ (n) view of this algorithm is as follows. Let Rt denote the truncated n-step return starting at ¯ (0) time step t and let yt denote the 0-step truncated return, Rt . By the definition of the n-step truncated return, we have: ¯ (n) ¯ (n−1) Rt = rt+1 + (1 − βt+1 )Rt+1 . This is similar to the case of value functions, but it accounts for the possibility of terminating the option at time step t + 1. The λ-return is defined in the usual way: ∞ ¯ (n) ¯ Rtλ = (1 − λ) ∑ λn−1 Rt . n=1 The parameters of the linear function approximator are updated on every time step proportionally to: ¯ ¯ ∆θt = Rtλ − yt ∇θ yt (1 − β1 ) · · · (1 − βt ). In our case, however, trajectories are generated according to the behavior policy b. The main idea of the algorithm is to use importance sampling corrections in order to account for the difference in the state distribution of the two policies. Let ρt = (n) Rt , π(st ,at ) b(st ,at ) be the importance sampling ratio at time step t. The truncated n-step return, satisfies: (n) (n−1) Rt = ρt [rt+1 + (1 − βt+1 )Rt+1 ]. The update to the parameter vector is proportional to: ∆θt = Rtλ − yt ∇θ yt ρ0 (1 − β1 ) · · · ρt−1 (1 − βt ). The following result shows that the expected updates of the on-policy and off-policy algorithms are the same. Theorem 3 For every time step t ≥ 0 and any initial state s, ¯ Eb [∆θt |s] = Eπ [∆θt |s]. (n) (n) ¯ Proof: First we will show by induction that Eb {Rt |s} = Eπ {Rt |s}, ∀n (which implies ¯ that Eb {Rtλ |s} = Eπ (Rtλ |s}). For n = 0, the statement is trivial. Assuming that it is true for n − 1, we have (n) Eb Rt |s = a ∑b(s, a)∑Pss ρ(s, a) a = s ∑∑ a Pss b(s, a) a s = a ∑π(s, a)∑Pss a (n−1) a rss + (1 − β(s ))Eb Rt+1 |s π(s, a) a ¯ (n−1) r + (1 − β(s ))Eπ Rt+1 |s b(s, a) ss a ¯ (n−1) rss + (1 − β(s ))Eπ Rt+1 |s ¯ (n) = Eπ Rt |s . s Now we are ready to prove the theorem’s main statement. Defining Ωt to be the set of all trajectory components up to state st , we have: Eb {∆θt |s} = ∑ ω∈Ωt Pb (ω|s)Eb (Rtλ − yt )∇θ yt |ω t−1 ∏ ρi (1 − βi+1 ) i=0 πi (1 − βi+1 ) i=0 bi t−1 = t−1 ∑ ∏ bi Psaiisi+1 ω∈Ωt Eb Rtλ |st − yt ∇θ yt ∏ i=0 t−1 = ∑ ∏ πi Psaiisi+1 ω∈Ωt = ∑ ω∈Ωt ¯ Eπ Rtλ |st − yt ∇θ yt (1 − β1 )...(1 − βt ) i=0 ¯ ¯ Pπ (ω|s)Eπ (Rtλ − yt )∇θ yt |ω (1 − β1 )...(1 − βt ) = Eπ ∆θt |s . Note that we are able to use st and ω interchangeably because of the Markov property. ¯ Since we have shown that Eb [∆θt |s] = Eπ [∆θt |s] for any state s, it follows that the expected updates will also be equal for any distribution of the initial state s. When learning the model of options with data generated from the behavior policy b, the starting state distribution with respect to which the learning is performed, I0 is determined by the stationary distribution of the behavior policy, as well as the initiation set of the option I. We note also that the importance sampling corrections only have to be performed for the trajectory since the initiation of the updates for the option. No corrections are required for the experience prior to this point. This should generate updates that have significantly lower variance than in the case of learning values of policies (Precup, Sutton & Dasgupta, 2001). Because of the termination condition of the option, β, ∆θ can quickly decay to zero. To avoid this problem, we can use a restart function g : S → [0, 1], such that g(st ) specifies the extent to which the updating episode is considered to start at time t. Adding restarts generates a new forward update: t ∆θt = (Rtλ − yt )∇θ yt ∑ gi ρi ...ρt−1 (1 − βi+1 )...(1 − βt ), (8) i=0 where Rtλ is the same as above. With an adaptation of the proof in Precup, Sutton & Dasgupta (2001), we can show that we get the same expected value of updates by applying this algorithm from the original starting distribution as we would by applying the algorithm without restarts from a starting distribution defined by I0 and g. We can turn this forward algorithm into an incremental, backward view algorithm in the following way: • Initialize k0 = g0 , e0 = k0 ∇θ y0 • At every time step t: δt = θt+1 = kt+1 = et+1 = ρt (rt+1 + (1 − βt+1 )yt+1 ) − yt θt + αδt et ρt kt (1 − βt+1 ) + gt+1 λρt (1 − βt+1 )et + kt+1 ∇θ yt+1 Using a similar technique to that of Precup, Sutton & Dasgupta (2001) and Sutton & Barto (1998), we can prove that the forward and backward algorithm are equivalent (omitted due to lack of space). This algorithm is guaranteed to converge if the variance of the updates is finite (Precup, Sutton & Dasgupta, 2001). In the case of options, the termination condition β can be used to ensure that this is the case. 5 Learning when the behavior policy is unknown In this section, we consider the case in which the behavior policy is unknown. This case is generally problematic for importance sampling algorithms, but the use of recognizers will allow us to define importance sampling corrections, as well as a convergent algorithm. Recall that when using a recognizer, the target policy of the option is defined as: c(s, a)b(s, a) π(s, a) = µ(s) and the recognition probability becomes: π(s, a) c(s, a) = b(s, a) µ(s) Of course, µ(s) depends on b. If b is unknown, instead of µ(s), we will use a maximum likelihood estimate µ : S → [0, 1]. The structure used to compute µ will have to be compatible ˆ ˆ with the feature space used to represent the reward model. We will make this more precise below. Likewise, the recognizer c(s, a) will have to be defined in terms of the features used to represent the model. We will then define the importance sampling corrections as: c(s, a) ˆ ρ(s, a) = µ(s) ˆ ρ(s, a) = We consider the case in which the function approximator used to model the option is actually a state aggregator. In this case, we will define recognizers which behave consistently in each partition, i.e., c(s, a) = c(p, a), ∀s ∈ p. This means that an action is either recognized or not recognized in all states of the partition. The recognition probability µ will have one ˆ entry for every partition p of the state space. Its value will be: N(p, c = 1) µ(p) = ˆ N(p) where N(p) is the number of times partition p was visited, and N(p, c = 1) is the number of times the action taken in p was recognized. In the limit, w.p.1, µ converges to ˆ ∑s d b (s|p) ∑a c(p, a)b(s, a) where d b (s|p) is the probability of visiting state s from partiˆ ˆ tion p under the stationary distribution of b. At this limit, π(s, a) = ρ(s, a)b(s, a) will be a ˆ well-defined policy (i.e., ∑a π(s, a) = 1). Using Theorem 3, off-policy updates using imˆ portance sampling corrections ρ will have the same expected value as on-policy updates ˆ ˆ using π. Note though that the learning algorithm never uses π; the only quantities needed ˆ are ρ, which are learned incrementally from data. For the case of general linear function approximation, we conjecture that a similar idea can be used, where the recognition probability is learned using logistic regression. The development of this part is left for future work. Acknowledgements The authors gratefully acknowledge the ideas and encouragement they have received in this work from Eddie Rafols, Mark Ring, Lihong Li and other members of the rlai.net group. We thank Csaba Szepesvari and the reviewers of the paper for constructive comments. This research was supported in part by iCore, NSERC, Alberta Ingenuity, and CFI. References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML. Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol . 112, pp. 181–211. Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings of NIPS-17. Sutton R.S., Raffols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks”. In Proceedings of NIPS-18. Tadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In Machine learning vol. 42, pp. 241-267. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42:674–690.

4 0.71742356 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

5 0.65212107 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Author: Yaakov Engel, Peter Szabo, Dmitry Volkinshtein

Abstract: The Octopus arm is a highly versatile and complex limb. How the Octopus controls such a hyper-redundant arm (not to mention eight of them!) is as yet unknown. Robotic arms based on the same mechanical principles may render present day robotic arms obsolete. In this paper, we tackle this control problem using an online reinforcement learning algorithm, based on a Bayesian approach to policy evaluation known as Gaussian process temporal difference (GPTD) learning. Our substitute for the real arm is a computer simulation of a 2-dimensional model of an Octopus arm. Even with the simplifications inherent to this model, the state space we face is a high-dimensional one. We apply a GPTDbased algorithm to this domain, and demonstrate its operation on several learning tasks of varying degrees of difficulty. 1

6 0.63742226 153 nips-2005-Policy-Gradient Methods for Planning

7 0.61555928 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

8 0.50792605 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

9 0.47074202 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

10 0.44801891 112 nips-2005-Learning Minimum Volume Sets

11 0.42766735 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

12 0.41390911 47 nips-2005-Consistency of one-class SVM and related algorithms

13 0.39912969 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

14 0.37717146 53 nips-2005-Cyclic Equilibria in Markov Games

15 0.37395242 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

16 0.36362424 85 nips-2005-Generalization to Unseen Cases

17 0.35728833 58 nips-2005-Divergences, surrogate loss functions and experimental design

18 0.35719579 117 nips-2005-Learning from Data of Variable Quality

19 0.35420251 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

20 0.347285 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.061), (10, 0.47), (27, 0.026), (31, 0.082), (34, 0.077), (55, 0.042), (69, 0.037), (73, 0.015), (88, 0.054), (91, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97925419 165 nips-2005-Response Analysis of Neuronal Population with Synaptic Depression

Author: Wentao Huang, Licheng Jiao, Shan Tan, Maoguo Gong

Abstract: In this paper, we aim at analyzing the characteristic of neuronal population responses to instantaneous or time-dependent inputs and the role of synapses in neural information processing. We have derived an evolution equation of the membrane potential density function with synaptic depression, and obtain the formulas for analytic computing the response of instantaneous re rate. Through a technical analysis, we arrive at several signi cant conclusions: The background inputs play an important role in information processing and act as a switch betwee temporal integration and coincidence detection. the role of synapses can be regarded as a spatio-temporal lter; it is important in neural information processing for the spatial distribution of synapses and the spatial and temporal relation of inputs. The instantaneous input frequency can affect the response amplitude and phase delay. 1

same-paper 2 0.94708264 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

Author: Benjamin V. Roy

Abstract: We consider approximate value iteration with a parameterized approximator in which the state space is partitioned and the optimal cost-to-go function over each partition is approximated by a constant. We establish performance loss bounds for policies derived from approximations associated with fixed points. These bounds identify benefits to having projection weights equal to the invariant distribution of the resulting policy. Such projection weighting leads to the same fixed points as TD(0). Our analysis also leads to the first performance loss bound for approximate value iteration with an average cost objective. 1 Preliminaries Consider a discrete-time communicating Markov decision process (MDP) with a finite state space S = {1, . . . , |S|}. At each state x ∈ S, there is a finite set Ux of admissible actions. If the current state is x and an action u ∈ Ux is selected, a cost of gu (x) is incurred, and the system transitions to a state y ∈ S with probability pxy (u). For any x ∈ S and u ∈ Ux , y∈S pxy (u) = 1. Costs are discounted at a rate of α ∈ (0, 1) per period. Each instance of such an MDP is defined by a quintuple (S, U, g, p, α). A (stationary deterministic) policy is a mapping µ that assigns an action u ∈ Ux to each state x ∈ S. If actions are selected based on a policy µ, the state follows a Markov process with transition matrix Pµ , where each (x, y)th entry is equal to pxy (µ(x)). The restriction to communicating MDPs ensures that it is possible to reach any state from any other state. Each policy µ is associated with a cost-to-go function Jµ ∈ |S| , defined by Jµ = ∞ t t −1 gµ , where, with some abuse of notation, gµ (x) = gµ(x) (x) t=0 α Pµ gµ = (I − αPµ ) for each x ∈ S. A policy µ is said to be greedy with respect to a function J if µ(x) ∈ argmin(gu (x) + α y∈S pxy (u)J(y)) for all x ∈ S. u∈Ux The optimal cost-to-go function J ∗ ∈ |S| is defined by J ∗ (x) = minµ Jµ (x), for all x ∈ S. A policy µ∗ is said to be optimal if Jµ∗ = J ∗ . It is well-known that an optimal policy exists. Further, a policy µ∗ is optimal if and only if it is greedy with respect to J ∗ . Hence, given the optimal cost-to-go function, optimal actions can computed be minimizing the right-hand side of the above inclusion. Value iteration generates a sequence J converging to J ∗ according to J +1 = T J , where T is the dynamic programming operator, defined by (T J)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)J(y)), for all x ∈ S and J ∈ |S| . This sequence converges to J ∗ for any initialization of J0 . 2 Approximate Value Iteration The state spaces of relevant MDPs are typically so large that computation and storage of a cost-to-go function is infeasible. One approach to dealing with this obstacle involves partitioning the state space S into a manageable number K of disjoint subsets S1 , . . . , SK and approximating the optimal cost-to-go function with a function that is constant over each partition. This can be thought of as a form of state aggregation – all states within a given partition are assumed to share a common optimal cost-to-go. To represent an approximation, we define a matrix Φ ∈ |S|×K such that each kth column is an indicator function for the kth partition Sk . Hence, for any r ∈ K , k, and x ∈ Sk , (Φr)(x) = rk . In this paper, we study variations of value iteration, each of which computes a vector r so that Φr approximates J ∗ . The use of such a policy µr which is greedy with respect to Φr is justified by the following result (see [10] for a proof): ˜ Theorem 1 If µ is a greedy policy with respect to a function J ∈ Jµ − J ∗ ≤ ∞ 2α ˜ J∗ − J 1−α |S| then ∞. One common way of approximating a function J ∈ |S| with a function of the form Φr involves projection with respect to a weighted Euclidean norm · π . The weighted Euclidean 1/2 |S| 2 norm: J 2,π = . Here, π ∈ + is a vector of weights that assign x∈S π(x)J (x) relative emphasis among states. The projection Ππ J is the function Φr that attains the minimum of J −Φr 2,π ; if there are multiple functions Φr that attain the minimum, they must form an affine space, and the projection is taken to be the one with minimal norm Φr 2,π . Note that in our context, where each kth column of Φ represents an indicator function for the kth partition, for any π, J, and x ∈ Sk , (Ππ J)(x) = y∈Sk π(y)J(y)/ y∈Sk π(y). Approximate value iteration begins with a function Φr(0) and generates a sequence according to Φr( +1) = Ππ T Φr( ) . It is well-known that the dynamic programming operator T is a contraction mapping with respect to the maximum norm. Further, Ππ is maximum-norm nonexpansive [16, 7, 8]. (This is not true for general Φ, but is true in our context in which columns of Φ are indicator functions for partitions.) It follows that the composition Ππ T is a contraction mapping. By the contraction mapping theorem, Ππ T has a unique fixed point Φ˜, which is the limit of the sequence Φr( ) . Further, the following result holds: r Theorem 2 For any MDP, partition, and weights π with support intersecting every partition, if Φ˜ = Ππ T Φ˜ then r r Φ˜ − J ∗ r ∞ ≤ 2 min J ∗ − Φr 1 − α r∈ K and (1 − α) Jµr − J ∗ ˜ ∞ ≤ ∞, 4α min J ∗ − Φr 1 − α r∈ K ∞. The first inequality of the theorem is an approximation error bound, established in [16, 7, 8] for broader classes of approximators that include state aggregation as a special case. The second is a performance loss bound, derived by simply combining the approximation error bound and Theorem 1. Note that Jµr (x) ≥ J ∗ (x) for all x, so the left-hand side of the performance loss bound ˜ is the maximal increase in cost-to-go, normalized by 1 − α. This normalization is natural, since a cost-to-go function is a linear combination of expected future costs, with coefficients 1, α, α2 , . . ., which sum to 1/(1 − α). Our motivation of the normalizing constant begs the question of whether, for fixed MDP parameters (S, U, g, p) and fixed Φ, minr J ∗ − Φr ∞ also grows with 1/(1 − α). It turns out that minr J ∗ − Φr ∞ = O(1). To see why, note that for any µ, Jµ = (I − αPµ )−1 gµ = 1 λ µ + hµ , 1−α where λµ (x) is the expected average cost if the process starts in state x and is controlled by policy µ, τ −1 1 t λµ = lim Pµ gµ , τ →∞ τ t=0 and hµ is the discounted differential cost function hµ = (I − αPµ )−1 (gµ − λµ ). Both λµ and hµ converge to finite vectors as α approaches 1 [3]. For an optimal policy µ∗ , limα↑1 λµ∗ (x) does not depend on x (in our context of a communicating MDP). Since constant functions lie in the range of Φ, lim min J ∗ − Φr α↑1 r∈ K ∞ ≤ lim hµ∗ α↑1 ∞ < ∞. The performance loss bound still exhibits an undesirable dependence on α through the coefficient 4α/(1 − α). In most relevant contexts, α is close to 1; a representative value might be 0.99. Consequently, 4α/(1 − α) can be very large. Unfortunately, the bound is sharp, as expressed by the following theorem. We will denote by 1 the vector with every component equal to 1. Theorem 3 For any δ > 0, α ∈ (0, 1), and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ r r with π = 1, 4α min J ∗ − Φr ∞ − δ. (1 − α) Jµr − J ∗ ∞ ≥ ˜ 1 − α r∈ K This theorem is established through an example in [22]. The choice of uniform weights (π = 1) is meant to point out that even for such a simple, perhaps natural, choice of weights, the performance loss bound is sharp. Based on Theorems 2 and 3, one might expect that there exists MDP parameters (S, U, g, p) and a partition such that, with π = 1, (1 − α) Jµr − J ∗ ˜ ∞ =Θ 1 min J ∗ − Φr 1 − α r∈ K ∞ . In other words, that the performance loss is both lower and upper bounded by 1/(1 − α) times the smallest possible approximation error. It turns out that this is not true, at least if we restrict to a finite state space. However, as the following theorem establishes, the coefficient multiplying minr∈ K J ∗ − Φr ∞ can grow arbitrarily large as α increases, keeping all else fixed. Theorem 4 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r lim inf (1 − α) (Jµr (x) − J ∗ (x)) ≥ L lim min J ∗ − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. This Theorem is also established through an example [22]. For any µ and x, lim ((1 − α)Jµ (x) − λµ (x)) = lim(1 − α)hµ (x) = 0. α↑1 α↑1 Combined with Theorem 4, this yields the following corollary. Corollary 1 For any L and ∆ ≥ 0, there exists MDP parameters (S, U, g, p) and a partition such that limα↑1 minr∈ K J ∗ − Φr ∞ = ∆ and, if Φ˜ = Ππ T Φ˜ with π = 1, r r ∗ lim inf (λµr (x) − λµ∗ (x)) ≥ L lim min J − Φr ∞ , ˜ α↑1 α↑1 r∈ K for all x ∈ S. 3 Using the Invariant Distribution In the previous section, we considered an approximation Φ˜ that solves Ππ T Φ˜ = Φ˜ for r r r some arbitrary pre-selected weights π. We now turn to consider use of an invariant state distribution πr of Pµr as the weight vector.1 This leads to a circular definition: the weights ˜ ˜ are used in defining r and now we are defining the weights in terms of r. What we are ˜ ˜ really after here is a vector r that satisfies Ππr T Φ˜ = Φ˜. The following theorem captures ˜ r r ˜ the associated benefits. (Due to space limitations, we omit the proof, which is provided in the full length version of this paper [22].) Theorem 5 For any MDP and partition, if Φ˜ = Ππr T Φ˜ and πr has support intersecting r r ˜ ˜ T every partition, (1 − α)πr (Jµr − J ∗ ) ≤ 2α minr∈ K J ∗ − Φr ∞ . ˜ ˜ When α is close to 1, which is typical, the right-hand side of our new performance loss bound is far less than that of Theorem 2. The primary improvement is in the omission of a factor of 1 − α from the denominator. But for the bounds to be compared in a meaningful way, we must also relate the left-hand-side expressions. A relation can be based on the fact that for all µ, limα↑1 (1 − α)Jµ − λµ ∞ = 0, as explained in Section 2. In particular, based on this, we have lim(1 − α) Jµ − J ∗ ∞ = |λµ − λ∗ | = λµ − λ∗ = lim π T (Jµ − J ∗ ), α↑1 α↑1 for all policies µ and probability distributions π. Hence, the left-hand-side expressions from the two performance bounds become directly comparable as α approaches 1. Another interesting comparison can be made by contrasting Corollary 1 against the following immediate consequence of Theorem 5. Corollary 2 For all MDP parameters (S, U, g, p) and partitions, if Φ˜ = Ππr T Φ˜ and r r ˜ lim inf α↑1 x∈Sk πr (x) > 0 for all k, ˜ lim sup λµr − λµ∗ ∞ ≤ 2 lim min J ∗ − Φr ∞ . ˜ α↑1 α↑1 r∈ K The comparison suggests that solving Φ˜ = Ππr T Φ˜ is strongly preferable to solving r r ˜ Φ˜ = Ππ T Φ˜ with π = 1. r r 1 By an invariant state distribution of a transition matrix P , we mean any probability distribution π such that π T P = π T . In the event that Pµr has multiple invariant distributions, πr denotes an ˜ ˜ arbitrary choice. 4 Exploration If a vector r solves Φ˜ = Ππr T Φ˜ and the support of πr intersects every partition, Theorem ˜ r r ˜ ˜ 5 promises a desirable bound. However, there are two significant shortcomings to this solution concept, which we will address in this section. First, in some cases, the equation Ππr T Φ˜ = Φ˜ does not have a solution. It is easy to produce examples of this; though r r ˜ no example has been documented for the particular class of approximators we are using here, [2] offers an example involving a different linearly parameterized approximator that captures the spirit of what can happen. Second, it would be nice to relax the requirement that the support of πr intersect every partition. ˜ To address these shortcomings, we introduce stochastic policies. A stochastic policy µ maps state-action pairs to probabilities. For each x ∈ S and u ∈ Ux , µ(x, u) is the probability of taking action u when in state x. Hence, µ(x, u) ≥ 0 for all x ∈ S and u ∈ Ux , and u∈Ux µ(x, u) = 1 for all x ∈ S. Given a scalar > 0 and a function J, the -greedy Boltzmann exploration policy with respect to J is defined by µ(x, u) = e−(Tu J)(x)(|Ux |−1)/ e . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e For any > 0 and r, let µr denote the -greedy Boltzmann exploration policy with respect to Φr. Further, we define a modified dynamic programming operator that incorporates Boltzmann exploration: (T J)(x) = u∈Ux e−(Tu J)(x)(|Ux |−1)/ e (Tu J)(x) . −(Tu J)(x)(|Ux |−1)/ e u∈Ux e As approaches 0, -greedy Boltzmann exploration policies become greedy and the modified dynamic programming operators become the dynamic programming operator. More precisely, for all r, x, and J, lim ↓0 µr (x, µr (x)) = 1 and lim ↓1 T J = T J. These are immediate consequences of the following result (see [4] for a proof). Lemma 1 For any n, v ∈ mini vi . n , mini vi + ≥ i e−vi (n−1)/ e vi / i e−vi (n−1)/ e ≥ Because we are only concerned with communicating MDPs, there is a unique invariant state distribution associated with each -greedy Boltzmann exploration policy µr and the support of this distribution is S. Let πr denote this distribution. We consider a vector r that ˜ solves Φ˜ = Ππr T Φ˜. For any > 0, there exists a solution to this equation (this is an r r ˜ immediate extension of Theorem 5.1 from [4]). We have the following performance loss bound, which parallels Theorem 5 but with an equation for which a solution is guaranteed to exist and without any requirement on the resulting invariant distribution. (Again, we omit the proof, which is available in [22].) Theorem 6 For any MDP, partition, and > 0, if Φ˜ = Ππr T Φ˜ then (1 − r r ˜ T ∗ ∗ α)(πr ) (Jµr − J ) ≤ 2α minr∈ K J − Φr ∞ + . ˜ ˜ 5 Computation: TD(0) Though computation is not a focus of this paper, we offer a brief discussion here. First, we describe a simple algorithm from [16], which draws on ideas from temporal-difference learning [11, 12] and Q-learning [23, 24] to solve Φ˜ = Ππ T Φ˜. It requires an abilr r ity to sample a sequence of states x(0) , x(1) , x(2) , . . ., each independent and identically distributed according to π. Also required is a way to efficiently compute (T Φr)(x) = minu∈Ux (gu (x) + α y∈S pxy (u)(Φr)(y)), for any given x and r. This is typically possible when the action set Ux and the support of px· (u) (i.e., the set of states that can follow x if action u is selected) are not too large. The algorithm generates a sequence of vectors r( ) according to r( +1) = r( ) + γ φ(x( ) ) (T Φr( ) )(x( ) ) − (Φr( ) )(x( ) ) , where γ is a step size and φ(x) denotes the column vector made up of components from the xth row of Φ. In [16], using results from [15, 9], it is shown that under appropriate assumptions on the step size sequence, r( ) converges to a vector r that solves Φ˜ = Ππ T Φ˜. ˜ r r The equation Φ˜ = Ππ T Φ˜ may have no solution. Further, the requirement that states r r are sampled independently from the invariant distribution may be impractical. However, a natural extension of the above algorithm leads to an easily implementable version of TD(0) that aims at solving Φ˜ = Ππr T Φ˜. The algorithm requires simulation of a trajectory r r ˜ x0 , x1 , x2 , . . . of the MDP, with each action ut ∈ Uxt generated by the -greedy Boltzmann exploration policy with respect to Φr(t) . The sequence of vectors r(t) is generated according to r(t+1) = r(t) + γt φ(xt ) (T Φr(t) )(xt ) − (Φr(t) )(xt ) . Under suitable conditions on the step size sequence, if this algorithm converges, the limit satisfies Φ˜ = Ππr T Φ˜. Whether such an algorithm converges and whether there are r r ˜ other algorithms that can effectively solve Φ˜ = Ππr T Φ˜ for broad classes of relevant r r ˜ problems remain open issues. 6 Extensions and Open Issues Our results demonstrate that weighting a Euclidean norm projection by the invariant distribution of a greedy (or approximately greedy) policy can lead to a dramatic performance gain. It is intriguing that temporal-difference learning implicitly carries out such a projection, and consequently, any limit of convergence obeys the stronger performance loss bound. This is not the first time that the invariant distribution has been shown to play a critical role in approximate value iteration and temporal-difference learning. In prior work involving approximation of a cost-to-go function for a fixed policy (no control) and a general linearly parameterized approximator (arbitrary matrix Φ), it was shown that weighting by the invariant distribution is key to ensuring convergence and an approximation error bound [17, 18]. Earlier empirical work anticipated this [13, 14]. The temporal-difference learning algorithm presented in Section 5 is a version of TD(0), This is a special case of TD(λ), which is parameterized by λ ∈ [0, 1]. It is not known whether the results of this paper can be extended to the general case of λ ∈ [0, 1]. Prior research has suggested that larger values of λ lead to superior results. In particular, an example of [1] and the approximation error bounds of [17, 18], both of which are restricted to the case of a fixed policy, suggest that approximation error is amplified by a factor of 1/(1 − α) as λ is changed from 1 to 0. The results of Sections 3 and 4 suggest that this factor vanishes if one considers a controlled process and performance loss rather than approximation error. Whether the results of this paper can be extended to accommodate approximate value iteration with general linearly parameterized approximators remains an open issue. In this broader context, error and performance loss bounds of the kind offered by Theorem 2 are unavailable, even when the invariant distribution is used to weight the projection. Such error and performance bounds are available, on the other hand, for the solution to a certain linear program [5, 6]. Whether a factor of 1/(1 − α) can similarly be eliminated from these bounds is an open issue. Our results can be extended to accommodate an average cost objective, assuming that the MDP is communicating. With Boltzmann exploration, the equation of interest becomes Φ˜ = Ππr (T Φ˜ − λ1). r r ˜ ˜ ˜ The variables include an estimate λ ∈ of the minimal average cost λ∗ ∈ and an approximation Φ˜ of the optimal differential cost function h∗ . The discount factor α is set r to 1 in computing an -greedy Boltzmann exploration policy as well as T . There is an average-cost version of temporal-difference learning for which any limit of convergence ˜ ˜ (λ, r) satisfies this equation [19, 20, 21]. Generalization of Theorem 2 does not lead to a useful result because the right-hand side of the bound becomes infinite as α approaches 1. On the other hand, generalization of Theorem 6 yields the first performance loss bound for approximate value iteration with an average-cost objective: Theorem 7 For any communicating MDP with an average-cost objective, partition, and r ˜ > 0, if Φ˜ = Ππr (T Φ˜ − λ1) then r ˜ λµr − λ∗ ≤ 2 min h∗ − Φr ˜ r∈ K ∞ + . Here, λµr ∈ denotes the average cost under policy µr , which is well-defined because the ˜ ˜ process is irreducible under an -greedy Boltzmann exploration policy. This theorem can be proved by taking limits on the left and right-hand sides of the bound of Theorem 6. It is easy to see that the limit of the left-hand side is λµr − λ∗ . The limit of minr∈ K J ∗ − Φr ∞ ˜ on the right-hand side is minr∈ K h∗ − Φr ∞ . (This follows from the analysis of [3].) Acknowledgments This material is based upon work supported by the National Science Foundation under Grant ECS-9985229 and by the Office of Naval Research under Grant MURI N00014-001-0637. The author’s understanding of the topic benefited from collaborations with Dimitri Bertsekas, Daniela de Farias, and John Tsitsiklis. A full length version of this paper has been submitted to Mathematics of Operations Research and has benefited from a number of useful comments and suggestions made by reviewers. References [1] D. P. Bertsekas. A counterexample to temporal-difference learning. Neural Computation, 7:270–279, 1994. [2] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996. [3] D. Blackwell. Discrete dynamic programming. Annals of Mathematical Statistics, 33:719–726, 1962. [4] D. P. de Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 2000. [5] D. P. de Farias and B. Van Roy. Approximate dynamic programming via linear programming. In Advances in Neural Information Processing Systems 14. MIT Press, 2002. [6] D. P. de Farias and B. Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003. [7] G. J. Gordon. Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, Carnegie Mellon University, 1995. [8] G. J. Gordon. Stable function approximation in dynamic programming. In Machine Learning: Proceedings of the Twelfth International Conference (ICML), San Francisco, CA, 1995. [9] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation, 6:1185–1201, 1994. [10] S. P. Singh and R. C. Yee. An upper-bound on the loss from approximate optimalvalue functions. Machine Learning, 1994. [11] R. S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [12] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [13] R. S. Sutton. On the virtues of linear learning and trajectory distributions. In Proceedings of the Workshop on Value Function Approximation, Machine Learning Conference, 1995. [14] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, Cambridge, MA, 1996. MIT Press. [15] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994. [16] J. N. Tsitsiklis and B. Van Roy. Feature–based methods for large scale dynamic programming. Machine Learning, 22:59–94, 1996. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal–difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. [18] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in Neural Information Processing Systems 9, Cambridge, MA, 1997. MIT Press. [19] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. In Proceedings of the IEEE Conference on Decision and Control, 1997. [20] J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning. Automatica, 35(11):1799–1808, 1999. [21] J. N. Tsitsiklis and B. Van Roy. On average versus discounted reward temporaldifference learning. Machine Learning, 49(2-3):179–191, 2002. [22] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Under review with Mathematics of Operations Research, available at www.stanford.edu/ bvr/psfiles/aggregation.pdf, 2005. [23] C. J. C. H. Watkins. Learning From Delayed Rewards. PhD thesis, Cambridge University, Cambridge, UK, 1989. [24] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.

3 0.94093394 75 nips-2005-Fixing two weaknesses of the Spectral Method

Author: Kevin Lang

Abstract: We discuss two intrinsic weaknesses of the spectral graph partitioning method, both of which have practical consequences. The first is that spectral embeddings tend to hide the best cuts from the commonly used hyperplane rounding method. Rather than cleaning up the resulting suboptimal cuts with local search, we recommend the adoption of flow-based rounding. The second weakness is that for many “power law” graphs, the spectral method produces cuts that are highly unbalanced, thus decreasing the usefulness of the method for visualization (see figure 4(b)) or as a basis for divide-and-conquer algorithms. These balance problems, which occur even though the spectral method’s quotient-style objective function does encourage balance, can be fixed with a stricter balance constraint that turns the spectral mathematical program into an SDP that can be solved for million-node graphs by a method of Burer and Monteiro. 1 Background Graph partitioning is the NP-hard problem of finding a small graph cut subject to the constraint that neither side of the resulting partitioning of the nodes is “too small”. We will be dealing with several versions: the graph bisection problem, which requires perfect 1 : 1 2 2 balance; the β-balanced cut problem (with β a fraction such as 1 ), which requires at least 3 β : (1 − β) balance; and the quotient cut problem, which requires the small side to be large enough to “pay for” the edges in the cut. The quotient cut metric is c/ min(a, b), where c is the cutsize and a and b are the sizes of the two sides of the cut. All of the well-known variants of the quotient cut metric (e.g. normalized cut [15]) have similar behavior with respect to the issues discussed in this paper. The spectral method for graph partitioning was introduced in 1973 by Fiedler and Donath & Hoffman [6]. In the mid-1980’s Alon & Milman [1] proved that spectral cuts can be at worst quadratically bad; in the mid 1990’s Guattery & Miller [10] proved that this analysis is tight by exhibiting a family of n-node graphs whose spectral bisections cut O(n 2/3 ) edges versus the optimal O(n1/3 ) edges. On the other hand, Spielman & Teng [16] have proved stronger performance guarantees for the special case of spacelike graphs. The spectral method can be derived by relaxing a quadratic integer program which encodes the graph bisection problem (see section 3.1). The solution to this relaxation is the “Fiedler vector”, or second smallest eigenvector of the graph’s discrete Laplacian matrix, whose elements xi can be interpreted as an embedding of the graph on the line. To obtain a (A) Graph with nearly balanced 8-cut (B) Spectral Embedding (C) Notional Flow-based Embedding Figure 1: The spectral embedding hides the best solution from hyperplane rounding. specific cut, one must apply a “rounding method” to this embedding. The hyperplane rounding method chooses one of the n − 1 cuts which separate the nodes whose x i values lie above and below some split value x. ˆ 2 Using flow to find cuts that are hidden from hyperplane rounding Theorists have long known that the spectral method cannot distinguish between deep cuts and long paths, and that this confusion can cause it to cut a graph in the wrong direction thereby producing the spectral method’s worst-case behavior [10]. In this section we will show by example that even when the spectral method is not fooled into cutting in the wrong direction, the resulting embedding can hide the best cuts from the hyperplane rounding method. This is a possible explanation for the frequently made empirical observation (see e.g. [12]) that hyperplane roundings of spectral embeddings are noisy and therefore benefit from cleanup with a local search method such as Fiduccia-Matheyses [8]. Consider the graph in figure 1(a), which has a near-bisection cutting 8 edges. For this graph the spectral method produces the embedding shown in figure 1(b), and recommends that we make a vertical cut (across the horizontal dimension which is based on the Fiedler vector). This is correct in a generalized sense, but it is obvious that no hyperplane (or vertical line in this picture) can possibly extract the optimal 8-edge cut. Some insight into why spectral embeddings tend to have this problem can be obtained from the spectral method’s electrical interpretation. In this view the graph is represented by a resistor network [7]. Current flowing in this network causes voltage drops across the resistors, thus determining the nodes’ voltages and hence their positions. When current flows through a long series of resistors, it induces a progressive voltage drop. This is what causes the excessive length of the embeddings of the horizontal girder-like structures which are blocking all vertical hyperplane cuts in figure 1(b). If the embedding method were somehow not based on current, but rather on flow, which does not distinguish between a pipe and a series of pipes, then the long girders could retract into the two sides of the embedding, as suggested by figure 1(c), and the best cut would be revealed. Because theoretical flow-like embedding methods such as [14] are currently not practical, we point out that in cases like figure 1(b), where the spectral method has not chosen an incorrect direction for the cut, one can use an S-T max flow problem with the flow running in the recommended direction (horizontally for this embedding) to extract the good cut even though it is hidden from all hyperplanes. We currently use two different flow-based rounding methods. A method called MQI looks for quotient cuts, and is already described in [13]. Another method, that we shall call Midflow, looks for β-balanced cuts. The input to Midflow is a graph and an ordering of its nodes (obtained e.g. from a spectral embedding or from the projection of any embedding onto a line). We divide the graph’s nodes into 3 sets F, L, and U. The sets F and L respectively contain the first βn and last βn nodes in the ordering, and U contains the remaining 50-50 balance ng s ro un di Hy pe r pl an e neg-pos split quotient cut score (cutsize / size of small side) 0.01 ctor r ve iedle of F 0.004 0.003 0.00268 0.00232 Best hyperplane rounding of Fiedler Vector Best improvement with local search 0.002 0.00138 0.001 60000 80000 Midflow rounding beta = 1/4 100000 120000 0.00145 140000 Midflow rounding of Fiedler Vector beta = 1/3 160000 180000 200000 220000 240000 number of nodes on ’left’ side of cut (out of 324800) Figure 2: A typical example (see section 2.1) where flow-based rounding beats hyperplane rounding, even when the hyperplane cuts are improved with Fiduccia-Matheyses search. Note that for this spacelike graph, the best quotient cuts have reasonably good balance. U = n − 2βn nodes, which are “up for grabs”. We set up an S-T max flow problem with one node for every graph node plus 2 new nodes for the source and sink. For each graph edge there are two arcs, one in each direction, with unit capacity. Finally, the nodes in F are pinned to the source and the nodes in L are pinned to sink by infinite capacity arcs. This max-flow problem can be solved by a good implementation of the push-relabel algorithm (such as Goldberg and Cherkassky’s hi pr [4]) in time that empirically is nearly linear with a very good constant factor. Figure 6 shows that solving a MidFlow problem with hi pr can be 1000 times cheaper than finding a spectral embedding with ARPACK. When the goal is finding good β-balanced cuts, MidFlow rounding is strictly more powerful than hyperplane rounding; from a given node ordering hyperplane rounding chooses the best of U + 1 candidate cuts, while MidFlow rounding chooses the best of 2U candidates, including all of those considered by hyperplane rounding. [Similarly, MQI rounding is strictly more powerful than hyperplane rounding for the task of finding good quotient cuts.] 2.1 A concrete example The plot in figure 2 shows a number of cuts in a 324,800 node nearly planar graph derived from a 700x464 pixel downward-looking view of some clouds over some mountains.1 The y-axis of the plot is quotient cut score; smaller values are better. We note in passing that the commonly used split point x = 0 does not yield the best hyperplane cut. Our main ˆ point is that the two cuts generated by MidFlow rounding of the Fiedler vector (with β = 1 3 and β = 1 ) are nearly twice as good as the best hyperplane cut. Even after the best 4 hyperplane cut has been improved by taking the best result of 100 runs of a version of Fiduccia-Matheyses local search, it is still much worse than the cuts obtained by flowbased rounding. 1 The graph’s edges are unweighted but are chosen by a randomized rule which is more likely to include an edge between two neighboring pixels if they have a similar grey value. Good cuts in the graph tend to run along discontinuities in the image, as one would expect. quotient cut score 1 SDP-LB (smaller is better) 0.1 Scatter plot showing cuts in a

4 0.90822834 156 nips-2005-Prediction and Change Detection

Author: Mark Steyvers, Scott Brown

Abstract: We measure the ability of human observers to predict the next datum in a sequence that is generated by a simple statistical process undergoing change at random points in time. Accurate performance in this task requires the identification of changepoints. We assess individual differences between observers both empirically, and using two kinds of models: a Bayesian approach for change detection and a family of cognitively plausible fast and frugal models. Some individuals detect too many changes and hence perform sub-optimally due to excess variability. Other individuals do not detect enough changes, and perform sub-optimally because they fail to notice short-term temporal trends. 1 Intr oduction Decision-making often requires a rapid response to change. For example, stock analysts need to quickly detect changes in the market in order to adjust investment strategies. Coaches need to track changes in a player’s performance in order to adjust strategy. When tracking changes, there are costs involved when either more or less changes are observed than actually occurred. For example, when using an overly conservative change detection criterion, a stock analyst might miss important short-term trends and interpret them as random fluctuations instead. On the other hand, a change may also be detected too readily. For example, in basketball, a player who makes a series of consecutive baskets is often identified as a “hot hand” player whose underlying ability is perceived to have suddenly increased [1,2]. This might lead to sub-optimal passing strategies, based on random fluctuations. We are interested in explaining individual differences in a sequential prediction task. Observers are shown stimuli generated from a simple statistical process with the task of predicting the next datum in the sequence. The latent parameters of the statistical process change discretely at random points in time. Performance in this task depends on the accurate detection of those changepoints, as well as inference about future outcomes based on the outcomes that followed the most recent inferred changepoint. There is much prior research in statistics on the problem of identifying changepoints [3,4,5]. In this paper, we adopt a Bayesian approach to the changepoint identification problem and develop a simple inference procedure to predict the next datum in a sequence. The Bayesian model serves as an ideal observer model and is useful to characterize the ways in which individuals deviate from optimality. The plan of the paper is as follows. We first introduce the sequential prediction task and discuss a Bayesian analysis of this prediction problem. We then discuss the results from a few individuals in this prediction task and show how the Bayesian approach can capture individual differences with a single “twitchiness” parameter that describes how readily changes are perceived in random sequences. We will show that some individuals are too twitchy: their performance is too variable because they base their predictions on too little of the recent data. Other individuals are not twitchy enough, and they fail to capture fast changes in the data. We also show how behavior can be explained with a set of fast and frugal models [6]. These are cognitively realistic models that operate under plausible computational constraints. 2 A pr ediction task wit h m ult iple c hange points In the prediction task, stimuli are presented sequentially and the task is to predict the next stimulus in the sequence. After t trials, the observer has been presented with stimuli y1, y2, …, yt and the task is to make a prediction about yt+1. After the prediction is made, the actual outcome yt+1 is revealed and the next trial proceeds to the prediction of yt+2. This procedure starts with y1 and is repeated for T trials. The observations yt are D-dimensional vectors with elements sampled from binomial distributions. The parameters of those distributions change discretely at random points in time such that the mean increases or decreases after a change point. This generates a sequence of observation vectors, y1, y2, …, yT, where each yt = {yt,1 … yt,D}. Each of the yt,d is sampled from a binomial distribution Bin(θt,d,K), so 0 ≤ yt,d ≤ K. The parameter vector θt ={θt,1 … θt,D} changes depending on the locations of the changepoints. At each time step, xt is a binary indicator for the occurrence of a changepoint occurring at time t+1. The parameter α determines the probability of a change occurring in the sequence. The generative model is specified by the following algorithm: 1. For d=1..D sample θ1,d from a Uniform(0,1) distribution 2. For t=2..T, (a) Sample xt-1 from a Bernoulli(α) distribution (b) If xt-1=0, then θt=θt-1, else for d=1..D sample θt,d from a Uniform(0,1) distribution (c) for d=1..D, sample yt from a Bin(θt,d,K) distribution Table 1 shows some data generated from the changepoint model with T=20, α=.1,and D=1. In the prediction task, y will be observed, but x and θ are not. Table 1: Example data t x θ y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 .68 .68 .68 .68 .48 .48 .48 .74 .74 .74 .74 .74 .74 .19 .19 .87 .87 .87 .87 .87 9 7 8 7 4 4 4 9 8 3 6 7 8 2 1 8 9 9 8 8 3 A Bayesian pr ediction m ode l In both our Bayesian and fast-and-frugal analyses, the prediction task is decomposed into two inference procedures. First, the changepoint locations are identified. This is followed by predictive inference for the next outcome based on the most recent changepoint locations. Several Bayesian approaches have been developed for changepoint problems involving single or multiple changepoints [3,5]. We apply a Markov Chain Monte Carlo (MCMC) analysis to approximate the joint posterior distribution over changepoint assignments x while integrating out θ. Gibbs sampling will be used to sample from this posterior marginal distribution. The samples can then be used to predict the next outcome in the sequence. 3.1 I n f e r e nc e f o r c h a n g e p o i n t a s s i g n m e n t s . To apply Gibbs sampling, we evaluate the conditional probability of assigning a changepoint at time i, given all other changepoint assignments and the current α value. By integrating out θ, the conditional probability is P ( xi | x−i , y, α ) = ∫ P ( xi ,θ , α | x− i , y ) (1) θ where x− i represents all switch point assignments except xi. This can be simplified by considering the location of the most recent changepoint preceding and following time i and the outcomes occurring between these locations. Let niL be the number of time steps from the last changepoint up to and including the current time step i such that xi − nL =1 and xi − nL + j =0 for 0 < niL . Similarly, let niR be the number of time steps that i i follow time step i up to the next changepoint such that xi + n R =1 and xi + nR − j =0 for i R i 0 < n . Let y = L i ∑ i − niL < k ≤ i i yk and y = ∑ k < k ≤i + n R yk . The update equation for the R i i changepoint assignment can then be simplified to P ( xi = m | x−i ) ∝ ( ) ( ( ) D Γ 1 + y L + y R Γ 1 + Kn L + Kn R − y L − y R ⎧ i, j i, j i i i, j i, j ⎪ (1 − α ) ∏ L R Γ 2 + Kni + Kni ⎪ j =1 ⎪ ⎨ L L L R R R ⎪ D Γ 1 + yi, j Γ 1 + Kni − yi, j Γ 1 + yi, j Γ 1 + Kni − yi, j α∏ ⎪ Γ 2 + KniL Γ 2 + KniR ⎪ j =1 ⎩ ( ) ( ( ) ( ) ( ) ) ( ) m=0 ) (2) m =1 We initialize the Gibbs sampler by sampling each xt from a Bernoulli(α) distribution. All changepoint assignments are then updated sequentially by the Gibbs sampling equation above. The sampler is run for M iterations after which one set of changepoint assignments is saved. The Gibbs sampler is then restarted multiple times until S samples have been collected. Although we could have included an update equation for α, in this analysis we treat α as a known constant. This will be useful when characterizing the differences between human observers in terms of differences in α. 3.2 P r e d i c ti v e i n f er e n ce The next latent parameter value θt+1 and outcome yt+1 can be predicted on the basis of observed outcomes that occurred after the last inferred changepoint: θ t+1, j = t ∑ i =t* +1 yt+1, j = round (θt +1, j K ) yi, j / K , (3) where t* is the location of the most recent change point. By considering multiple Gibbs samples, we get a distribution over outcomes yt+1. We base the model predictions on the mean of this distribution. 3.3 I l l u s t r a t i o n o f m o d e l p er f o r m a n c e Figure 1 illustrates the performance of the model on a one dimensional sequence (D=1) generated from the changepoint model with T=160, α=0.05, and K=10. The Gibbs sampler was run for M=30 iterations and S=200 samples were collected. The top panel shows the actual changepoints (triangles) and the distribution of changepoint assignments averaged over samples. The bottom panel shows the observed data y (thin lines) as well as the θ values in the generative model (rescaled between 0 and 10). At locations with large changes between observations, the marginal changepoint probability is quite high. At other locations, the true change in the mean is very small, and the model is less likely to put in a changepoint. The lower right panel shows the distribution over predicted θt+1 values. xt 1 0.5 0 yt 10 1 5 θt+1 0.5 0 20 40 60 80 100 120 140 160 0 Figure 1. Results of model simulation. 4 Prediction experiment We tested performance of 9 human observers in the prediction task. The observers included the authors, a visitor, and one student who were aware of the statistical nature of the task as well as naïve students. The observers were seated in front of an LCD touch screen displaying a two-dimensional grid of 11 x 11 buttons. The changepoint model was used to generate a sequence of T=1500 stimuli for two binomial variables y1 and y2 (D=2, K=10). The change probability α was set to 0.1. The two variables y1 and y2 specified the two-dimensional button location. The same sequence was used for all observers. On each trial, the observer touched a button on the grid displayed on the touch screen. Following each button press, the button corresponding to the next {y1,y2} outcome in the sequence was highlighted. Observers were instructed to press the button that best predicted the next location of the highlighted button. The 1500 trials were divided into three blocks of 500 trials. Breaks were allowed between blocks. The whole experiment lasted between 15 and 30 minutes. Figure 2 shows the first 50 trials from the third block of the experiment. The top and bottom panels show the actual outcomes for the y1 and y2 button grid coordinates as well as the predictions for two observers (SB and MY). The figure shows that at trial 15, the y1 and y2 coordinates show a large shift followed by an immediate shift in observer’s MY predictions (on trial 16). Observer SB waits until trial 17 to make a shift. 10 5 0 outcomes SB predictions MY predictions 10 5 0 0 5 10 15 20 25 Trial 30 35 40 45 50 Figure 2. Trial by trial predictions from two observers. 4.1 T a s k er r o r We assessed prediction performance by comparing the prediction with the actual outcome in the sequence. Task error was measured by normalized city-block distance T 1 (4) task error= ∑ yt ,1 − ytO,1 + yt ,2 − ytO,2 (T − 1) t =2 where yO represents the observer’s prediction. Note that the very first trial is excluded from this calculation. Even though more suitable probabilistic measures for prediction error could have been adopted, we wanted to allow comparison of observer’s performance with both probabilistic and non-probabilistic models. Task error ranged from 2.8 (for participant MY) to 3.3 (for ML). We also assessed the performance of five models – their task errors ranged from 2.78 to 3.20. The Bayesian models (Section 3) had the lowest task errors, just below 2.8. This fits with our definition of the Bayesian models as “ideal observer” models – their task error is lower than any other model’s and any human observer’s task error. The fast and frugal models (Section 5) had task errors ranging from 2.85 to 3.20. 5 Modeling R esults We will refer to the models with the following letter codes: B=Bayesian Model, LB=limited Bayesian model, FF1..3=fast and frugal models 1..3. We assessed model fit by comparing the model’s prediction against the human observers’ predictions, again using a normalized city-block distance model error= T 1 ∑ ytM − ytO,1 + ytM − ytO,2 ,1 ,2 (T − 1) t=2 (5) where yM represents the model’s prediction. The model error for each individual observer is shown in Figure 3. It is important to note that because each model is associated with a set of free parameters, the parameters optimized for task error and model error are different. For Figure 3, the parameters were optimized to minimize Equation (5) for each individual observer, showing the extent to which these models can capture the performance of individual observers, not necessarily providing the best task performance. B LB FF1 FF2 MY MS MM EJ FF3 Model Error 2 1.5 1 0.5 0 PH NP DN SB ML 1 Figure 3. Model error for each individual observer. 5.1 B ay e s i a n p re d i ct i o n m o d e l s At each trial t, the model was provided with the sequence of all previous outcomes. The Gibbs sampling and inference procedures from Eq. (2) and (3) were applied with M=30 iterations and S=200 samples. The change probability α was a free parameter. In the full Bayesian model, the whole sequence of observations up to the current trial is available for prediction, leading to a memory requirement of up to T=1500 trials – a psychologically unreasonable assumption. We therefore also simulated a limited Bayesian model (LB) where the observed sequence was truncated to the last 10 outcomes. The LB model showed almost no decrement in task performance compared to the full Bayesian model. Figure 3 also shows that it fit human data quite well. 5.2 I n d i v i d u a l D i f f er e nc e s The right-hand panel of Figure 4 plots each observer’s task error as a function of the mean city-block distance between their subsequent button presses. This shows a clear U-shaped function. Observers with very variable predictions (e.g., ML and DN) had large average changes between successive button pushes, and also had large task error: These observers were too “twitchy”. Observers with very small average button changes (e.g., SB and NP) were not twitchy enough, and also had large task error. Observers in the middle had the lowest task error (e.g., MS and MY). The left-hand panel of Figure 4 shows the same data, but with the x-axis based on the Bayesian model fits. Instead of using mean button change distance to index twitchiness (as in 1 Error bars indicate bootstrapped 95% confidence intervals. the right-hand panel), the left-hand panel uses the estimated α parameters from the Bayesian model. A similar U-shaped pattern is observed: individuals with too large or too small α estimates have large task errors. 3.3 DN 3.2 Task Error ML SB 3.2 NP 3.1 Task Error 3.3 PH EJ 3 MM MS MY 2.9 2.8 10 -4 10 -3 10 -2 DN NP 3.1 3 PH EJ MM MS 2.9 B ML SB MY 2.8 10 -1 10 0 0.5 1 α 1.5 2 Mean Button Change 2.5 3 Figure 4. Task error vs. “twitchiness”. Left-hand panel indexes twitchiness using estimated α parameters from Bayesian model fits. Right-hand panel uses mean distance between successive predictions. 5.3 F a s t - a n d - F r u g a l ( F F ) p r e d ic t i o n m o d e l s These models perform the prediction task using simple heuristics that are cognitively plausible. The FF models keep a short memory of previous stimulus values and make predictions using the same two-step process as the Bayesian model. First, a decision is made as to whether the latent parameter θ has changed. Second, remembered stimulus values that occurred after the most recently detected changepoint are used to generate the next prediction. A simple heuristic is used to detect changepoints: If the distance between the most recent observation and prediction is greater than some threshold amount, a change is inferred. We defined the distance between a prediction (p) and an observation (y) as the difference between the log-likelihoods of y assuming θ=p and θ=y. Thus, if fB(.|θ, K) is the binomial density with parameters θ and K, the distance between observation y and prediction p is defined as d(y,p)=log(fB(y|y,K))-log(fB(y|p,K)). A changepoint on time step t+1 is inferred whenever d(yt,pt)>C. The parameter C governs the twitchiness of the model predictions. If C is large, only very dramatic changepoints will be detected, and the model will be too conservative. If C is small, the model will be too twitchy, and will detect changepoints on the basis of small random fluctuations. Predictions are based on the most recent M observations, which are kept in memory, unless a changepoint has been detected in which case only those observations occurring after the changepoint are used for prediction. The prediction for time step t+1 is simply the mean of these observations, say p. Human observers were reticent to make predictions very close to the boundaries. This was modeled by allowing the FF model to change its prediction for the next time step, yt+1, towards the mean prediction (0.5). This change reflects a two-way bet. If the probability of a change occurring is α, the best guess will be 0.5 if that change occurs, or the mean p if the change does not occur. Thus, the prediction made is actually yt+1=1/2 α+(1-α)p. Note that we do not allow perfect knowledge of the probability of a changepoint, α. Instead, an estimated value of α is used based on the number of changepoints detected in the data series up to time t. The FF model nests two simpler FF models that are psychologically interesting. If the twitchiness threshold parameter C becomes arbitrarily large, the model never detects a change and instead becomes a continuous running average model. Predictions from this model are simply a boxcar smooth of the data. Alternatively, if we assume no memory the model must based each prediction on only the previous stimulus (i.e., M=1). Above, in Figure 3, we labeled the complete FF model as FF1, the boxcar model as FF2 and the memoryless model FF3. Figure 3 showed that the complete FF model (FF1) fit the data from all observers significantly better than either the boxcar model (FF2) or the memoryless model (FF3). Exceptions were observers PH, DN and ML, for whom all three FF model fit equally well. This result suggests that our observers were (mostly) doing more than just keeping a running average of the data, or using only the most recent observation. The FF1 model fit the data about as well as the Bayesian models for all observers except MY and MS. Note that, in general, the FF1 and Bayesian model fits are very good: the average city block distance between the human data and the model prediction is around 0.75 (out of 10) buttons on both the x- and y-axes. 6 C onclusion We used an online prediction task to study changepoint detection. Human observers had to predict the next observation in stochastic sequences containing random changepoints. We showed that some observers are too “twitchy”: They perform poorly on the prediction task because they see changes where only random fluctuation exists. Other observers are not twitchy enough, and they perform poorly because they fail to see small changes. We developed a Bayesian changepoint detection model that performed the task optimally, and also provided a good fit to human data when sub-optimal parameter settings were used. Finally, we developed a fast-and-frugal model that showed how participants may be able to perform well at the task using minimal information and simple decision heuristics. Acknowledgments We thank Eric-Jan Wagenmakers and Mike Yi for useful discussions related to this work. This work was supported in part by a grant from the US Air Force Office of Scientific Research (AFOSR grant number FA9550-04-1-0317). R e f er e n ce s [1] Gilovich, T., Vallone, R. and Tversky, A. (1985). The hot hand in basketball: on the misperception of random sequences. Cognitive Psychology17, 295-314. [2] Albright, S.C. (1993a). A statistical analysis of hitting streaks in baseball. Journal of the American Statistical Association, 88, 1175-1183. [3] Stephens, D.A. (1994). Bayesian retrospective multiple changepoint identification. Applied Statistics 43(1), 159-178. [4] Carlin, B.P., Gelfand, A.E., & Smith, A.F.M. (1992). Hierarchical Bayesian analysis of changepoint problems. Applied Statistics 41(2), 389-405. [5] Green, P.J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4), 711-732. [6] Gigerenzer, G., & Goldstein, D.G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650-669.

5 0.86144853 110 nips-2005-Learning Depth from Single Monocular Images

Author: Ashutosh Saxena, Sung H. Chung, Andrew Y. Ng

Abstract: We consider the task of depth estimation from a single monocular image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured outdoor environments which include forests, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a discriminatively-trained Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models both depths at individual points as well as the relation between depths at different points. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. 1

6 0.58754617 177 nips-2005-Size Regularized Cut for Data Clustering

7 0.5864377 46 nips-2005-Consensus Propagation

8 0.58315581 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

9 0.58072257 53 nips-2005-Cyclic Equilibria in Markov Games

10 0.56428516 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches

11 0.54524153 153 nips-2005-Policy-Gradient Methods for Planning

12 0.54218793 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

13 0.51427829 144 nips-2005-Off-policy Learning with Options and Recognizers

14 0.51070714 34 nips-2005-Bayesian Surprise Attracts Human Attention

15 0.51057208 173 nips-2005-Sensory Adaptation within a Bayesian Framework for Perception

16 0.50695515 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

17 0.50465792 23 nips-2005-An Application of Markov Random Fields to Range Sensing

18 0.49741495 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

19 0.49321708 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

20 0.4929066 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine