nips nips2010 nips2010-37 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sridhar Mahadevan, Bo Liu
Abstract: This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. The first two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same benefits at lower computational cost. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. [sent-5, score-0.267]
2 Krylov and Bellman error bases are based on the Neumann series expansion. [sent-7, score-0.496]
3 These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. [sent-8, score-0.436]
4 The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. [sent-9, score-0.268]
5 Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. [sent-11, score-0.417]
6 An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same benefits at lower computational cost. [sent-12, score-0.798]
7 Recently, there has been growing interest in automatic basis construction methods for constructing a problem-specific low-dimensional representation of an MDP. [sent-14, score-0.177]
8 Functions on the original state space, such as the reward function or the value function, are “compressed” by projecting them onto a basis matrix Φ, whose column space spans a low-dimensional subspace of the function space on the states of the original MDP. [sent-15, score-0.383]
9 Among the various approaches proposed are reward-sensitive bases, such as Krylov bases [10] and an incremental variant called Bellman error basis functions (BEBFs) [9]. [sent-16, score-0.678]
10 These approaches construct bases by dilating the (sampled) reward function by geometric powers of the (sampled) transition matrix of a policy. [sent-17, score-0.653]
11 An alternative approach, called proto-value functions, constructs reward-invariant bases by finding the eigenvectors of the symmetric graph Laplacian matrix induced by the neighborhood topology of the state space under the given actions [7]. [sent-18, score-0.505]
12 A fundamental dilemma that is revealed by these prior studies is that neither reward-sensitive nor reward-invariant eigenvector bases by themselves appear to be fully satisfactory. [sent-19, score-0.372]
13 A Chebyshev polynomial bound for the error due to approximation using Krylov bases was derived in [10], extending a known similar result for general Krylov approximation [12]. [sent-20, score-0.483]
14 This bound shows that performance of Krylov bases (and BEBFs) tends to degrade as the discount factor γ → 1. [sent-21, score-0.403]
15 Intuitively, the initial basis vectors capture short-term transient behavior near rewarding regions, and tend to poorly ap1 proximate the value function over the entire state space until a sufficiently large time scale is reached. [sent-22, score-0.189]
16 A straightforward geometrical analysis of approximation errors using least-squares fixed point approximation onto a basis shows that the Bellman error decomposes into the sum of two terms: a reward error and a second term involving the feature prediction error [1, 8] (see Figure 1). [sent-23, score-0.638]
17 This analysis helps reveal sources of error: Krylov bases and BEBFs tend to have low reward error (or zero in the non-sampled case), and hence a large component of the error in using these bases tends to be due to the feature prediction error. [sent-24, score-1.11]
18 In contrast, PVFs tend to have large reward error since typical spiky goal reward functions are poorly approximated by smooth low-order eigenvectors; however, their feature prediction error can be quite low as the eigenvectors often capture invariant subspaces of the model transition matrix. [sent-25, score-0.682]
19 A hybrid approach that combined low-order eigenvectors of the transition matrix (or PVFs) with higher-order Krylov bases was proposed in [10], which empirically resulted in a better approach. [sent-26, score-0.486]
20 This paper demonstrates a more principled approach to address this problem, by constructing new bases that emerge from investigating the links between basis construction methods and different power series expansions of value functions. [sent-27, score-0.641]
21 In particular, instead of using the eigenvectors of the transition matrix, the proposed approach uses the average-reward or gain as the first basis vector, and dilates the reward function by powers of the average-adjusted transition matrix. [sent-28, score-0.512]
22 It turns out that the gain is an element of the space of eigenvectors associated with the eigenvalue λ = 1 of the transition matrix. [sent-29, score-0.112]
23 The relevance of power series expansion to approximations of value functions was hinted at in early work by Schwartz [13] on undiscounted optimization, although he did not discuss basis construction. [sent-30, score-0.32]
24 Krylov and Bellman error basis functions (BEBFs) [10, 9, 12], as well as proto-value functions [7], can be related to terms in the Neumann series expansion. [sent-31, score-0.332]
25 Ultimately, the performance of these bases is limited by the speed of convergence of the Neumann expansion, and of course, other errors arising due to reward and feature prediction error. [sent-32, score-0.591]
26 It is well-known that discounted value functions can be written in the form of a Laurent series expansion, where the first two terms 1 correspond to the average-reward term (scaled by 1−γ ), and the average-adjusted sum of rewards (or bias). [sent-34, score-0.176]
27 Higher order terms involve powers of the Drazin (or group) inverse of a singular matrix related to the transition matrix. [sent-35, score-0.172]
28 This expansion provides a mathematical framework for analyzing the properties of basis construction methods and developing newer representations. [sent-36, score-0.215]
29 In particular, Krylov bases converge slowly for high discount factors since the value function is dominated by the scaled average-reward term, which is poorly approximated by the initial BEBF or Krylov basis vectors as it involves the long-term limiting matrix P ∗ . [sent-37, score-0.666]
30 The Laurent series expansion leads to a new type of basis called a Drazin basis [6]. [sent-38, score-0.403]
31 An approximation of Drazin bases called Bellman average-reward bases (BARBs) is described and compared with BEBFs, Krylov bases, and PVFs. [sent-39, score-0.777]
32 The value function V associated with a deterministic policy π : S → A is defined as the long-term expected sum of rewards received starting from a state, and following the policy π indefinitely. [sent-41, score-0.133]
33 For a fixed policy π, the induced discounted Markov reward process is defined as (P, R, γ). [sent-44, score-0.264]
34 ˆ A popular approach to approximating V is to use a linear combination of basis functions V ≈ V = Φw, where the basis matrix Φ is of size |S| × k, and k |S|. [sent-45, score-0.353]
35 The Bellman error for a given basis Φ, denoted BE(Φ), is defined as the difference between the two sides of the Bellman equation, when 1 In what follows, we suppress the dependence of P , R, and V on the policy π to avoid clutter. [sent-46, score-0.27]
36 2 γP ΦwΦ R (I − ΠΦ )R ΠΦ R (I − ΠΦ )γP ΦwΦ ΠΦ γΦwΦ Φ Figure 1: The Bellman error due to a basis and its decomposition. [sent-47, score-0.226]
37 If the Markov chain defined by P is irreducible and aperiodic, then ΠΦ = Φ(ΦT D∗ Φ)−1 ΦT D∗ , where D∗ is a diagonal matrix whose entries contain the stationary distribution of the Markov chain. [sent-52, score-0.096]
38 Krylov bases correspond to successive terms in the Neumann series [10, 12]. [sent-58, score-0.446]
39 An incremental variant of the Krylov-based approach is called Bellman error basis functions (BEBFs) [9]. [sent-68, score-0.306]
40 In particular, given a set of basis functions Φk (where the first one is assumed to equal R), the next basis is defined to be φk+1 = T (Φk wΦk ) − Φk wΦk . [sent-69, score-0.316]
41 In the model-free reinforcement learning setting, φk+1 ˆ ˆ can be approximated by the temporal-difference (TD) error φk+1 = r + γ Qk (s , πk (s )) − Qk (s, a), ˆ given a set of stored transitions in the form (s, a, r, s ). [sent-70, score-0.124]
42 Here, Qk is the fixed-point least-squares approximation to the action-value function Q(s, a) on the basis Φk . [sent-71, score-0.16]
43 It can be easily shown that BEBFs and Krylov bases define the same space [8]. [sent-72, score-0.372]
44 1 Convergence Analysis A key issue in evaluating the effectiveness of Krylov bases (and BEBFs) is the speed of convergence of the Neumann series. [sent-74, score-0.387]
45 As γ → 1, Krylov bases and BEBFs converge rather slowly, owing to a large increase in the weighted feature error. [sent-75, score-0.472]
46 Petrik [10] derived a bound for the error due to Krylov approximation, which depends on the condition number of I − γP , and the ratio of two Chebyshev polynomials on the complex plane. [sent-79, score-0.097]
47 It has been shown that BEBFs reduce the Bellman error at a rate bounded by value iteration [9]. [sent-81, score-0.081]
48 For standard value iteration, A = I − γP , and consequently the natural decomposition is to set 2 The two components of the Bellman error may partially (or fully) cancel each other out: the Bellman error of V itself is 0, but it generates non-zero reward and feature prediction errors. [sent-84, score-0.366]
49 99 Figure 2: Condition number of I − γP as γ → 1, where P is the transition matrix of the optimal policy in a chain MDP of 50 states with rewards in states 10 and 41 [9]. [sent-87, score-0.213]
50 First, note that the overall Bellman error can be reduced to the weighted feature error, since the reward error is 0 as R is in the span of both BEBFs and Krylov bases: ˆ BE(Φ) 2 = T (ΦwΦ ) − ΦwΦ 2 = R − (I − γP )V 2 = (I − ΠΦ )γP ΦwΦ 2 . [sent-94, score-0.463]
51 Figure 2 shows empirically that one reason for the slow convergence of BEBFs and Krylov bases is that as γ → 1, the condition number of I − γP significantly increases. [sent-101, score-0.401]
52 Figure 3 compares the weighted feature error of BEBF bases (the performance of Krylov bases is identical and not shown) on a 50 state chain domain with a single goal reward of 1 in state 25. [sent-102, score-1.201]
53 Notice as γ increases, the feature error increases dramatically. [sent-104, score-0.116]
54 4 Laurent Series Expansion and Drazin Bases A potential solution to the slow convergence of BEBFs and Krylov bases is suggested by a different power series called the Laurent expansion. [sent-105, score-0.478]
55 This connection uses the following Laurent series expansion of V in terms of the average reward ρ of the policy π, the average-adjusted sum of rewards h, and higher order terms that involve the generalized spectral inverse (Drazin or group inverse) of I − P . [sent-107, score-0.422]
56 As γ increases, the weighted feature error grows much larger as the value function becomes progressively smoother and less like the reward function. [sent-146, score-0.361]
57 As γ → 1, the first term in the Laurent series expansion grows quite large causing the slow convergence of Krylov bases and BEBFs. [sent-148, score-0.507]
58 (I − P )D is a generalized inverse of the singular matrix I − P called the Drazin inverse [2, 11]. [sent-149, score-0.126]
59 The matrix I − P of a Markov chain has index k = 1. [sent-154, score-0.091]
60 For index 1 matrices, the Drazin inverse turns out to be the same as the group inverse, which is defined as (I − P )D = (I − P + P ∗ )−1 − P ∗ , 1 where the long-term limiting matrix P ∗ = limn→∞ n k=0 P k = I − (I − P )(I − P )D . [sent-155, score-0.102]
61 In terms of the expansion above, y−1 is the gain of the policy, y0 is its bias, and so on. [sent-169, score-0.087]
62 Analogous to the Krylov bases, the successive terms of the Laurent series expansion can be viewed as basis vectors. [sent-174, score-0.271]
63 More formally, the Drazin basis is defined as the space spanned by the vectors [6]: Dm = {P ∗ R, (I − P )D R, ((I − P )D )2 R, . [sent-175, score-0.166]
64 (2) The first basis vector is the average-reward or gain g = P ∗ R of policy π. [sent-179, score-0.213]
65 The second basis vector is the bias, or average-adjusted sum of rewards h. [sent-180, score-0.19]
66 Subsequent basis vectors correspond to higher-order terms in the Laurent series. [sent-181, score-0.156]
67 Building on this insight, an approximate Drazin basis called Bellman average-reward bases (BARBs) can be defined as follows. [sent-185, score-0.535]
68 First, the approximate Drazin basis is defined as the space spanned by the vectors Am = {P ∗ R, (P − P ∗ )R, (P − P ∗ )2 R, . [sent-186, score-0.166]
69 BARBs are similar to Krylov bases, except that the reward function is being dilated by the averageadjusted transition matrix P − P ∗ , and the first basis element is the gain. [sent-193, score-0.387]
70 Analogous to BEBFs, in the model-free reinforcement learning setting, BARBs can be computed using the average-adjusted TD error ˆ ˆ φk+1 (s) = r − ρk (s) + Qk (s , πk (s )) − Qk (s, a). [sent-197, score-0.106]
71 1 Expressivity Properties Some results concerning the expressivity of approximate Drazin bases and BARBs are now discussed. [sent-201, score-0.391]
72 Theorem 2 For any k > 1, the following hold: span {Ak (R)} ⊆ span {BARBk+1 (R)} . [sent-203, score-0.092]
73 For k = 1, both approximate Drazin bases and BARBs contain the gain ρ. [sent-208, score-0.396]
74 For general k > 2, the new basis vector in Ak is P k−1 R, which can be shown to be part of BARBk+1 (R). [sent-210, score-0.145]
75 There is a similar decomposition of the average-adjusted Bellman error into a component that depends on the average-adjusted reward error and an undiscounted weighted feature error. [sent-212, score-0.455]
76 Theorem 3 Given a basis Φ, for any average reward Markov reward process (P, R), the Bellman error can be decomposed as follows: ˆ ˆ T (V ) − V = R − ρ + P ΦwΦ − ΦwΦ = (I − ΠΦ )(R − ρ) + (I − ΠΦ )P ΦwΦ = (I − ΠΦ )R − (I − ΠΦ )ρ + (I − ΠΦ )P ΦwΦ . [sent-213, score-0.578]
77 Proof: The three terms represent the reward error, the average-reward error, and the undiscounted weighted feature error. [sent-214, score-0.304]
78 A more detailed convergence analysis of BARBs is given in [4], based on the relationship between the approximation error and the mixing rate of the Markov chain defined by P . [sent-216, score-0.162]
79 5 0 Bellman Error 6 BARBs BEBF PVF−MP 20 30 # Basis Functions 20 0 0 0 10 20 30 # Basis Functions 40 50 0 10 20 30 # Basis Functions 40 50 0 0 10 20 30 # Basis Functions 40 50 Figure 4: Experimental comparison on a 50 state chain MDP with rewards in state 10 and 41. [sent-227, score-0.154]
80 6 Experimental Comparisons Figure 4 compares the performance of Bellman average-reward basis functions (BARBs) vs. [sent-234, score-0.183]
81 Bellman-error basis functions (BEBFs), and a variant of proto-value functions (PVF-MP) on a 50 state chain MDP. [sent-235, score-0.289]
82 The PVF-MP algorithm selects basis functions incrementally based upon the Bellman error, where basis function k+1 is the PVF that has the largest inner product with the Bellman error resulting from the previous k basis functions. [sent-243, score-0.542]
83 PVFs have a high reward error, since the reward function is a set of two delta functions that is poorly approximated by the eigenvectors of the combinatorial Laplacian on the chain graph. [sent-244, score-0.489]
84 The overall Bellman error remains large due to the high reward error. [sent-246, score-0.25]
85 The reward error for BEBFs is by definition 0 as R is a basis vector itself. [sent-247, score-0.395]
86 However, the weighted feature error for BEBFs grows quite large as γ increases from 0. [sent-248, score-0.178]
87 Drazin and Krylov bases in the two-room gridworld MDP [7]. [sent-256, score-0.372]
88 Drazin bases perform the best, followed by BARBs, and then Krylov bases. [sent-257, score-0.372]
89 BEBFs on a 10 × 10 grid world MDP with a reward placed in the upper left corner state. [sent-260, score-0.18]
90 The Neumann series, which underlies Bellman error and Krylov bases, tends to converge slowly as γ → 1. [sent-265, score-0.114]
91 Drazin and Krylov bases in a 100 state two-room MDP [7]. [sent-267, score-0.401]
92 The reward was set at +100 for reaching a corner goal state in one of the rooms. [sent-269, score-0.209]
93 2 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 0 5 10 15 20 25 30 35 Figure 6: Experimental comparison of BARBs and BEBFs on a 10 × 10 grid world MDP with a reward in the upper left corner. [sent-298, score-0.169]
94 An incremental version of Drazin bases called Bellman average-reward bases (BARBs) was investigated. [sent-306, score-0.786]
95 Numerical experiments on simple MDPs show superior performance of Drazin bases and BARBs to BEBFs, Krylov bases, and PVFs. [sent-307, score-0.372]
96 Scaling BARBs and Drazin bases to large MDPs requires addressing sampling issues, and exploiting structure in transition matrices, such as using factored representations. [sent-308, score-0.419]
97 Reinforcement learning methods to estimate the first few terms of the Laurent series were proposed in [5], and can be adapted for basis construction. [sent-310, score-0.199]
98 The Schultz expansion provides a way of rewriting the Neumann series using a multiplicative series of dyadic powers of the transition matrix, which is useful for multiscale bases [6]. [sent-311, score-0.596]
99 An investigation of basis construction from power series expansions of value functions. [sent-332, score-0.255]
100 Sensitive-discount optimality: Unifying discounted and average reward reinforcement learning. [sent-335, score-0.245]
wordName wordTfidf (topN-words)
[('drazin', 0.393), ('bases', 0.372), ('krylov', 0.346), ('bebf', 0.338), ('barbs', 0.273), ('bebfs', 0.262), ('pvf', 0.24), ('bellman', 0.24), ('barb', 0.229), ('reward', 0.169), ('basis', 0.145), ('laurent', 0.115), ('mp', 0.1), ('error', 0.081), ('neumann', 0.072), ('dbf', 0.066), ('barbk', 0.055), ('expansion', 0.052), ('chain', 0.051), ('discounted', 0.051), ('weighted', 0.051), ('transition', 0.047), ('span', 0.046), ('rewards', 0.045), ('pvfs', 0.044), ('policy', 0.044), ('series', 0.043), ('mdp', 0.043), ('eigenvectors', 0.041), ('powers', 0.039), ('undiscounted', 0.038), ('mdps', 0.036), ('qk', 0.036), ('ak', 0.035), ('feature', 0.035), ('expansions', 0.033), ('inverse', 0.033), ('vf', 0.033), ('discount', 0.031), ('markov', 0.03), ('cm', 0.029), ('state', 0.029), ('chebyshev', 0.029), ('functions', 0.026), ('matrix', 0.026), ('reinforcement', 0.025), ('incremental', 0.024), ('gain', 0.024), ('amherst', 0.022), ('bebfk', 0.022), ('spanned', 0.021), ('successive', 0.02), ('pm', 0.019), ('actions', 0.019), ('slowly', 0.019), ('aperiodic', 0.019), ('expressivity', 0.019), ('irreducible', 0.019), ('approximated', 0.018), ('called', 0.018), ('construction', 0.018), ('bias', 0.017), ('laplacian', 0.017), ('inverses', 0.016), ('geometrical', 0.016), ('mahadevan', 0.016), ('squaring', 0.016), ('power', 0.016), ('singular', 0.016), ('insight', 0.016), ('polynomials', 0.016), ('poorly', 0.015), ('convergence', 0.015), ('approximation', 0.015), ('limiting', 0.015), ('massachusetts', 0.015), ('parr', 0.014), ('progressively', 0.014), ('converge', 0.014), ('column', 0.014), ('group', 0.014), ('decomposed', 0.014), ('index', 0.014), ('constructing', 0.014), ('slow', 0.014), ('kj', 0.013), ('td', 0.013), ('km', 0.013), ('visits', 0.013), ('ax', 0.013), ('decision', 0.012), ('compares', 0.012), ('variant', 0.012), ('afosr', 0.012), ('scaled', 0.011), ('corner', 0.011), ('approximating', 0.011), ('ends', 0.011), ('terms', 0.011), ('grows', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
Author: Sridhar Mahadevan, Bo Liu
Abstract: This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. The first two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same benefits at lower computational cost. 1
2 0.10649352 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
Author: Amir-massoud Farahmand, Csaba Szepesvári, Rémi Munos
Abstract: We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum – as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast. 1
3 0.099463947 229 nips-2010-Reward Design via Online Gradient Ascent
Author: Jonathan Sorg, Richard L. Lewis, Satinder P. Singh
Abstract: Recent work has demonstrated that when artificial agents are limited in their ability to achieve their goals, the agent designer can benefit by making the agent’s goals different from the designer’s. This gives rise to the optimization problem of designing the artificial agent’s goals—in the RL framework, designing the agent’s reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent’s lifetime nor do they take advantage of knowledge about the agent’s structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent’s lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations. 1 The Optimal Reward Problem In this work, we consider the scenario of an agent designer building an autonomous agent. The designer has his or her own goals which must be translated into goals for the autonomous agent. We represent goals using the Reinforcement Learning (RL) formalism of the reward function. This leads to the optimal reward problem of designing the agent’s reward function so as to maximize the objective reward received by the agent designer. Typically, the designer assigns his or her own reward to the agent. However, there is ample work which demonstrates the benefit of assigning reward which does not match the designer’s. For example, work on reward shaping [11] has shown how to modify rewards to accelerate learning without altering the optimal policy, and PAC-MDP methods [5, 20] including approximate Bayesian methods [7, 19] add bonuses to the objective reward to achieve optimism under uncertainty. These approaches explicitly or implicitly assume that the asymptotic behavior of the agent should be the same as that which would occur using the objective reward function. These methods do not explicitly consider the optimal reward problem; however, they do show improved performance through reward modification. In our recent work that does explicitly consider the optimal reward problem [18], we analyzed an explicit hypothesis about the benefit of reward design—that it helps mitigate the performance loss caused by computational constraints (bounds) on agent architectures. We considered various types of agent limitations—limits on planning depth, failure to account for partial observability, and other erroneous modeling assumptions—and demonstrated the benefits of good reward functions in each case empirically. Crucially, in bounded agents, the optimal reward function often leads to behavior that is different from the asymptotic behavior achieved with the objective reward function. In this work, we develop an algorithm, Policy Gradient for Reward Design (PGRD), for improving reward functions for a family of bounded agents that behave according to repeated local (from the current state) model-based planning. We show that this algorithm is capable of improving the reward functions in agents with computational limitations necessitating small bounds on the depth of planning, and also from the use of an inaccurate model (which may be inaccurate due to computationally-motivated approximations). PGRD has few parameters, improves the reward 1 function online during an agent’s lifetime, takes advantage of knowledge about the agent’s structure (through the gradient computation), and is linear in the number of reward function parameters. Notation. Formally, we consider discrete-time partially-observable environments with a finite number of hidden states s ∈ S, actions a ∈ A, and observations o ∈ O; these finite set assumptions are useful for our theorems, but our algorithm can handle infinite sets in practice. Its dynamics are governed by a state-transition function P (s |s, a) that defines a distribution over next-states s conditioned on current state s and action a, and an observation function Ω(o|s) that defines a distribution over observations o conditioned on current state s. The agent designer’s goals are specified via the objective reward function RO . At each time step, the designer receives reward RO (st ) ∈ [0, 1] based on the current state st of the environment, where the subscript denotes time. The designer’s objective return is the expected mean objective reward N 1 obtained over an infinite horizon, i.e., limN →∞ E N t=0 RO (st ) . In the standard view of RL, the agent uses the same reward function as the designer to align the interests of the agent and the designer. Here we allow for a separate agent reward function R(· ). An agent’s reward function can in general be defined in terms of the history of actions and observations, but is often more pragmatically defined in terms of some abstraction of history. We define the agent’s reward function precisely in Section 2. Optimal Reward Problem. An RL agent attempts to act so as to maximize its own cumulative reward, or return. Crucially, as a result, the sequence of environment-states {st }∞ is affected by t=0 the choice of reward function; therefore, the agent designer’s return is affected as well. The optimal reward problem arises from the fact that while the objective reward function is fixed as part of the problem description, the reward function is a choice to be made by the designer. We capture this choice abstractly by letting the reward be parameterized by some vector of parameters θ chosen from space of parameters Θ. Each θ ∈ Θ specifies a reward function R(· ; θ) which in turn produces a distribution over environment state sequences via whatever RL method the agent uses. The expected N 1 return obtained by the designer for choice θ is U(θ) = limN →∞ E N t=0 RO (st ) R(·; θ) . The optimal reward parameters are given by the solution to the optimal reward problem [16, 17, 18]: θ∗ = arg max U(θ) = arg max lim E θ∈Θ θ∈Θ N →∞ 1 N N RO (st ) R(·; θ) . (1) t=0 Our previous research on solving the optimal reward problem has focused primarily on the properties of the optimal reward function and its correspondence to the agent architecture and the environment [16, 17, 18]. This work has used inefficient exhaustive search methods for finding good approximations to θ∗ (though there is recent work on using genetic algorithms to do this [6, 9, 12]). Our primary contribution in this paper is a new convergent online stochastic gradient method for finding approximately optimal reward functions. To our knowledge, this is the first algorithm that improves reward functions in an online setting—during a single agent’s lifetime. In Section 2, we present the PGRD algorithm, prove its convergence, and relate it to OLPOMDP [2], a policy gradient algorithm. In Section 3, we present experiments demonstrating PGRD’s ability to approximately solve the optimal reward problem online. 2 PGRD: Policy Gradient for Reward Design PGRD builds on the following insight: the agent’s planning algorithm procedurally converts the reward function into behavior; thus, the reward function can be viewed as a specific parameterization of the agent’s policy. Using this insight, PGRD updates the reward parameters by estimating the gradient of the objective return with respect to the reward parameters, θ U(θ), from experience, using standard policy gradient techniques. In fact, we show that PGRD can be viewed as an (independently interesting) generalization of the policy gradient method OLPOMDP [2]. Specifically, we show that OLPOMDP is special case of PGRD when the planning depth d is zero. In this section, we first present the family of local planning agents for which PGRD improves the reward function. Next, we develop PGRD and prove its convergence. Finally, we show that PGRD generalizes OLPOMDP and discuss how adding planning to OLPOMDP affects the space of policies available to the optimization method. 2 1 2 3 4 5 Input: T , θ0 , {αt }∞ , β, γ t=0 o0 , i0 = initializeStart(); for t = 0, 1, 2, 3, . . . do ∀a Qt (a; θt ) = plan(it , ot , T, R(it , ·, ·; θt ), d,γ); at ∼ µ(a|it ; Qt ); rt+1 , ot+1 = takeAction(at ); µ(a |i ;Q ) 6 7 8 9 t zt+1 = βzt + θt t |itt ;Qt ) t ; µ(a θt+1 = θt + αt (rt+1 zt+1 − λθt ) ; it+1 = updateInternalState(it , at , ot+1 ); end Figure 1: PGRD (Policy Gradient for Reward Design) Algorithm A Family of Limited Agents with Internal State. Given a Markov model T defined over the observation space O and action space A, denote T (o |o, a) the probability of next observation o given that the agent takes action a after observing o. Our agents use the model T to plan. We do not assume that the model T is an accurate model of the environment. The use of an incorrect model is one type of agent limitation we examine in our experiments. In general, agents can use non-Markov models defined in terms of the history of observations and actions; we leave this for future work. The agent maintains an internal state feature vector it that is updated at each time step using it+1 = updateInternalState(it , at , ot+1 ). The internal state allows the agent to use reward functions T that depend on the agent’s history. We consider rewards of the form R(it , o, a; θt ) = θt φ(it , o, a), where θt is the reward parameter vector at time t, and φ(it , o, a) is a vector of features based on internal state it , planning state o, and action a. Note that if φ is a vector of binary indicator features, this representation allows for arbitrary reward functions and thus the representation is completely general. Many existing methods use reward functions that depend on history. Reward functions based on empirical counts of observations, as in PAC-MDP approaches [5, 20], provide some examples; see [14, 15, 13] for others. We present a concrete example in our empirical section. At each time step t, the agent’s planning algorithm, plan, performs depth-d planning using the model T and reward function R(it , o, a; θt ) with current internal state it and reward parameters θt . Specifically, the agent computes a d-step Q-value function Qd (it , ot , a; θt ) ∀a ∈ A, where Qd (it , o, a; θt ) = R(it , o, a; θt ) + γ o ∈O T (o |o, a) maxb∈A Qd−1 (it , o , b; θt ) and Q0 (it , o, a; θt ) = R(it , o, a; θt ). We emphasize that the internal state it and reward parameters θt are held invariant while planning. Note that the d-step Q-values are only computed for the current observation ot , in effect by building a depth-d tree rooted at ot . In the d = 0 special case, the planning procedure completely ignores the model T and returns Q0 (it , ot , a; θt ) = R(it , ot , a; θt ). Regardless of the value of d, we treat the end result of planning as providing a scoring function Qt (a; θt ) where the dependence on d, it and ot is dropped from the notation. To allow for gradient calculations, our agents act according to the τ Qt (a;θt ) def Boltzmann (soft-max) stochastic policy parameterized by Q: µ(a|it ; Qt ) = e eτ Qt (b;θt ) , where τ b is a temperature parameter that determines how stochastically the agent selects the action with the highest score. When the planning depth d is small due to computational limitations, the agent cannot account for events beyond the planning depth. We examine this limitation in our experiments. Gradient Ascent. To develop a gradient algorithm for improving the reward function, we need to compute the gradient of the objective return with respect to θ: θ U(θ). The main insight is to break the gradient calculation into the calculation of two gradients. The first is the gradient of the objective return with respect to the policy µ, and the second is the gradient of the policy with respect to the reward function parameters θ. The first gradient is exactly what is computed in standard policy gradient approaches [2]. The second gradient is challenging because the transformation from reward parameters to policy involves a model-based planning procedure. We draw from the work of Neu and Szepesv´ ri [10] which shows that this gradient computation resembles planning itself. We a develop PGRD, presented in Figure 1, explicitly as a generalization of OLPOMDP, a policy gradient algorithm developed by Bartlett and Baxter [2], because of its foundational simplicity relative to other policy-gradient algorithms such as those based on actor-critic methods (e.g., [4]). Notably, the reward parameters are the only parameters being learned in PGRD. 3 PGRD follows the form of OLPOMDP (Algorithm 1 in Bartlett and Baxter [2]) but generalizes it in three places. In Figure 1 line 3, the agent plans to compute the policy, rather than storing the policy directly. In line 6, the gradient of the policy with respect to the parameters accounts for the planning procedure. In line 8, the agent maintains a general notion of internal state that allows for richer parameterization of policies than typically considered (similar to Aberdeen and Baxter [1]). The algorithm takes as parameters a sequence of learning rates {αk }, a decaying-average parameter β, and regularization parameter λ > 0 which keeps the the reward parameters θ bounded throughout learning. Given a sequence of calculations of the gradient of the policy with respect to the parameters, θt µ(at |it ; Qt ), the remainder of the algorithm climbs the gradient of objective return θ U(θ) using OLPOMDP machinery. In the next subsection, we discuss how to compute θt µ(at |it ; Qt ). Computing the Gradient of the Policy with respect to Reward. For the Boltzmann distribution, the gradient of the policy with respect to the reward parameters is given by the equation θt µ(a|it ; Qt ) = τ · µ(a|Qt )[ θt Qt (a|it ; θt ) − θt Qt (b; θt )], where τ is the Boltzmann b∈A temperature (see [10]). Thus, computing θt µ(a|it ; Qt ) reduces to computing θt Qt (a; θt ). The value of Qt depends on the reward parameters θt , the model, and the planning depth. However, as we present below, the process of computing the gradient closely resembles the process of planning itself, and the two computations can be interleaved. Theorem 1 presented below is an adaptation of Proposition 4 from Neu and Szepesv´ ri [10]. It presents the gradient computation for depth-d a planning as well as for infinite-depth discounted planning. We assume that the gradient of the reward function with respect to the parameters is bounded: supθ,o,i,a θ R(i, o, a, θ) < ∞. The proof of the theorem follows directly from Proposition 4 of Neu and Szepesv´ ri [10]. a Theorem 1. Except on a set of measure zero, for any depth d, the gradient θ Qd (o, a; θ) exists and is given by the recursion (where we have dropped the dependence on i for simplicity) d θ Q (o, a; θ) = θ R(o, a; θ) π d−1 (b|o ) T (o |o, a) +γ o ∈O d−1 (o θQ , b; θ), (2) b∈A where θ Q0 (o, a; θ) = θ R(o, a; θ) and π d (a|o) ∈ arg maxa Qd (o, a; θ) is any policy that is greedy with respect to Qd . The result also holds for θ Q∗ (o, a; θ) = θ limd→∞ Qd (o, a; θ). The Q-function will not be differentiable when there are multiple optimal policies. This is reflected in the arbitrary choice of π in the gradient calculation. However, it was shown by Neu and Szepesv´ ri [10] that even for values of θ which are not differentiable, the above computation produces a a valid calculation of a subgradient; we discuss this below in our proof of convergence of PGRD. Convergence of PGRD (Figure 1). Given a particular fixed reward function R(·; θ), transition model T , and planning depth, there is a corresponding fixed randomized policy µ(a|i; θ)—where we have explicitly represented the reward’s dependence on the internal state vector i in the policy parameterization and dropped Q from the notation as it is redundant given that everything else is fixed. Denote the agent’s internal-state update as a (usually deterministic) distribution ψ(i |i, a, o). Given a fixed reward parameter vector θ, the joint environment-state–internal-state transitions can be modeled as a Markov chain with a |S||I| × |S||I| transition matrix M (θ) whose entries are given by M s,i , s ,i (θ) = p( s , i | s, i ; θ) = o,a ψ(i |i, a, o)Ω(o|s )P (s |s, a)µ(a|i; θ). We make the following assumptions about the agent and the environment: Assumption 1. The transition matrix M (θ) of the joint environment-state–internal-state Markov chain has a unique stationary distribution π(θ) = [πs1 ,i1 (θ), πs2 ,i2 (θ), . . . , πs|S| ,i|I| (θ)] satisfying the balance equations π(θ)M (θ) = π(θ), for all θ ∈ Θ. Assumption 2. During its execution, PGRD (Figure 1) does not reach a value of it , and θt at which µ(at |it , Qt ) is not differentiable with respect to θt . It follows from Assumption 1 that the objective return, U(θ), is independent of the start state. The original OLPOMDP convergence proof [2] has a similar condition that only considers environment states. Intuitively, this condition allows PGRD to handle history-dependence of a reward function in the same manner that it handles partial observability in an environment. Assumption 2 accounts for the fact that a planning algorithm may not be fully differentiable everywhere. However, Theorem 1 showed that infinite and bounded-depth planning is differentiable almost everywhere (in a measure theoretic sense). Furthermore, this assumption is perhaps stronger than necessary, as stochastic approximation algorithms, which provide the theory upon which OLPOMDP is based, have been shown to converge using subgradients [8]. 4 In order to state the convergence theorem, we must define the approximate gradient which OLPOMDP def T calculates. Let the approximate gradient estimate be β U(θ) = limT →∞ t=1 rt zt for a fixed θ and θ PGRD parameter β, where zt (in Figure 1) represents a time-decaying average of the θt µ(at |it , Qt ) calculations. It was shown by Bartlett and Baxter [2] that β U(θ) is close to the true value θ U(θ) θ for large values of β. Theorem 2 proves that PGRD converges to a stable equilibrium point based on this approximate gradient measure. This equilibrium point will typically correspond to some local optimum in the return function U(θ). Given our development and assumptions, the theorem is a straightforward extension of Theorem 6 from Bartlett and Baxter [2] (proof omitted). ∞ Theorem 2. Given β ∈ [0, 1), λ > 0, and a sequence of step sizes αt satisfying t=0 αt = ∞ and ∞ 2 t=0 (αt ) < ∞, PGRD produces a sequence of reward parameters θt such that θt → L as t → ∞ a.s., where L is the set of stable equilibrium points of the differential equation ∂θ = β U(θ) − λθ. θ ∂t PGRD generalizes OLPOMDP. As stated above, OLPOMDP, when it uses a Boltzmann distribution in its policy representation (a common case), is a special case of PGRD when the planning depth is zero. First, notice that in the case of depth-0 planning, Q0 (i, o, a; θ) = R(i, o, a, θ), regardless of the transition model and reward parameterization. We can also see from Theorem 1 that 0 θ Q (i, o, a; θ) = θ R(i, o, a; θ). Because R(i, o, a; θ) can be parameterized arbitrarily, PGRD can be configured to match standard OLPOMDP with any policy parameterization that also computes a score function for the Boltzmann distribution. In our experiments, we demonstrate that choosing a planning depth d > 0 can be beneficial over using OLPOMDP (d = 0). In the remainder of this section, we show theoretically that choosing d > 0 does not hurt in the sense that it does not reduce the space of policies available to the policy gradient method. Specifically, we show that when using an expressive enough reward parameterization, PGRD’s space of policies is not restricted relative to OLPOMDP’s space of policies. We prove the result for infinite planning, but the extension to depth-limited planning is straightforward. Theorem 3. There exists a reward parameterization such that, for an arbitrary transition model T , the space of policies representable by PGRD with infinite planning is identical to the space of policies representable by PGRD with depth 0 planning. Proof. Ignoring internal state for now (holding it constant), let C(o, a) be an arbitrary reward function used by PGRD with depth 0 planning. Let R(o, a; θ) be a reward function for PGRD with infinite (d = ∞) planning. The depth-∞ agent uses the planning result Q∗ (o, a; θ) to act, while the depth-0 agent uses the function C(o, a) to act. Therefore, it suffices to show that one can always choose θ such that the planning solution Q∗ (o, a; θ) equals C(o, a). For all o ∈ O, a ∈ A, set R(o, a; θ) = C(o, a) − γ o T (o |o, a) maxa C(o , a ). Substituting Q∗ for C, this is the Bellman optimality equation [22] for infinite-horizon planning. Setting R(o, a; θ) as above is possible if it is parameterized by a table with an entry for each observation–action pair. Theorem 3 also shows that the effect of an arbitrarily poor model can be overcome with a good choice of reward function. This is because a Boltzmann distribution can, allowing for an arbitrary scoring function C, represent any policy. We demonstrate this ability of PGRD in our experiments. 3 Experiments The primary objective of our experiments is to demonstrate that PGRD is able to use experience online to improve the reward function parameters, thereby improving the agent’s obtained objective return. Specifically, we compare the objective return achieved by PGRD to the objective return achieved by PGRD with the reward adaptation turned off. In both cases, the reward function is initialized to the objective reward function. A secondary objective is to demonstrate that when a good model is available, adding the ability to plan—even for small depths—improves performance relative to the baseline algorithm of OLPOMDP (or equivalently PGRD with depth d = 0). Foraging Domain for Experiments 1 to 3: The foraging environment illustrated in Figure 2(a) is a 3 × 3 grid world with 3 dead-end corridors (rows) separated by impassable walls. The agent (bird) has four available actions corresponding to each cardinal direction. Movement in the intended direction fails with probability 0.1, resulting in movement in a random direction. If the resulting direction is 5 Objective Return 0.15 D=6, α=0 & D=6, α=5×10 −5 D=4, α=2×10 −4 D=0, α=5×10 −4 0.1 0.05 0 D=4, α=0 D=0, α=0 1000 2000 3000 4000 5000 Time Steps C) Objective Return B) A) 0.15 D=6, α=0 & D=6, α=5×10 −5 D=3, α=3×10 −3 D=1, α=3×10 −4 0.1 D=3, α=0 0.05 D=0, α=0.01 & D=1, α=0 0 1000 2000 3000 4000 5000 D=0, α=0 Time Steps Figure 2: A) Foraging Domain, B) Performance of PGRD with observation-action reward features, C) Performance of PGRD with recency reward features blocked by a wall or the boundary, the action results in no movement. There is a food source (worm) located in one of the three right-most locations at the end of each corridor. The agent has an eat action, which consumes the worm when the agent is at the worm’s location. After the agent consumes the worm, a new worm appears randomly in one of the other two potential worm locations. Objective Reward for the Foraging Domain: The designer’s goal is to maximize the average number of worms eaten per time step. Thus, the objective reward function RO provides a reward of 1.0 when the agent eats a worm, and a reward of 0 otherwise. The objective return is defined as in Equation (1). Experimental Methodology: We tested PGRD for depth-limited planning agents of depths 0–6. Recall that PGRD for the agent with planning depth 0 is the OLPOMDP algorithm. For each depth, we jointly optimized over the PGRD algorithm parameters, α and β (we use a fixed α throughout learning). We tested values for α on an approximate logarithmic scale in the range (10−6 , 10−2 ) as well as the special value of α = 0, which corresponds to an agent that does not adapt its reward function. We tested β values in the set 0, 0.4, 0.7, 0.9, 0.95, 0.99. Following common practice [3], we set the λ parameter to 0. We explicitly bound the reward parameters and capped the reward function output both to the range [−1, 1]. We used a Boltzmann temperature parameter of τ = 100 and planning discount factor γ = 0.95. Because we initialized θ so that the initial reward function was the objective reward function, PGRD with α = 0 was equivalent to a standard depth-limited planning agent. Experiment 1: A fully observable environment with a correct model learned online. In this experiment, we improve the reward function in an agent whose only limitation is planning depth, using (1) a general reward parameterization based on the current observation and (2) a more compact reward parameterization which also depends on the history of observations. Observation: The agent observes the full state, which is given by the pair o = (l, w), where l is the agent’s location and w is the worm’s location. Learning a Correct Model: Although the theorem of convergence of PGRD relies on the agent having a fixed model, the algorithm itself is readily applied to the case of learning a model online. In this experiment, the agent’s model T is learned online based on empirical transition probabilities between observations (recall this is a fully observable environment). Let no,a,o be the number of times that o was reached after taking action a after observing o. The agent models the probability of seeing o as no,a,o T (o |o, a) = . n o o,a,o Reward Parameterizations: Recall that R(i, o, a; θ) = θT φ(i, o, a), for some φ(i, o, a). (1) In the observation-action parameterization, φ(i, o, a) is a binary feature vector with one binary feature for each observation-action pair—internal state is ignored. This is effectively a table representation over all reward functions indexed by (o, a). As shown in Theorem 3, the observation-action feature representation is capable of producing arbitrary policies over the observations. In large problems, such a parameterization would not be feasible. (2) The recency parameterization is a more compact representation which uses features that rely on the history of observations. The feature vector is φ(i, o, a) = [RO (o, a), 1, φcl (l, i), φcl,a (l, a, i)], where RO (o, a) is the objective reward function defined as above. The feature φcl (l) = 1 − 1/c(l, i), where c(l, i) is the number of time steps since the agent has visited location l, as represented in the agent’s internal state i. Its value is normalized to the range [0, 1) and is high when the agent has not been to location l recently. The feature φcl,a (l, a, i) = 1 − 1/c(l, a, i) is similarly defined with respect to the time since the agent has taken action a in location l. Features based on recency counts encourage persistent exploration [21, 18]. 6 Results & Discussion: Figure 2(b) and Figure 2(c) present results for agents that use the observationaction parameterization and the recency parameterization of the reward function respectively. The horizontal axis is the number of time steps of experience. The vertical axis is the objective return, i.e., the average objective reward per time step. Each curve is an average over 130 trials. The values of d and the associated optimal algorithm parameters for each curve are noted in the figures. First, note that with d = 6, the agent is unbounded, because food is never more than 6 steps away. Therefore, the agent does not benefit from adapting the reward function parameters (given that we initialize to the objective reward function). Indeed, the d = 6, α = 0 agent performs as well as the best reward-optimizing agent. The performance for d = 6 improves with experience because the model improves with experience (and thus from the curves it is seen that the model gets quite accurate in about 1500 time steps). The largest objective return obtained for d = 6 is also the best objective return that can be obtained for any value of d. Several results can be observed in both Figures 2(b) and (c). 1) Each curve that uses α > 0 (solid lines) improves with experience. This is a demonstration of our primary contribution, that PGRD is able to effectively improve the reward function with experience. That the improvement over time is not just due to model learning is seen in the fact that for each value of d < 6 the curve for α > 0 (solid-line) which adapts the reward parameters does significantly better than the corresponding curve for α = 0 (dashed-line); the α = 0 agents still learn the model. 2) For both α = 0 and α > 0 agents, the objective return obtained by agents with equivalent amounts of experience increases monotonically as d is increased (though to maintain readability we only show selected values of d in each figure). This demonstrates our secondary contribution, that the ability to plan in PGRD significantly improves performance over standard OLPOMDP (PGRD with d = 0). There are also some interesting differences between the results for the two different reward function parameterizations. With the observation-action parameterization, we noted that there always exists a setting of θ for all d that will yield optimal objective return. This is seen in Figure 2(b) in that all solid-line curves approach optimal objective return. In contrast, the more compact recency reward parameterization does not afford this guarantee and indeed for small values of d (< 3), the solid-line curves in Figure 2(c) converge to less than optimal objective return. Notably, OLPOMDP (d = 0) does not perform well with this feature set. On the other hand, for planning depths 3 ≤ d < 6, the PGRD agents with the recency parameterization achieve optimal objective return faster than the corresponding PGRD agent with the observation-action parameterization. Finally, we note that this experiment validates our claim that PGRD can improve reward functions that depend on history. Experiment 2: A fully observable environment and poor given model. Our theoretical analysis showed that PGRD with an incorrect model and the observation–action reward parameterization should (modulo local maxima issues) do just as well asymptotically as it would with a correct model. Here we illustrate this theoretical result empirically on the same foraging domain and objective reward function used in Experiment 1. We also test our hypothesis that a poor model should slow down the rate of learning relative to a correct model. Poor Model: We gave the agents a fixed incorrect model of the foraging environment that assumes there are no internal walls separating the 3 corridors. Reward Parameterization: We used the observation–action reward parameterization. With a poor model it is no longer interesting to initialize θ so that the initial reward function is the objective reward function because even for d = 6 such an agent would do poorly. Furthermore, we found that this initialization leads to excessively bad exploration and therefore poor learning of how to modify the reward. Thus, we initialize θ to uniform random values near 0, in the range (−10−3 , 10−3 ). Results: Figure 3(a) plots the objective return as a function of number of steps of experience. Each curve is an average over 36 trials. As hypothesized, the bad model slows learning by a factor of more than 10 (notice the difference in the x-axis scales from those in Figure 2). Here, deeper planning results in slower learning and indeed the d = 0 agent that does not use the model at all learns the fastest. However, also as hypothesized, because they used the expressive observation–action parameterization, agents of all planning depths mitigate the damage caused by the poor model and eventually converge to the optimal objective return. Experiment 3: Partially observable foraging world. Here we evaluate PGRD’s ability to learn in a partially observable version of the foraging domain. In addition, the agents learn a model under the erroneous (and computationally convenient) assumption that the domain is fully observable. 7 0.1 −4 D = 0, α = 2 ×10 D = 2, α = 3 ×10 −5 −5 D = 6, α = 2 ×10 0.05 D = 0&2&6, α = 0 0 1 2 3 Time Steps 4 5 x 10 4 0.06 D = 6, α = 7 ×10 D = 2, α = 7 ×10 −4 0.04 D = 1, α = 7 ×10 −4 D = 0, α = 5 ×10 −4 D = 0, α = 0 D = 1&2&6, α = 0 0.02 0 C) −4 1000 2000 3000 4000 5000 Time Steps Objective Return B) 0.08 0.15 Objective Return Objective Return A) 2.5 2 x 10 −3 D=6, α=3×10 −6 D=0, α=1×10 −5 1.5 D=0&6, α=0 1 0.5 1 2 3 Time Steps 4 5 x 10 4 Figure 3: A) Performance of PGRD with a poor model, B) Performance of PGRD in a partially observable world with recency reward features, C) Performance of PGRD in Acrobot Partial Observation: Instead of viewing the location of the worm at all times, the agent can now only see the worm when it is colocated with it: its observation is o = (l, f ), where f indicates whether the agent is colocated with the food. Learning an Incorrect Model: The model is learned just as in Experiment 1. Because of the erroneous full observability assumption, the model will hallucinate about worms at all the corridor ends based on the empirical frequency of having encountered them there. Reward Parameterization: We used the recency parameterization; due to the partial observability, agents with the observation–action feature set perform poorly in this environment. The parameters θ are initialized such that the initial reward function equals the objective reward function. Results & Discussion: Figure 3(b) plots the mean of 260 trials. As seen in the solid-line curves, PGRD improves the objective return at all depths (only a small amount for d = 0 and significantly more for d > 0). In fact, agents which don’t adapt the reward are hurt by planning (relative to d = 0). This experiment demonstrates that the combination of planning and reward improvement can be beneficial even when the model is erroneous. Because of the partial observability, optimal behavior in this environment achieves less objective return than in Experiment 1. Experiment 4: Acrobot. In this experiment we test PGRD in the Acrobot environment [22], a common benchmark task in the RL literature and one that has previously been used in the testing of policy gradient approaches [23]. This experiment demonstrates PGRD in an environment in which an agent must be limited due to the size of the state space and further demonstrates that adding model-based planning to policy gradient approaches can improve performance. Domain: The version of Acrobot we use is as specified by Sutton and Barto [22]. It is a two-link robot arm in which the position of one shoulder-joint is fixed and the agent’s control is limited to 3 actions which apply torque to the elbow-joint. Observation: The fully-observable state space is 4 dimensional, with two joint angles ψ1 and ψ2 , and ˙ ˙ two joint velocities ψ1 and ψ2 . Objective Reward: The designer receives an objective reward of 1.0 when the tip is one arm’s length above the fixed shoulder-joint, after which the bot is reset to its initial resting position. Model: We provide the agent with a perfect model of the environment. Because the environment is continuous, value iteration is intractable, and computational limitations prevent planning deep enough to compute the optimal action in any state. The feature vector contains 13 entries. One feature corresponds to the objective reward signal. For each action, there are 5 features corresponding to each of the state features plus an additional feature representing the height of the tip: φ(i, o, a) = ˙ ˙ [RO (o), {ψ1 (o), ψ2 (o), ψ1 (o), ψ2 (o), h(o)}a ]. The height feature has been used in previous work as an alternative definition of objective reward [23]. Results & Discussion: We plot the mean of 80 trials in Figure 3(c). Agents that use the fixed (α = 0) objective reward function with bounded-depth planning perform according to the bottom two curves. Allowing PGRD and OLPOMDP to adapt the parameters θ leads to improved objective return, as seen in the top two curves in Figure 3(c). Finally, the PGRD d = 6 agent outperforms the standard OLPOMDP agent (PGRD with d = 0), further demonstrating that PGRD outperforms OLPOMDP. Overall Conclusion: We developed PGRD, a new method for approximately solving the optimal reward problem in bounded planning agents that can be applied in an online setting. We showed that PGRD is a generalization of OLPOMDP and demonstrated that it both improves reward functions in limited agents and outperforms the model-free OLPOMDP approach. 8 References [1] Douglas Aberdeen and Jonathan Baxter. Scalable Internal-State Policy-Gradient Methods for POMDPs. Proceedings of the Nineteenth International Conference on Machine Learning, 2002. [2] Peter L. Bartlett and Jonathan Baxter. Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the 39th IEEE Conference on Decision and Control, 2000. [3] Jonathan Baxter, Peter L. Bartlett, and Lex Weaver. Experiments with Infinite-Horizon, Policy-Gradient Estimation, 2001. [4] Shalabh Bhatnagar, Richard S. Sutton, M Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 2009. [5] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A General Polynomial Time Algorithm for NearOptimal Reinforcement Learning. Journal of Machine Learning Research, 3:213–231, 2001. [6] S. Elfwing, Eiji Uchibe, K. Doya, and H. I. Christensen. Co-evolution of Shaping Rewards and MetaParameters in Reinforcement Learning. Adaptive Behavior, 16(6):400–412, 2008. [7] J. Zico Kolter and Andrew Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning, pages 513–520, 2009. [8] Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd edition, 2010. [9] Cetin Mericli, Tekin Mericli, and H. Levent Akin. A Reward Function Generation Method Using Genetic ¸ ¸ ¸ Algorithms : A Robot Soccer Case Study (Extended Abstract). In Proc. of the 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010), number 2, pages 1513–1514, 2010. [10] Gergely Neu and Csaba Szepesv´ ri. Apprenticeship learning using inverse reinforcement learning and a gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, pages 295–302, 2007. [11] Andrew Y. Ng, Stuart J. Russell, and D. Harada. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pages 278–287, 1999. [12] Scott Niekum, Andrew G. Barto, and Lee Spector. Genetic Programming for Reward Function Search. IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010. [13] Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V. Hafner. Intrinsic Motivation Systems for Autonomous Mental Development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, April 2007. [14] J¨ rgen Schmidhuber. Curious model-building control systems. In IEEE International Joint Conference on u Neural Networks, pages 1458–1463, 1991. [15] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically Motivated Reinforcement Learning. In Proceedings of Advances in Neural Information Processing Systems 17 (NIPS), pages 1281–1288, 2005. [16] Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where Do Rewards Come From? In Proceedings of the Annual Conference of the Cognitive Science Society, pages 2601–2606, 2009. [17] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transations on Autonomous Mental Development, 2(2):70–82, 2010. [18] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Internal Rewards Mitigate Agent Boundedness. In Proceedings of the 27th International Conference on Machine Learning, 2010. [19] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010. [20] Alexander L. Strehl and Michael L. Littman. An analysis of model-based Interval Estimation for Markov Decision Processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008. [21] Richard S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In The Seventh International Conference on Machine Learning, pages 216–224. 1990. [22] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. [23] Lex Weaver and Nigel Tao. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 538–545. 2001. 9
4 0.096632861 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
5 0.088485666 59 nips-2010-Deep Coding Network
Author: Yuanqing Lin, Zhang Tong, Shenghuo Zhu, Kai Yu
Abstract: This paper proposes a principled extension of the traditional single-layer flat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding. The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. Empirically, it is shown that the deep coding approach yields improved performance in benchmark datasets.
6 0.084250793 134 nips-2010-LSTD with Random Projections
7 0.084229976 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
8 0.083829708 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
9 0.081665576 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
10 0.074145809 76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding
11 0.071279675 43 nips-2010-Bootstrapping Apprenticeship Learning
12 0.07102447 208 nips-2010-Policy gradients in linearly-solvable MDPs
13 0.063866504 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
14 0.062638327 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
15 0.061362743 212 nips-2010-Predictive State Temporal Difference Learning
16 0.059642691 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient
17 0.057432227 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
18 0.054626614 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
19 0.051182713 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
20 0.049042489 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
topicId topicWeight
[(0, 0.105), (1, -0.128), (2, -0.017), (3, 0.024), (4, 0.018), (5, -0.029), (6, 0.026), (7, 0.026), (8, -0.033), (9, 0.047), (10, -0.022), (11, 0.009), (12, 0.007), (13, -0.065), (14, -0.032), (15, -0.065), (16, -0.044), (17, 0.041), (18, -0.002), (19, 0.009), (20, 0.017), (21, -0.056), (22, -0.067), (23, -0.023), (24, 0.026), (25, 0.011), (26, -0.035), (27, -0.019), (28, 0.039), (29, -0.011), (30, -0.049), (31, 0.067), (32, -0.118), (33, 0.028), (34, -0.036), (35, 0.005), (36, 0.007), (37, 0.055), (38, -0.048), (39, -0.042), (40, 0.009), (41, -0.0), (42, 0.088), (43, -0.078), (44, 0.117), (45, 0.051), (46, 0.051), (47, -0.081), (48, -0.082), (49, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.93144327 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
Author: Sridhar Mahadevan, Bo Liu
Abstract: This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. The first two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same benefits at lower computational cost. 1
2 0.63300663 229 nips-2010-Reward Design via Online Gradient Ascent
Author: Jonathan Sorg, Richard L. Lewis, Satinder P. Singh
Abstract: Recent work has demonstrated that when artificial agents are limited in their ability to achieve their goals, the agent designer can benefit by making the agent’s goals different from the designer’s. This gives rise to the optimization problem of designing the artificial agent’s goals—in the RL framework, designing the agent’s reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent’s lifetime nor do they take advantage of knowledge about the agent’s structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent’s lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations. 1 The Optimal Reward Problem In this work, we consider the scenario of an agent designer building an autonomous agent. The designer has his or her own goals which must be translated into goals for the autonomous agent. We represent goals using the Reinforcement Learning (RL) formalism of the reward function. This leads to the optimal reward problem of designing the agent’s reward function so as to maximize the objective reward received by the agent designer. Typically, the designer assigns his or her own reward to the agent. However, there is ample work which demonstrates the benefit of assigning reward which does not match the designer’s. For example, work on reward shaping [11] has shown how to modify rewards to accelerate learning without altering the optimal policy, and PAC-MDP methods [5, 20] including approximate Bayesian methods [7, 19] add bonuses to the objective reward to achieve optimism under uncertainty. These approaches explicitly or implicitly assume that the asymptotic behavior of the agent should be the same as that which would occur using the objective reward function. These methods do not explicitly consider the optimal reward problem; however, they do show improved performance through reward modification. In our recent work that does explicitly consider the optimal reward problem [18], we analyzed an explicit hypothesis about the benefit of reward design—that it helps mitigate the performance loss caused by computational constraints (bounds) on agent architectures. We considered various types of agent limitations—limits on planning depth, failure to account for partial observability, and other erroneous modeling assumptions—and demonstrated the benefits of good reward functions in each case empirically. Crucially, in bounded agents, the optimal reward function often leads to behavior that is different from the asymptotic behavior achieved with the objective reward function. In this work, we develop an algorithm, Policy Gradient for Reward Design (PGRD), for improving reward functions for a family of bounded agents that behave according to repeated local (from the current state) model-based planning. We show that this algorithm is capable of improving the reward functions in agents with computational limitations necessitating small bounds on the depth of planning, and also from the use of an inaccurate model (which may be inaccurate due to computationally-motivated approximations). PGRD has few parameters, improves the reward 1 function online during an agent’s lifetime, takes advantage of knowledge about the agent’s structure (through the gradient computation), and is linear in the number of reward function parameters. Notation. Formally, we consider discrete-time partially-observable environments with a finite number of hidden states s ∈ S, actions a ∈ A, and observations o ∈ O; these finite set assumptions are useful for our theorems, but our algorithm can handle infinite sets in practice. Its dynamics are governed by a state-transition function P (s |s, a) that defines a distribution over next-states s conditioned on current state s and action a, and an observation function Ω(o|s) that defines a distribution over observations o conditioned on current state s. The agent designer’s goals are specified via the objective reward function RO . At each time step, the designer receives reward RO (st ) ∈ [0, 1] based on the current state st of the environment, where the subscript denotes time. The designer’s objective return is the expected mean objective reward N 1 obtained over an infinite horizon, i.e., limN →∞ E N t=0 RO (st ) . In the standard view of RL, the agent uses the same reward function as the designer to align the interests of the agent and the designer. Here we allow for a separate agent reward function R(· ). An agent’s reward function can in general be defined in terms of the history of actions and observations, but is often more pragmatically defined in terms of some abstraction of history. We define the agent’s reward function precisely in Section 2. Optimal Reward Problem. An RL agent attempts to act so as to maximize its own cumulative reward, or return. Crucially, as a result, the sequence of environment-states {st }∞ is affected by t=0 the choice of reward function; therefore, the agent designer’s return is affected as well. The optimal reward problem arises from the fact that while the objective reward function is fixed as part of the problem description, the reward function is a choice to be made by the designer. We capture this choice abstractly by letting the reward be parameterized by some vector of parameters θ chosen from space of parameters Θ. Each θ ∈ Θ specifies a reward function R(· ; θ) which in turn produces a distribution over environment state sequences via whatever RL method the agent uses. The expected N 1 return obtained by the designer for choice θ is U(θ) = limN →∞ E N t=0 RO (st ) R(·; θ) . The optimal reward parameters are given by the solution to the optimal reward problem [16, 17, 18]: θ∗ = arg max U(θ) = arg max lim E θ∈Θ θ∈Θ N →∞ 1 N N RO (st ) R(·; θ) . (1) t=0 Our previous research on solving the optimal reward problem has focused primarily on the properties of the optimal reward function and its correspondence to the agent architecture and the environment [16, 17, 18]. This work has used inefficient exhaustive search methods for finding good approximations to θ∗ (though there is recent work on using genetic algorithms to do this [6, 9, 12]). Our primary contribution in this paper is a new convergent online stochastic gradient method for finding approximately optimal reward functions. To our knowledge, this is the first algorithm that improves reward functions in an online setting—during a single agent’s lifetime. In Section 2, we present the PGRD algorithm, prove its convergence, and relate it to OLPOMDP [2], a policy gradient algorithm. In Section 3, we present experiments demonstrating PGRD’s ability to approximately solve the optimal reward problem online. 2 PGRD: Policy Gradient for Reward Design PGRD builds on the following insight: the agent’s planning algorithm procedurally converts the reward function into behavior; thus, the reward function can be viewed as a specific parameterization of the agent’s policy. Using this insight, PGRD updates the reward parameters by estimating the gradient of the objective return with respect to the reward parameters, θ U(θ), from experience, using standard policy gradient techniques. In fact, we show that PGRD can be viewed as an (independently interesting) generalization of the policy gradient method OLPOMDP [2]. Specifically, we show that OLPOMDP is special case of PGRD when the planning depth d is zero. In this section, we first present the family of local planning agents for which PGRD improves the reward function. Next, we develop PGRD and prove its convergence. Finally, we show that PGRD generalizes OLPOMDP and discuss how adding planning to OLPOMDP affects the space of policies available to the optimization method. 2 1 2 3 4 5 Input: T , θ0 , {αt }∞ , β, γ t=0 o0 , i0 = initializeStart(); for t = 0, 1, 2, 3, . . . do ∀a Qt (a; θt ) = plan(it , ot , T, R(it , ·, ·; θt ), d,γ); at ∼ µ(a|it ; Qt ); rt+1 , ot+1 = takeAction(at ); µ(a |i ;Q ) 6 7 8 9 t zt+1 = βzt + θt t |itt ;Qt ) t ; µ(a θt+1 = θt + αt (rt+1 zt+1 − λθt ) ; it+1 = updateInternalState(it , at , ot+1 ); end Figure 1: PGRD (Policy Gradient for Reward Design) Algorithm A Family of Limited Agents with Internal State. Given a Markov model T defined over the observation space O and action space A, denote T (o |o, a) the probability of next observation o given that the agent takes action a after observing o. Our agents use the model T to plan. We do not assume that the model T is an accurate model of the environment. The use of an incorrect model is one type of agent limitation we examine in our experiments. In general, agents can use non-Markov models defined in terms of the history of observations and actions; we leave this for future work. The agent maintains an internal state feature vector it that is updated at each time step using it+1 = updateInternalState(it , at , ot+1 ). The internal state allows the agent to use reward functions T that depend on the agent’s history. We consider rewards of the form R(it , o, a; θt ) = θt φ(it , o, a), where θt is the reward parameter vector at time t, and φ(it , o, a) is a vector of features based on internal state it , planning state o, and action a. Note that if φ is a vector of binary indicator features, this representation allows for arbitrary reward functions and thus the representation is completely general. Many existing methods use reward functions that depend on history. Reward functions based on empirical counts of observations, as in PAC-MDP approaches [5, 20], provide some examples; see [14, 15, 13] for others. We present a concrete example in our empirical section. At each time step t, the agent’s planning algorithm, plan, performs depth-d planning using the model T and reward function R(it , o, a; θt ) with current internal state it and reward parameters θt . Specifically, the agent computes a d-step Q-value function Qd (it , ot , a; θt ) ∀a ∈ A, where Qd (it , o, a; θt ) = R(it , o, a; θt ) + γ o ∈O T (o |o, a) maxb∈A Qd−1 (it , o , b; θt ) and Q0 (it , o, a; θt ) = R(it , o, a; θt ). We emphasize that the internal state it and reward parameters θt are held invariant while planning. Note that the d-step Q-values are only computed for the current observation ot , in effect by building a depth-d tree rooted at ot . In the d = 0 special case, the planning procedure completely ignores the model T and returns Q0 (it , ot , a; θt ) = R(it , ot , a; θt ). Regardless of the value of d, we treat the end result of planning as providing a scoring function Qt (a; θt ) where the dependence on d, it and ot is dropped from the notation. To allow for gradient calculations, our agents act according to the τ Qt (a;θt ) def Boltzmann (soft-max) stochastic policy parameterized by Q: µ(a|it ; Qt ) = e eτ Qt (b;θt ) , where τ b is a temperature parameter that determines how stochastically the agent selects the action with the highest score. When the planning depth d is small due to computational limitations, the agent cannot account for events beyond the planning depth. We examine this limitation in our experiments. Gradient Ascent. To develop a gradient algorithm for improving the reward function, we need to compute the gradient of the objective return with respect to θ: θ U(θ). The main insight is to break the gradient calculation into the calculation of two gradients. The first is the gradient of the objective return with respect to the policy µ, and the second is the gradient of the policy with respect to the reward function parameters θ. The first gradient is exactly what is computed in standard policy gradient approaches [2]. The second gradient is challenging because the transformation from reward parameters to policy involves a model-based planning procedure. We draw from the work of Neu and Szepesv´ ri [10] which shows that this gradient computation resembles planning itself. We a develop PGRD, presented in Figure 1, explicitly as a generalization of OLPOMDP, a policy gradient algorithm developed by Bartlett and Baxter [2], because of its foundational simplicity relative to other policy-gradient algorithms such as those based on actor-critic methods (e.g., [4]). Notably, the reward parameters are the only parameters being learned in PGRD. 3 PGRD follows the form of OLPOMDP (Algorithm 1 in Bartlett and Baxter [2]) but generalizes it in three places. In Figure 1 line 3, the agent plans to compute the policy, rather than storing the policy directly. In line 6, the gradient of the policy with respect to the parameters accounts for the planning procedure. In line 8, the agent maintains a general notion of internal state that allows for richer parameterization of policies than typically considered (similar to Aberdeen and Baxter [1]). The algorithm takes as parameters a sequence of learning rates {αk }, a decaying-average parameter β, and regularization parameter λ > 0 which keeps the the reward parameters θ bounded throughout learning. Given a sequence of calculations of the gradient of the policy with respect to the parameters, θt µ(at |it ; Qt ), the remainder of the algorithm climbs the gradient of objective return θ U(θ) using OLPOMDP machinery. In the next subsection, we discuss how to compute θt µ(at |it ; Qt ). Computing the Gradient of the Policy with respect to Reward. For the Boltzmann distribution, the gradient of the policy with respect to the reward parameters is given by the equation θt µ(a|it ; Qt ) = τ · µ(a|Qt )[ θt Qt (a|it ; θt ) − θt Qt (b; θt )], where τ is the Boltzmann b∈A temperature (see [10]). Thus, computing θt µ(a|it ; Qt ) reduces to computing θt Qt (a; θt ). The value of Qt depends on the reward parameters θt , the model, and the planning depth. However, as we present below, the process of computing the gradient closely resembles the process of planning itself, and the two computations can be interleaved. Theorem 1 presented below is an adaptation of Proposition 4 from Neu and Szepesv´ ri [10]. It presents the gradient computation for depth-d a planning as well as for infinite-depth discounted planning. We assume that the gradient of the reward function with respect to the parameters is bounded: supθ,o,i,a θ R(i, o, a, θ) < ∞. The proof of the theorem follows directly from Proposition 4 of Neu and Szepesv´ ri [10]. a Theorem 1. Except on a set of measure zero, for any depth d, the gradient θ Qd (o, a; θ) exists and is given by the recursion (where we have dropped the dependence on i for simplicity) d θ Q (o, a; θ) = θ R(o, a; θ) π d−1 (b|o ) T (o |o, a) +γ o ∈O d−1 (o θQ , b; θ), (2) b∈A where θ Q0 (o, a; θ) = θ R(o, a; θ) and π d (a|o) ∈ arg maxa Qd (o, a; θ) is any policy that is greedy with respect to Qd . The result also holds for θ Q∗ (o, a; θ) = θ limd→∞ Qd (o, a; θ). The Q-function will not be differentiable when there are multiple optimal policies. This is reflected in the arbitrary choice of π in the gradient calculation. However, it was shown by Neu and Szepesv´ ri [10] that even for values of θ which are not differentiable, the above computation produces a a valid calculation of a subgradient; we discuss this below in our proof of convergence of PGRD. Convergence of PGRD (Figure 1). Given a particular fixed reward function R(·; θ), transition model T , and planning depth, there is a corresponding fixed randomized policy µ(a|i; θ)—where we have explicitly represented the reward’s dependence on the internal state vector i in the policy parameterization and dropped Q from the notation as it is redundant given that everything else is fixed. Denote the agent’s internal-state update as a (usually deterministic) distribution ψ(i |i, a, o). Given a fixed reward parameter vector θ, the joint environment-state–internal-state transitions can be modeled as a Markov chain with a |S||I| × |S||I| transition matrix M (θ) whose entries are given by M s,i , s ,i (θ) = p( s , i | s, i ; θ) = o,a ψ(i |i, a, o)Ω(o|s )P (s |s, a)µ(a|i; θ). We make the following assumptions about the agent and the environment: Assumption 1. The transition matrix M (θ) of the joint environment-state–internal-state Markov chain has a unique stationary distribution π(θ) = [πs1 ,i1 (θ), πs2 ,i2 (θ), . . . , πs|S| ,i|I| (θ)] satisfying the balance equations π(θ)M (θ) = π(θ), for all θ ∈ Θ. Assumption 2. During its execution, PGRD (Figure 1) does not reach a value of it , and θt at which µ(at |it , Qt ) is not differentiable with respect to θt . It follows from Assumption 1 that the objective return, U(θ), is independent of the start state. The original OLPOMDP convergence proof [2] has a similar condition that only considers environment states. Intuitively, this condition allows PGRD to handle history-dependence of a reward function in the same manner that it handles partial observability in an environment. Assumption 2 accounts for the fact that a planning algorithm may not be fully differentiable everywhere. However, Theorem 1 showed that infinite and bounded-depth planning is differentiable almost everywhere (in a measure theoretic sense). Furthermore, this assumption is perhaps stronger than necessary, as stochastic approximation algorithms, which provide the theory upon which OLPOMDP is based, have been shown to converge using subgradients [8]. 4 In order to state the convergence theorem, we must define the approximate gradient which OLPOMDP def T calculates. Let the approximate gradient estimate be β U(θ) = limT →∞ t=1 rt zt for a fixed θ and θ PGRD parameter β, where zt (in Figure 1) represents a time-decaying average of the θt µ(at |it , Qt ) calculations. It was shown by Bartlett and Baxter [2] that β U(θ) is close to the true value θ U(θ) θ for large values of β. Theorem 2 proves that PGRD converges to a stable equilibrium point based on this approximate gradient measure. This equilibrium point will typically correspond to some local optimum in the return function U(θ). Given our development and assumptions, the theorem is a straightforward extension of Theorem 6 from Bartlett and Baxter [2] (proof omitted). ∞ Theorem 2. Given β ∈ [0, 1), λ > 0, and a sequence of step sizes αt satisfying t=0 αt = ∞ and ∞ 2 t=0 (αt ) < ∞, PGRD produces a sequence of reward parameters θt such that θt → L as t → ∞ a.s., where L is the set of stable equilibrium points of the differential equation ∂θ = β U(θ) − λθ. θ ∂t PGRD generalizes OLPOMDP. As stated above, OLPOMDP, when it uses a Boltzmann distribution in its policy representation (a common case), is a special case of PGRD when the planning depth is zero. First, notice that in the case of depth-0 planning, Q0 (i, o, a; θ) = R(i, o, a, θ), regardless of the transition model and reward parameterization. We can also see from Theorem 1 that 0 θ Q (i, o, a; θ) = θ R(i, o, a; θ). Because R(i, o, a; θ) can be parameterized arbitrarily, PGRD can be configured to match standard OLPOMDP with any policy parameterization that also computes a score function for the Boltzmann distribution. In our experiments, we demonstrate that choosing a planning depth d > 0 can be beneficial over using OLPOMDP (d = 0). In the remainder of this section, we show theoretically that choosing d > 0 does not hurt in the sense that it does not reduce the space of policies available to the policy gradient method. Specifically, we show that when using an expressive enough reward parameterization, PGRD’s space of policies is not restricted relative to OLPOMDP’s space of policies. We prove the result for infinite planning, but the extension to depth-limited planning is straightforward. Theorem 3. There exists a reward parameterization such that, for an arbitrary transition model T , the space of policies representable by PGRD with infinite planning is identical to the space of policies representable by PGRD with depth 0 planning. Proof. Ignoring internal state for now (holding it constant), let C(o, a) be an arbitrary reward function used by PGRD with depth 0 planning. Let R(o, a; θ) be a reward function for PGRD with infinite (d = ∞) planning. The depth-∞ agent uses the planning result Q∗ (o, a; θ) to act, while the depth-0 agent uses the function C(o, a) to act. Therefore, it suffices to show that one can always choose θ such that the planning solution Q∗ (o, a; θ) equals C(o, a). For all o ∈ O, a ∈ A, set R(o, a; θ) = C(o, a) − γ o T (o |o, a) maxa C(o , a ). Substituting Q∗ for C, this is the Bellman optimality equation [22] for infinite-horizon planning. Setting R(o, a; θ) as above is possible if it is parameterized by a table with an entry for each observation–action pair. Theorem 3 also shows that the effect of an arbitrarily poor model can be overcome with a good choice of reward function. This is because a Boltzmann distribution can, allowing for an arbitrary scoring function C, represent any policy. We demonstrate this ability of PGRD in our experiments. 3 Experiments The primary objective of our experiments is to demonstrate that PGRD is able to use experience online to improve the reward function parameters, thereby improving the agent’s obtained objective return. Specifically, we compare the objective return achieved by PGRD to the objective return achieved by PGRD with the reward adaptation turned off. In both cases, the reward function is initialized to the objective reward function. A secondary objective is to demonstrate that when a good model is available, adding the ability to plan—even for small depths—improves performance relative to the baseline algorithm of OLPOMDP (or equivalently PGRD with depth d = 0). Foraging Domain for Experiments 1 to 3: The foraging environment illustrated in Figure 2(a) is a 3 × 3 grid world with 3 dead-end corridors (rows) separated by impassable walls. The agent (bird) has four available actions corresponding to each cardinal direction. Movement in the intended direction fails with probability 0.1, resulting in movement in a random direction. If the resulting direction is 5 Objective Return 0.15 D=6, α=0 & D=6, α=5×10 −5 D=4, α=2×10 −4 D=0, α=5×10 −4 0.1 0.05 0 D=4, α=0 D=0, α=0 1000 2000 3000 4000 5000 Time Steps C) Objective Return B) A) 0.15 D=6, α=0 & D=6, α=5×10 −5 D=3, α=3×10 −3 D=1, α=3×10 −4 0.1 D=3, α=0 0.05 D=0, α=0.01 & D=1, α=0 0 1000 2000 3000 4000 5000 D=0, α=0 Time Steps Figure 2: A) Foraging Domain, B) Performance of PGRD with observation-action reward features, C) Performance of PGRD with recency reward features blocked by a wall or the boundary, the action results in no movement. There is a food source (worm) located in one of the three right-most locations at the end of each corridor. The agent has an eat action, which consumes the worm when the agent is at the worm’s location. After the agent consumes the worm, a new worm appears randomly in one of the other two potential worm locations. Objective Reward for the Foraging Domain: The designer’s goal is to maximize the average number of worms eaten per time step. Thus, the objective reward function RO provides a reward of 1.0 when the agent eats a worm, and a reward of 0 otherwise. The objective return is defined as in Equation (1). Experimental Methodology: We tested PGRD for depth-limited planning agents of depths 0–6. Recall that PGRD for the agent with planning depth 0 is the OLPOMDP algorithm. For each depth, we jointly optimized over the PGRD algorithm parameters, α and β (we use a fixed α throughout learning). We tested values for α on an approximate logarithmic scale in the range (10−6 , 10−2 ) as well as the special value of α = 0, which corresponds to an agent that does not adapt its reward function. We tested β values in the set 0, 0.4, 0.7, 0.9, 0.95, 0.99. Following common practice [3], we set the λ parameter to 0. We explicitly bound the reward parameters and capped the reward function output both to the range [−1, 1]. We used a Boltzmann temperature parameter of τ = 100 and planning discount factor γ = 0.95. Because we initialized θ so that the initial reward function was the objective reward function, PGRD with α = 0 was equivalent to a standard depth-limited planning agent. Experiment 1: A fully observable environment with a correct model learned online. In this experiment, we improve the reward function in an agent whose only limitation is planning depth, using (1) a general reward parameterization based on the current observation and (2) a more compact reward parameterization which also depends on the history of observations. Observation: The agent observes the full state, which is given by the pair o = (l, w), where l is the agent’s location and w is the worm’s location. Learning a Correct Model: Although the theorem of convergence of PGRD relies on the agent having a fixed model, the algorithm itself is readily applied to the case of learning a model online. In this experiment, the agent’s model T is learned online based on empirical transition probabilities between observations (recall this is a fully observable environment). Let no,a,o be the number of times that o was reached after taking action a after observing o. The agent models the probability of seeing o as no,a,o T (o |o, a) = . n o o,a,o Reward Parameterizations: Recall that R(i, o, a; θ) = θT φ(i, o, a), for some φ(i, o, a). (1) In the observation-action parameterization, φ(i, o, a) is a binary feature vector with one binary feature for each observation-action pair—internal state is ignored. This is effectively a table representation over all reward functions indexed by (o, a). As shown in Theorem 3, the observation-action feature representation is capable of producing arbitrary policies over the observations. In large problems, such a parameterization would not be feasible. (2) The recency parameterization is a more compact representation which uses features that rely on the history of observations. The feature vector is φ(i, o, a) = [RO (o, a), 1, φcl (l, i), φcl,a (l, a, i)], where RO (o, a) is the objective reward function defined as above. The feature φcl (l) = 1 − 1/c(l, i), where c(l, i) is the number of time steps since the agent has visited location l, as represented in the agent’s internal state i. Its value is normalized to the range [0, 1) and is high when the agent has not been to location l recently. The feature φcl,a (l, a, i) = 1 − 1/c(l, a, i) is similarly defined with respect to the time since the agent has taken action a in location l. Features based on recency counts encourage persistent exploration [21, 18]. 6 Results & Discussion: Figure 2(b) and Figure 2(c) present results for agents that use the observationaction parameterization and the recency parameterization of the reward function respectively. The horizontal axis is the number of time steps of experience. The vertical axis is the objective return, i.e., the average objective reward per time step. Each curve is an average over 130 trials. The values of d and the associated optimal algorithm parameters for each curve are noted in the figures. First, note that with d = 6, the agent is unbounded, because food is never more than 6 steps away. Therefore, the agent does not benefit from adapting the reward function parameters (given that we initialize to the objective reward function). Indeed, the d = 6, α = 0 agent performs as well as the best reward-optimizing agent. The performance for d = 6 improves with experience because the model improves with experience (and thus from the curves it is seen that the model gets quite accurate in about 1500 time steps). The largest objective return obtained for d = 6 is also the best objective return that can be obtained for any value of d. Several results can be observed in both Figures 2(b) and (c). 1) Each curve that uses α > 0 (solid lines) improves with experience. This is a demonstration of our primary contribution, that PGRD is able to effectively improve the reward function with experience. That the improvement over time is not just due to model learning is seen in the fact that for each value of d < 6 the curve for α > 0 (solid-line) which adapts the reward parameters does significantly better than the corresponding curve for α = 0 (dashed-line); the α = 0 agents still learn the model. 2) For both α = 0 and α > 0 agents, the objective return obtained by agents with equivalent amounts of experience increases monotonically as d is increased (though to maintain readability we only show selected values of d in each figure). This demonstrates our secondary contribution, that the ability to plan in PGRD significantly improves performance over standard OLPOMDP (PGRD with d = 0). There are also some interesting differences between the results for the two different reward function parameterizations. With the observation-action parameterization, we noted that there always exists a setting of θ for all d that will yield optimal objective return. This is seen in Figure 2(b) in that all solid-line curves approach optimal objective return. In contrast, the more compact recency reward parameterization does not afford this guarantee and indeed for small values of d (< 3), the solid-line curves in Figure 2(c) converge to less than optimal objective return. Notably, OLPOMDP (d = 0) does not perform well with this feature set. On the other hand, for planning depths 3 ≤ d < 6, the PGRD agents with the recency parameterization achieve optimal objective return faster than the corresponding PGRD agent with the observation-action parameterization. Finally, we note that this experiment validates our claim that PGRD can improve reward functions that depend on history. Experiment 2: A fully observable environment and poor given model. Our theoretical analysis showed that PGRD with an incorrect model and the observation–action reward parameterization should (modulo local maxima issues) do just as well asymptotically as it would with a correct model. Here we illustrate this theoretical result empirically on the same foraging domain and objective reward function used in Experiment 1. We also test our hypothesis that a poor model should slow down the rate of learning relative to a correct model. Poor Model: We gave the agents a fixed incorrect model of the foraging environment that assumes there are no internal walls separating the 3 corridors. Reward Parameterization: We used the observation–action reward parameterization. With a poor model it is no longer interesting to initialize θ so that the initial reward function is the objective reward function because even for d = 6 such an agent would do poorly. Furthermore, we found that this initialization leads to excessively bad exploration and therefore poor learning of how to modify the reward. Thus, we initialize θ to uniform random values near 0, in the range (−10−3 , 10−3 ). Results: Figure 3(a) plots the objective return as a function of number of steps of experience. Each curve is an average over 36 trials. As hypothesized, the bad model slows learning by a factor of more than 10 (notice the difference in the x-axis scales from those in Figure 2). Here, deeper planning results in slower learning and indeed the d = 0 agent that does not use the model at all learns the fastest. However, also as hypothesized, because they used the expressive observation–action parameterization, agents of all planning depths mitigate the damage caused by the poor model and eventually converge to the optimal objective return. Experiment 3: Partially observable foraging world. Here we evaluate PGRD’s ability to learn in a partially observable version of the foraging domain. In addition, the agents learn a model under the erroneous (and computationally convenient) assumption that the domain is fully observable. 7 0.1 −4 D = 0, α = 2 ×10 D = 2, α = 3 ×10 −5 −5 D = 6, α = 2 ×10 0.05 D = 0&2&6, α = 0 0 1 2 3 Time Steps 4 5 x 10 4 0.06 D = 6, α = 7 ×10 D = 2, α = 7 ×10 −4 0.04 D = 1, α = 7 ×10 −4 D = 0, α = 5 ×10 −4 D = 0, α = 0 D = 1&2&6, α = 0 0.02 0 C) −4 1000 2000 3000 4000 5000 Time Steps Objective Return B) 0.08 0.15 Objective Return Objective Return A) 2.5 2 x 10 −3 D=6, α=3×10 −6 D=0, α=1×10 −5 1.5 D=0&6, α=0 1 0.5 1 2 3 Time Steps 4 5 x 10 4 Figure 3: A) Performance of PGRD with a poor model, B) Performance of PGRD in a partially observable world with recency reward features, C) Performance of PGRD in Acrobot Partial Observation: Instead of viewing the location of the worm at all times, the agent can now only see the worm when it is colocated with it: its observation is o = (l, f ), where f indicates whether the agent is colocated with the food. Learning an Incorrect Model: The model is learned just as in Experiment 1. Because of the erroneous full observability assumption, the model will hallucinate about worms at all the corridor ends based on the empirical frequency of having encountered them there. Reward Parameterization: We used the recency parameterization; due to the partial observability, agents with the observation–action feature set perform poorly in this environment. The parameters θ are initialized such that the initial reward function equals the objective reward function. Results & Discussion: Figure 3(b) plots the mean of 260 trials. As seen in the solid-line curves, PGRD improves the objective return at all depths (only a small amount for d = 0 and significantly more for d > 0). In fact, agents which don’t adapt the reward are hurt by planning (relative to d = 0). This experiment demonstrates that the combination of planning and reward improvement can be beneficial even when the model is erroneous. Because of the partial observability, optimal behavior in this environment achieves less objective return than in Experiment 1. Experiment 4: Acrobot. In this experiment we test PGRD in the Acrobot environment [22], a common benchmark task in the RL literature and one that has previously been used in the testing of policy gradient approaches [23]. This experiment demonstrates PGRD in an environment in which an agent must be limited due to the size of the state space and further demonstrates that adding model-based planning to policy gradient approaches can improve performance. Domain: The version of Acrobot we use is as specified by Sutton and Barto [22]. It is a two-link robot arm in which the position of one shoulder-joint is fixed and the agent’s control is limited to 3 actions which apply torque to the elbow-joint. Observation: The fully-observable state space is 4 dimensional, with two joint angles ψ1 and ψ2 , and ˙ ˙ two joint velocities ψ1 and ψ2 . Objective Reward: The designer receives an objective reward of 1.0 when the tip is one arm’s length above the fixed shoulder-joint, after which the bot is reset to its initial resting position. Model: We provide the agent with a perfect model of the environment. Because the environment is continuous, value iteration is intractable, and computational limitations prevent planning deep enough to compute the optimal action in any state. The feature vector contains 13 entries. One feature corresponds to the objective reward signal. For each action, there are 5 features corresponding to each of the state features plus an additional feature representing the height of the tip: φ(i, o, a) = ˙ ˙ [RO (o), {ψ1 (o), ψ2 (o), ψ1 (o), ψ2 (o), h(o)}a ]. The height feature has been used in previous work as an alternative definition of objective reward [23]. Results & Discussion: We plot the mean of 80 trials in Figure 3(c). Agents that use the fixed (α = 0) objective reward function with bounded-depth planning perform according to the bottom two curves. Allowing PGRD and OLPOMDP to adapt the parameters θ leads to improved objective return, as seen in the top two curves in Figure 3(c). Finally, the PGRD d = 6 agent outperforms the standard OLPOMDP agent (PGRD with d = 0), further demonstrating that PGRD outperforms OLPOMDP. Overall Conclusion: We developed PGRD, a new method for approximately solving the optimal reward problem in bounded planning agents that can be applied in an online setting. We showed that PGRD is a generalization of OLPOMDP and demonstrated that it both improves reward functions in limited agents and outperforms the model-free OLPOMDP approach. 8 References [1] Douglas Aberdeen and Jonathan Baxter. Scalable Internal-State Policy-Gradient Methods for POMDPs. Proceedings of the Nineteenth International Conference on Machine Learning, 2002. [2] Peter L. Bartlett and Jonathan Baxter. Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the 39th IEEE Conference on Decision and Control, 2000. [3] Jonathan Baxter, Peter L. Bartlett, and Lex Weaver. Experiments with Infinite-Horizon, Policy-Gradient Estimation, 2001. [4] Shalabh Bhatnagar, Richard S. Sutton, M Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 2009. [5] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A General Polynomial Time Algorithm for NearOptimal Reinforcement Learning. Journal of Machine Learning Research, 3:213–231, 2001. [6] S. Elfwing, Eiji Uchibe, K. Doya, and H. I. Christensen. Co-evolution of Shaping Rewards and MetaParameters in Reinforcement Learning. Adaptive Behavior, 16(6):400–412, 2008. [7] J. Zico Kolter and Andrew Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning, pages 513–520, 2009. [8] Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd edition, 2010. [9] Cetin Mericli, Tekin Mericli, and H. Levent Akin. A Reward Function Generation Method Using Genetic ¸ ¸ ¸ Algorithms : A Robot Soccer Case Study (Extended Abstract). In Proc. of the 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010), number 2, pages 1513–1514, 2010. [10] Gergely Neu and Csaba Szepesv´ ri. Apprenticeship learning using inverse reinforcement learning and a gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, pages 295–302, 2007. [11] Andrew Y. Ng, Stuart J. Russell, and D. Harada. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pages 278–287, 1999. [12] Scott Niekum, Andrew G. Barto, and Lee Spector. Genetic Programming for Reward Function Search. IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010. [13] Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V. Hafner. Intrinsic Motivation Systems for Autonomous Mental Development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, April 2007. [14] J¨ rgen Schmidhuber. Curious model-building control systems. In IEEE International Joint Conference on u Neural Networks, pages 1458–1463, 1991. [15] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically Motivated Reinforcement Learning. In Proceedings of Advances in Neural Information Processing Systems 17 (NIPS), pages 1281–1288, 2005. [16] Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where Do Rewards Come From? In Proceedings of the Annual Conference of the Cognitive Science Society, pages 2601–2606, 2009. [17] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transations on Autonomous Mental Development, 2(2):70–82, 2010. [18] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Internal Rewards Mitigate Agent Boundedness. In Proceedings of the 27th International Conference on Machine Learning, 2010. [19] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010. [20] Alexander L. Strehl and Michael L. Littman. An analysis of model-based Interval Estimation for Markov Decision Processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008. [21] Richard S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In The Seventh International Conference on Machine Learning, pages 216–224. 1990. [22] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. [23] Lex Weaver and Nigel Tao. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 538–545. 2001. 9
3 0.59323174 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
4 0.54232514 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
Author: Sergey Levine, Zoran Popovic, Vladlen Koltun
Abstract: The goal of inverse reinforcement learning is to find a reward function for a Markov decision process, given example traces from its optimal policy. Current IRL techniques generally rely on user-supplied features that form a concise basis for the reward. We present an algorithm that instead constructs reward features from a large collection of component features, by building logical conjunctions of those component features that are relevant to the example policy. Given example traces, the algorithm returns a reward function as well as the constructed features. The reward function can be used to recover a full, deterministic, stationary policy, and the features can be used to transplant the reward function into any novel environment on which the component features are well defined. 1
5 0.52798569 76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding
Author: J. Z. Kolter, Siddharth Batra, Andrew Y. Ng
Abstract: Energy disaggregation is the task of taking a whole-home energy signal and separating it into its component appliances. Studies have shown that having devicelevel energy information can cause users to conserve significant amounts of energy, but current electricity meters only report whole-home data. Thus, developing algorithmic methods for disaggregation presents a key technical challenge in the effort to maximize energy conservation. In this paper, we examine a large scale energy disaggregation task, and apply a novel extension of sparse coding to this problem. In particular, we develop a method, based upon structured prediction, for discriminatively training sparse coding algorithms specifically to maximize disaggregation performance. We show that this significantly improves the performance of sparse coding algorithms on the energy task and illustrate how these disaggregation results can provide useful information about energy usage. 1
6 0.52086294 59 nips-2010-Deep Coding Network
7 0.49169075 4 nips-2010-A Computational Decision Theory for Interactive Assistants
8 0.46554548 212 nips-2010-Predictive State Temporal Difference Learning
9 0.45858365 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
10 0.45051688 43 nips-2010-Bootstrapping Apprenticeship Learning
11 0.43320367 134 nips-2010-LSTD with Random Projections
12 0.419155 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
13 0.4186132 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
14 0.40944639 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
15 0.40502024 168 nips-2010-Monte-Carlo Planning in Large POMDPs
16 0.39610225 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
17 0.3815904 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
18 0.37707958 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
19 0.36977619 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
20 0.36316502 203 nips-2010-Parametric Bandits: The Generalized Linear Case
topicId topicWeight
[(13, 0.037), (17, 0.01), (27, 0.054), (30, 0.037), (35, 0.011), (45, 0.184), (50, 0.048), (52, 0.027), (60, 0.025), (67, 0.385), (77, 0.036), (79, 0.018), (90, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.72902536 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
Author: Sridhar Mahadevan, Bo Liu
Abstract: This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. The first two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same benefits at lower computational cost. 1
2 0.62054586 152 nips-2010-Learning from Logged Implicit Exploration Data
Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade
Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.
3 0.48024157 168 nips-2010-Monte-Carlo Planning in Large POMDPs
Author: David Silver, Joel Veness
Abstract: This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs. 1
4 0.47770476 162 nips-2010-Link Discovery using Graph Feature Tracking
Author: Emile Richard, Nicolas Baskiotis, Theodoros Evgeniou, Nicolas Vayatis
Abstract: We consider the problem of discovering links of an evolving undirected graph given a series of past snapshots of that graph. The graph is observed through the time sequence of its adjacency matrix and only the presence of edges is observed. The absence of an edge on a certain snapshot cannot be distinguished from a missing entry in the adjacency matrix. Additional information can be provided by examining the dynamics of the graph through a set of topological features, such as the degrees of the vertices. We develop a novel methodology by building on both static matrix completion methods and the estimation of the future state of relevant graph features. Our procedure relies on the formulation of an optimization problem which can be approximately solved by a fast alternating linearized algorithm whose properties are examined. We show experiments with both simulated and real data which reveal the interest of our methodology. 1
5 0.47530928 158 nips-2010-Learning via Gaussian Herding
Author: Koby Crammer, Daniel D. Lee
Abstract: We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data. 1
6 0.47516212 63 nips-2010-Distributed Dual Averaging In Networks
7 0.47490579 287 nips-2010-Worst-Case Linear Discriminant Analysis
8 0.47445184 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
9 0.47433022 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
10 0.47427318 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
11 0.47391936 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades
12 0.47383824 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.47378638 17 nips-2010-A biologically plausible network for the computation of orientation dominance
14 0.47372216 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
15 0.47367433 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
16 0.47337469 204 nips-2010-Penalized Principal Component Regression on Graphs for Analysis of Subnetworks
17 0.47320259 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
18 0.47305942 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
19 0.47300071 155 nips-2010-Learning the context of a category
20 0.47287905 257 nips-2010-Structured Determinantal Point Processes