nips nips2010 nips2010-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fr Abstract Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. [sent-3, score-0.106]
2 Unfortunately, some problems cannot be modeled with state-dependent reward functions, e. [sent-4, score-0.258]
3 , problems whose objective explicitly implies reducing the uncertainty on the state. [sent-6, score-0.077]
4 To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. [sent-7, score-0.492]
5 We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. [sent-8, score-0.238]
6 Let us consider active sensing problems, where the objective is to act so as to acquire knowledge about certain state variables. [sent-11, score-0.216]
7 This can be formalized as a POMDP by rewarding—if successful—a final action consisting in expressing the diagnoser’s “best guess”. [sent-13, score-0.083]
8 Actually, a large body of work formalizes active sensing with POMDPs [2, 3, 4]. [sent-14, score-0.157]
9 , to minimize the entropy over a given state variable. [sent-17, score-0.103]
10 In such cases, POMDPs are not appropriate because the reward function depends on the state and the action, not on the knowledge of the agent. [sent-18, score-0.317]
11 Instead, we need a model where the instant reward depends on the current belief state. [sent-19, score-0.455]
12 The belief MDP formalism provides the needed expressiveness for these problems. [sent-20, score-0.221]
13 One can argue that acquiring information is always a means, not an end, and thus, a “well-defined” sequential-decision making problem with partial observability must always be modeled as a normal POMDP. [sent-22, score-0.119]
14 After reviewing some background knowledge on POMDPs in Section 2, Section 3 introduces ρPOMDPs—an extension of POMDPs where the reward is a (typically convex) function of the belief state—and proves that the convexity of the value function is preserved. [sent-25, score-0.556]
15 Then we show how classical solving algorithms can be adapted depending whether the reward function is piecewise linear (Sec. [sent-26, score-0.383]
16 Compared to classical deterministic planning, the agent has to face the difficulty to account for a system not only with uncertain dynamics but also whose current state is imperfectly known. [sent-31, score-0.135]
17 Using belief states, a POMDP can be rewritten as an MDP over the belief space, or belief MDP, ∆, A, τ, ρ , where the new transition τ and reward functions ρ are defined respectively over ∆ × A × ∆ and ∆ × A. [sent-37, score-0.877]
18 An issue is that, even if a POMDP has a finite number of states, the corresponding belief MDP is defined over a continuous—and thus infinite—belief space. [sent-39, score-0.197]
19 In this continuous MDP, the objective is to maximize the cumulative reward by looking for a policy taking the current belief state as input. [sent-40, score-0.605]
20 More formally, we are searching for a policy verifying ∞ π ∗ = argmaxπ∈A∆ J π (b0 ) where J π (b0 ) = E [ t=0 γρt |b0 , π], ρt being the expected immediate reward obtained at time step t, and γ a discount factor. [sent-41, score-0.3]
21 The POMDP framework presents a reward function r(s, a) based on the state and action. [sent-43, score-0.317]
22 On the other hand, the belief MDP presents a reward function ρ(b, a) based on beliefs. [sent-44, score-0.455]
23 This belief-based 1 Π(S) forms a simplex because b 1 = 1, that is why we use ∆ as the set of all possible b. [sent-45, score-0.143]
24 2 reward function is derived as the expectation of the POMDP rewards: ρ(b, a) = b(s)r(s, a). [sent-46, score-0.258]
25 1 has the property to generate piecewise-linear and convex (PWLC) value functions for each horizon [1], i. [sent-48, score-0.157]
26 , each function is determined by a set of hyperplanes (each represented by a vector), the value at a given belief point being that of the highest hyperplane. [sent-50, score-0.264]
27 There are several algorithms based on pruning techniques like Batch Enumeration [8] or more efficient algorithms such as Witness or Incremental Pruning [6]. [sent-61, score-0.122]
28 3 Solving POMDPs with Approximate Updates The value function updating processes presented above are exact and provide value functions that can be used whatever the initial belief state b0 . [sent-63, score-0.331]
29 A number of approximate POMDP solutions have been proposed to reduce the complexity of these computations, using for example heuristic estimates of the value function, or applying the value update only on selected belief points [9]. [sent-64, score-0.228]
30 selects a new set of belief points Bn based on Bn−1 and the current approximation Vn−1 ; 2. [sent-69, score-0.271]
31 performs a Bellman backup at each belief point b ∈ Bn , resulting in one α-vector per point; 3. [sent-70, score-0.262]
32 The various PB algorithms differ mainly in how belief points are selected, and in how the update is performed. [sent-72, score-0.259]
33 1 POMDP extension for Active Sensing Introducing ρPOMDPs All problems with partial observability confront the issue of getting more information to achieve some goal. [sent-76, score-0.123]
34 This problem is usually implicitly addressed in the resolution process, where acquiring information is only a means for optimizing an expected reward based on the system state. [sent-77, score-0.317]
35 Some active sensing problems can be modeled this way (e. [sent-78, score-0.157]
36 However, the reward of a POMDP cannot model this since it is only based on the current state and action. [sent-84, score-0.317]
37 We prefer to consider a new way of defining rewards based on the acquired knowledge represented by belief states. [sent-86, score-0.293]
38 The rest of the paper explores the fact that belief MDPs can be used outside the specific definition of ρ(b, a) in Eq. [sent-87, score-0.197]
39 2, and therefore discusses how to solve this special type of active sensing problems. [sent-88, score-0.157]
40 A way of fixing this is to generalize the POMDP framework to a ρ-based POMDP (ρPOMDP), where the reward is not defined as a function r(s, a), but directly as a function ρ(b, a). [sent-92, score-0.258]
41 The nature of the ρ(b, a) function depends on the problem, but is usually related to some uncertainty or error measure [3, 2, 4]. [sent-93, score-0.117]
42 Also, other simpler functions based on the same idea can be used, such as the distance from the simplex center (DSC), ρdsc (b) = b − c m , where c is the center of the simplex and m a positive integer that denotes the order of the metric space. [sent-96, score-0.338]
43 ’s work [3] defines the active sensing problem as optimizing a weighted sum of uncertainty measurements and costs, where the former depends on the belief and the latter on the system state. [sent-100, score-0.49]
44 To that end, we discuss the convexity of the value function, which permits extending these algorithms using PWLC approximations. [sent-102, score-0.095]
45 For ρPOMDPs, this property also holds if the reward function ρ(b, a) is convex, as shown in Theorem 3. [sent-105, score-0.285]
46 If ρ and V0 are convex functions over ∆, then the value function Vn of the belief MDP is convex over ∆ at any time step n. [sent-109, score-0.357]
47 Thus, a reward function meant to reduce the uncertainty must provide high payloads near the corners of the simplex, and low payloads near its center. [sent-111, score-0.467]
48 For that reason, we will focus only on reward functions that comply with convexity in the rest of the paper. [sent-112, score-0.35]
49 Plus, starting with V0 = 0, it is also easy to prove by induction that, if ρ is continuous (respectively differentiable), then Vn is continuous (respectively piecewise differentiable). [sent-115, score-0.142]
50 3 Piecewise Linear Reward Functions This section focuses on the case where ρ is a PWLC function and shows that only a small adaptation of the exact and approximate updates in the POMDP case is necessary to compute the optimal value function. [sent-117, score-0.109]
51 The reward is computed as: ρ(b, a) = max a b(s)α(s) . [sent-123, score-0.258]
52 The only difference is again the reward function representation as an envelope of hyperplanes. [sent-133, score-0.258]
53 PB algorithms select the hyperplane that maximizes the value function at each belief point, so the same simplification can be applied to the set Γa . [sent-134, score-0.275]
54 ρ 4 Generalizing to Other Reward Functions Uncertainty measurements such as the negative entropy or the DSC (with m > 1 and m = ∞) are not piecewise linear functions. [sent-135, score-0.171]
55 However, convex functions can be efficiently approximated by piecewise linear functions, making it possible to apply the techniques described in Section 3. [sent-138, score-0.188]
56 3 with a bounded error, as long as the approximation of ρ is bounded. [sent-139, score-0.099]
57 1 Consider a continuous, convex and piecewise differentiable reward function ρ(b),4 and an arbitrary (and finite) set of points B ⊂ ∆ where the gradient is well defined. [sent-142, score-0.512]
58 A lower PWLC approximation of ρ(b) can be obtained by using each element b ∈ B as a base point for constructing a tangent hyperplane which is always a lower bound of ρ(b). [sent-143, score-0.178]
59 At any point b ∈ ∆ the error of the approximation can be written as B (b) = |ρ(b) − ωB (b)|, and if we specifically pick b as the point where bound this error depending on the nature of ρ. [sent-146, score-0.201]
60 B (b) (5) is maximal (worst error), then we can try to It is well known that a piecewise linear approximation of a Lipschitz function is bounded because the gradient ρ(b ) that is used to construct the hyperplane ωb (b) has bounded norm [18]. [sent-147, score-0.32]
61 Unfortunately, the negative entropy is not Lipschitz (f (x) = x log2 (x) has an infinite slope when x → 0), so this result is not generic enough to cover a wide range of active sensing problems. [sent-148, score-0.201]
62 First, we will introduce some basic results over the simplex and the convexity of ρ. [sent-151, score-0.207]
63 1 will show that, for each b, it is possible to find a belief point in B far enough from the boundary of the simplex but within a bounded distance to b. [sent-153, score-0.499]
64 Then, in a second step, we will assume the function ρ(b) verifies the α-H¨ lder condition to be able to bound the norm of the gradient in Lemma 4. [sent-154, score-0.16]
65 3 will use both lemmas to bound the error of ρ’s approximation under these assumptions. [sent-157, score-0.107]
66 For each point b ∈ ∆, it is possible to associate a point b∗ = argmaxx∈B ωx (b) corresponding to the point in B whose tangent hyperplane gives the best approximation of ρ at b. [sent-159, score-0.208]
67 Consider the point b ∈ ∆ where B (b) is maximum: this error can be easily computed using the gradient ρ(b∗ ). [sent-160, score-0.091]
68 Unfortunately, some partial derivatives of ρ may diverge to infinity on the boundary of the simplex in the non-Lipschitz case, making the error hard to analyze. [sent-161, score-0.264]
69 Therefore, to ensure that this error can be bounded, instead of b∗ , we will take a safe b ∈ B (far enough from the boundary) by using an intermediate point b in an inner simplex ∆ε , where ∆ε = {b ∈ [ε, 1]N | i bi = 1} with N = |S|. [sent-162, score-0.21]
70 1 Thus, for a given b ∈ ∆ and ε ∈ (0, N ], we define the point b = argminx∈∆ε x − b 1 as the closest point to b in ∆ε and b = argminx∈B x − b 1 as the closest point to b in B (see Figure 1). [sent-163, score-0.081]
71 These two points will be used to find an upper bound for the distance b − b 1 based on the density of B, defined as δB = min max b − b 1 . [sent-164, score-0.079]
72 The distance (1-norm) between the maximum error point b ∈ ∆ and the selected b ∈ B is bounded by b − b 1 ≤ 2(N − 1)ε + δB . [sent-167, score-0.147]
73 [Proof in [17, Appendix]] If we pick ε > δB , then we are sure that b is not on the boundary of the simplex ∆, with a minimum distance from the boundary of η = ε − δB . [sent-168, score-0.271]
74 6 approximation of convex α-H¨ lder functions, which is a broader family of functions including the o negative entropy, convex Lipschitz functions and others. [sent-170, score-0.343]
75 The α-H¨ lder condition is a generalization o of the Lipschitz condition. [sent-171, score-0.112]
76 1 The limit case, where a convex α-H¨ lder function has infinite-valued norm for the gradient, is always o on the boundary of the simplex ∆ (due to the convexity), and therefore the point b will be free of this predicament because of η. [sent-175, score-0.4]
77 More precisely, an α-H¨ lder function in ∆ with constant Kα in o 1-norm complies with the Lipschitz condition on ∆η with a constant Kα η α (see [17, Appendix]). [sent-176, score-0.159]
78 Moreover, the norm of the gradient f (b ) 1 is also bounded as stated by Lemma 4. [sent-177, score-0.08]
79 Let η > 0 and f be an α-H¨ lder (with constant Kα ), bounded and convex function o from ∆ to R, f being differentiable everywhere in ∆o (the interior of ∆). [sent-181, score-0.329]
80 Let ρ be a continuous and convex function over ∆, differentiable everywhere in ∆o (the interior of ∆), and satisfying the α-H¨ lder condition with constant Kα . [sent-186, score-0.297]
81 The error of an o α approximation ωB can be bounded by Cδb , where C is a scalar constant. [sent-187, score-0.139]
82 2 Exact Updates Knowing that the approximation of ρ is bounded for a wide family of functions, the techniques described in Sec. [sent-189, score-0.099]
83 1 can be directly applied using ωB (b) as the PWLC reward function. [sent-192, score-0.258]
84 These algorithms can be safely used because the propagation of the error due to exact updates is bounded. [sent-193, score-0.18]
85 Let Vt be the value function using the PWLC approximation described above and Vt∗ the optimal value function both at time t, H being ˆ the exact update operator and H the same operator with the PWLC approximation. [sent-195, score-0.09]
86 3) (By contraction) (By sum of a geometric series) 1−γ For these algorithms, the selection of the set B remains open, raising similar issues as the selection of belief points in PB algorithms. [sent-197, score-0.228]
87 The first term can be bounded by the same reasoning as in [11], where α (R −R +CδB )δB ˆ Vt − Vt ∞ = max min , with Rmin and Rmax the minimum and maximum values for 1−γ 5 Points from ∆’s boundary can be removed where the gradient is not defined, as the proofs only rely on interior points. [sent-210, score-0.163]
88 1−γ 5 Conclusions We have introduced ρPOMDPs, an extension of POMDPs that allows for expressing sequential decision-making problems where reducing the uncertainty on some state variables is an explicit objective. [sent-213, score-0.201]
89 In this model, the reward ρ is typically a convex function of the belief state. [sent-214, score-0.521]
90 Using the convexity of ρ, a first important result that we prove is that a Bellman backup Vn = HVn−1 preserves convexity. [sent-215, score-0.102]
91 Yet, if ρ is not PWLC, performing exact updates is much more complex. [sent-217, score-0.109]
92 We therefore propose employing PWLC approximations of the convex reward function at hand to come back to a simple case, and show that the resulting algorithms converge to the optimal value function in the limit. [sent-218, score-0.399]
93 In the robotics field, uncertainty measurements within POMDPs have been widely used as heuristics [2], with very good results but no convergence guarantees. [sent-222, score-0.145]
94 These techniques use only state-dependent rewards, but uncertainty measurements are employed to speed up the solving process, at the cost of losing some basic properties (e. [sent-223, score-0.11]
95 An important point is that the time complexity of the new algorithms only changes due to the size of the approximation of ρ. [sent-229, score-0.101]
96 The optimal control of partially observable Markov decision processes over a finite horizon. [sent-238, score-0.16]
97 Exact and approximate algorithms for partially observable Markov decision processes. [sent-270, score-0.191]
98 Computationally feasible bounds for partially observed Markov decision processes. [sent-290, score-0.08]
99 SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. [sent-315, score-0.287]
100 A POMDP extension with beliefo dependent rewards – extended version. [sent-331, score-0.133]
wordName wordTfidf (topN-words)
[('pwlc', 0.467), ('pomdp', 0.37), ('pomdps', 0.368), ('reward', 0.258), ('hvt', 0.234), ('belief', 0.197), ('vt', 0.167), ('vn', 0.154), ('simplex', 0.143), ('lder', 0.112), ('pb', 0.104), ('rewards', 0.096), ('piecewise', 0.094), ('sensing', 0.088), ('mdp', 0.08), ('observable', 0.08), ('uncertainty', 0.077), ('dsc', 0.07), ('active', 0.069), ('convex', 0.066), ('convexity', 0.064), ('updates', 0.062), ('sarsop', 0.062), ('spaan', 0.062), ('pruning', 0.06), ('state', 0.059), ('observability', 0.057), ('bounded', 0.056), ('action', 0.055), ('pineau', 0.053), ('boundary', 0.052), ('agent', 0.05), ('lipschitz', 0.05), ('bellman', 0.049), ('hyperplane', 0.047), ('complies', 0.047), ('mihaylova', 0.047), ('payloads', 0.047), ('perseus', 0.047), ('pez', 0.047), ('exact', 0.047), ('partially', 0.046), ('shannon', 0.046), ('approximations', 0.044), ('entropy', 0.044), ('approximation', 0.043), ('policy', 0.042), ('bn', 0.042), ('kurniawati', 0.041), ('rmin', 0.041), ('hyperplanes', 0.04), ('appendix', 0.04), ('error', 0.04), ('differentiable', 0.039), ('backup', 0.038), ('buffet', 0.038), ('corners', 0.038), ('pbvi', 0.038), ('surveillance', 0.038), ('tangent', 0.037), ('extension', 0.037), ('horizon', 0.036), ('robotics', 0.035), ('decision', 0.034), ('planning', 0.034), ('measurements', 0.033), ('acquiring', 0.033), ('argminx', 0.033), ('hero', 0.033), ('mdps', 0.033), ('meanwhile', 0.032), ('reachable', 0.032), ('points', 0.031), ('interior', 0.031), ('jair', 0.031), ('algorithms', 0.031), ('argmax', 0.03), ('partial', 0.029), ('beliefs', 0.029), ('markov', 0.028), ('functions', 0.028), ('expressing', 0.028), ('ra', 0.028), ('yet', 0.027), ('point', 0.027), ('property', 0.027), ('rmax', 0.026), ('system', 0.026), ('inria', 0.026), ('looking', 0.025), ('generates', 0.025), ('everywhere', 0.025), ('formalism', 0.024), ('approximating', 0.024), ('distance', 0.024), ('visited', 0.024), ('gradient', 0.024), ('continuous', 0.024), ('lemma', 0.024), ('bound', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
2 0.21796936 168 nips-2010-Monte-Carlo Planning in Large POMDPs
Author: David Silver, Joel Veness
Abstract: This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs. 1
3 0.19260314 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
4 0.18681204 229 nips-2010-Reward Design via Online Gradient Ascent
Author: Jonathan Sorg, Richard L. Lewis, Satinder P. Singh
Abstract: Recent work has demonstrated that when artificial agents are limited in their ability to achieve their goals, the agent designer can benefit by making the agent’s goals different from the designer’s. This gives rise to the optimization problem of designing the artificial agent’s goals—in the RL framework, designing the agent’s reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent’s lifetime nor do they take advantage of knowledge about the agent’s structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent’s lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations. 1 The Optimal Reward Problem In this work, we consider the scenario of an agent designer building an autonomous agent. The designer has his or her own goals which must be translated into goals for the autonomous agent. We represent goals using the Reinforcement Learning (RL) formalism of the reward function. This leads to the optimal reward problem of designing the agent’s reward function so as to maximize the objective reward received by the agent designer. Typically, the designer assigns his or her own reward to the agent. However, there is ample work which demonstrates the benefit of assigning reward which does not match the designer’s. For example, work on reward shaping [11] has shown how to modify rewards to accelerate learning without altering the optimal policy, and PAC-MDP methods [5, 20] including approximate Bayesian methods [7, 19] add bonuses to the objective reward to achieve optimism under uncertainty. These approaches explicitly or implicitly assume that the asymptotic behavior of the agent should be the same as that which would occur using the objective reward function. These methods do not explicitly consider the optimal reward problem; however, they do show improved performance through reward modification. In our recent work that does explicitly consider the optimal reward problem [18], we analyzed an explicit hypothesis about the benefit of reward design—that it helps mitigate the performance loss caused by computational constraints (bounds) on agent architectures. We considered various types of agent limitations—limits on planning depth, failure to account for partial observability, and other erroneous modeling assumptions—and demonstrated the benefits of good reward functions in each case empirically. Crucially, in bounded agents, the optimal reward function often leads to behavior that is different from the asymptotic behavior achieved with the objective reward function. In this work, we develop an algorithm, Policy Gradient for Reward Design (PGRD), for improving reward functions for a family of bounded agents that behave according to repeated local (from the current state) model-based planning. We show that this algorithm is capable of improving the reward functions in agents with computational limitations necessitating small bounds on the depth of planning, and also from the use of an inaccurate model (which may be inaccurate due to computationally-motivated approximations). PGRD has few parameters, improves the reward 1 function online during an agent’s lifetime, takes advantage of knowledge about the agent’s structure (through the gradient computation), and is linear in the number of reward function parameters. Notation. Formally, we consider discrete-time partially-observable environments with a finite number of hidden states s ∈ S, actions a ∈ A, and observations o ∈ O; these finite set assumptions are useful for our theorems, but our algorithm can handle infinite sets in practice. Its dynamics are governed by a state-transition function P (s |s, a) that defines a distribution over next-states s conditioned on current state s and action a, and an observation function Ω(o|s) that defines a distribution over observations o conditioned on current state s. The agent designer’s goals are specified via the objective reward function RO . At each time step, the designer receives reward RO (st ) ∈ [0, 1] based on the current state st of the environment, where the subscript denotes time. The designer’s objective return is the expected mean objective reward N 1 obtained over an infinite horizon, i.e., limN →∞ E N t=0 RO (st ) . In the standard view of RL, the agent uses the same reward function as the designer to align the interests of the agent and the designer. Here we allow for a separate agent reward function R(· ). An agent’s reward function can in general be defined in terms of the history of actions and observations, but is often more pragmatically defined in terms of some abstraction of history. We define the agent’s reward function precisely in Section 2. Optimal Reward Problem. An RL agent attempts to act so as to maximize its own cumulative reward, or return. Crucially, as a result, the sequence of environment-states {st }∞ is affected by t=0 the choice of reward function; therefore, the agent designer’s return is affected as well. The optimal reward problem arises from the fact that while the objective reward function is fixed as part of the problem description, the reward function is a choice to be made by the designer. We capture this choice abstractly by letting the reward be parameterized by some vector of parameters θ chosen from space of parameters Θ. Each θ ∈ Θ specifies a reward function R(· ; θ) which in turn produces a distribution over environment state sequences via whatever RL method the agent uses. The expected N 1 return obtained by the designer for choice θ is U(θ) = limN →∞ E N t=0 RO (st ) R(·; θ) . The optimal reward parameters are given by the solution to the optimal reward problem [16, 17, 18]: θ∗ = arg max U(θ) = arg max lim E θ∈Θ θ∈Θ N →∞ 1 N N RO (st ) R(·; θ) . (1) t=0 Our previous research on solving the optimal reward problem has focused primarily on the properties of the optimal reward function and its correspondence to the agent architecture and the environment [16, 17, 18]. This work has used inefficient exhaustive search methods for finding good approximations to θ∗ (though there is recent work on using genetic algorithms to do this [6, 9, 12]). Our primary contribution in this paper is a new convergent online stochastic gradient method for finding approximately optimal reward functions. To our knowledge, this is the first algorithm that improves reward functions in an online setting—during a single agent’s lifetime. In Section 2, we present the PGRD algorithm, prove its convergence, and relate it to OLPOMDP [2], a policy gradient algorithm. In Section 3, we present experiments demonstrating PGRD’s ability to approximately solve the optimal reward problem online. 2 PGRD: Policy Gradient for Reward Design PGRD builds on the following insight: the agent’s planning algorithm procedurally converts the reward function into behavior; thus, the reward function can be viewed as a specific parameterization of the agent’s policy. Using this insight, PGRD updates the reward parameters by estimating the gradient of the objective return with respect to the reward parameters, θ U(θ), from experience, using standard policy gradient techniques. In fact, we show that PGRD can be viewed as an (independently interesting) generalization of the policy gradient method OLPOMDP [2]. Specifically, we show that OLPOMDP is special case of PGRD when the planning depth d is zero. In this section, we first present the family of local planning agents for which PGRD improves the reward function. Next, we develop PGRD and prove its convergence. Finally, we show that PGRD generalizes OLPOMDP and discuss how adding planning to OLPOMDP affects the space of policies available to the optimization method. 2 1 2 3 4 5 Input: T , θ0 , {αt }∞ , β, γ t=0 o0 , i0 = initializeStart(); for t = 0, 1, 2, 3, . . . do ∀a Qt (a; θt ) = plan(it , ot , T, R(it , ·, ·; θt ), d,γ); at ∼ µ(a|it ; Qt ); rt+1 , ot+1 = takeAction(at ); µ(a |i ;Q ) 6 7 8 9 t zt+1 = βzt + θt t |itt ;Qt ) t ; µ(a θt+1 = θt + αt (rt+1 zt+1 − λθt ) ; it+1 = updateInternalState(it , at , ot+1 ); end Figure 1: PGRD (Policy Gradient for Reward Design) Algorithm A Family of Limited Agents with Internal State. Given a Markov model T defined over the observation space O and action space A, denote T (o |o, a) the probability of next observation o given that the agent takes action a after observing o. Our agents use the model T to plan. We do not assume that the model T is an accurate model of the environment. The use of an incorrect model is one type of agent limitation we examine in our experiments. In general, agents can use non-Markov models defined in terms of the history of observations and actions; we leave this for future work. The agent maintains an internal state feature vector it that is updated at each time step using it+1 = updateInternalState(it , at , ot+1 ). The internal state allows the agent to use reward functions T that depend on the agent’s history. We consider rewards of the form R(it , o, a; θt ) = θt φ(it , o, a), where θt is the reward parameter vector at time t, and φ(it , o, a) is a vector of features based on internal state it , planning state o, and action a. Note that if φ is a vector of binary indicator features, this representation allows for arbitrary reward functions and thus the representation is completely general. Many existing methods use reward functions that depend on history. Reward functions based on empirical counts of observations, as in PAC-MDP approaches [5, 20], provide some examples; see [14, 15, 13] for others. We present a concrete example in our empirical section. At each time step t, the agent’s planning algorithm, plan, performs depth-d planning using the model T and reward function R(it , o, a; θt ) with current internal state it and reward parameters θt . Specifically, the agent computes a d-step Q-value function Qd (it , ot , a; θt ) ∀a ∈ A, where Qd (it , o, a; θt ) = R(it , o, a; θt ) + γ o ∈O T (o |o, a) maxb∈A Qd−1 (it , o , b; θt ) and Q0 (it , o, a; θt ) = R(it , o, a; θt ). We emphasize that the internal state it and reward parameters θt are held invariant while planning. Note that the d-step Q-values are only computed for the current observation ot , in effect by building a depth-d tree rooted at ot . In the d = 0 special case, the planning procedure completely ignores the model T and returns Q0 (it , ot , a; θt ) = R(it , ot , a; θt ). Regardless of the value of d, we treat the end result of planning as providing a scoring function Qt (a; θt ) where the dependence on d, it and ot is dropped from the notation. To allow for gradient calculations, our agents act according to the τ Qt (a;θt ) def Boltzmann (soft-max) stochastic policy parameterized by Q: µ(a|it ; Qt ) = e eτ Qt (b;θt ) , where τ b is a temperature parameter that determines how stochastically the agent selects the action with the highest score. When the planning depth d is small due to computational limitations, the agent cannot account for events beyond the planning depth. We examine this limitation in our experiments. Gradient Ascent. To develop a gradient algorithm for improving the reward function, we need to compute the gradient of the objective return with respect to θ: θ U(θ). The main insight is to break the gradient calculation into the calculation of two gradients. The first is the gradient of the objective return with respect to the policy µ, and the second is the gradient of the policy with respect to the reward function parameters θ. The first gradient is exactly what is computed in standard policy gradient approaches [2]. The second gradient is challenging because the transformation from reward parameters to policy involves a model-based planning procedure. We draw from the work of Neu and Szepesv´ ri [10] which shows that this gradient computation resembles planning itself. We a develop PGRD, presented in Figure 1, explicitly as a generalization of OLPOMDP, a policy gradient algorithm developed by Bartlett and Baxter [2], because of its foundational simplicity relative to other policy-gradient algorithms such as those based on actor-critic methods (e.g., [4]). Notably, the reward parameters are the only parameters being learned in PGRD. 3 PGRD follows the form of OLPOMDP (Algorithm 1 in Bartlett and Baxter [2]) but generalizes it in three places. In Figure 1 line 3, the agent plans to compute the policy, rather than storing the policy directly. In line 6, the gradient of the policy with respect to the parameters accounts for the planning procedure. In line 8, the agent maintains a general notion of internal state that allows for richer parameterization of policies than typically considered (similar to Aberdeen and Baxter [1]). The algorithm takes as parameters a sequence of learning rates {αk }, a decaying-average parameter β, and regularization parameter λ > 0 which keeps the the reward parameters θ bounded throughout learning. Given a sequence of calculations of the gradient of the policy with respect to the parameters, θt µ(at |it ; Qt ), the remainder of the algorithm climbs the gradient of objective return θ U(θ) using OLPOMDP machinery. In the next subsection, we discuss how to compute θt µ(at |it ; Qt ). Computing the Gradient of the Policy with respect to Reward. For the Boltzmann distribution, the gradient of the policy with respect to the reward parameters is given by the equation θt µ(a|it ; Qt ) = τ · µ(a|Qt )[ θt Qt (a|it ; θt ) − θt Qt (b; θt )], where τ is the Boltzmann b∈A temperature (see [10]). Thus, computing θt µ(a|it ; Qt ) reduces to computing θt Qt (a; θt ). The value of Qt depends on the reward parameters θt , the model, and the planning depth. However, as we present below, the process of computing the gradient closely resembles the process of planning itself, and the two computations can be interleaved. Theorem 1 presented below is an adaptation of Proposition 4 from Neu and Szepesv´ ri [10]. It presents the gradient computation for depth-d a planning as well as for infinite-depth discounted planning. We assume that the gradient of the reward function with respect to the parameters is bounded: supθ,o,i,a θ R(i, o, a, θ) < ∞. The proof of the theorem follows directly from Proposition 4 of Neu and Szepesv´ ri [10]. a Theorem 1. Except on a set of measure zero, for any depth d, the gradient θ Qd (o, a; θ) exists and is given by the recursion (where we have dropped the dependence on i for simplicity) d θ Q (o, a; θ) = θ R(o, a; θ) π d−1 (b|o ) T (o |o, a) +γ o ∈O d−1 (o θQ , b; θ), (2) b∈A where θ Q0 (o, a; θ) = θ R(o, a; θ) and π d (a|o) ∈ arg maxa Qd (o, a; θ) is any policy that is greedy with respect to Qd . The result also holds for θ Q∗ (o, a; θ) = θ limd→∞ Qd (o, a; θ). The Q-function will not be differentiable when there are multiple optimal policies. This is reflected in the arbitrary choice of π in the gradient calculation. However, it was shown by Neu and Szepesv´ ri [10] that even for values of θ which are not differentiable, the above computation produces a a valid calculation of a subgradient; we discuss this below in our proof of convergence of PGRD. Convergence of PGRD (Figure 1). Given a particular fixed reward function R(·; θ), transition model T , and planning depth, there is a corresponding fixed randomized policy µ(a|i; θ)—where we have explicitly represented the reward’s dependence on the internal state vector i in the policy parameterization and dropped Q from the notation as it is redundant given that everything else is fixed. Denote the agent’s internal-state update as a (usually deterministic) distribution ψ(i |i, a, o). Given a fixed reward parameter vector θ, the joint environment-state–internal-state transitions can be modeled as a Markov chain with a |S||I| × |S||I| transition matrix M (θ) whose entries are given by M s,i , s ,i (θ) = p( s , i | s, i ; θ) = o,a ψ(i |i, a, o)Ω(o|s )P (s |s, a)µ(a|i; θ). We make the following assumptions about the agent and the environment: Assumption 1. The transition matrix M (θ) of the joint environment-state–internal-state Markov chain has a unique stationary distribution π(θ) = [πs1 ,i1 (θ), πs2 ,i2 (θ), . . . , πs|S| ,i|I| (θ)] satisfying the balance equations π(θ)M (θ) = π(θ), for all θ ∈ Θ. Assumption 2. During its execution, PGRD (Figure 1) does not reach a value of it , and θt at which µ(at |it , Qt ) is not differentiable with respect to θt . It follows from Assumption 1 that the objective return, U(θ), is independent of the start state. The original OLPOMDP convergence proof [2] has a similar condition that only considers environment states. Intuitively, this condition allows PGRD to handle history-dependence of a reward function in the same manner that it handles partial observability in an environment. Assumption 2 accounts for the fact that a planning algorithm may not be fully differentiable everywhere. However, Theorem 1 showed that infinite and bounded-depth planning is differentiable almost everywhere (in a measure theoretic sense). Furthermore, this assumption is perhaps stronger than necessary, as stochastic approximation algorithms, which provide the theory upon which OLPOMDP is based, have been shown to converge using subgradients [8]. 4 In order to state the convergence theorem, we must define the approximate gradient which OLPOMDP def T calculates. Let the approximate gradient estimate be β U(θ) = limT →∞ t=1 rt zt for a fixed θ and θ PGRD parameter β, where zt (in Figure 1) represents a time-decaying average of the θt µ(at |it , Qt ) calculations. It was shown by Bartlett and Baxter [2] that β U(θ) is close to the true value θ U(θ) θ for large values of β. Theorem 2 proves that PGRD converges to a stable equilibrium point based on this approximate gradient measure. This equilibrium point will typically correspond to some local optimum in the return function U(θ). Given our development and assumptions, the theorem is a straightforward extension of Theorem 6 from Bartlett and Baxter [2] (proof omitted). ∞ Theorem 2. Given β ∈ [0, 1), λ > 0, and a sequence of step sizes αt satisfying t=0 αt = ∞ and ∞ 2 t=0 (αt ) < ∞, PGRD produces a sequence of reward parameters θt such that θt → L as t → ∞ a.s., where L is the set of stable equilibrium points of the differential equation ∂θ = β U(θ) − λθ. θ ∂t PGRD generalizes OLPOMDP. As stated above, OLPOMDP, when it uses a Boltzmann distribution in its policy representation (a common case), is a special case of PGRD when the planning depth is zero. First, notice that in the case of depth-0 planning, Q0 (i, o, a; θ) = R(i, o, a, θ), regardless of the transition model and reward parameterization. We can also see from Theorem 1 that 0 θ Q (i, o, a; θ) = θ R(i, o, a; θ). Because R(i, o, a; θ) can be parameterized arbitrarily, PGRD can be configured to match standard OLPOMDP with any policy parameterization that also computes a score function for the Boltzmann distribution. In our experiments, we demonstrate that choosing a planning depth d > 0 can be beneficial over using OLPOMDP (d = 0). In the remainder of this section, we show theoretically that choosing d > 0 does not hurt in the sense that it does not reduce the space of policies available to the policy gradient method. Specifically, we show that when using an expressive enough reward parameterization, PGRD’s space of policies is not restricted relative to OLPOMDP’s space of policies. We prove the result for infinite planning, but the extension to depth-limited planning is straightforward. Theorem 3. There exists a reward parameterization such that, for an arbitrary transition model T , the space of policies representable by PGRD with infinite planning is identical to the space of policies representable by PGRD with depth 0 planning. Proof. Ignoring internal state for now (holding it constant), let C(o, a) be an arbitrary reward function used by PGRD with depth 0 planning. Let R(o, a; θ) be a reward function for PGRD with infinite (d = ∞) planning. The depth-∞ agent uses the planning result Q∗ (o, a; θ) to act, while the depth-0 agent uses the function C(o, a) to act. Therefore, it suffices to show that one can always choose θ such that the planning solution Q∗ (o, a; θ) equals C(o, a). For all o ∈ O, a ∈ A, set R(o, a; θ) = C(o, a) − γ o T (o |o, a) maxa C(o , a ). Substituting Q∗ for C, this is the Bellman optimality equation [22] for infinite-horizon planning. Setting R(o, a; θ) as above is possible if it is parameterized by a table with an entry for each observation–action pair. Theorem 3 also shows that the effect of an arbitrarily poor model can be overcome with a good choice of reward function. This is because a Boltzmann distribution can, allowing for an arbitrary scoring function C, represent any policy. We demonstrate this ability of PGRD in our experiments. 3 Experiments The primary objective of our experiments is to demonstrate that PGRD is able to use experience online to improve the reward function parameters, thereby improving the agent’s obtained objective return. Specifically, we compare the objective return achieved by PGRD to the objective return achieved by PGRD with the reward adaptation turned off. In both cases, the reward function is initialized to the objective reward function. A secondary objective is to demonstrate that when a good model is available, adding the ability to plan—even for small depths—improves performance relative to the baseline algorithm of OLPOMDP (or equivalently PGRD with depth d = 0). Foraging Domain for Experiments 1 to 3: The foraging environment illustrated in Figure 2(a) is a 3 × 3 grid world with 3 dead-end corridors (rows) separated by impassable walls. The agent (bird) has four available actions corresponding to each cardinal direction. Movement in the intended direction fails with probability 0.1, resulting in movement in a random direction. If the resulting direction is 5 Objective Return 0.15 D=6, α=0 & D=6, α=5×10 −5 D=4, α=2×10 −4 D=0, α=5×10 −4 0.1 0.05 0 D=4, α=0 D=0, α=0 1000 2000 3000 4000 5000 Time Steps C) Objective Return B) A) 0.15 D=6, α=0 & D=6, α=5×10 −5 D=3, α=3×10 −3 D=1, α=3×10 −4 0.1 D=3, α=0 0.05 D=0, α=0.01 & D=1, α=0 0 1000 2000 3000 4000 5000 D=0, α=0 Time Steps Figure 2: A) Foraging Domain, B) Performance of PGRD with observation-action reward features, C) Performance of PGRD with recency reward features blocked by a wall or the boundary, the action results in no movement. There is a food source (worm) located in one of the three right-most locations at the end of each corridor. The agent has an eat action, which consumes the worm when the agent is at the worm’s location. After the agent consumes the worm, a new worm appears randomly in one of the other two potential worm locations. Objective Reward for the Foraging Domain: The designer’s goal is to maximize the average number of worms eaten per time step. Thus, the objective reward function RO provides a reward of 1.0 when the agent eats a worm, and a reward of 0 otherwise. The objective return is defined as in Equation (1). Experimental Methodology: We tested PGRD for depth-limited planning agents of depths 0–6. Recall that PGRD for the agent with planning depth 0 is the OLPOMDP algorithm. For each depth, we jointly optimized over the PGRD algorithm parameters, α and β (we use a fixed α throughout learning). We tested values for α on an approximate logarithmic scale in the range (10−6 , 10−2 ) as well as the special value of α = 0, which corresponds to an agent that does not adapt its reward function. We tested β values in the set 0, 0.4, 0.7, 0.9, 0.95, 0.99. Following common practice [3], we set the λ parameter to 0. We explicitly bound the reward parameters and capped the reward function output both to the range [−1, 1]. We used a Boltzmann temperature parameter of τ = 100 and planning discount factor γ = 0.95. Because we initialized θ so that the initial reward function was the objective reward function, PGRD with α = 0 was equivalent to a standard depth-limited planning agent. Experiment 1: A fully observable environment with a correct model learned online. In this experiment, we improve the reward function in an agent whose only limitation is planning depth, using (1) a general reward parameterization based on the current observation and (2) a more compact reward parameterization which also depends on the history of observations. Observation: The agent observes the full state, which is given by the pair o = (l, w), where l is the agent’s location and w is the worm’s location. Learning a Correct Model: Although the theorem of convergence of PGRD relies on the agent having a fixed model, the algorithm itself is readily applied to the case of learning a model online. In this experiment, the agent’s model T is learned online based on empirical transition probabilities between observations (recall this is a fully observable environment). Let no,a,o be the number of times that o was reached after taking action a after observing o. The agent models the probability of seeing o as no,a,o T (o |o, a) = . n o o,a,o Reward Parameterizations: Recall that R(i, o, a; θ) = θT φ(i, o, a), for some φ(i, o, a). (1) In the observation-action parameterization, φ(i, o, a) is a binary feature vector with one binary feature for each observation-action pair—internal state is ignored. This is effectively a table representation over all reward functions indexed by (o, a). As shown in Theorem 3, the observation-action feature representation is capable of producing arbitrary policies over the observations. In large problems, such a parameterization would not be feasible. (2) The recency parameterization is a more compact representation which uses features that rely on the history of observations. The feature vector is φ(i, o, a) = [RO (o, a), 1, φcl (l, i), φcl,a (l, a, i)], where RO (o, a) is the objective reward function defined as above. The feature φcl (l) = 1 − 1/c(l, i), where c(l, i) is the number of time steps since the agent has visited location l, as represented in the agent’s internal state i. Its value is normalized to the range [0, 1) and is high when the agent has not been to location l recently. The feature φcl,a (l, a, i) = 1 − 1/c(l, a, i) is similarly defined with respect to the time since the agent has taken action a in location l. Features based on recency counts encourage persistent exploration [21, 18]. 6 Results & Discussion: Figure 2(b) and Figure 2(c) present results for agents that use the observationaction parameterization and the recency parameterization of the reward function respectively. The horizontal axis is the number of time steps of experience. The vertical axis is the objective return, i.e., the average objective reward per time step. Each curve is an average over 130 trials. The values of d and the associated optimal algorithm parameters for each curve are noted in the figures. First, note that with d = 6, the agent is unbounded, because food is never more than 6 steps away. Therefore, the agent does not benefit from adapting the reward function parameters (given that we initialize to the objective reward function). Indeed, the d = 6, α = 0 agent performs as well as the best reward-optimizing agent. The performance for d = 6 improves with experience because the model improves with experience (and thus from the curves it is seen that the model gets quite accurate in about 1500 time steps). The largest objective return obtained for d = 6 is also the best objective return that can be obtained for any value of d. Several results can be observed in both Figures 2(b) and (c). 1) Each curve that uses α > 0 (solid lines) improves with experience. This is a demonstration of our primary contribution, that PGRD is able to effectively improve the reward function with experience. That the improvement over time is not just due to model learning is seen in the fact that for each value of d < 6 the curve for α > 0 (solid-line) which adapts the reward parameters does significantly better than the corresponding curve for α = 0 (dashed-line); the α = 0 agents still learn the model. 2) For both α = 0 and α > 0 agents, the objective return obtained by agents with equivalent amounts of experience increases monotonically as d is increased (though to maintain readability we only show selected values of d in each figure). This demonstrates our secondary contribution, that the ability to plan in PGRD significantly improves performance over standard OLPOMDP (PGRD with d = 0). There are also some interesting differences between the results for the two different reward function parameterizations. With the observation-action parameterization, we noted that there always exists a setting of θ for all d that will yield optimal objective return. This is seen in Figure 2(b) in that all solid-line curves approach optimal objective return. In contrast, the more compact recency reward parameterization does not afford this guarantee and indeed for small values of d (< 3), the solid-line curves in Figure 2(c) converge to less than optimal objective return. Notably, OLPOMDP (d = 0) does not perform well with this feature set. On the other hand, for planning depths 3 ≤ d < 6, the PGRD agents with the recency parameterization achieve optimal objective return faster than the corresponding PGRD agent with the observation-action parameterization. Finally, we note that this experiment validates our claim that PGRD can improve reward functions that depend on history. Experiment 2: A fully observable environment and poor given model. Our theoretical analysis showed that PGRD with an incorrect model and the observation–action reward parameterization should (modulo local maxima issues) do just as well asymptotically as it would with a correct model. Here we illustrate this theoretical result empirically on the same foraging domain and objective reward function used in Experiment 1. We also test our hypothesis that a poor model should slow down the rate of learning relative to a correct model. Poor Model: We gave the agents a fixed incorrect model of the foraging environment that assumes there are no internal walls separating the 3 corridors. Reward Parameterization: We used the observation–action reward parameterization. With a poor model it is no longer interesting to initialize θ so that the initial reward function is the objective reward function because even for d = 6 such an agent would do poorly. Furthermore, we found that this initialization leads to excessively bad exploration and therefore poor learning of how to modify the reward. Thus, we initialize θ to uniform random values near 0, in the range (−10−3 , 10−3 ). Results: Figure 3(a) plots the objective return as a function of number of steps of experience. Each curve is an average over 36 trials. As hypothesized, the bad model slows learning by a factor of more than 10 (notice the difference in the x-axis scales from those in Figure 2). Here, deeper planning results in slower learning and indeed the d = 0 agent that does not use the model at all learns the fastest. However, also as hypothesized, because they used the expressive observation–action parameterization, agents of all planning depths mitigate the damage caused by the poor model and eventually converge to the optimal objective return. Experiment 3: Partially observable foraging world. Here we evaluate PGRD’s ability to learn in a partially observable version of the foraging domain. In addition, the agents learn a model under the erroneous (and computationally convenient) assumption that the domain is fully observable. 7 0.1 −4 D = 0, α = 2 ×10 D = 2, α = 3 ×10 −5 −5 D = 6, α = 2 ×10 0.05 D = 0&2&6, α = 0 0 1 2 3 Time Steps 4 5 x 10 4 0.06 D = 6, α = 7 ×10 D = 2, α = 7 ×10 −4 0.04 D = 1, α = 7 ×10 −4 D = 0, α = 5 ×10 −4 D = 0, α = 0 D = 1&2&6, α = 0 0.02 0 C) −4 1000 2000 3000 4000 5000 Time Steps Objective Return B) 0.08 0.15 Objective Return Objective Return A) 2.5 2 x 10 −3 D=6, α=3×10 −6 D=0, α=1×10 −5 1.5 D=0&6, α=0 1 0.5 1 2 3 Time Steps 4 5 x 10 4 Figure 3: A) Performance of PGRD with a poor model, B) Performance of PGRD in a partially observable world with recency reward features, C) Performance of PGRD in Acrobot Partial Observation: Instead of viewing the location of the worm at all times, the agent can now only see the worm when it is colocated with it: its observation is o = (l, f ), where f indicates whether the agent is colocated with the food. Learning an Incorrect Model: The model is learned just as in Experiment 1. Because of the erroneous full observability assumption, the model will hallucinate about worms at all the corridor ends based on the empirical frequency of having encountered them there. Reward Parameterization: We used the recency parameterization; due to the partial observability, agents with the observation–action feature set perform poorly in this environment. The parameters θ are initialized such that the initial reward function equals the objective reward function. Results & Discussion: Figure 3(b) plots the mean of 260 trials. As seen in the solid-line curves, PGRD improves the objective return at all depths (only a small amount for d = 0 and significantly more for d > 0). In fact, agents which don’t adapt the reward are hurt by planning (relative to d = 0). This experiment demonstrates that the combination of planning and reward improvement can be beneficial even when the model is erroneous. Because of the partial observability, optimal behavior in this environment achieves less objective return than in Experiment 1. Experiment 4: Acrobot. In this experiment we test PGRD in the Acrobot environment [22], a common benchmark task in the RL literature and one that has previously been used in the testing of policy gradient approaches [23]. This experiment demonstrates PGRD in an environment in which an agent must be limited due to the size of the state space and further demonstrates that adding model-based planning to policy gradient approaches can improve performance. Domain: The version of Acrobot we use is as specified by Sutton and Barto [22]. It is a two-link robot arm in which the position of one shoulder-joint is fixed and the agent’s control is limited to 3 actions which apply torque to the elbow-joint. Observation: The fully-observable state space is 4 dimensional, with two joint angles ψ1 and ψ2 , and ˙ ˙ two joint velocities ψ1 and ψ2 . Objective Reward: The designer receives an objective reward of 1.0 when the tip is one arm’s length above the fixed shoulder-joint, after which the bot is reset to its initial resting position. Model: We provide the agent with a perfect model of the environment. Because the environment is continuous, value iteration is intractable, and computational limitations prevent planning deep enough to compute the optimal action in any state. The feature vector contains 13 entries. One feature corresponds to the objective reward signal. For each action, there are 5 features corresponding to each of the state features plus an additional feature representing the height of the tip: φ(i, o, a) = ˙ ˙ [RO (o), {ψ1 (o), ψ2 (o), ψ1 (o), ψ2 (o), h(o)}a ]. The height feature has been used in previous work as an alternative definition of objective reward [23]. Results & Discussion: We plot the mean of 80 trials in Figure 3(c). Agents that use the fixed (α = 0) objective reward function with bounded-depth planning perform according to the bottom two curves. Allowing PGRD and OLPOMDP to adapt the parameters θ leads to improved objective return, as seen in the top two curves in Figure 3(c). Finally, the PGRD d = 6 agent outperforms the standard OLPOMDP agent (PGRD with d = 0), further demonstrating that PGRD outperforms OLPOMDP. Overall Conclusion: We developed PGRD, a new method for approximately solving the optimal reward problem in bounded planning agents that can be applied in an online setting. We showed that PGRD is a generalization of OLPOMDP and demonstrated that it both improves reward functions in limited agents and outperforms the model-free OLPOMDP approach. 8 References [1] Douglas Aberdeen and Jonathan Baxter. Scalable Internal-State Policy-Gradient Methods for POMDPs. Proceedings of the Nineteenth International Conference on Machine Learning, 2002. [2] Peter L. Bartlett and Jonathan Baxter. Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the 39th IEEE Conference on Decision and Control, 2000. [3] Jonathan Baxter, Peter L. Bartlett, and Lex Weaver. Experiments with Infinite-Horizon, Policy-Gradient Estimation, 2001. [4] Shalabh Bhatnagar, Richard S. Sutton, M Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 2009. [5] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A General Polynomial Time Algorithm for NearOptimal Reinforcement Learning. Journal of Machine Learning Research, 3:213–231, 2001. [6] S. Elfwing, Eiji Uchibe, K. Doya, and H. I. Christensen. Co-evolution of Shaping Rewards and MetaParameters in Reinforcement Learning. Adaptive Behavior, 16(6):400–412, 2008. [7] J. Zico Kolter and Andrew Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning, pages 513–520, 2009. [8] Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd edition, 2010. [9] Cetin Mericli, Tekin Mericli, and H. Levent Akin. A Reward Function Generation Method Using Genetic ¸ ¸ ¸ Algorithms : A Robot Soccer Case Study (Extended Abstract). In Proc. of the 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010), number 2, pages 1513–1514, 2010. [10] Gergely Neu and Csaba Szepesv´ ri. Apprenticeship learning using inverse reinforcement learning and a gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, pages 295–302, 2007. [11] Andrew Y. Ng, Stuart J. Russell, and D. Harada. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pages 278–287, 1999. [12] Scott Niekum, Andrew G. Barto, and Lee Spector. Genetic Programming for Reward Function Search. IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010. [13] Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V. Hafner. Intrinsic Motivation Systems for Autonomous Mental Development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, April 2007. [14] J¨ rgen Schmidhuber. Curious model-building control systems. In IEEE International Joint Conference on u Neural Networks, pages 1458–1463, 1991. [15] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically Motivated Reinforcement Learning. In Proceedings of Advances in Neural Information Processing Systems 17 (NIPS), pages 1281–1288, 2005. [16] Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where Do Rewards Come From? In Proceedings of the Annual Conference of the Cognitive Science Society, pages 2601–2606, 2009. [17] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transations on Autonomous Mental Development, 2(2):70–82, 2010. [18] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Internal Rewards Mitigate Agent Boundedness. In Proceedings of the 27th International Conference on Machine Learning, 2010. [19] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010. [20] Alexander L. Strehl and Michael L. Littman. An analysis of model-based Interval Estimation for Markov Decision Processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008. [21] Richard S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In The Seventh International Conference on Machine Learning, pages 216–224. 1990. [22] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. [23] Lex Weaver and Nigel Tao. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 538–545. 2001. 9
5 0.16436574 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum
Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
6 0.12100951 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
7 0.10828316 4 nips-2010-A Computational Decision Theory for Interactive Assistants
8 0.10820178 43 nips-2010-Bootstrapping Apprenticeship Learning
9 0.10442769 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
10 0.09747903 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
11 0.096521966 192 nips-2010-Online Classification with Specificity Constraints
12 0.086597659 212 nips-2010-Predictive State Temporal Difference Learning
13 0.083829708 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
14 0.081540778 64 nips-2010-Distributionally Robust Markov Decision Processes
15 0.079616159 152 nips-2010-Learning from Logged Implicit Exploration Data
16 0.079153299 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
17 0.077962942 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
18 0.077081166 268 nips-2010-The Neural Costs of Optimal Control
19 0.074064799 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
20 0.073710479 203 nips-2010-Parametric Bandits: The Generalized Linear Case
topicId topicWeight
[(0, 0.19), (1, -0.197), (2, 0.013), (3, 0.001), (4, -0.013), (5, 0.042), (6, -0.064), (7, 0.01), (8, 0.004), (9, 0.15), (10, -0.042), (11, 0.044), (12, -0.004), (13, -0.077), (14, 0.009), (15, -0.059), (16, -0.145), (17, 0.054), (18, 0.06), (19, -0.004), (20, 0.005), (21, -0.051), (22, -0.024), (23, -0.073), (24, 0.066), (25, -0.054), (26, -0.037), (27, -0.038), (28, 0.077), (29, -0.121), (30, -0.059), (31, 0.019), (32, 0.032), (33, -0.019), (34, -0.036), (35, -0.037), (36, 0.005), (37, 0.049), (38, 0.04), (39, -0.064), (40, 0.004), (41, 0.008), (42, 0.024), (43, -0.137), (44, 0.073), (45, -0.018), (46, -0.04), (47, 0.026), (48, -0.088), (49, -0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.93085021 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
2 0.88115412 229 nips-2010-Reward Design via Online Gradient Ascent
Author: Jonathan Sorg, Richard L. Lewis, Satinder P. Singh
Abstract: Recent work has demonstrated that when artificial agents are limited in their ability to achieve their goals, the agent designer can benefit by making the agent’s goals different from the designer’s. This gives rise to the optimization problem of designing the artificial agent’s goals—in the RL framework, designing the agent’s reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent’s lifetime nor do they take advantage of knowledge about the agent’s structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent’s lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations. 1 The Optimal Reward Problem In this work, we consider the scenario of an agent designer building an autonomous agent. The designer has his or her own goals which must be translated into goals for the autonomous agent. We represent goals using the Reinforcement Learning (RL) formalism of the reward function. This leads to the optimal reward problem of designing the agent’s reward function so as to maximize the objective reward received by the agent designer. Typically, the designer assigns his or her own reward to the agent. However, there is ample work which demonstrates the benefit of assigning reward which does not match the designer’s. For example, work on reward shaping [11] has shown how to modify rewards to accelerate learning without altering the optimal policy, and PAC-MDP methods [5, 20] including approximate Bayesian methods [7, 19] add bonuses to the objective reward to achieve optimism under uncertainty. These approaches explicitly or implicitly assume that the asymptotic behavior of the agent should be the same as that which would occur using the objective reward function. These methods do not explicitly consider the optimal reward problem; however, they do show improved performance through reward modification. In our recent work that does explicitly consider the optimal reward problem [18], we analyzed an explicit hypothesis about the benefit of reward design—that it helps mitigate the performance loss caused by computational constraints (bounds) on agent architectures. We considered various types of agent limitations—limits on planning depth, failure to account for partial observability, and other erroneous modeling assumptions—and demonstrated the benefits of good reward functions in each case empirically. Crucially, in bounded agents, the optimal reward function often leads to behavior that is different from the asymptotic behavior achieved with the objective reward function. In this work, we develop an algorithm, Policy Gradient for Reward Design (PGRD), for improving reward functions for a family of bounded agents that behave according to repeated local (from the current state) model-based planning. We show that this algorithm is capable of improving the reward functions in agents with computational limitations necessitating small bounds on the depth of planning, and also from the use of an inaccurate model (which may be inaccurate due to computationally-motivated approximations). PGRD has few parameters, improves the reward 1 function online during an agent’s lifetime, takes advantage of knowledge about the agent’s structure (through the gradient computation), and is linear in the number of reward function parameters. Notation. Formally, we consider discrete-time partially-observable environments with a finite number of hidden states s ∈ S, actions a ∈ A, and observations o ∈ O; these finite set assumptions are useful for our theorems, but our algorithm can handle infinite sets in practice. Its dynamics are governed by a state-transition function P (s |s, a) that defines a distribution over next-states s conditioned on current state s and action a, and an observation function Ω(o|s) that defines a distribution over observations o conditioned on current state s. The agent designer’s goals are specified via the objective reward function RO . At each time step, the designer receives reward RO (st ) ∈ [0, 1] based on the current state st of the environment, where the subscript denotes time. The designer’s objective return is the expected mean objective reward N 1 obtained over an infinite horizon, i.e., limN →∞ E N t=0 RO (st ) . In the standard view of RL, the agent uses the same reward function as the designer to align the interests of the agent and the designer. Here we allow for a separate agent reward function R(· ). An agent’s reward function can in general be defined in terms of the history of actions and observations, but is often more pragmatically defined in terms of some abstraction of history. We define the agent’s reward function precisely in Section 2. Optimal Reward Problem. An RL agent attempts to act so as to maximize its own cumulative reward, or return. Crucially, as a result, the sequence of environment-states {st }∞ is affected by t=0 the choice of reward function; therefore, the agent designer’s return is affected as well. The optimal reward problem arises from the fact that while the objective reward function is fixed as part of the problem description, the reward function is a choice to be made by the designer. We capture this choice abstractly by letting the reward be parameterized by some vector of parameters θ chosen from space of parameters Θ. Each θ ∈ Θ specifies a reward function R(· ; θ) which in turn produces a distribution over environment state sequences via whatever RL method the agent uses. The expected N 1 return obtained by the designer for choice θ is U(θ) = limN →∞ E N t=0 RO (st ) R(·; θ) . The optimal reward parameters are given by the solution to the optimal reward problem [16, 17, 18]: θ∗ = arg max U(θ) = arg max lim E θ∈Θ θ∈Θ N →∞ 1 N N RO (st ) R(·; θ) . (1) t=0 Our previous research on solving the optimal reward problem has focused primarily on the properties of the optimal reward function and its correspondence to the agent architecture and the environment [16, 17, 18]. This work has used inefficient exhaustive search methods for finding good approximations to θ∗ (though there is recent work on using genetic algorithms to do this [6, 9, 12]). Our primary contribution in this paper is a new convergent online stochastic gradient method for finding approximately optimal reward functions. To our knowledge, this is the first algorithm that improves reward functions in an online setting—during a single agent’s lifetime. In Section 2, we present the PGRD algorithm, prove its convergence, and relate it to OLPOMDP [2], a policy gradient algorithm. In Section 3, we present experiments demonstrating PGRD’s ability to approximately solve the optimal reward problem online. 2 PGRD: Policy Gradient for Reward Design PGRD builds on the following insight: the agent’s planning algorithm procedurally converts the reward function into behavior; thus, the reward function can be viewed as a specific parameterization of the agent’s policy. Using this insight, PGRD updates the reward parameters by estimating the gradient of the objective return with respect to the reward parameters, θ U(θ), from experience, using standard policy gradient techniques. In fact, we show that PGRD can be viewed as an (independently interesting) generalization of the policy gradient method OLPOMDP [2]. Specifically, we show that OLPOMDP is special case of PGRD when the planning depth d is zero. In this section, we first present the family of local planning agents for which PGRD improves the reward function. Next, we develop PGRD and prove its convergence. Finally, we show that PGRD generalizes OLPOMDP and discuss how adding planning to OLPOMDP affects the space of policies available to the optimization method. 2 1 2 3 4 5 Input: T , θ0 , {αt }∞ , β, γ t=0 o0 , i0 = initializeStart(); for t = 0, 1, 2, 3, . . . do ∀a Qt (a; θt ) = plan(it , ot , T, R(it , ·, ·; θt ), d,γ); at ∼ µ(a|it ; Qt ); rt+1 , ot+1 = takeAction(at ); µ(a |i ;Q ) 6 7 8 9 t zt+1 = βzt + θt t |itt ;Qt ) t ; µ(a θt+1 = θt + αt (rt+1 zt+1 − λθt ) ; it+1 = updateInternalState(it , at , ot+1 ); end Figure 1: PGRD (Policy Gradient for Reward Design) Algorithm A Family of Limited Agents with Internal State. Given a Markov model T defined over the observation space O and action space A, denote T (o |o, a) the probability of next observation o given that the agent takes action a after observing o. Our agents use the model T to plan. We do not assume that the model T is an accurate model of the environment. The use of an incorrect model is one type of agent limitation we examine in our experiments. In general, agents can use non-Markov models defined in terms of the history of observations and actions; we leave this for future work. The agent maintains an internal state feature vector it that is updated at each time step using it+1 = updateInternalState(it , at , ot+1 ). The internal state allows the agent to use reward functions T that depend on the agent’s history. We consider rewards of the form R(it , o, a; θt ) = θt φ(it , o, a), where θt is the reward parameter vector at time t, and φ(it , o, a) is a vector of features based on internal state it , planning state o, and action a. Note that if φ is a vector of binary indicator features, this representation allows for arbitrary reward functions and thus the representation is completely general. Many existing methods use reward functions that depend on history. Reward functions based on empirical counts of observations, as in PAC-MDP approaches [5, 20], provide some examples; see [14, 15, 13] for others. We present a concrete example in our empirical section. At each time step t, the agent’s planning algorithm, plan, performs depth-d planning using the model T and reward function R(it , o, a; θt ) with current internal state it and reward parameters θt . Specifically, the agent computes a d-step Q-value function Qd (it , ot , a; θt ) ∀a ∈ A, where Qd (it , o, a; θt ) = R(it , o, a; θt ) + γ o ∈O T (o |o, a) maxb∈A Qd−1 (it , o , b; θt ) and Q0 (it , o, a; θt ) = R(it , o, a; θt ). We emphasize that the internal state it and reward parameters θt are held invariant while planning. Note that the d-step Q-values are only computed for the current observation ot , in effect by building a depth-d tree rooted at ot . In the d = 0 special case, the planning procedure completely ignores the model T and returns Q0 (it , ot , a; θt ) = R(it , ot , a; θt ). Regardless of the value of d, we treat the end result of planning as providing a scoring function Qt (a; θt ) where the dependence on d, it and ot is dropped from the notation. To allow for gradient calculations, our agents act according to the τ Qt (a;θt ) def Boltzmann (soft-max) stochastic policy parameterized by Q: µ(a|it ; Qt ) = e eτ Qt (b;θt ) , where τ b is a temperature parameter that determines how stochastically the agent selects the action with the highest score. When the planning depth d is small due to computational limitations, the agent cannot account for events beyond the planning depth. We examine this limitation in our experiments. Gradient Ascent. To develop a gradient algorithm for improving the reward function, we need to compute the gradient of the objective return with respect to θ: θ U(θ). The main insight is to break the gradient calculation into the calculation of two gradients. The first is the gradient of the objective return with respect to the policy µ, and the second is the gradient of the policy with respect to the reward function parameters θ. The first gradient is exactly what is computed in standard policy gradient approaches [2]. The second gradient is challenging because the transformation from reward parameters to policy involves a model-based planning procedure. We draw from the work of Neu and Szepesv´ ri [10] which shows that this gradient computation resembles planning itself. We a develop PGRD, presented in Figure 1, explicitly as a generalization of OLPOMDP, a policy gradient algorithm developed by Bartlett and Baxter [2], because of its foundational simplicity relative to other policy-gradient algorithms such as those based on actor-critic methods (e.g., [4]). Notably, the reward parameters are the only parameters being learned in PGRD. 3 PGRD follows the form of OLPOMDP (Algorithm 1 in Bartlett and Baxter [2]) but generalizes it in three places. In Figure 1 line 3, the agent plans to compute the policy, rather than storing the policy directly. In line 6, the gradient of the policy with respect to the parameters accounts for the planning procedure. In line 8, the agent maintains a general notion of internal state that allows for richer parameterization of policies than typically considered (similar to Aberdeen and Baxter [1]). The algorithm takes as parameters a sequence of learning rates {αk }, a decaying-average parameter β, and regularization parameter λ > 0 which keeps the the reward parameters θ bounded throughout learning. Given a sequence of calculations of the gradient of the policy with respect to the parameters, θt µ(at |it ; Qt ), the remainder of the algorithm climbs the gradient of objective return θ U(θ) using OLPOMDP machinery. In the next subsection, we discuss how to compute θt µ(at |it ; Qt ). Computing the Gradient of the Policy with respect to Reward. For the Boltzmann distribution, the gradient of the policy with respect to the reward parameters is given by the equation θt µ(a|it ; Qt ) = τ · µ(a|Qt )[ θt Qt (a|it ; θt ) − θt Qt (b; θt )], where τ is the Boltzmann b∈A temperature (see [10]). Thus, computing θt µ(a|it ; Qt ) reduces to computing θt Qt (a; θt ). The value of Qt depends on the reward parameters θt , the model, and the planning depth. However, as we present below, the process of computing the gradient closely resembles the process of planning itself, and the two computations can be interleaved. Theorem 1 presented below is an adaptation of Proposition 4 from Neu and Szepesv´ ri [10]. It presents the gradient computation for depth-d a planning as well as for infinite-depth discounted planning. We assume that the gradient of the reward function with respect to the parameters is bounded: supθ,o,i,a θ R(i, o, a, θ) < ∞. The proof of the theorem follows directly from Proposition 4 of Neu and Szepesv´ ri [10]. a Theorem 1. Except on a set of measure zero, for any depth d, the gradient θ Qd (o, a; θ) exists and is given by the recursion (where we have dropped the dependence on i for simplicity) d θ Q (o, a; θ) = θ R(o, a; θ) π d−1 (b|o ) T (o |o, a) +γ o ∈O d−1 (o θQ , b; θ), (2) b∈A where θ Q0 (o, a; θ) = θ R(o, a; θ) and π d (a|o) ∈ arg maxa Qd (o, a; θ) is any policy that is greedy with respect to Qd . The result also holds for θ Q∗ (o, a; θ) = θ limd→∞ Qd (o, a; θ). The Q-function will not be differentiable when there are multiple optimal policies. This is reflected in the arbitrary choice of π in the gradient calculation. However, it was shown by Neu and Szepesv´ ri [10] that even for values of θ which are not differentiable, the above computation produces a a valid calculation of a subgradient; we discuss this below in our proof of convergence of PGRD. Convergence of PGRD (Figure 1). Given a particular fixed reward function R(·; θ), transition model T , and planning depth, there is a corresponding fixed randomized policy µ(a|i; θ)—where we have explicitly represented the reward’s dependence on the internal state vector i in the policy parameterization and dropped Q from the notation as it is redundant given that everything else is fixed. Denote the agent’s internal-state update as a (usually deterministic) distribution ψ(i |i, a, o). Given a fixed reward parameter vector θ, the joint environment-state–internal-state transitions can be modeled as a Markov chain with a |S||I| × |S||I| transition matrix M (θ) whose entries are given by M s,i , s ,i (θ) = p( s , i | s, i ; θ) = o,a ψ(i |i, a, o)Ω(o|s )P (s |s, a)µ(a|i; θ). We make the following assumptions about the agent and the environment: Assumption 1. The transition matrix M (θ) of the joint environment-state–internal-state Markov chain has a unique stationary distribution π(θ) = [πs1 ,i1 (θ), πs2 ,i2 (θ), . . . , πs|S| ,i|I| (θ)] satisfying the balance equations π(θ)M (θ) = π(θ), for all θ ∈ Θ. Assumption 2. During its execution, PGRD (Figure 1) does not reach a value of it , and θt at which µ(at |it , Qt ) is not differentiable with respect to θt . It follows from Assumption 1 that the objective return, U(θ), is independent of the start state. The original OLPOMDP convergence proof [2] has a similar condition that only considers environment states. Intuitively, this condition allows PGRD to handle history-dependence of a reward function in the same manner that it handles partial observability in an environment. Assumption 2 accounts for the fact that a planning algorithm may not be fully differentiable everywhere. However, Theorem 1 showed that infinite and bounded-depth planning is differentiable almost everywhere (in a measure theoretic sense). Furthermore, this assumption is perhaps stronger than necessary, as stochastic approximation algorithms, which provide the theory upon which OLPOMDP is based, have been shown to converge using subgradients [8]. 4 In order to state the convergence theorem, we must define the approximate gradient which OLPOMDP def T calculates. Let the approximate gradient estimate be β U(θ) = limT →∞ t=1 rt zt for a fixed θ and θ PGRD parameter β, where zt (in Figure 1) represents a time-decaying average of the θt µ(at |it , Qt ) calculations. It was shown by Bartlett and Baxter [2] that β U(θ) is close to the true value θ U(θ) θ for large values of β. Theorem 2 proves that PGRD converges to a stable equilibrium point based on this approximate gradient measure. This equilibrium point will typically correspond to some local optimum in the return function U(θ). Given our development and assumptions, the theorem is a straightforward extension of Theorem 6 from Bartlett and Baxter [2] (proof omitted). ∞ Theorem 2. Given β ∈ [0, 1), λ > 0, and a sequence of step sizes αt satisfying t=0 αt = ∞ and ∞ 2 t=0 (αt ) < ∞, PGRD produces a sequence of reward parameters θt such that θt → L as t → ∞ a.s., where L is the set of stable equilibrium points of the differential equation ∂θ = β U(θ) − λθ. θ ∂t PGRD generalizes OLPOMDP. As stated above, OLPOMDP, when it uses a Boltzmann distribution in its policy representation (a common case), is a special case of PGRD when the planning depth is zero. First, notice that in the case of depth-0 planning, Q0 (i, o, a; θ) = R(i, o, a, θ), regardless of the transition model and reward parameterization. We can also see from Theorem 1 that 0 θ Q (i, o, a; θ) = θ R(i, o, a; θ). Because R(i, o, a; θ) can be parameterized arbitrarily, PGRD can be configured to match standard OLPOMDP with any policy parameterization that also computes a score function for the Boltzmann distribution. In our experiments, we demonstrate that choosing a planning depth d > 0 can be beneficial over using OLPOMDP (d = 0). In the remainder of this section, we show theoretically that choosing d > 0 does not hurt in the sense that it does not reduce the space of policies available to the policy gradient method. Specifically, we show that when using an expressive enough reward parameterization, PGRD’s space of policies is not restricted relative to OLPOMDP’s space of policies. We prove the result for infinite planning, but the extension to depth-limited planning is straightforward. Theorem 3. There exists a reward parameterization such that, for an arbitrary transition model T , the space of policies representable by PGRD with infinite planning is identical to the space of policies representable by PGRD with depth 0 planning. Proof. Ignoring internal state for now (holding it constant), let C(o, a) be an arbitrary reward function used by PGRD with depth 0 planning. Let R(o, a; θ) be a reward function for PGRD with infinite (d = ∞) planning. The depth-∞ agent uses the planning result Q∗ (o, a; θ) to act, while the depth-0 agent uses the function C(o, a) to act. Therefore, it suffices to show that one can always choose θ such that the planning solution Q∗ (o, a; θ) equals C(o, a). For all o ∈ O, a ∈ A, set R(o, a; θ) = C(o, a) − γ o T (o |o, a) maxa C(o , a ). Substituting Q∗ for C, this is the Bellman optimality equation [22] for infinite-horizon planning. Setting R(o, a; θ) as above is possible if it is parameterized by a table with an entry for each observation–action pair. Theorem 3 also shows that the effect of an arbitrarily poor model can be overcome with a good choice of reward function. This is because a Boltzmann distribution can, allowing for an arbitrary scoring function C, represent any policy. We demonstrate this ability of PGRD in our experiments. 3 Experiments The primary objective of our experiments is to demonstrate that PGRD is able to use experience online to improve the reward function parameters, thereby improving the agent’s obtained objective return. Specifically, we compare the objective return achieved by PGRD to the objective return achieved by PGRD with the reward adaptation turned off. In both cases, the reward function is initialized to the objective reward function. A secondary objective is to demonstrate that when a good model is available, adding the ability to plan—even for small depths—improves performance relative to the baseline algorithm of OLPOMDP (or equivalently PGRD with depth d = 0). Foraging Domain for Experiments 1 to 3: The foraging environment illustrated in Figure 2(a) is a 3 × 3 grid world with 3 dead-end corridors (rows) separated by impassable walls. The agent (bird) has four available actions corresponding to each cardinal direction. Movement in the intended direction fails with probability 0.1, resulting in movement in a random direction. If the resulting direction is 5 Objective Return 0.15 D=6, α=0 & D=6, α=5×10 −5 D=4, α=2×10 −4 D=0, α=5×10 −4 0.1 0.05 0 D=4, α=0 D=0, α=0 1000 2000 3000 4000 5000 Time Steps C) Objective Return B) A) 0.15 D=6, α=0 & D=6, α=5×10 −5 D=3, α=3×10 −3 D=1, α=3×10 −4 0.1 D=3, α=0 0.05 D=0, α=0.01 & D=1, α=0 0 1000 2000 3000 4000 5000 D=0, α=0 Time Steps Figure 2: A) Foraging Domain, B) Performance of PGRD with observation-action reward features, C) Performance of PGRD with recency reward features blocked by a wall or the boundary, the action results in no movement. There is a food source (worm) located in one of the three right-most locations at the end of each corridor. The agent has an eat action, which consumes the worm when the agent is at the worm’s location. After the agent consumes the worm, a new worm appears randomly in one of the other two potential worm locations. Objective Reward for the Foraging Domain: The designer’s goal is to maximize the average number of worms eaten per time step. Thus, the objective reward function RO provides a reward of 1.0 when the agent eats a worm, and a reward of 0 otherwise. The objective return is defined as in Equation (1). Experimental Methodology: We tested PGRD for depth-limited planning agents of depths 0–6. Recall that PGRD for the agent with planning depth 0 is the OLPOMDP algorithm. For each depth, we jointly optimized over the PGRD algorithm parameters, α and β (we use a fixed α throughout learning). We tested values for α on an approximate logarithmic scale in the range (10−6 , 10−2 ) as well as the special value of α = 0, which corresponds to an agent that does not adapt its reward function. We tested β values in the set 0, 0.4, 0.7, 0.9, 0.95, 0.99. Following common practice [3], we set the λ parameter to 0. We explicitly bound the reward parameters and capped the reward function output both to the range [−1, 1]. We used a Boltzmann temperature parameter of τ = 100 and planning discount factor γ = 0.95. Because we initialized θ so that the initial reward function was the objective reward function, PGRD with α = 0 was equivalent to a standard depth-limited planning agent. Experiment 1: A fully observable environment with a correct model learned online. In this experiment, we improve the reward function in an agent whose only limitation is planning depth, using (1) a general reward parameterization based on the current observation and (2) a more compact reward parameterization which also depends on the history of observations. Observation: The agent observes the full state, which is given by the pair o = (l, w), where l is the agent’s location and w is the worm’s location. Learning a Correct Model: Although the theorem of convergence of PGRD relies on the agent having a fixed model, the algorithm itself is readily applied to the case of learning a model online. In this experiment, the agent’s model T is learned online based on empirical transition probabilities between observations (recall this is a fully observable environment). Let no,a,o be the number of times that o was reached after taking action a after observing o. The agent models the probability of seeing o as no,a,o T (o |o, a) = . n o o,a,o Reward Parameterizations: Recall that R(i, o, a; θ) = θT φ(i, o, a), for some φ(i, o, a). (1) In the observation-action parameterization, φ(i, o, a) is a binary feature vector with one binary feature for each observation-action pair—internal state is ignored. This is effectively a table representation over all reward functions indexed by (o, a). As shown in Theorem 3, the observation-action feature representation is capable of producing arbitrary policies over the observations. In large problems, such a parameterization would not be feasible. (2) The recency parameterization is a more compact representation which uses features that rely on the history of observations. The feature vector is φ(i, o, a) = [RO (o, a), 1, φcl (l, i), φcl,a (l, a, i)], where RO (o, a) is the objective reward function defined as above. The feature φcl (l) = 1 − 1/c(l, i), where c(l, i) is the number of time steps since the agent has visited location l, as represented in the agent’s internal state i. Its value is normalized to the range [0, 1) and is high when the agent has not been to location l recently. The feature φcl,a (l, a, i) = 1 − 1/c(l, a, i) is similarly defined with respect to the time since the agent has taken action a in location l. Features based on recency counts encourage persistent exploration [21, 18]. 6 Results & Discussion: Figure 2(b) and Figure 2(c) present results for agents that use the observationaction parameterization and the recency parameterization of the reward function respectively. The horizontal axis is the number of time steps of experience. The vertical axis is the objective return, i.e., the average objective reward per time step. Each curve is an average over 130 trials. The values of d and the associated optimal algorithm parameters for each curve are noted in the figures. First, note that with d = 6, the agent is unbounded, because food is never more than 6 steps away. Therefore, the agent does not benefit from adapting the reward function parameters (given that we initialize to the objective reward function). Indeed, the d = 6, α = 0 agent performs as well as the best reward-optimizing agent. The performance for d = 6 improves with experience because the model improves with experience (and thus from the curves it is seen that the model gets quite accurate in about 1500 time steps). The largest objective return obtained for d = 6 is also the best objective return that can be obtained for any value of d. Several results can be observed in both Figures 2(b) and (c). 1) Each curve that uses α > 0 (solid lines) improves with experience. This is a demonstration of our primary contribution, that PGRD is able to effectively improve the reward function with experience. That the improvement over time is not just due to model learning is seen in the fact that for each value of d < 6 the curve for α > 0 (solid-line) which adapts the reward parameters does significantly better than the corresponding curve for α = 0 (dashed-line); the α = 0 agents still learn the model. 2) For both α = 0 and α > 0 agents, the objective return obtained by agents with equivalent amounts of experience increases monotonically as d is increased (though to maintain readability we only show selected values of d in each figure). This demonstrates our secondary contribution, that the ability to plan in PGRD significantly improves performance over standard OLPOMDP (PGRD with d = 0). There are also some interesting differences between the results for the two different reward function parameterizations. With the observation-action parameterization, we noted that there always exists a setting of θ for all d that will yield optimal objective return. This is seen in Figure 2(b) in that all solid-line curves approach optimal objective return. In contrast, the more compact recency reward parameterization does not afford this guarantee and indeed for small values of d (< 3), the solid-line curves in Figure 2(c) converge to less than optimal objective return. Notably, OLPOMDP (d = 0) does not perform well with this feature set. On the other hand, for planning depths 3 ≤ d < 6, the PGRD agents with the recency parameterization achieve optimal objective return faster than the corresponding PGRD agent with the observation-action parameterization. Finally, we note that this experiment validates our claim that PGRD can improve reward functions that depend on history. Experiment 2: A fully observable environment and poor given model. Our theoretical analysis showed that PGRD with an incorrect model and the observation–action reward parameterization should (modulo local maxima issues) do just as well asymptotically as it would with a correct model. Here we illustrate this theoretical result empirically on the same foraging domain and objective reward function used in Experiment 1. We also test our hypothesis that a poor model should slow down the rate of learning relative to a correct model. Poor Model: We gave the agents a fixed incorrect model of the foraging environment that assumes there are no internal walls separating the 3 corridors. Reward Parameterization: We used the observation–action reward parameterization. With a poor model it is no longer interesting to initialize θ so that the initial reward function is the objective reward function because even for d = 6 such an agent would do poorly. Furthermore, we found that this initialization leads to excessively bad exploration and therefore poor learning of how to modify the reward. Thus, we initialize θ to uniform random values near 0, in the range (−10−3 , 10−3 ). Results: Figure 3(a) plots the objective return as a function of number of steps of experience. Each curve is an average over 36 trials. As hypothesized, the bad model slows learning by a factor of more than 10 (notice the difference in the x-axis scales from those in Figure 2). Here, deeper planning results in slower learning and indeed the d = 0 agent that does not use the model at all learns the fastest. However, also as hypothesized, because they used the expressive observation–action parameterization, agents of all planning depths mitigate the damage caused by the poor model and eventually converge to the optimal objective return. Experiment 3: Partially observable foraging world. Here we evaluate PGRD’s ability to learn in a partially observable version of the foraging domain. In addition, the agents learn a model under the erroneous (and computationally convenient) assumption that the domain is fully observable. 7 0.1 −4 D = 0, α = 2 ×10 D = 2, α = 3 ×10 −5 −5 D = 6, α = 2 ×10 0.05 D = 0&2&6, α = 0 0 1 2 3 Time Steps 4 5 x 10 4 0.06 D = 6, α = 7 ×10 D = 2, α = 7 ×10 −4 0.04 D = 1, α = 7 ×10 −4 D = 0, α = 5 ×10 −4 D = 0, α = 0 D = 1&2&6, α = 0 0.02 0 C) −4 1000 2000 3000 4000 5000 Time Steps Objective Return B) 0.08 0.15 Objective Return Objective Return A) 2.5 2 x 10 −3 D=6, α=3×10 −6 D=0, α=1×10 −5 1.5 D=0&6, α=0 1 0.5 1 2 3 Time Steps 4 5 x 10 4 Figure 3: A) Performance of PGRD with a poor model, B) Performance of PGRD in a partially observable world with recency reward features, C) Performance of PGRD in Acrobot Partial Observation: Instead of viewing the location of the worm at all times, the agent can now only see the worm when it is colocated with it: its observation is o = (l, f ), where f indicates whether the agent is colocated with the food. Learning an Incorrect Model: The model is learned just as in Experiment 1. Because of the erroneous full observability assumption, the model will hallucinate about worms at all the corridor ends based on the empirical frequency of having encountered them there. Reward Parameterization: We used the recency parameterization; due to the partial observability, agents with the observation–action feature set perform poorly in this environment. The parameters θ are initialized such that the initial reward function equals the objective reward function. Results & Discussion: Figure 3(b) plots the mean of 260 trials. As seen in the solid-line curves, PGRD improves the objective return at all depths (only a small amount for d = 0 and significantly more for d > 0). In fact, agents which don’t adapt the reward are hurt by planning (relative to d = 0). This experiment demonstrates that the combination of planning and reward improvement can be beneficial even when the model is erroneous. Because of the partial observability, optimal behavior in this environment achieves less objective return than in Experiment 1. Experiment 4: Acrobot. In this experiment we test PGRD in the Acrobot environment [22], a common benchmark task in the RL literature and one that has previously been used in the testing of policy gradient approaches [23]. This experiment demonstrates PGRD in an environment in which an agent must be limited due to the size of the state space and further demonstrates that adding model-based planning to policy gradient approaches can improve performance. Domain: The version of Acrobot we use is as specified by Sutton and Barto [22]. It is a two-link robot arm in which the position of one shoulder-joint is fixed and the agent’s control is limited to 3 actions which apply torque to the elbow-joint. Observation: The fully-observable state space is 4 dimensional, with two joint angles ψ1 and ψ2 , and ˙ ˙ two joint velocities ψ1 and ψ2 . Objective Reward: The designer receives an objective reward of 1.0 when the tip is one arm’s length above the fixed shoulder-joint, after which the bot is reset to its initial resting position. Model: We provide the agent with a perfect model of the environment. Because the environment is continuous, value iteration is intractable, and computational limitations prevent planning deep enough to compute the optimal action in any state. The feature vector contains 13 entries. One feature corresponds to the objective reward signal. For each action, there are 5 features corresponding to each of the state features plus an additional feature representing the height of the tip: φ(i, o, a) = ˙ ˙ [RO (o), {ψ1 (o), ψ2 (o), ψ1 (o), ψ2 (o), h(o)}a ]. The height feature has been used in previous work as an alternative definition of objective reward [23]. Results & Discussion: We plot the mean of 80 trials in Figure 3(c). Agents that use the fixed (α = 0) objective reward function with bounded-depth planning perform according to the bottom two curves. Allowing PGRD and OLPOMDP to adapt the parameters θ leads to improved objective return, as seen in the top two curves in Figure 3(c). Finally, the PGRD d = 6 agent outperforms the standard OLPOMDP agent (PGRD with d = 0), further demonstrating that PGRD outperforms OLPOMDP. Overall Conclusion: We developed PGRD, a new method for approximately solving the optimal reward problem in bounded planning agents that can be applied in an online setting. We showed that PGRD is a generalization of OLPOMDP and demonstrated that it both improves reward functions in limited agents and outperforms the model-free OLPOMDP approach. 8 References [1] Douglas Aberdeen and Jonathan Baxter. Scalable Internal-State Policy-Gradient Methods for POMDPs. Proceedings of the Nineteenth International Conference on Machine Learning, 2002. [2] Peter L. Bartlett and Jonathan Baxter. Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the 39th IEEE Conference on Decision and Control, 2000. [3] Jonathan Baxter, Peter L. Bartlett, and Lex Weaver. Experiments with Infinite-Horizon, Policy-Gradient Estimation, 2001. [4] Shalabh Bhatnagar, Richard S. Sutton, M Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 2009. [5] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A General Polynomial Time Algorithm for NearOptimal Reinforcement Learning. Journal of Machine Learning Research, 3:213–231, 2001. [6] S. Elfwing, Eiji Uchibe, K. Doya, and H. I. Christensen. Co-evolution of Shaping Rewards and MetaParameters in Reinforcement Learning. Adaptive Behavior, 16(6):400–412, 2008. [7] J. Zico Kolter and Andrew Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning, pages 513–520, 2009. [8] Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd edition, 2010. [9] Cetin Mericli, Tekin Mericli, and H. Levent Akin. A Reward Function Generation Method Using Genetic ¸ ¸ ¸ Algorithms : A Robot Soccer Case Study (Extended Abstract). In Proc. of the 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010), number 2, pages 1513–1514, 2010. [10] Gergely Neu and Csaba Szepesv´ ri. Apprenticeship learning using inverse reinforcement learning and a gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, pages 295–302, 2007. [11] Andrew Y. Ng, Stuart J. Russell, and D. Harada. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pages 278–287, 1999. [12] Scott Niekum, Andrew G. Barto, and Lee Spector. Genetic Programming for Reward Function Search. IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010. [13] Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V. Hafner. Intrinsic Motivation Systems for Autonomous Mental Development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, April 2007. [14] J¨ rgen Schmidhuber. Curious model-building control systems. In IEEE International Joint Conference on u Neural Networks, pages 1458–1463, 1991. [15] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically Motivated Reinforcement Learning. In Proceedings of Advances in Neural Information Processing Systems 17 (NIPS), pages 1281–1288, 2005. [16] Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where Do Rewards Come From? In Proceedings of the Annual Conference of the Cognitive Science Society, pages 2601–2606, 2009. [17] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transations on Autonomous Mental Development, 2(2):70–82, 2010. [18] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Internal Rewards Mitigate Agent Boundedness. In Proceedings of the 27th International Conference on Machine Learning, 2010. [19] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010. [20] Alexander L. Strehl and Michael L. Littman. An analysis of model-based Interval Estimation for Markov Decision Processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008. [21] Richard S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In The Seventh International Conference on Machine Learning, pages 216–224. 1990. [22] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. [23] Lex Weaver and Nigel Tao. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 538–545. 2001. 9
3 0.77919531 168 nips-2010-Monte-Carlo Planning in Large POMDPs
Author: David Silver, Joel Veness
Abstract: This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan effectively in significantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the first general purpose planner to achieve high performance in such large and unfactored POMDPs. 1
4 0.76576239 4 nips-2010-A Computational Decision Theory for Interactive Assistants
Author: Alan Fern, Prasad Tadepalli
Abstract: We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We first introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in finite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant’s action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution. 1
5 0.68293291 93 nips-2010-Feature Construction for Inverse Reinforcement Learning
Author: Sergey Levine, Zoran Popovic, Vladlen Koltun
Abstract: The goal of inverse reinforcement learning is to find a reward function for a Markov decision process, given example traces from its optimal policy. Current IRL techniques generally rely on user-supplied features that form a concise basis for the reward. We present an algorithm that instead constructs reward features from a large collection of component features, by building logical conjunctions of those component features that are relevant to the example policy. Given example traces, the algorithm returns a reward function as well as the constructed features. The reward function can be used to recover a full, deterministic, stationary policy, and the features can be used to transplant the reward function into any novel environment on which the component features are well defined. 1
6 0.65813482 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
7 0.65206861 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions
8 0.59101117 43 nips-2010-Bootstrapping Apprenticeship Learning
9 0.57132775 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains
10 0.5668543 192 nips-2010-Online Classification with Specificity Constraints
11 0.52343011 64 nips-2010-Distributionally Robust Markov Decision Processes
12 0.52091628 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
13 0.51201642 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories
14 0.49464121 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning
15 0.49300063 203 nips-2010-Parametric Bandits: The Generalized Linear Case
16 0.43251181 212 nips-2010-Predictive State Temporal Difference Learning
17 0.42735058 183 nips-2010-Non-Stochastic Bandit Slate Problems
18 0.42029911 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
19 0.39428529 283 nips-2010-Variational Inference over Combinatorial Spaces
20 0.3710399 215 nips-2010-Probabilistic Deterministic Infinite Automata
topicId topicWeight
[(13, 0.016), (27, 0.04), (30, 0.049), (45, 0.68), (50, 0.027), (52, 0.017), (60, 0.027), (77, 0.023), (78, 0.011), (90, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99910301 11 nips-2010-A POMDP Extension with Belief-dependent Rewards
Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet
Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1
2 0.9988246 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs
Author: Nikos Karampatziakis
Abstract: We cast the problem of identifying basic blocks of code in a binary executable as learning a mapping from a byte sequence to a segmentation of the sequence. In general, inference in segmentation models, such as semi-CRFs, can be cubic in the length of the sequence. By taking advantage of the structure of our problem, we derive a linear-time inference algorithm which makes our approach practical, given that even small programs are tens or hundreds of thousands bytes long. Furthermore, we introduce two loss functions which are appropriate for our problem and show how to use structural SVMs to optimize the learned mapping for these losses. Finally, we present experimental results that demonstrate the advantages of our method against a strong baseline. 1
3 0.99860859 235 nips-2010-Self-Paced Learning for Latent Variable Models
Author: M. P. Kumar, Benjamin Packer, Daphne Koller
Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1
4 0.99805325 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
Author: Daniel Lowd, Pedro Domingos
Abstract: Arithmetic circuits (ACs) exploit context-specific independence and determinism to allow exact inference even in networks with high treewidth. In this paper, we introduce the first ever approximate inference methods using ACs, for domains where exact inference remains intractable. We propose and evaluate a variety of techniques based on exact compilation, forward sampling, AC structure learning, Markov network parameter learning, variational inference, and Gibbs sampling. In experiments on eight challenging real-world domains, we find that the methods based on sampling and learning work best: one such method (AC2 -F) is faster and usually more accurate than loopy belief propagation, mean field, and Gibbs sampling; another (AC2 -G) has a running time similar to Gibbs sampling but is consistently more accurate than all baselines. 1
5 0.9979431 167 nips-2010-Mixture of time-warped trajectory models for movement decoding
Author: Elaine Corbett, Eric Perreault, Konrad Koerding
Abstract: Applications of Brain-Machine-Interfaces typically estimate user intent based on biological signals that are under voluntary control. For example, we might want to estimate how a patient with a paralyzed arm wants to move based on residual muscle activity. To solve such problems it is necessary to integrate obtained information over time. To do so, state of the art approaches typically use a probabilistic model of how the state, e.g. position and velocity of the arm, evolves over time – a so-called trajectory model. We wanted to further develop this approach using two intuitive insights: (1) At any given point of time there may be a small set of likely movement targets, potentially identified by the location of objects in the workspace or by gaze information from the user. (2) The user may want to produce movements at varying speeds. We thus use a generative model with a trajectory model incorporating these insights. Approximate inference on that generative model is implemented using a mixture of extended Kalman filters. We find that the resulting algorithm allows us to decode arm movements dramatically better than when we use a trajectory model with linear dynamics. 1 In trod u cti on When patients have lost a limb or the ability to communicate with the outside world, brain machine interfaces (BMIs) are often used to enable robotic prostheses or restore communication. To achieve this, the user's intended state of the device must be decoded from biological signals. In the context of Bayesian statistics, two aspects are important for the design of an estimator of a temporally evolving state: the observation model, which describes how measured variables relate to the system’s state and the trajectory model which describes how the state changes over time in a probabilistic manner. Following this logic many recent BMI applications have relied on Bayesian estimation for a wide range of problems including the decoding of intended human [1] and animal [2] movements. In the context of BMIs, Bayesian approaches offer a principled way of formalizing the uncertainty about signals and thus often result in improvements over other signal processing techniques [1]-[3]. Most work on state estimation in dynamical systems has assumed linear dynamics and Gaussian noise. Under these circumstances, efficient algorithms result from belief propagation. The most frequent application uses the Kalman filter (KF), which recursively combines noisy state observations with the probabilistic evolution of state defined by the trajectory model to estimate the marginal distribution over states [4]. Such approaches have been used widely for applications including upper [1] and lower [5] extremity prosthetic 1 devices, functional electric stimulation [6] and human computer interactions [7]. As these algorithms are so commonly used, it seems promising to develop extensions to nonlinear trajectory models that may better describe the probabilistic distribution of movements in everyday life. One salient departure from the standard assumptions is that people tend to produce both slow and fast movements, depending on the situation. Models with linear dynamics only allow such deviation through the noise term, which makes these models poor at describing the natural variation of movement speeds during real world tasks. Explicitly incorporating movement speed into the trajectory model should lead to better movement estimates. Knowledge of the target position should also strongly affect trajectory models. After all , we tend to accelerate our arm early during movement and slow down later on. Target information can be linearly incorporated into the trajectory model, and this has greatly improved predictions [8]-[12]. Alternatively, if there are a small number of potential targets then a mixture of trajectory models approach [13] can be used. Here we are interested in the case where available data provide a prior over potential t argets but where movement targets may be anywhere. We want to incorporate target uncertainty and allow generalization to novel targets. Prior information about potential targets could come from a number of sources but would generally be noisy. For example, activity in the dorsal premotor cortex provides information about intended target location prior to movement and may be used where such recordings are available [14]. Target information may also be found noninvasively by tracking eye movements. However, such data will generally provide non-zero priors for a number of possible target locations as the subject saccades over the scene. While subjects almost always look at a target before reaching for it [15], there may be a delay of up to a second between looking at the target and the reach – a time interval over which up to 3 saccades are typically made. Each of these fixations could be the target. Hence, a probabilistic distribution of targets is appropriate when using either neural recordings or eye tracking to estimate potential reach targets Here we present an algorithm that uses a mixture of extended Kalman Filters (EKFs) to combine our insights related to the variation of movement speed and the availability of probabilistic target knowledge. Each of the mixture component s allows the speed of the movement to vary continuously over time. We tested how well we could use EMGs and eye movements to decode hand position of humans performing a three -dimensional large workspace reaching task. We find that using a trajectory model that allows for probabilistic target information and variation of speed leads to dramatic improvements in decoding quality. 2 Gen e ral Decod i n g S etti n g We wanted to test how well different decoding algorithms can decode human movement, over a wide range of dynamics. While many recent studies have looked at more restrictive, two-dimensional movements, a system to restore arm function should produce a wide range of 3D trajectories. We recorded arm kinematics and EMGs of healthy subjects during unconstrained 3D reaches to targets over a large workspace. Two healthy subjects were asked to reach at slow, normal and fast speeds, as they would in everyday life. Subjects were seated as they reached towards 16 LEDs in blocks of 150s, which were located on two planes positioned such that all targets were just reachable (Fig 1A). The target LED was lit for one second prior to an auditory go cue, at which time the subject would reach to the target at the appropriate speed. Slow, normal and fast reaches were allotted 3 s, 1.5s and 1s respectively; however, subjects determined the speed. An approximate total of 450 reaches were performed per subject. The subjects provided informed consent, and the protocol was approved by the Northwestern University Institutional Review Board. EMG signals were measured from the pectoralis major, and the three deltoid muscles of the shoulder. This represents a small subset of the muscles involved in reaching, and approximates those muscles retaining some voluntary control following mid-level cervical spinal cord injuries. 2 The EMG signals were band-pass filtered between 10 and 1,000 Hz, and subsequently anti aliased filtered. Hand, wrist, shoulder and head positions were tracked using an Optotrak motion analysis system. We simultaneously recorded eye movements with an ASL EYETRAC-6 head mounted eye tracker. Approximately 25% of the reaches were assigned to the test set, and the rest were used for training. Reaches for which either the motion capture data was incomplete, or there was visible motion artifact on the EMG were removed. As the state we used hand positions and joint angles (3 shoulder, 2 elbow, position, velocity and acceleration, 24 dimensions). Joint angles were calculated from the shoulder and wrist marker data using digitized bony landmarks which defined a coordinate system for the upper limb as detailed by Wu et al. [16]. As the motion data were sampled at 60Hz, the mean absolute value o f the EMG in the corresponding 16.7ms windows was used as an observation of the state at each time-step. Algorithm accuracy was quantified by normalizing the root -mean-squared error by the straight line distance between the first and final position of the endpoint for each reach. We compared the algorithms statistically using repeated measures ANOVAs with Tukey post -hoc tests, treating reach and subject as random effects. In the rest of the paper we will ask how well these reaching movements can be decoded from EMG and eye-tracking data. Figure 1: A Experimental setup and B sample kinematics and processed EMGs for one reach 3 Kal man Fi l ters w i th Target i n f ormati on All models that we consider in this paper assume linear observations with Gaussian noise: (1) where x is the state, y is the observation and v is the measurement noise with p(v) ~ N(0,R), and R is the observation covariance matrix. The model fitted the measured EMGs with an average r2 of 0.55. This highlights the need to integrate information over time. The standard approach also assumes linear dynamics and Gaussian process noise: (2) where, x t represents the hand and joint angle positions, w is the process noise with p(w) ~ N(0,Q), and Q is the state covariance matrix. The Kalman filter does optimal inference for this generative model. This model can effectively capture the dynamics of stereotypical reaches to a single target by appropriately tuning its parameters. However, when used to describe reaches to multiple targets, the model cannot describe target dependent aspects of reaching but boils down to a random drift model. Fast velocities are underestimated as they are unlikely under the trajectory model and there is excessive drift close to the target (Fig. 2A). 3 In many decoding applications we may know the subject’s target. A range of recent studies have addressed the issue of incorporating this information into the trajectory model [8, 13], and we might assume the effect of the target on the dynamics to be linear. This naturally suggests adding the target to the state space, which works well in practice [9, 12]. By appending the target to the state vector (KFT), the simple linear format of the KF may be retained: (3) where xTt is the vector of target positions, with dimensionality less than or equal to that of xt. This trajectory model thus allows describing both the rapid acceleration that characterizes the beginning of a reach and the stabilization towards its end. We compared the accuracy of the KF and the KFT to the Single Target Model (STM), a KF trained only on reaches to the target being tested (Fig. 2). The STM represents the best possible prediction that could be obtained with a Kalman filter. Assuming the target is perfectly known, we implemented the KFT by correctly initializing the target state xT at the beginning of the reach. We will relax this assumption below. The initial hand and joint angle positions were also assumed to be known. Figure 2: A Sample reach and predictions and B average accuracies with standard errors for KFT, KF and MTM. Consistent with the recent literature, both methods that incorporated target information produced higher prediction accuracy than the standard KF (both p<0.0001). Interestingly, there was no significant difference between the KFT and the STM (p=0.9). It seems that when we have knowledge of the target, we do not lose much by training a single model over the whole workspace rather than modeling the targets individually. This is encouraging, as we desire a BMI system that can generalize to any target within the workspace, not just specifically to those that are available in the training data. Clearly, adding the target to the state space allows the dynamics of typical movements to be modeled effectively, resulting in dramatic increases in decoding performance. 4 Ti me Warp i n g 4.1 I m p l e m e n t i n g a t i m e - w a r p e d t r a j e c t o r y mo d e l While the KFT above can capture the general reach trajectory profile, it does not allow for natural variability in the speed of movements. Depending on our task objectives, which would not directly be observed by a BMI, we might lazily reach toward a target or move a t maximal speed. We aim to change the trajectory model to explicitly incorporate a warping factor by which the average movement speed is scaled, allowing for such variability. As the movement speed will be positive in all practical cases, we model the logarithm of this factor, 4 and append it to the state vector: (4) We create a time-warped trajectory model by noting that if the average rate of a trajectory is to be scaled by a factor S, the position at time t will equal that of the original trajectory at time St. Differentiating, the velocity will be multiplied by S, and the acceleration by S 2. For simplicity, the trajectory noise is assumed to be additive and Gaussian, and the model is assumed to be stationary: (5) where Ip is the p-dimensional identity matrix and is a p p matrix of zeros. Only the terms used to predict the acceleration states need to be estimated to build the state transition matrix, and they are scaled as a nonlinear function of xs. After adding the variable movement speed to the state space the system is no longer linear. Therefore we need a different solution strategy. Instead of the typical KFT we use the Extended Kalman Filter (EKFT) to implement a nonlinear trajectory model by linearizing the dynamics around the best estimate at each time-step [17]. With this approach we add only small computational overhead to the KFT recursions. 4.2 Tr a i n i n g t h e t i m e w a r p i n g mo d e l The filter parameters were trained using a variant of the Expectation Maximization (EM) algorithm [18]. For extended Kalman filter learning the initialization for the variables may matter. S was initialized with the ground truth average reach speeds for each movement relative to the average speed across all movements. The state transition parameters were estimated using nonlinear least squares regression, while C, Q and R were estimated linearly for the new system, using the maximum likelihood solution [18] (M-step). For the E-step we used a standard extended Kalman smoother. We thus found the expected values for t he states given the current filter parameters. For this computation, and later when testing the algorithm, xs was initialized to its average value across all reaches while the remaining states were initialized to their true values. The smoothed estimate fo r xs was then used, along with the true values for the other states, to re-estimate the filter parameters in the M-step as before. We alternated between the E and M steps until the log likelihood converged (which it did in all cases). Following the training procedure, the diagonal of the state covariance matrix Q corresponding to xs was set to the variance of the smoothed xs over all reaches, according to how much this state should be allowed to change during prediction. This allowed the estimate of xs to develop over the course of the reach due to the evidence provided by the observations, better capturing the dynamics of reaches at different speeds. 4.3 P e r f o r ma n c e o f t h e t i m e - w a r p e d E K F T Incorporating time warping explicitly into the trajectory model pro duced a noticeable increase in decoding performance over the KFT. As the speed state xs is estimated throughout the course of the reach, based on the evidence provided by the observations, the trajectory model has the flexibility to follow the dynamics of the reach more accurately (Fig. 3). While at the normal self-selected speed the difference between the algorithms is small, for the slow and fast speeds, where the dynamics deviate from average, there i s a clear advantage to the time warping model. 5 Figure 3: Hand positions and predictions of the KFT and EKFT for sample reaches at A slow, B normal and C fast speeds. Note the different time scales between reaches. The models were first trained using data from all speeds (Fig. 4A). The EKFT was 1.8% more accurate on average (p<0.01), and the effect was significant at the slow (1.9%, p<0.05) and the fast (2.8%, p<0.01), but not at the normal (p=0.3) speed. We also trained the models from data using only reaches at the self-selected normal speed, as we wanted to see if there was enough variation to effectively train the EKFT (Fig. 4B). Interestingly, the performance of the EKFT was reduced by only 0.6%, and the KFT by 1.1%. The difference in performance between the EKFT and KFT was even more pronounced on aver age (2.3%, p<0.001), and for the slow and fast speeds (3.6 and 4.1%, both p< 0.0001). At the normal speed, the algorithms again were not statistically different (p=0.6). This result demonstrates that the EKFT is a practical option for a real BMI system, as it is not necessary to greatly vary the speeds while collecting training data for the model to be effective over a wide range of intended speeds. Explicitly incorporating speed information into the trajectory model helps decoding, by modeling the natural variation in volitional speed. Figure 4: Mean and standard error of EKFT and KFT accuracy at the different subjectselected speeds. Models were trained on reaches at A all speeds and B just normal speed reaches. Asterisks indicate statistically significant differences between the algorithms. 5 Mi xtu res of Target s So far, we have assumed that the targets of our reaches are perfectly known. In a real-world system, there will be uncertainty about the intended target of the reach. However, in typical applications there are a small number of possible objectives. Here we address this situation. Drawing on the recent literature, we use a mixture model to consider each of the possible targets [11, 13]. We condition the posterior probability for the state on the N possible targets, T: (6) 6 Using Bayes' Rule, this equation becomes: (7) As we are dealing with a mixture model, we perform the Kalman filter recursion for each possible target, xT, and our solution is a weighted sum of the outputs. The weights are proportional to the prior for that target, , and the likelihood of the model given that target . is independent of the target and does not need to be calculated. We tested mixtures of both algorithms, the mKFT and mEKFT, with real uncert ain priors obtained from eye-tracking in the one-second period preceding movement. As the targets were situated on two planes, the three-dimensional location of the eye gaze was found by projecting its direction onto those planes. The first, middle and last eye samples were selected, and all other samples were assigned to a group according to which of the three was closest. The mean and variance of these three groups were used to initialize three Kalman filters in the mixture model. The priors of the three groups were assigned proportional to the number of samples in them. If the subject looks at multiple positions prior to reaching, this method ensures with a high probability that the correct target was accounted for in one of the filters in the mixture. We also compared the MTM approach of Yu et al. [13], where a different KF model was generated for each target, and a mixture is performed over these models. This approach explicitly captures the dynamics of stereotypical reaches to specific targets. Given perfect target information, it would reduce to the STM described above. Priors for the MTM were found by assigning each valid eye sample to its closest two targets, and weighting the models proportional to the number of samples assigned to the corresponding target, divided by its distance from the mean of those samples. We tried other ways of assigning priors and the one presented gave the best results. We calculated the reduction in decoding quality when instead of perfect priors we provide eye-movement based noisy priors (Fig. 5). The accuracies of the mEKFT, the mKFT and the MTM were only degraded by 0.8, 1.9 and 2.1% respectively, compared to the perfect prior situation. The mEKFT was still close to 10% better than the KF. The mixture model framework is effective in accounting for uncertain priors. Figure 5: Mean and standard errors of accuracy for algorithms with perfect priors, and uncertain priors with full and partial training set. The asterisk indicates a statistically significant effects between the two training types, where real priors are used. Here, only reaches at normal speed were used to train the models, as this is a more realistic training set for a BMI application. This accounts for the degraded performance of the MTM with perfect priors relative to the STM from above (Fig. 2). With even more stereotyped training data for each target, the MTM doesn't generalize as well to new speeds. 7 We also wanted to know if the algorithms could generalize to new targets. In a real application, the available training data will generally not span the entire useable worksp ace. We compared the algorithms where reaches to all targets except the one being tested had been used to train the models. The performance of the MTM was significantly de graded unsurprisingly, as it was designed for reaches to a set of known targets. Performance of the mKFT and mEKFT degraded by about 1%, but not significantly (both p>0.7), demonstrating that the continuous approach to target information is preferable when the target could be anywhere in space, not just at locations for which training data is available. 6 Di scu ssi on and concl u si on s The goal of this work was to design a trajectory model that would improve decoding for BMIs with an application to reaching. We incorporated two features that prominently influence the dynamics of natural reach: the movement speed and the target location. Our approach is appropriate where uncertain target information is available. The model generalizes well to new regions of the workspace for which there is no training data, and across a broad range of reaching dynamics to widely spaced targets in three dimensions. The advantages over linear models in decoding precision we report here could be equally obtained using mixtures over many targets and speeds. While mixture models [11, 13] could allow for slow versus fast movements and any number of potential targets, this strategy will generally require many mixture components. Such an approach would require a lot more training data, as we have shown that it does not generalize well. It would also be run-time intensive which is problematic for prosthetic devices that rely on low power controllers. In contrast, the algorithm introduced here only takes a small amount of additional run-time in comparison to the standard KF approach. The EKF is only marginally slower than the standard KF and the algorithm will not generally need to consider more than 3 mixture components assuming the subject fixates the target within the second pre ceding the reach. In this paper we assumed that subjects always would fixate a reach target – along with other non-targets. While this is close to the way humans usually coordinate eyes and reaches [15], there might be cases where people do not fixate a reach target. Our approach could be easily extended to deal with such situations by adding a dummy mixture component that all ows the description of movements to any target. As an alternative to mixture approaches, a system can explicitly estimate the target position in the state vector [9]. This approach, however, would not straightforwardly allow for the rich target information available; we look at the target but also at other locations, strongly suggesting mixture distributions. A combination of the two approaches could further improve decoding quality. We could both estimate speed and target position for the EKFT in a continuous manner while retaining the mixture over target priors. We believe that the issues that we have addressed here are almost universal. Virtually all types of movements are executed at varying speed. A probabilistic distribution for a small number of action candidates may also be expected in most BMI applications – after all there are usually only a small number of actions that make sense in a given environment. While this work is presented in the context of decoding human reaching, it may be applied to a wide range of BMI applications including lower limb prosthetic devices and human computer interactions, as well as different signal sources such as electrode grid recordings and electroencephalograms. The increased user control in conveying their intended movements would significantly improve the functionality of a neuroprosthetic device. A c k n o w l e d g e me n t s T h e a u t h o r s t h a n k T. H a s w e l l , E . K r e p k o v i c h , a n d V. Ravichandran for assistance with experiments. This work was funded by the NSF Program in Cyber-Physical Systems. R e f e re n c e s [1] L.R. Hochberg, M.D. Serruya, G.M. Friehs, J.A. Mukand, M. Saleh, A.H. Caplan, A. Branner, D. 8 [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Chen, R.D. Penn, and J.P. Donoghue, “Neuronal ensemble control of prosthetic devices by a human with tetraplegia,” Nature, vol. 442, 2006, pp. 164–171. W. Wu, Y. Gao, E. Bienenstock, J.P. Donoghue, and M.J. Black, “Bayesian population decoding of motor cortical activity using a Kalman filter,” Neural Computation, vol. 18, 2006, pp. 80–118. W. Wu, M.J. Black, Y. Gao, E. Bienenstock, M. Serruya, A. Shaikhouni, and J.P. Donoghue, “Neural decoding of cursor motion using a Kalman filter,” Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference, 2003, p. 133. R.E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, vol. 82, 1960, pp. 35–45. G.G. Scandaroli, G.A. Borges, J.Y. Ishihara, M.H. Terra, A.F.D. Rocha, and F.A.D.O. Nascimento, “Estimation of foot orientation with respect to ground for an above knee robotic prosthesis,” Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems, St. Louis, MO, USA: IEEE Press, 2009, pp. 1112-1117. I. Cikajlo, Z. Matjačić, and T. Bajd, “Efficient FES triggering applying Kalman filter during sensory supported treadmill walking,” Journal of Medical Engineering & Technology, vol. 32, 2008, pp. 133144. S. Kim, J.D. Simeral, L.R. Hochberg, J.P. Donoghue, and M.J. Black, “Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia,” Journal of Neural Engineering, vol. 5, 2008, pp. 455-476. L. Srinivasan, U.T. Eden, A.S. Willsky, and E.N. Brown, “A state-space analysis for reconstruction of goal-directed movements using neural signals,” Neural computation, vol. 18, 2006, pp. 2465–2494. G.H. Mulliken, S. Musallam, and R.A. Andersen, “Decoding trajectories from posterior parietal cortex ensembles,” Journal of Neuroscience, vol. 28, 2008, p. 12913. W. Wu, J.E. Kulkarni, N.G. Hatsopoulos, and L. Paninski, “Neural Decoding of Hand Motion Using a Linear State-Space Model With Hidden States,” IEEE Transactions on neural systems and rehabilitation engineering, vol. 17, 2009, p. 1. J.E. Kulkarni and L. Paninski, “State-space decoding of goal-directed movements,” IEEE Signal Processing Magazine, vol. 25, 2008, p. 78. C. Kemere and T. Meng, “Optimal estimation of feed-forward-controlled linear systems,” IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings.(ICASSP'05), 2005. B.M. Yu, C. Kemere, G. Santhanam, A. Afshar, S.I. Ryu, T.H. Meng, M. Sahani, and K.V. Shenoy, “Mixture of trajectory models for neural decoding of goal-directed movements,” Journal of neurophysiology, vol. 97, 2007, p. 3763. N. Hatsopoulos, J. Joshi, and J.G. O'Leary, “Decoding continuous and discrete motor behaviors using motor and premotor cortical ensembles,” Journal of neurophysiology, vol. 92, 2004, p. 1165. R.S. Johansson, G. Westling, A. Backstrom, and J.R. Flanagan, “Eye-hand coordination in object manipulation,” Journal of Neuroscience, vol. 21, 2001, p. 6917. G. Wu, F.C. van der Helm, H.E.J. Veeger, M. Makhsous, P. Van Roy, C. Anglin, J. Nagels, A.R. Karduna, and K. McQuade, “ISB recommendation on definitions of joint coordinate systems of various joints for the reporting of human joint motion–Part II: shoulder, elbow, wrist and hand,” Journal of biomechanics, vol. 38, 2005, pp. 981–992. D. Simon, Optimal state estimation: Kalman, H [infinity] and nonlinear approaches, John Wiley and Sons, 2006. Z. Ghahramani and G.E. Hinton, “Parameter estimation for linear dynamical systems,” University of Toronto technical report CRG-TR-96-2, vol. 6, 1996. 9
6 0.99736977 187 nips-2010-Occlusion Detection and Motion Estimation with Convex Optimization
7 0.99675059 100 nips-2010-Gaussian Process Preference Elicitation
8 0.99611318 133 nips-2010-Kernel Descriptors for Visual Recognition
9 0.99600321 108 nips-2010-Graph-Valued Regression
10 0.98001039 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
11 0.97722006 83 nips-2010-Evidence-Specific Structures for Rich Tractable CRFs
12 0.97515982 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning
13 0.9724915 144 nips-2010-Learning Efficient Markov Networks
14 0.97214419 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering
15 0.97204763 197 nips-2010-Optimal Bayesian Recommendation Sets and Myopically Optimal Choice Query Sets
16 0.96980661 212 nips-2010-Predictive State Temporal Difference Learning
17 0.96864754 118 nips-2010-Implicit Differentiation by Perturbation
18 0.96783161 61 nips-2010-Direct Loss Minimization for Structured Prediction
19 0.96612483 25 nips-2010-Active Learning by Querying Informative and Representative Examples
20 0.96606743 94 nips-2010-Feature Set Embedding for Incomplete Data