nips nips2010 nips2010-168 knowledge-graph by maker-knowledge-mining

168 nips-2010-Monte-Carlo Planning in Large POMDPs

Source: pdf

Author: David Silver, Joel Veness

Abstract: This paper introduces a Monte-Carlo algorithm for online planning in large POMDPs. The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. The new algorithm, POMCP, has two important properties. First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. Second, only a black box simulator of the POMDP is required, rather than explicit probability distributions. These properties enable POMCP to plan eﬀectively in signiﬁcantly larger POMDPs than has previously been possible. We demonstrate its effectiveness in three large POMDPs. We scale up a well-known benchmark problem, rocksample, by several orders of magnitude. We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. Our MonteCarlo planning algorithm achieved a high level of performance with no prior knowledge, and was also able to exploit simple domain knowledge to achieve better results with less search. POMCP is the ﬁrst general purpose planner to achieve high performance in such large and unfactored POMDPs. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The algorithm combines a Monte-Carlo update of the agent’s belief state with a Monte-Carlo tree search from the current belief state. [sent-4, score-0.642]

2 First, MonteCarlo sampling is used to break the curse of dimensionality both during belief state updates and during planning. [sent-6, score-0.339]

3 We also introduce two challenging new POMDPs: 10 × 10 battleship and partially observable PacMan, with approximately 1018 and 1056 states respectively. [sent-11, score-0.288]

4 1 Introduction Monte-Carlo tree search (MCTS) is a new approach to online planning that has provided exceptional performance in large, fully observable domains. [sent-14, score-0.571]

5 The key idea is to evaluate each state in a search tree by the average outcome of simulations from that state. [sent-16, score-0.457]

6 It breaks the curse of dimensionality by sampling state transitions instead of considering all possible state transitions. [sent-19, score-0.296]

7 Full-width planning algorithms, such as value iteration [6], scale poorly for two reasons, sometimes referred to as the curse of dimensionality and the curse of history [12]. [sent-25, score-0.338]

8 The basic idea of our approach is to use Monte-Carlo sampling to break both curses, by sampling start states from the belief state, and by sampling histories using a black box simulator. [sent-28, score-0.412]

9 Our search algorithm constructs, online, a search tree of histories. [sent-29, score-0.375]

10 Each node of the search tree estimates the value of a history by Monte-Carlo simulation. [sent-30, score-0.388]

11 For each simulation, the 1 start state is sampled from the current belief state, and state transitions and observations are sampled from a black box simulator. [sent-31, score-0.472]

12 We show that if the belief state is correct, then this simple procedure converges to the optimal policy for any ﬁnite horizon POMDP. [sent-32, score-0.396]

13 As the search tree is constructed, we store the set of sample states encountered by the black box simulator in each node of the search tree. [sent-35, score-0.616]

14 We approximate the belief state by the set of sample states corresponding to the actual history. [sent-36, score-0.296]

15 Our algorithm, Partially Observable MonteCarlo Planning (POMCP), eﬃciently uses the same set of Monte-Carlo simulations for both tree search and belief state updates. [sent-37, score-0.595]

16 For any state s ∈ S and for any action a ∈ A, the transition probabilities a Pss = P r(st+1 = s |st = s, at = a) determine the next state distribution s , and the reward function Ra = E[rt+1 |st = s, at = a] determines the expected reward. [sent-40, score-0.351]

17 A history is a sequence of actions and observations, ht = {a1 , o1 , . [sent-44, score-0.428]

18 The value function V π (h) is the expected return from state s when following policy π, V π (h) = Eπ [Rt |ht = h]. [sent-53, score-0.296]

19 The belief state is the probability distribution over states given history h, B(s, h) = P r(st = s|ht = h). [sent-56, score-0.398]

20 2 Online Planning in POMDPs Online POMDP planners use forward search, from the current history or belief state, to form a local approximation to the optimal value function. [sent-58, score-0.344]

21 They construct a search tree of belief states, using a heuristic best-ﬁrst expansion procedure. [sent-61, score-0.386]

22 Each value in the search tree is updated by a full-width computation that takes account of all possible actions, observations and next states. [sent-62, score-0.298]

23 Monte-Carlo planning is a very diﬀerent paradigm for online planning in POMDPs [2, 7]. [sent-66, score-0.389]

24 The simulator provides a sample of a successor state, observation and reward, given a state and action, (st+1 , ot+1 , rt+1 ) ∼ G(st , at ), and can also be reset to a start state s. [sent-68, score-0.388]

25 However, prior Monte-Carlo planners have been limited to ﬁxed horizon, depth-ﬁrst search [7] (also known as sparse sampling), or to simple rollout methods with no search tree [2], and have not so far proven to be competitive with best-ﬁrst, full-width planning methods. [sent-73, score-0.897]

26 3 Rollouts In fully observable MDPs, Monte-Carlo simulation provides a simple method for evaluating a state s. [sent-75, score-0.288]

27 Sequences of states are generated by an MDP simulator, starting from s and using a random rollout policy, until a terminal state or discount horizon is reached. [sent-76, score-0.479]

28 The value of N 1 state s is estimated by the mean return of N simulations from s, V (s) = N i=1 Ri , where i R is the return from the beginning of the ith simulation. [sent-77, score-0.377]

29 Monte-Carlo simulation can be turned into a simple control algorithm by evaluating all legal actions and selecting the action with highest evaluation [15]. [sent-78, score-0.351]

30 Monte-Carlo simulation can be extended to partially observable MDPs [2] by using a history based rollout policy πrollout (h, a). [sent-79, score-0.658]

31 To evaluate candidate action a in history h, simulations are generated from ha using a POMDP simulator and the rollout policy. [sent-80, score-0.832]

32 The value of ha is estimated by the mean return of N simulations from ha. [sent-81, score-0.4]

33 4 Monte-Carlo Tree Search Monte-Carlo tree search [3] uses Monte-Carlo simulation to evaluate the nodes of a search tree in a sequentially best-ﬁrst order. [sent-83, score-0.564]

34 There is one node in the tree for each state s, containing a value Q(s, a) and a visitation count N (s, a) for each action a, and an overall count N (s) = a N (s, a). [sent-84, score-0.349]

35 The value is estimated by the mean return from s of all simulations in which action a was selected from state s. [sent-86, score-0.365]

36 Each simulation starts from the current state st , and is divided into two stages: a tree policy that is used while within the search tree; and a rollout policy that is used once simulations leave the scope of the search tree. [sent-87, score-1.134]

37 The simplest version of MCTS uses a greedy tree policy during the ﬁrst stage, which selects the action with the highest value; and a uniform random rollout policy during the second stage. [sent-88, score-0.631]

38 After each simulation, one new node is added to the search tree, containing the ﬁrst state visited in the second stage. [sent-89, score-0.283]

39 Each state of the search tree is viewed as a multi-armed bandit, and actions are chosen by using the UCB1 algorithm [1]. [sent-91, score-0.54]

40 Once all actions from state s are represented in the search tree, the tree policy selects the action maximising the augmented action-value, argmaxa Q⊕ (s, a). [sent-94, score-0.706]

41 Otherwise, the rollout policy is used to select actions. [sent-95, score-0.344]

42 3 Monte-Carlo Planning in POMDPs Partially Observable Monte-Carlo Planning (POMCP) consists of a UCT search that selects actions at each time-step; and a particle ﬁlter that updates the agent’s belief state. [sent-100, score-0.571]

43 1 Partially Observable UCT (PO–UCT) We extend the UCT algorithm to partially observable environments by using a search tree of histories instead of states. [sent-102, score-0.454]

44 The tree contains a node T (h) = N (h), V (h) for each represented history h. [sent-103, score-0.261]

45 V (h) is the value of history h, estimated by the mean return of all simulations starting with h. [sent-105, score-0.277]

46 We assume for now that the belief state B(s, h) is known exactly. [sent-107, score-0.256]

47 Each simulation starts in an initial state that is sampled from B(·, ht ). [sent-108, score-0.338]

48 The agent constructs a search tree from multiple simulations, and evaluates each history by its mean return (left). [sent-114, score-0.524]

49 The agent uses the search tree to select a real action a, and observes a real observation o (middle). [sent-115, score-0.447]

50 The agent then prunes the tree and begins a new search from the updated history hao (right). [sent-116, score-0.547]

51 stage of simulation, actions are selected by a history based rollout policy πrollout (h, a) (e. [sent-117, score-0.62]

52 2 Monte-Carlo Belief State Updates In small state spaces, the belief state can be updated exactly by Bayes’ theorem, B(s , hao) = a a s∈S Zs o Pss B(s,h) Z a P a B(s,h) . [sent-122, score-0.397]

53 To plan eﬃciently in large POMDPs, we approximate the belief state using an unweighted particle ﬁlter, and use a Monte-Carlo procedure to update particles based on sample observations, rewards, and state transitions. [sent-126, score-0.573]

54 Although weighted particle ﬁlters are used widely to represent belief states, an unweighted particle ﬁlter can be implemented particularly eﬃciently with a black box simulator, without requiring an explicit model of the POMDP, and providing excellent scalability to larger problems. [sent-127, score-0.427]

55 i We approximate the belief state for history ht by K particles, Bt ∈ S, 1 ≤ i ≤ K. [sent-128, score-0.51]

56 Each particle corresponds to a sample state, and the belief state is the sum of all particles, K 1 ˆ i B(s, ht ) = K i=1 δsBt , where δss is the kronecker delta function. [sent-129, score-0.517]

57 After a real action at is executed, and a real observation ot is observed, the particles are updated by Monte-Carlo simulation. [sent-131, score-0.288]

58 A state s is sampled from the current belief state ˆ B(s, ht ), by selecting a particle at random from Bt . [sent-132, score-0.635]

59 This particle is passed into the black box simulator, to give a successor state s and observation o , (s , o , r) ∼ G(s, at ). [sent-133, score-0.358]

60 This approximation to ˆ the belief state approaches the true belief state with suﬃcient particles, limK→∞ B(s, ht ) = B(s, ht ), ∀s ∈ S. [sent-136, score-0.816]

61 In practice we combine the belief state update with particle reinvigoration. [sent-138, score-0.365]

62 3 Partially Observable Monte-Carlo POMCP combines Monte-Carlo belief state updates with PO–UCT, and shares the same simulations for both Monte-Carlo procedures. [sent-141, score-0.37]

63 The search procedure is called from the current history ht . [sent-143, score-0.381]

64 Each simulation begins from a start state that is sampled from the belief state B(ht ). [sent-144, score-0.442]

65 Simulate(s , hao, depth + 1) B(h) ← B(h) ∪ {s} N (h) ← N (h) + 1 N (ha) ← N (ha) + 1 V (ha) ← V (ha) + R−V (ha) N (ha) return R end procedure b b using the partially observable UCT algorithm, as described above. [sent-147, score-0.265]

66 For every history h encountered during simulation, the belief state B(h) is updated to include the simulation state. [sent-148, score-0.449]

67 When search is complete, the agent selects the action at with greatest value, and receives a real observation ot from the world. [sent-149, score-0.44]

68 At this point, the node T (ht at ot ) becomes the root of the new search tree, and the belief state B(ht ao) determines the agent’s new belief state. [sent-150, score-0.625]

69 This suggests two simple ways to apply UCT to POMDPs: either by converting every belief state into an MDP state, or by converting every history into an MDP state, and then applying UCT directly to the derived MDP. [sent-154, score-0.358]

70 However, the ﬁrst approach is computationally expensive in large POMDPs, where even a single belief state update can be prohibitively costly. [sent-155, score-0.256]

71 The key innovation of the PO–UCT algorithm is to apply a UCT search to a history-based MDP, but using a state-based simulator to eﬃciently sample states from the current beliefs. [sent-157, score-0.259]

72 In this section we prove that given the true belief state B(s, h), PO–UCT also converges to the optimal value function. [sent-158, score-0.256]

73 This is the distribution of histories generated by sampling an initial state st ∼ B(s, ht ), and then repeatedly sampling actions from policy π(h, a) and sampling states, observations and rewards from M, until terminating at 5 ˜ time T . [sent-166, score-0.749]

74 This is the distribution of histories generated by starting at ht and then repeatedly sampling actions from policy π ˜ and sampling state transitions and rewards from M, until terminating at time T . [sent-168, score-0.652]

75 For any rollout policy π, the POMDP rollout distribution is equal to the derived ˜ MDP rollout distribution, ∀π Dπ (hT ) = Dπ (hT ). [sent-170, score-0.844]

76 5 Experiments We applied POMCP to the benchmark rocksample problem, and to two new problems: battleship and pocman. [sent-179, score-0.275]

77 In the smaller rocksample problems, we compared POMCP to the best full-width online planning algorithms. [sent-182, score-0.392]

78 The PO-rollout algorithm used Monte-Carlo belief state updates, as described in section 3. [sent-185, score-0.256]

79 On the larger battleship and pocman problems, we combined POMCP with particle reinvigoration. [sent-192, score-0.382]

80 After each real action and observation, additional particles were added to the belief state, by applying a domain speciﬁc local transformation to existing particles. [sent-193, score-0.321]

81 When n simulations were used, n/16 new particles were added to the belief set. [sent-194, score-0.319]

82 We also introduced domain knowledge into the search algorithm, by deﬁning a set of preferred actions Ap . [sent-195, score-0.472]

83 When preferred actions were used, the rollout policy selected actions uniformly from Ap , and each new node T (ha) in the tree was initialised to Vinit (ha) = Rhi , Ninit (ha) = 10 for preferred actions a ∈ Ap , and to Vinit (ha) = Rlo , Ninit (ha) = 0 for all other actions. [sent-197, score-1.382]

84 Otherwise, the rollouts policy selected actions uniformly among all legal actions, and each new node T (ha) was initialised to Vinit (ha) = 0, Ninit (ha) = 0 for all a ∈ A. [sent-198, score-0.452]

85 When provided with an exactly factored representation, online full-width planners have been successful in rocksample (7, 8) [13], and an oﬄine full-width planner has been successful in the much larger rocksample (11, 11) problem [11]. [sent-201, score-0.554]

86 On rocksample (7, 8), the performance of POMCP with preferred actions was close to the best prior online planning methods combined with oﬄine solvers. [sent-208, score-0.716]

87 On rocksample (11, 11), POMCP achieved the same performance with 4 seconds of online computation to the state-of-the-art solver SARSOP with 1000 seconds of oﬄine computation [11]. [sent-209, score-0.296]

88 28 Table 1: Comparison of Monte-Carlo planning with full-width planning on rocksample. [sent-231, score-0.336]

89 In the battleship POMDP, 5 ships are placed at random into a 10 × 10 grid, subject to the constraint that no ship may be placed adjacent or diagonally adjacent to another ship. [sent-238, score-0.311]

90 When preferred actions were used, impossible cells for ships were deduced automatically, by marking oﬀ the diagonally adjacent cells to each hit. [sent-250, score-0.45]

91 POMCP was able to sink all ships more than 50 moves faster, on average, than random play, and more than 25 moves faster than randomly selecting amongst preferred actions (which corresponds to the simple strategy used by many humans when playing the Battleship game). [sent-253, score-0.392]

92 Using preferred actions, POMCP achieved better results with less search; however, even without preferred actions, POMCP was able to deduce the diagonal constraints from its rollouts, and performed almost as well given more simulations per move. [sent-254, score-0.391]

93 The PocMan agent receives a reward of −1 at each step, +10 for each food pellet, +25 for eating a ghost and −100 for dying. [sent-262, score-0.281]

94 The search time for POMCP with preferred actions is shown on the top axis. [sent-281, score-0.451]

95 When using preferred actions, if PocMan was under the eﬀect of a power pill, then he preferred to move in directions where he saw ghosts. [sent-286, score-0.3]

96 Using preferred actions, POMCP achieved an average undiscounted return of over 300, compared to 230 for the PO-rollout algorithm. [sent-289, score-0.268]

97 In these challenging POMDPs, Monte-Carlo simulation provides an eﬀective mechanism both for tree search and for belief state updates, breaking the curse of dimensionality and allowing much greater scalability than has previously been possible. [sent-300, score-0.606]

98 Unlike previous approaches to Monte-Carlo planning in POMDPs, the PO–UCT algorithm provides a computationally eﬃcient best-ﬁrst search that focuses its samples in the most promising regions of the search space. [sent-301, score-0.422]

99 The battleship and pocman problems provide two examples of large POMDPs which cannot easily be factored and are intractable to prior algorithms for POMDP planning. [sent-303, score-0.299]

100 SARSOP: Eﬃcient point-based POMDP planning by approximating optimally reachable belief spaces. [sent-352, score-0.306]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pomcp', 0.519), ('rollout', 0.25), ('uct', 0.239), ('ha', 0.225), ('actions', 0.174), ('rocksample', 0.171), ('pocman', 0.169), ('planning', 0.168), ('pomdp', 0.156), ('ht', 0.152), ('preferred', 0.15), ('pomdps', 0.145), ('belief', 0.138), ('search', 0.127), ('tree', 0.121), ('state', 0.118), ('particle', 0.109), ('battleship', 0.104), ('planners', 0.104), ('observable', 0.102), ('history', 0.102), ('policy', 0.094), ('simulator', 0.092), ('simulations', 0.091), ('ninit', 0.091), ('particles', 0.09), ('agent', 0.09), ('ine', 0.089), ('return', 0.084), ('hao', 0.084), ('po', 0.078), ('action', 0.072), ('simulation', 0.068), ('ships', 0.068), ('ot', 0.066), ('mcts', 0.065), ('pss', 0.065), ('vinit', 0.065), ('ghost', 0.063), ('histories', 0.062), ('mdp', 0.057), ('sarsop', 0.057), ('ship', 0.057), ('initialised', 0.057), ('online', 0.053), ('ghosts', 0.052), ('pomc', 0.052), ('rocks', 0.052), ('rollouts', 0.052), ('zs', 0.049), ('receives', 0.048), ('horizon', 0.046), ('st', 0.044), ('reward', 0.043), ('partially', 0.042), ('box', 0.042), ('states', 0.04), ('rhi', 0.039), ('rlo', 0.039), ('ra', 0.039), ('node', 0.038), ('food', 0.037), ('legal', 0.037), ('depth', 0.037), ('observation', 0.037), ('seconds', 0.036), ('undiscounted', 0.034), ('diagonally', 0.034), ('ective', 0.034), ('montecarlo', 0.034), ('discounted', 0.034), ('curse', 0.034), ('black', 0.029), ('planner', 0.029), ('hb', 0.029), ('bits', 0.029), ('erent', 0.028), ('observations', 0.027), ('factored', 0.026), ('amazons', 0.026), ('invigoration', 0.026), ('pellets', 0.026), ('qinit', 0.026), ('swapped', 0.026), ('unvaluable', 0.026), ('sampling', 0.026), ('game', 0.025), ('valuable', 0.025), ('discount', 0.025), ('adjacent', 0.024), ('exploration', 0.024), ('bandit', 0.024), ('updated', 0.023), ('di', 0.023), ('updates', 0.023), ('successor', 0.023), ('basic', 0.023), ('domain', 0.021), ('manhattan', 0.021), ('eat', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 168 nips-2010-Monte-Carlo Planning in Large POMDPs

Author: David Silver, Joel Veness

2 0.21796936 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet

Abstract: Partially Observable Markov Decision Processes (POMDPs) model sequential decision-making problems under uncertainty and partial observability. Unfortunately, some problems cannot be modeled with state-dependent reward functions, e.g., problems whose objective explicitly implies reducing the uncertainty on the state. To that end, we introduce ρPOMDPs, an extension of POMDPs where the reward function ρ depends on the belief state. We show that, under the common assumption that ρ is convex, the value function is also convex, what makes it possible to (1) approximate ρ arbitrarily well with a piecewise linear and convex (PWLC) function, and (2) use state-of-the-art exact or approximate solving algorithms with limited changes. 1

3 0.18522991 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

Author: Finale Doshi-velez, David Wingate, Nicholas Roy, Joshua B. Tenenbaum

Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1

4 0.13965389 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “ofﬂine” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

5 0.13661259 229 nips-2010-Reward Design via Online Gradient Ascent

Author: Jonathan Sorg, Richard L. Lewis, Satinder P. Singh

Abstract: Recent work has demonstrated that when artiﬁcial agents are limited in their ability to achieve their goals, the agent designer can beneﬁt by making the agent’s goals different from the designer’s. This gives rise to the optimization problem of designing the artiﬁcial agent’s goals—in the RL framework, designing the agent’s reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent’s lifetime nor do they take advantage of knowledge about the agent’s structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent’s lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations. 1 The Optimal Reward Problem In this work, we consider the scenario of an agent designer building an autonomous agent. The designer has his or her own goals which must be translated into goals for the autonomous agent. We represent goals using the Reinforcement Learning (RL) formalism of the reward function. This leads to the optimal reward problem of designing the agent’s reward function so as to maximize the objective reward received by the agent designer. Typically, the designer assigns his or her own reward to the agent. However, there is ample work which demonstrates the beneﬁt of assigning reward which does not match the designer’s. For example, work on reward shaping [11] has shown how to modify rewards to accelerate learning without altering the optimal policy, and PAC-MDP methods [5, 20] including approximate Bayesian methods [7, 19] add bonuses to the objective reward to achieve optimism under uncertainty. These approaches explicitly or implicitly assume that the asymptotic behavior of the agent should be the same as that which would occur using the objective reward function. These methods do not explicitly consider the optimal reward problem; however, they do show improved performance through reward modiﬁcation. In our recent work that does explicitly consider the optimal reward problem [18], we analyzed an explicit hypothesis about the beneﬁt of reward design—that it helps mitigate the performance loss caused by computational constraints (bounds) on agent architectures. We considered various types of agent limitations—limits on planning depth, failure to account for partial observability, and other erroneous modeling assumptions—and demonstrated the beneﬁts of good reward functions in each case empirically. Crucially, in bounded agents, the optimal reward function often leads to behavior that is different from the asymptotic behavior achieved with the objective reward function. In this work, we develop an algorithm, Policy Gradient for Reward Design (PGRD), for improving reward functions for a family of bounded agents that behave according to repeated local (from the current state) model-based planning. We show that this algorithm is capable of improving the reward functions in agents with computational limitations necessitating small bounds on the depth of planning, and also from the use of an inaccurate model (which may be inaccurate due to computationally-motivated approximations). PGRD has few parameters, improves the reward 1 function online during an agent’s lifetime, takes advantage of knowledge about the agent’s structure (through the gradient computation), and is linear in the number of reward function parameters. Notation. Formally, we consider discrete-time partially-observable environments with a ﬁnite number of hidden states s ∈ S, actions a ∈ A, and observations o ∈ O; these ﬁnite set assumptions are useful for our theorems, but our algorithm can handle inﬁnite sets in practice. Its dynamics are governed by a state-transition function P (s |s, a) that deﬁnes a distribution over next-states s conditioned on current state s and action a, and an observation function Ω(o|s) that deﬁnes a distribution over observations o conditioned on current state s. The agent designer’s goals are speciﬁed via the objective reward function RO . At each time step, the designer receives reward RO (st ) ∈ [0, 1] based on the current state st of the environment, where the subscript denotes time. The designer’s objective return is the expected mean objective reward N 1 obtained over an inﬁnite horizon, i.e., limN →∞ E N t=0 RO (st ) . In the standard view of RL, the agent uses the same reward function as the designer to align the interests of the agent and the designer. Here we allow for a separate agent reward function R(· ). An agent’s reward function can in general be deﬁned in terms of the history of actions and observations, but is often more pragmatically deﬁned in terms of some abstraction of history. We deﬁne the agent’s reward function precisely in Section 2. Optimal Reward Problem. An RL agent attempts to act so as to maximize its own cumulative reward, or return. Crucially, as a result, the sequence of environment-states {st }∞ is affected by t=0 the choice of reward function; therefore, the agent designer’s return is affected as well. The optimal reward problem arises from the fact that while the objective reward function is ﬁxed as part of the problem description, the reward function is a choice to be made by the designer. We capture this choice abstractly by letting the reward be parameterized by some vector of parameters θ chosen from space of parameters Θ. Each θ ∈ Θ speciﬁes a reward function R(· ; θ) which in turn produces a distribution over environment state sequences via whatever RL method the agent uses. The expected N 1 return obtained by the designer for choice θ is U(θ) = limN →∞ E N t=0 RO (st ) R(·; θ) . The optimal reward parameters are given by the solution to the optimal reward problem [16, 17, 18]: θ∗ = arg max U(θ) = arg max lim E θ∈Θ θ∈Θ N →∞ 1 N N RO (st ) R(·; θ) . (1) t=0 Our previous research on solving the optimal reward problem has focused primarily on the properties of the optimal reward function and its correspondence to the agent architecture and the environment [16, 17, 18]. This work has used inefﬁcient exhaustive search methods for ﬁnding good approximations to θ∗ (though there is recent work on using genetic algorithms to do this [6, 9, 12]). Our primary contribution in this paper is a new convergent online stochastic gradient method for ﬁnding approximately optimal reward functions. To our knowledge, this is the ﬁrst algorithm that improves reward functions in an online setting—during a single agent’s lifetime. In Section 2, we present the PGRD algorithm, prove its convergence, and relate it to OLPOMDP [2], a policy gradient algorithm. In Section 3, we present experiments demonstrating PGRD’s ability to approximately solve the optimal reward problem online. 2 PGRD: Policy Gradient for Reward Design PGRD builds on the following insight: the agent’s planning algorithm procedurally converts the reward function into behavior; thus, the reward function can be viewed as a speciﬁc parameterization of the agent’s policy. Using this insight, PGRD updates the reward parameters by estimating the gradient of the objective return with respect to the reward parameters, θ U(θ), from experience, using standard policy gradient techniques. In fact, we show that PGRD can be viewed as an (independently interesting) generalization of the policy gradient method OLPOMDP [2]. Speciﬁcally, we show that OLPOMDP is special case of PGRD when the planning depth d is zero. In this section, we ﬁrst present the family of local planning agents for which PGRD improves the reward function. Next, we develop PGRD and prove its convergence. Finally, we show that PGRD generalizes OLPOMDP and discuss how adding planning to OLPOMDP affects the space of policies available to the optimization method. 2 1 2 3 4 5 Input: T , θ0 , {αt }∞ , β, γ t=0 o0 , i0 = initializeStart(); for t = 0, 1, 2, 3, . . . do ∀a Qt (a; θt ) = plan(it , ot , T, R(it , ·, ·; θt ), d,γ); at ∼ µ(a|it ; Qt ); rt+1 , ot+1 = takeAction(at ); µ(a |i ;Q ) 6 7 8 9 t zt+1 = βzt + θt t |itt ;Qt ) t ; µ(a θt+1 = θt + αt (rt+1 zt+1 − λθt ) ; it+1 = updateInternalState(it , at , ot+1 ); end Figure 1: PGRD (Policy Gradient for Reward Design) Algorithm A Family of Limited Agents with Internal State. Given a Markov model T deﬁned over the observation space O and action space A, denote T (o |o, a) the probability of next observation o given that the agent takes action a after observing o. Our agents use the model T to plan. We do not assume that the model T is an accurate model of the environment. The use of an incorrect model is one type of agent limitation we examine in our experiments. In general, agents can use non-Markov models deﬁned in terms of the history of observations and actions; we leave this for future work. The agent maintains an internal state feature vector it that is updated at each time step using it+1 = updateInternalState(it , at , ot+1 ). The internal state allows the agent to use reward functions T that depend on the agent’s history. We consider rewards of the form R(it , o, a; θt ) = θt φ(it , o, a), where θt is the reward parameter vector at time t, and φ(it , o, a) is a vector of features based on internal state it , planning state o, and action a. Note that if φ is a vector of binary indicator features, this representation allows for arbitrary reward functions and thus the representation is completely general. Many existing methods use reward functions that depend on history. Reward functions based on empirical counts of observations, as in PAC-MDP approaches [5, 20], provide some examples; see [14, 15, 13] for others. We present a concrete example in our empirical section. At each time step t, the agent’s planning algorithm, plan, performs depth-d planning using the model T and reward function R(it , o, a; θt ) with current internal state it and reward parameters θt . Speciﬁcally, the agent computes a d-step Q-value function Qd (it , ot , a; θt ) ∀a ∈ A, where Qd (it , o, a; θt ) = R(it , o, a; θt ) + γ o ∈O T (o |o, a) maxb∈A Qd−1 (it , o , b; θt ) and Q0 (it , o, a; θt ) = R(it , o, a; θt ). We emphasize that the internal state it and reward parameters θt are held invariant while planning. Note that the d-step Q-values are only computed for the current observation ot , in effect by building a depth-d tree rooted at ot . In the d = 0 special case, the planning procedure completely ignores the model T and returns Q0 (it , ot , a; θt ) = R(it , ot , a; θt ). Regardless of the value of d, we treat the end result of planning as providing a scoring function Qt (a; θt ) where the dependence on d, it and ot is dropped from the notation. To allow for gradient calculations, our agents act according to the τ Qt (a;θt ) def Boltzmann (soft-max) stochastic policy parameterized by Q: µ(a|it ; Qt ) = e eτ Qt (b;θt ) , where τ b is a temperature parameter that determines how stochastically the agent selects the action with the highest score. When the planning depth d is small due to computational limitations, the agent cannot account for events beyond the planning depth. We examine this limitation in our experiments. Gradient Ascent. To develop a gradient algorithm for improving the reward function, we need to compute the gradient of the objective return with respect to θ: θ U(θ). The main insight is to break the gradient calculation into the calculation of two gradients. The ﬁrst is the gradient of the objective return with respect to the policy µ, and the second is the gradient of the policy with respect to the reward function parameters θ. The ﬁrst gradient is exactly what is computed in standard policy gradient approaches [2]. The second gradient is challenging because the transformation from reward parameters to policy involves a model-based planning procedure. We draw from the work of Neu and Szepesv´ ri [10] which shows that this gradient computation resembles planning itself. We a develop PGRD, presented in Figure 1, explicitly as a generalization of OLPOMDP, a policy gradient algorithm developed by Bartlett and Baxter [2], because of its foundational simplicity relative to other policy-gradient algorithms such as those based on actor-critic methods (e.g., [4]). Notably, the reward parameters are the only parameters being learned in PGRD. 3 PGRD follows the form of OLPOMDP (Algorithm 1 in Bartlett and Baxter [2]) but generalizes it in three places. In Figure 1 line 3, the agent plans to compute the policy, rather than storing the policy directly. In line 6, the gradient of the policy with respect to the parameters accounts for the planning procedure. In line 8, the agent maintains a general notion of internal state that allows for richer parameterization of policies than typically considered (similar to Aberdeen and Baxter [1]). The algorithm takes as parameters a sequence of learning rates {αk }, a decaying-average parameter β, and regularization parameter λ > 0 which keeps the the reward parameters θ bounded throughout learning. Given a sequence of calculations of the gradient of the policy with respect to the parameters, θt µ(at |it ; Qt ), the remainder of the algorithm climbs the gradient of objective return θ U(θ) using OLPOMDP machinery. In the next subsection, we discuss how to compute θt µ(at |it ; Qt ). Computing the Gradient of the Policy with respect to Reward. For the Boltzmann distribution, the gradient of the policy with respect to the reward parameters is given by the equation θt µ(a|it ; Qt ) = τ · µ(a|Qt )[ θt Qt (a|it ; θt ) − θt Qt (b; θt )], where τ is the Boltzmann b∈A temperature (see [10]). Thus, computing θt µ(a|it ; Qt ) reduces to computing θt Qt (a; θt ). The value of Qt depends on the reward parameters θt , the model, and the planning depth. However, as we present below, the process of computing the gradient closely resembles the process of planning itself, and the two computations can be interleaved. Theorem 1 presented below is an adaptation of Proposition 4 from Neu and Szepesv´ ri [10]. It presents the gradient computation for depth-d a planning as well as for inﬁnite-depth discounted planning. We assume that the gradient of the reward function with respect to the parameters is bounded: supθ,o,i,a θ R(i, o, a, θ) < ∞. The proof of the theorem follows directly from Proposition 4 of Neu and Szepesv´ ri [10]. a Theorem 1. Except on a set of measure zero, for any depth d, the gradient θ Qd (o, a; θ) exists and is given by the recursion (where we have dropped the dependence on i for simplicity) d θ Q (o, a; θ) = θ R(o, a; θ) π d−1 (b|o ) T (o |o, a) +γ o ∈O d−1 (o θQ , b; θ), (2) b∈A where θ Q0 (o, a; θ) = θ R(o, a; θ) and π d (a|o) ∈ arg maxa Qd (o, a; θ) is any policy that is greedy with respect to Qd . The result also holds for θ Q∗ (o, a; θ) = θ limd→∞ Qd (o, a; θ). The Q-function will not be differentiable when there are multiple optimal policies. This is reﬂected in the arbitrary choice of π in the gradient calculation. However, it was shown by Neu and Szepesv´ ri [10] that even for values of θ which are not differentiable, the above computation produces a a valid calculation of a subgradient; we discuss this below in our proof of convergence of PGRD. Convergence of PGRD (Figure 1). Given a particular ﬁxed reward function R(·; θ), transition model T , and planning depth, there is a corresponding ﬁxed randomized policy µ(a|i; θ)—where we have explicitly represented the reward’s dependence on the internal state vector i in the policy parameterization and dropped Q from the notation as it is redundant given that everything else is ﬁxed. Denote the agent’s internal-state update as a (usually deterministic) distribution ψ(i |i, a, o). Given a ﬁxed reward parameter vector θ, the joint environment-state–internal-state transitions can be modeled as a Markov chain with a |S||I| × |S||I| transition matrix M (θ) whose entries are given by M s,i , s ,i (θ) = p( s , i | s, i ; θ) = o,a ψ(i |i, a, o)Ω(o|s )P (s |s, a)µ(a|i; θ). We make the following assumptions about the agent and the environment: Assumption 1. The transition matrix M (θ) of the joint environment-state–internal-state Markov chain has a unique stationary distribution π(θ) = [πs1 ,i1 (θ), πs2 ,i2 (θ), . . . , πs|S| ,i|I| (θ)] satisfying the balance equations π(θ)M (θ) = π(θ), for all θ ∈ Θ. Assumption 2. During its execution, PGRD (Figure 1) does not reach a value of it , and θt at which µ(at |it , Qt ) is not differentiable with respect to θt . It follows from Assumption 1 that the objective return, U(θ), is independent of the start state. The original OLPOMDP convergence proof [2] has a similar condition that only considers environment states. Intuitively, this condition allows PGRD to handle history-dependence of a reward function in the same manner that it handles partial observability in an environment. Assumption 2 accounts for the fact that a planning algorithm may not be fully differentiable everywhere. However, Theorem 1 showed that inﬁnite and bounded-depth planning is differentiable almost everywhere (in a measure theoretic sense). Furthermore, this assumption is perhaps stronger than necessary, as stochastic approximation algorithms, which provide the theory upon which OLPOMDP is based, have been shown to converge using subgradients [8]. 4 In order to state the convergence theorem, we must deﬁne the approximate gradient which OLPOMDP def T calculates. Let the approximate gradient estimate be β U(θ) = limT →∞ t=1 rt zt for a ﬁxed θ and θ PGRD parameter β, where zt (in Figure 1) represents a time-decaying average of the θt µ(at |it , Qt ) calculations. It was shown by Bartlett and Baxter [2] that β U(θ) is close to the true value θ U(θ) θ for large values of β. Theorem 2 proves that PGRD converges to a stable equilibrium point based on this approximate gradient measure. This equilibrium point will typically correspond to some local optimum in the return function U(θ). Given our development and assumptions, the theorem is a straightforward extension of Theorem 6 from Bartlett and Baxter [2] (proof omitted). ∞ Theorem 2. Given β ∈ [0, 1), λ > 0, and a sequence of step sizes αt satisfying t=0 αt = ∞ and ∞ 2 t=0 (αt ) < ∞, PGRD produces a sequence of reward parameters θt such that θt → L as t → ∞ a.s., where L is the set of stable equilibrium points of the differential equation ∂θ = β U(θ) − λθ. θ ∂t PGRD generalizes OLPOMDP. As stated above, OLPOMDP, when it uses a Boltzmann distribution in its policy representation (a common case), is a special case of PGRD when the planning depth is zero. First, notice that in the case of depth-0 planning, Q0 (i, o, a; θ) = R(i, o, a, θ), regardless of the transition model and reward parameterization. We can also see from Theorem 1 that 0 θ Q (i, o, a; θ) = θ R(i, o, a; θ). Because R(i, o, a; θ) can be parameterized arbitrarily, PGRD can be conﬁgured to match standard OLPOMDP with any policy parameterization that also computes a score function for the Boltzmann distribution. In our experiments, we demonstrate that choosing a planning depth d > 0 can be beneﬁcial over using OLPOMDP (d = 0). In the remainder of this section, we show theoretically that choosing d > 0 does not hurt in the sense that it does not reduce the space of policies available to the policy gradient method. Speciﬁcally, we show that when using an expressive enough reward parameterization, PGRD’s space of policies is not restricted relative to OLPOMDP’s space of policies. We prove the result for inﬁnite planning, but the extension to depth-limited planning is straightforward. Theorem 3. There exists a reward parameterization such that, for an arbitrary transition model T , the space of policies representable by PGRD with inﬁnite planning is identical to the space of policies representable by PGRD with depth 0 planning. Proof. Ignoring internal state for now (holding it constant), let C(o, a) be an arbitrary reward function used by PGRD with depth 0 planning. Let R(o, a; θ) be a reward function for PGRD with inﬁnite (d = ∞) planning. The depth-∞ agent uses the planning result Q∗ (o, a; θ) to act, while the depth-0 agent uses the function C(o, a) to act. Therefore, it sufﬁces to show that one can always choose θ such that the planning solution Q∗ (o, a; θ) equals C(o, a). For all o ∈ O, a ∈ A, set R(o, a; θ) = C(o, a) − γ o T (o |o, a) maxa C(o , a ). Substituting Q∗ for C, this is the Bellman optimality equation [22] for inﬁnite-horizon planning. Setting R(o, a; θ) as above is possible if it is parameterized by a table with an entry for each observation–action pair. Theorem 3 also shows that the effect of an arbitrarily poor model can be overcome with a good choice of reward function. This is because a Boltzmann distribution can, allowing for an arbitrary scoring function C, represent any policy. We demonstrate this ability of PGRD in our experiments. 3 Experiments The primary objective of our experiments is to demonstrate that PGRD is able to use experience online to improve the reward function parameters, thereby improving the agent’s obtained objective return. Speciﬁcally, we compare the objective return achieved by PGRD to the objective return achieved by PGRD with the reward adaptation turned off. In both cases, the reward function is initialized to the objective reward function. A secondary objective is to demonstrate that when a good model is available, adding the ability to plan—even for small depths—improves performance relative to the baseline algorithm of OLPOMDP (or equivalently PGRD with depth d = 0). Foraging Domain for Experiments 1 to 3: The foraging environment illustrated in Figure 2(a) is a 3 × 3 grid world with 3 dead-end corridors (rows) separated by impassable walls. The agent (bird) has four available actions corresponding to each cardinal direction. Movement in the intended direction fails with probability 0.1, resulting in movement in a random direction. If the resulting direction is 5 Objective Return 0.15 D=6, α=0 & D=6, α=5×10 −5 D=4, α=2×10 −4 D=0, α=5×10 −4 0.1 0.05 0 D=4, α=0 D=0, α=0 1000 2000 3000 4000 5000 Time Steps C) Objective Return B) A) 0.15 D=6, α=0 & D=6, α=5×10 −5 D=3, α=3×10 −3 D=1, α=3×10 −4 0.1 D=3, α=0 0.05 D=0, α=0.01 & D=1, α=0 0 1000 2000 3000 4000 5000 D=0, α=0 Time Steps Figure 2: A) Foraging Domain, B) Performance of PGRD with observation-action reward features, C) Performance of PGRD with recency reward features blocked by a wall or the boundary, the action results in no movement. There is a food source (worm) located in one of the three right-most locations at the end of each corridor. The agent has an eat action, which consumes the worm when the agent is at the worm’s location. After the agent consumes the worm, a new worm appears randomly in one of the other two potential worm locations. Objective Reward for the Foraging Domain: The designer’s goal is to maximize the average number of worms eaten per time step. Thus, the objective reward function RO provides a reward of 1.0 when the agent eats a worm, and a reward of 0 otherwise. The objective return is deﬁned as in Equation (1). Experimental Methodology: We tested PGRD for depth-limited planning agents of depths 0–6. Recall that PGRD for the agent with planning depth 0 is the OLPOMDP algorithm. For each depth, we jointly optimized over the PGRD algorithm parameters, α and β (we use a ﬁxed α throughout learning). We tested values for α on an approximate logarithmic scale in the range (10−6 , 10−2 ) as well as the special value of α = 0, which corresponds to an agent that does not adapt its reward function. We tested β values in the set 0, 0.4, 0.7, 0.9, 0.95, 0.99. Following common practice [3], we set the λ parameter to 0. We explicitly bound the reward parameters and capped the reward function output both to the range [−1, 1]. We used a Boltzmann temperature parameter of τ = 100 and planning discount factor γ = 0.95. Because we initialized θ so that the initial reward function was the objective reward function, PGRD with α = 0 was equivalent to a standard depth-limited planning agent. Experiment 1: A fully observable environment with a correct model learned online. In this experiment, we improve the reward function in an agent whose only limitation is planning depth, using (1) a general reward parameterization based on the current observation and (2) a more compact reward parameterization which also depends on the history of observations. Observation: The agent observes the full state, which is given by the pair o = (l, w), where l is the agent’s location and w is the worm’s location. Learning a Correct Model: Although the theorem of convergence of PGRD relies on the agent having a ﬁxed model, the algorithm itself is readily applied to the case of learning a model online. In this experiment, the agent’s model T is learned online based on empirical transition probabilities between observations (recall this is a fully observable environment). Let no,a,o be the number of times that o was reached after taking action a after observing o. The agent models the probability of seeing o as no,a,o T (o |o, a) = . n o o,a,o Reward Parameterizations: Recall that R(i, o, a; θ) = θT φ(i, o, a), for some φ(i, o, a). (1) In the observation-action parameterization, φ(i, o, a) is a binary feature vector with one binary feature for each observation-action pair—internal state is ignored. This is effectively a table representation over all reward functions indexed by (o, a). As shown in Theorem 3, the observation-action feature representation is capable of producing arbitrary policies over the observations. In large problems, such a parameterization would not be feasible. (2) The recency parameterization is a more compact representation which uses features that rely on the history of observations. The feature vector is φ(i, o, a) = [RO (o, a), 1, φcl (l, i), φcl,a (l, a, i)], where RO (o, a) is the objective reward function deﬁned as above. The feature φcl (l) = 1 − 1/c(l, i), where c(l, i) is the number of time steps since the agent has visited location l, as represented in the agent’s internal state i. Its value is normalized to the range [0, 1) and is high when the agent has not been to location l recently. The feature φcl,a (l, a, i) = 1 − 1/c(l, a, i) is similarly deﬁned with respect to the time since the agent has taken action a in location l. Features based on recency counts encourage persistent exploration [21, 18]. 6 Results & Discussion: Figure 2(b) and Figure 2(c) present results for agents that use the observationaction parameterization and the recency parameterization of the reward function respectively. The horizontal axis is the number of time steps of experience. The vertical axis is the objective return, i.e., the average objective reward per time step. Each curve is an average over 130 trials. The values of d and the associated optimal algorithm parameters for each curve are noted in the ﬁgures. First, note that with d = 6, the agent is unbounded, because food is never more than 6 steps away. Therefore, the agent does not beneﬁt from adapting the reward function parameters (given that we initialize to the objective reward function). Indeed, the d = 6, α = 0 agent performs as well as the best reward-optimizing agent. The performance for d = 6 improves with experience because the model improves with experience (and thus from the curves it is seen that the model gets quite accurate in about 1500 time steps). The largest objective return obtained for d = 6 is also the best objective return that can be obtained for any value of d. Several results can be observed in both Figures 2(b) and (c). 1) Each curve that uses α > 0 (solid lines) improves with experience. This is a demonstration of our primary contribution, that PGRD is able to effectively improve the reward function with experience. That the improvement over time is not just due to model learning is seen in the fact that for each value of d < 6 the curve for α > 0 (solid-line) which adapts the reward parameters does signiﬁcantly better than the corresponding curve for α = 0 (dashed-line); the α = 0 agents still learn the model. 2) For both α = 0 and α > 0 agents, the objective return obtained by agents with equivalent amounts of experience increases monotonically as d is increased (though to maintain readability we only show selected values of d in each ﬁgure). This demonstrates our secondary contribution, that the ability to plan in PGRD signiﬁcantly improves performance over standard OLPOMDP (PGRD with d = 0). There are also some interesting differences between the results for the two different reward function parameterizations. With the observation-action parameterization, we noted that there always exists a setting of θ for all d that will yield optimal objective return. This is seen in Figure 2(b) in that all solid-line curves approach optimal objective return. In contrast, the more compact recency reward parameterization does not afford this guarantee and indeed for small values of d (< 3), the solid-line curves in Figure 2(c) converge to less than optimal objective return. Notably, OLPOMDP (d = 0) does not perform well with this feature set. On the other hand, for planning depths 3 ≤ d < 6, the PGRD agents with the recency parameterization achieve optimal objective return faster than the corresponding PGRD agent with the observation-action parameterization. Finally, we note that this experiment validates our claim that PGRD can improve reward functions that depend on history. Experiment 2: A fully observable environment and poor given model. Our theoretical analysis showed that PGRD with an incorrect model and the observation–action reward parameterization should (modulo local maxima issues) do just as well asymptotically as it would with a correct model. Here we illustrate this theoretical result empirically on the same foraging domain and objective reward function used in Experiment 1. We also test our hypothesis that a poor model should slow down the rate of learning relative to a correct model. Poor Model: We gave the agents a ﬁxed incorrect model of the foraging environment that assumes there are no internal walls separating the 3 corridors. Reward Parameterization: We used the observation–action reward parameterization. With a poor model it is no longer interesting to initialize θ so that the initial reward function is the objective reward function because even for d = 6 such an agent would do poorly. Furthermore, we found that this initialization leads to excessively bad exploration and therefore poor learning of how to modify the reward. Thus, we initialize θ to uniform random values near 0, in the range (−10−3 , 10−3 ). Results: Figure 3(a) plots the objective return as a function of number of steps of experience. Each curve is an average over 36 trials. As hypothesized, the bad model slows learning by a factor of more than 10 (notice the difference in the x-axis scales from those in Figure 2). Here, deeper planning results in slower learning and indeed the d = 0 agent that does not use the model at all learns the fastest. However, also as hypothesized, because they used the expressive observation–action parameterization, agents of all planning depths mitigate the damage caused by the poor model and eventually converge to the optimal objective return. Experiment 3: Partially observable foraging world. Here we evaluate PGRD’s ability to learn in a partially observable version of the foraging domain. In addition, the agents learn a model under the erroneous (and computationally convenient) assumption that the domain is fully observable. 7 0.1 −4 D = 0, α = 2 ×10 D = 2, α = 3 ×10 −5 −5 D = 6, α = 2 ×10 0.05 D = 0&2&6, α = 0 0 1 2 3 Time Steps 4 5 x 10 4 0.06 D = 6, α = 7 ×10 D = 2, α = 7 ×10 −4 0.04 D = 1, α = 7 ×10 −4 D = 0, α = 5 ×10 −4 D = 0, α = 0 D = 1&2&6, α = 0 0.02 0 C) −4 1000 2000 3000 4000 5000 Time Steps Objective Return B) 0.08 0.15 Objective Return Objective Return A) 2.5 2 x 10 −3 D=6, α=3×10 −6 D=0, α=1×10 −5 1.5 D=0&6, α=0 1 0.5 1 2 3 Time Steps 4 5 x 10 4 Figure 3: A) Performance of PGRD with a poor model, B) Performance of PGRD in a partially observable world with recency reward features, C) Performance of PGRD in Acrobot Partial Observation: Instead of viewing the location of the worm at all times, the agent can now only see the worm when it is colocated with it: its observation is o = (l, f ), where f indicates whether the agent is colocated with the food. Learning an Incorrect Model: The model is learned just as in Experiment 1. Because of the erroneous full observability assumption, the model will hallucinate about worms at all the corridor ends based on the empirical frequency of having encountered them there. Reward Parameterization: We used the recency parameterization; due to the partial observability, agents with the observation–action feature set perform poorly in this environment. The parameters θ are initialized such that the initial reward function equals the objective reward function. Results & Discussion: Figure 3(b) plots the mean of 260 trials. As seen in the solid-line curves, PGRD improves the objective return at all depths (only a small amount for d = 0 and signiﬁcantly more for d > 0). In fact, agents which don’t adapt the reward are hurt by planning (relative to d = 0). This experiment demonstrates that the combination of planning and reward improvement can be beneﬁcial even when the model is erroneous. Because of the partial observability, optimal behavior in this environment achieves less objective return than in Experiment 1. Experiment 4: Acrobot. In this experiment we test PGRD in the Acrobot environment [22], a common benchmark task in the RL literature and one that has previously been used in the testing of policy gradient approaches [23]. This experiment demonstrates PGRD in an environment in which an agent must be limited due to the size of the state space and further demonstrates that adding model-based planning to policy gradient approaches can improve performance. Domain: The version of Acrobot we use is as speciﬁed by Sutton and Barto [22]. It is a two-link robot arm in which the position of one shoulder-joint is ﬁxed and the agent’s control is limited to 3 actions which apply torque to the elbow-joint. Observation: The fully-observable state space is 4 dimensional, with two joint angles ψ1 and ψ2 , and ˙ ˙ two joint velocities ψ1 and ψ2 . Objective Reward: The designer receives an objective reward of 1.0 when the tip is one arm’s length above the ﬁxed shoulder-joint, after which the bot is reset to its initial resting position. Model: We provide the agent with a perfect model of the environment. Because the environment is continuous, value iteration is intractable, and computational limitations prevent planning deep enough to compute the optimal action in any state. The feature vector contains 13 entries. One feature corresponds to the objective reward signal. For each action, there are 5 features corresponding to each of the state features plus an additional feature representing the height of the tip: φ(i, o, a) = ˙ ˙ [RO (o), {ψ1 (o), ψ2 (o), ψ1 (o), ψ2 (o), h(o)}a ]. The height feature has been used in previous work as an alternative deﬁnition of objective reward [23]. Results & Discussion: We plot the mean of 80 trials in Figure 3(c). Agents that use the ﬁxed (α = 0) objective reward function with bounded-depth planning perform according to the bottom two curves. Allowing PGRD and OLPOMDP to adapt the parameters θ leads to improved objective return, as seen in the top two curves in Figure 3(c). Finally, the PGRD d = 6 agent outperforms the standard OLPOMDP agent (PGRD with d = 0), further demonstrating that PGRD outperforms OLPOMDP. Overall Conclusion: We developed PGRD, a new method for approximately solving the optimal reward problem in bounded planning agents that can be applied in an online setting. We showed that PGRD is a generalization of OLPOMDP and demonstrated that it both improves reward functions in limited agents and outperforms the model-free OLPOMDP approach. 8 References [1] Douglas Aberdeen and Jonathan Baxter. Scalable Internal-State Policy-Gradient Methods for POMDPs. Proceedings of the Nineteenth International Conference on Machine Learning, 2002. [2] Peter L. Bartlett and Jonathan Baxter. Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the 39th IEEE Conference on Decision and Control, 2000. [3] Jonathan Baxter, Peter L. Bartlett, and Lex Weaver. Experiments with Inﬁnite-Horizon, Policy-Gradient Estimation, 2001. [4] Shalabh Bhatnagar, Richard S. Sutton, M Ghavamzadeh, and Mark Lee. Natural actor-critic algorithms. Automatica, 2009. [5] Ronen I. Brafman and Moshe Tennenholtz. R-MAX - A General Polynomial Time Algorithm for NearOptimal Reinforcement Learning. Journal of Machine Learning Research, 3:213–231, 2001. [6] S. Elfwing, Eiji Uchibe, K. Doya, and H. I. Christensen. Co-evolution of Shaping Rewards and MetaParameters in Reinforcement Learning. Adaptive Behavior, 16(6):400–412, 2008. [7] J. Zico Kolter and Andrew Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning, pages 513–520, 2009. [8] Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd edition, 2010. [9] Cetin Mericli, Tekin Mericli, and H. Levent Akin. A Reward Function Generation Method Using Genetic ¸ ¸ ¸ Algorithms : A Robot Soccer Case Study (Extended Abstract). In Proc. of the 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010), number 2, pages 1513–1514, 2010. [10] Gergely Neu and Csaba Szepesv´ ri. Apprenticeship learning using inverse reinforcement learning and a gradient methods. In Proceedings of the 23rd Conference on Uncertainty in Artiﬁcial Intelligence, pages 295–302, 2007. [11] Andrew Y. Ng, Stuart J. Russell, and D. Harada. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the 16th International Conference on Machine Learning, pages 278–287, 1999. [12] Scott Niekum, Andrew G. Barto, and Lee Spector. Genetic Programming for Reward Function Search. IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010. [13] Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V. Hafner. Intrinsic Motivation Systems for Autonomous Mental Development. IEEE Transactions on Evolutionary Computation, 11(2):265–286, April 2007. [14] J¨ rgen Schmidhuber. Curious model-building control systems. In IEEE International Joint Conference on u Neural Networks, pages 1458–1463, 1991. [15] Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically Motivated Reinforcement Learning. In Proceedings of Advances in Neural Information Processing Systems 17 (NIPS), pages 1281–1288, 2005. [16] Satinder Singh, Richard L. Lewis, and Andrew G. Barto. Where Do Rewards Come From? In Proceedings of the Annual Conference of the Cognitive Science Society, pages 2601–2606, 2009. [17] Satinder Singh, Richard L. Lewis, Andrew G. Barto, and Jonathan Sorg. Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective. IEEE Transations on Autonomous Mental Development, 2(2):70–82, 2010. [18] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Internal Rewards Mitigate Agent Boundedness. In Proceedings of the 27th International Conference on Machine Learning, 2010. [19] Jonathan Sorg, Satinder Singh, and Richard L. Lewis. Variance-Based Rewards for Approximate Bayesian Reinforcement Learning. In Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence, 2010. [20] Alexander L. Strehl and Michael L. Littman. An analysis of model-based Interval Estimation for Markov Decision Processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008. [21] Richard S. Sutton. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. In The Seventh International Conference on Machine Learning, pages 216–224. 1990. [22] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. [23] Lex Weaver and Nigel Tao. The Optimal Reward Baseline for Gradient-Based Reinforcement Learning. In Proceedings of the 17th Conference on Uncertainty in Artiﬁcial Intelligence, pages 538–545. 2001. 9

6 0.12433978 4 nips-2010-A Computational Decision Theory for Interactive Assistants

7 0.1223362 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

8 0.11504161 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

9 0.10282134 64 nips-2010-Distributionally Robust Markov Decision Processes

10 0.098806262 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

11 0.09391962 212 nips-2010-Predictive State Temporal Difference Learning

12 0.085742749 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

13 0.084781907 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

14 0.083897509 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

15 0.079226285 192 nips-2010-Online Classification with Specificity Constraints

16 0.07622622 43 nips-2010-Bootstrapping Apprenticeship Learning

17 0.074394517 208 nips-2010-Policy gradients in linearly-solvable MDPs

18 0.073752239 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

19 0.073161885 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

20 0.070504621 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.155), (1, -0.21), (2, -0.046), (3, -0.015), (4, -0.067), (5, 0.02), (6, -0.032), (7, 0.019), (8, 0.04), (9, 0.128), (10, -0.046), (11, 0.065), (12, -0.056), (13, -0.04), (14, 0.045), (15, -0.085), (16, -0.088), (17, 0.021), (18, -0.011), (19, -0.024), (20, -0.057), (21, -0.001), (22, -0.007), (23, -0.12), (24, 0.091), (25, 0.005), (26, -0.039), (27, 0.008), (28, 0.043), (29, -0.115), (30, -0.008), (31, 0.054), (32, 0.029), (33, -0.046), (34, 0.031), (35, 0.055), (36, -0.023), (37, 0.038), (38, -0.006), (39, -0.052), (40, 0.069), (41, -0.058), (42, -0.036), (43, -0.079), (44, -0.011), (45, 0.0), (46, -0.071), (47, 0.083), (48, -0.056), (49, -0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96315861 168 nips-2010-Monte-Carlo Planning in Large POMDPs

Author: David Silver, Joel Veness

2 0.77669692 4 nips-2010-A Computational Decision Theory for Interactive Assistants

Author: Alan Fern, Prasad Tadepalli

Abstract: We study several classes of interactive assistants from the points of view of decision theory and computational complexity. We ﬁrst introduce a class of POMDPs called hidden-goal MDPs (HGMDPs), which formalize the problem of interactively assisting an agent whose goal is hidden and whose actions are observable. In spite of its restricted nature, we show that optimal action selection in ﬁnite horizon HGMDPs is PSPACE-complete even in domains with deterministic dynamics. We then introduce a more restricted model called helper action MDPs (HAMDPs), where the assistant’s action is accepted by the agent when it is helpful, and can be easily ignored by the agent otherwise. We show classes of HAMDPs that are complete for PSPACE and NP along with a polynomial time class. Furthermore, we show that for general HAMDPs a simple myopic policy achieves a regret, compared to an omniscient assistant, that is bounded by the entropy of the initial goal distribution. A variation of this policy is shown to achieve worst-case regret that is logarithmic in the number of goals for any goal distribution. 1

3 0.73474866 229 nips-2010-Reward Design via Online Gradient Ascent

Author: Jonathan Sorg, Richard L. Lewis, Satinder P. Singh

4 0.71423352 11 nips-2010-A POMDP Extension with Belief-dependent Rewards

Author: Mauricio Araya, Olivier Buffet, Vincent Thomas, Françcois Charpillet

5 0.66413492 64 nips-2010-Distributionally Robust Markov Decision Processes

Author: Huan Xu, Shie Mannor

Abstract: We consider Markov decision processes where the values of the parameters are uncertain. This uncertainty is described by a sequence of nested sets (that is, each set contains the previous one), each of which corresponds to a probabilistic guarantee for a different conﬁdence level so that a set of admissible probability distributions of the unknown parameters is speciﬁed. This formulation models the case where the decision maker is aware of and wants to exploit some (yet imprecise) a-priori information of the distribution of parameters, and arises naturally in practice where methods to estimate the conﬁdence region of parameters abound. We propose a decision criterion based on distributional robustness: the optimal policy maximizes the expected total reward under the most adversarial probability distribution over realizations of the uncertain parameters that is admissible (i.e., it agrees with the a-priori information). We show that ﬁnding the optimal distributionally robust policy can be reduced to a standard robust MDP where the parameters belong to a single uncertainty set, hence it can be computed in polynomial time under mild technical conditions.

6 0.61484003 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

7 0.6138621 93 nips-2010-Feature Construction for Inverse Reinforcement Learning

8 0.58042806 50 nips-2010-Constructing Skill Trees for Reinforcement Learning Agents from Demonstration Trajectories

9 0.56604964 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

10 0.52408296 43 nips-2010-Bootstrapping Apprenticeship Learning

11 0.50217146 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

12 0.47508982 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

13 0.45241132 39 nips-2010-Bayesian Action-Graph Games

14 0.44452673 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions

15 0.43018708 212 nips-2010-Predictive State Temporal Difference Learning

16 0.42776996 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

17 0.40371513 66 nips-2010-Double Q-learning

18 0.40109831 192 nips-2010-Online Classification with Specificity Constraints

19 0.39556915 152 nips-2010-Learning from Logged Implicit Exploration Data

20 0.39314297 215 nips-2010-Probabilistic Deterministic Infinite Automata

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.02), (27, 0.072), (30, 0.041), (35, 0.011), (45, 0.24), (50, 0.039), (52, 0.038), (60, 0.05), (67, 0.028), (77, 0.034), (90, 0.022), (95, 0.296)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81153446 256 nips-2010-Structural epitome: a way to summarize one’s visual experience

Author: Nebojsa Jojic, Alessandro Perina, Vittorio Murino

Abstract: In order to study the properties of total visual input in humans, a single subject wore a camera for two weeks capturing, on average, an image every 20 seconds. The resulting new dataset contains a mix of indoor and outdoor scenes as well as numerous foreground objects. Our ﬁrst goal is to create a visual summary of the subject’s two weeks of life using unsupervised algorithms that would automatically discover recurrent scenes, familiar faces or common actions. Direct application of existing algorithms, such as panoramic stitching (e.g., Photosynth) or appearance-based clustering models (e.g., the epitome), is impractical due to either the large dataset size or the dramatic variations in the lighting conditions. As a remedy to these problems, we introduce a novel image representation, the ”structural element (stel) epitome,” and an associated efﬁcient learning algorithm. In our model, each image or image patch is characterized by a hidden mapping T which, as in previous epitome models, deﬁnes a mapping between the image coordinates and the coordinates in the large ”all-I-have-seen” epitome matrix. The limited epitome real-estate forces the mappings of different images to overlap which indicates image similarity. However, the image similarity no longer depends on direct pixel-to-pixel intensity/color/feature comparisons as in previous epitome models, but on spatial conﬁguration of scene or object parts, as the model is based on the palette-invariant stel models. As a result, stel epitomes capture structure that is invariant to non-structural changes, such as illumination changes, that tend to uniformly affect pixels belonging to a single scene or object part. 1

same-paper 2 0.80506319 168 nips-2010-Monte-Carlo Planning in Large POMDPs

Author: David Silver, Joel Veness

3 0.76528352 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms

Author: Margareta Ackerman, Shai Ben-David, David Loker

Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill deﬁned problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some ﬁrst steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1

4 0.66261244 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

Author: Akshat Kumar, Shlomo Zilberstein

Abstract: Computing a maximum a posteriori (MAP) assignment in graphical models is a crucial inference problem for many practical applications. Several provably convergent approaches have been successfully developed using linear programming (LP) relaxation of the MAP problem. We present an alternative approach, which transforms the MAP problem into that of inference in a mixture of simple Bayes nets. We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. Experiments on the real-world protein design dataset show that EM’s convergence rate is signiﬁcantly higher than the previous LP relaxation based approach MPLP. EM also achieves a solution quality within 95% of optimal for most instances. 1

5 0.66064072 287 nips-2010-Worst-Case Linear Discriminant Analysis

Author: Yu Zhang, Dit-Yan Yeung

Abstract: Dimensionality reduction is often needed in many applications due to the high dimensionality of the data involved. In this paper, we ďŹ rst analyze the scatter measures used in the conventional linear discriminant analysis (LDA) model and note that the formulation is based on the average-case view. Based on this analysis, we then propose a new dimensionality reduction method called worst-case linear discriminant analysis (WLDA) by deďŹ ning new between-class and within-class scatter measures. This new model adopts the worst-case view which arguably is more suitable for applications such as classiďŹ cation. When the number of training data points or the number of features is not very large, we relax the optimization problem involved and formulate it as a metric learning problem. Otherwise, we take a greedy approach by ďŹ nding one direction of the transformation at a time. Moreover, we also analyze a special case of WLDA to show its relationship with conventional LDA. Experiments conducted on several benchmark datasets demonstrate the effectiveness of WLDA when compared with some related dimensionality reduction methods. 1

6 0.66060042 152 nips-2010-Learning from Logged Implicit Exploration Data

7 0.65731025 103 nips-2010-Generating more realistic images using gated MRF's

8 0.65719563 63 nips-2010-Distributed Dual Averaging In Networks

9 0.6565361 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

10 0.65636981 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

11 0.65580499 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect

12 0.65563339 150 nips-2010-Learning concept graphs from text with stick-breaking priors

13 0.65550041 1 nips-2010-(RF)^2 -- Random Forest Random Field

14 0.65467602 158 nips-2010-Learning via Gaussian Herding

15 0.65429872 224 nips-2010-Regularized estimation of image statistics by Score Matching

16 0.65428752 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

17 0.65419269 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

18 0.65388417 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

19 0.6536181 155 nips-2010-Learning the context of a category

20 0.65350473 212 nips-2010-Predictive State Temporal Difference Learning