Abstract: We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. 1
1 edu Abstract We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. [sent-3, score-0.835]
2 Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. [sent-4, score-1.309]
3 We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning. [sent-5, score-0.834]
4 1 Introduction We address the reinforcement learning (RL) problem of finding a good policy in an unknown, stochastic, and partially observable domain, given both data from independent exploration and expert demonstrations. [sent-6, score-1.145]
5 In contrast, imitation and inverse reinforcement learning [5, 6] use expert trajectories to learn reward models. [sent-9, score-0.671]
6 We consider cases where we have data from both independent exploration and expert trajectories. [sent-11, score-0.463]
7 Data from independent observation gives direct information about the dynamics, while expert demonstrations show outputs of good policies and thus provide indirect information about the underlying model. [sent-12, score-0.677]
8 Because dynamics and policies are linked through a complex, nonlinear function, leveraging information about both these aspects at once is challenging. [sent-14, score-0.275]
9 In previous work [7, 8, 9, 10], the model prior p(M ) was defined as a distribution directly on the dynamics and rewards models, making it difficult to incorporate expert trajectories. [sent-17, score-0.678]
10 Our main contribution is a new approach to defining this prior: our prior uses the assumption that the expert knew something about the world model when computing his optimal policy. [sent-18, score-0.682]
11 In this domain, one of our key technical contributions is the insight that the Bayesian approach used for building models of transition dynamics can also be used as policy priors, if we exchange the typical role of actions and 1 observations. [sent-21, score-0.671]
12 For example, algorithms for learning partially observable Markov decision processes (POMDPs) build models that output observations and take in actions as exogenous variables. [sent-22, score-0.361]
13 By using nonparametric priors [12], our agent can scale the sophistication of its policies and world models based on the data. [sent-24, score-0.783]
14 First, our choices for the policy prior and a world model prior can be viewed as a joint prior which introduces a bias for world models which are both simple and easy to control. [sent-26, score-1.164]
15 This bias is especially beneficial in the case of direct policy search, where it is easier to search directly for good controllers than it is to first construct a complete POMDP model and then plan with it. [sent-27, score-0.677]
16 Our method can also be used with approximately optimal expert data; in these cases the expert data can be used to bias which models are likely but not set hard constraints on the model. [sent-28, score-0.895]
17 The state transition function T (s′ |s, a) defines the distribution over next-states s′ to which the agent may transition after taking action a from state s. [sent-33, score-0.423]
18 Bayesian nonparametric approaches are well-suited for partially observable environments because they can also infer the dimensionality of the underlying state space. [sent-40, score-0.294]
19 Other approaches [15, 16, 9] try to balance the off-line computation of a good policy (the computational complexity) and the cost of getting data online (the sample complexity). [sent-45, score-0.471]
20 Finite State Controllers Another possibility for choosing actions—including in our partiallyobservable reinforcement learning setting—is to consider a parametric family of policies, and attempt to estimate the optimal policy parameters from data. [sent-46, score-0.533]
21 This is the approach underlying, for example, much work on policy gradients. [sent-47, score-0.441]
22 The node transition function β(n′ |n, o) defines the distribution over next-nodes n′ to which the agent may transition after taking action a from node n. [sent-51, score-0.341]
23 The policy function π(a|n) is a distribution over actions that the finite state controller may output in node n. [sent-52, score-0.716]
24 3 Nonparametric Bayesian Policy Priors We now describe our framework for combining world models and expert data. [sent-54, score-0.626]
25 Recall that our key assumption is that the expert used knowledge about the underlying world to derive his policy. [sent-55, score-0.586]
26 1 2 Figure 1: Two graphical models of expert data generation. [sent-57, score-0.471]
27 Left: the prior only addresses world dynamics and rewards. [sent-58, score-0.323]
28 Right: the prior addresses both world dynamics and controllable policies. [sent-59, score-0.353]
29 Combined with the world model M , the expert’s policy πe and agent’s policy πa produce the expert’s and agent’s data De and Da . [sent-62, score-1.06]
30 The data consist of a sequence of histories, where a history ht is a sequence of actions a1 , · · · , at , observations o1 , · · · , ot , and rewards r1 , · · · , rt . [sent-63, score-0.267]
31 The agent has access to all histories, but the true world model and optimal policy are hidden. [sent-64, score-0.779]
32 Both graphical models assume that a particular world M is sampled from a prior over POMDPs, gM (M ). [sent-65, score-0.366]
33 In what would be the standard application of Bayesian RL with expert data (Fig. [sent-66, score-0.408]
34 1(a)), the prior gM (M ) fully encapsulates our initial belief over world models. [sent-67, score-0.297]
35 An expert, who knows the true world model M , executes a planning algorithm plan(M ) to construct an optimal policy πe . [sent-68, score-0.716]
36 The expert then executes the policy to generate expert data De , distributed according to p(De |M, πe ), where πe = plan(M ). [sent-69, score-1.281]
37 1(a) does not easily allow us to encode a prior bias toward more controllable world models. [sent-71, score-0.39]
38 In particular, if we choose a distribution of the form p(πe |M ) ∝ fM (πe )gπ (πe ) (1) where we interpret gπ (πe ) as a prior over policies and fM (πe ) as a likelihood of a policy given a model. [sent-74, score-0.763]
39 For example, if the policy class is the set of finite state controllers as discussed in Sec. [sent-77, score-0.555]
40 2, the policy prior gπ (πe ) might encode preferences for a smaller number of nodes used the policy, while gM (M ) might encode preferences for a smaller number of visited states in the world. [sent-78, score-0.802]
41 The function fM (πe ) can also be made more general to encode how likely it is that the expert uses the policy πe given world model M . [sent-79, score-1.074]
42 Similarly, the conditional distribution over policies given data De and Da is a p(πe |De , Da ) ∝ gπ (πe )p(De |πe ) M o,r fM (πe )p(De , Da |M )gM (M ) We next describe three inference approaches for using Eqs. [sent-83, score-0.3]
43 If fM (πe ) = δ(plan(M )) and we believe that all policies are equally likely (graphical model 1(a)), then we can leverage the expert’s data by simply considering how well that world model’s policy plan(M ) matches the expert’s actions for a particular world model M . [sent-86, score-1.151]
44 The uniform policy prior implied by standard Bayesian RL does not allow us to encode prior biases about the policy. [sent-94, score-0.68]
45 6 becomes E [q(a)] = M o,r a q(a|M )p(De , Da |M )gM (M )gπ (plan(M ))p(De |plan(M )) (7) where we still assume that the expert uses an optimal policy, that is, fM (πe ) = δ(plan(M )). [sent-97, score-0.408]
46 It also assumes that the expert used the optimal policy, whereas a more realistic assumption might be that the expert uses a near-optimal policy. [sent-100, score-0.816]
47 While the model-based inference for policy priors is correct, using importance weights often suffers when the proposal distribution is not near the true posterior. [sent-105, score-0.612]
48 In particular, sampling world models and policies—both very high dimensional objects—from distributions that ignore large parts of the evidence means that large numbers of samples may be needed to get accurate estimates. [sent-106, score-0.283]
49 We now describe an inference approach that alternates sampling models and policies that both avoids importance sampling and can be used even 1 We omit the belief over world states b(s) from the equations that follow for clarity; all references to q(a|M ) are q(a|bM (s), M ). [sent-107, score-0.638]
50 The inference proceeds in two alternating stages: first, we sample a new policy given a sampled model. [sent-110, score-0.514]
51 1, we use the iPOMDP [12] as a conjugate prior over policies encoded as finite state controllers. [sent-114, score-0.385]
52 While applying MH can suffer from the same issues as the importance sampling in the model-based approach, Gibbs sampling new policies removes one set of proposal distributions from the inference, resulting in better estimates with fewer samples. [sent-123, score-0.319]
53 1 Priors over State Controller Policies We now turn to the definition of the policy prior p(πe ). [sent-125, score-0.537]
54 In theory, any policy prior can be used, but there are some practical considerations. [sent-126, score-0.537]
55 Mathematically, the policy prior serves as a regularizer to avoid overfitting the expert data, so it should encode a preference toward simple policies. [sent-127, score-0.992]
56 In discrete domains, one choice for the policy prior (as well as the model prior) is the iPOMDP [12]. [sent-129, score-0.537]
57 The iPOMDP posits that there are an infinite number of states s but a few popular states are visited most of the time; the beam sampler [17] can efficiently draw samples of state transition, observation, and reward models for visited states. [sent-131, score-0.499]
58 To use the iPOMDP as a policy prior, we simply reverse the roles of actions and observations, treating the observations as inputs and the actions as outputs. [sent-133, score-0.69]
59 Now, the iPOMDP posits that there is a state controller with an infinite number of nodes n, but probable polices use only a small subset of the nodes a majority of the time. [sent-134, score-0.27]
60 We perform joint inference over the node transition and policy parameters β and π as well as the visited nodes n. [sent-135, score-0.615]
61 As in the model prior application, using the iPOMDP as a policy prior biases the agent towards simpler policies—those that visit fewer nodes—but allows the number of nodes to grow as with new expert experience. [sent-138, score-1.303]
62 There are some mild conditions on the world and policy priors to ensure consistency: since the policy prior and model prior are specified independently, we require that there exist models for which both the policy prior and model prior are non-zero in the limit of data. [sent-141, score-2.052]
63 Formally, we also require that the expert provide optimal trajectories; in practice, we see that this assumption can be relaxed. [sent-142, score-0.408]
64 3 output samples of models or policies to be used for planning. [sent-148, score-0.295]
65 During the testing phase, the internal belief state of the models (in the model-based approaches) or the internal node state of the policies (in the policy-based approaches), is updated after each action-observation pair. [sent-152, score-0.437]
66 Actions are chosen by first selecting, depending on the approach, a model or policy based on their weights, and then performing its most preferred action. [sent-154, score-0.441]
67 2 4 Experiments We first describe a pair of demonstrations that show two important properties of using policy priors: (1) that policy priors can be useful even in the absence of expert data and (2) that our approach works even when the expert trajectories are not optimal. [sent-156, score-1.921]
68 We then compare policy priors with the basic iPOMDP [12] and finite-state model learner trained with EM on several standard problems. [sent-157, score-0.597]
69 The agent was provided with an expert tran jectory with probability . [sent-160, score-0.568]
70 No expert trajectories were provided in the last quarter of the iterations. [sent-162, score-0.482]
71 Models and policies were updated every 100 iterations, and each episode was capped at 50 iterations (though it could be shorter, if the task was achieved in fewer iterations). [sent-164, score-0.29]
72 Following each update, we ran 50 test episodes (not included in the agent’s experience) with the new models and policies to empirically evaluate the current value of the agents’ policy. [sent-165, score-0.287]
73 One iteration of bounded policy iteration [19] was performed per sampled model. [sent-168, score-0.47]
74 Policy Priors with No Expert Data The combined policy and model prior can be used to encode a prior bias towards models with simpler control policies. [sent-171, score-0.786]
75 6 be useful even without expert data: the left pane of Fig. [sent-173, score-0.408]
76 2 shows the performance of the policy prior-biased approaches and the standard iPOMDP on a gridworld problem in which observations correspond to both the adjacent walls (relevant for planning) and the color of the square (not relevant for planning). [sent-174, score-0.595]
77 The optimal policy for this gridworld was simple: go east until the agent hits a wall, then go south. [sent-176, score-0.69]
78 Without expert data, Approach 1 cannot do better than iPOMDP. [sent-178, score-0.408]
79 By biasing the agent towards worlds that admit simpler policies, the model-based inference with policy priors (Approach 2) creates a faster learner. [sent-179, score-0.824]
80 Policy Priors with Imperfect Experts While we focused on optimal expert data, in practice policy priors can be applied even if the expert is imperfect. [sent-180, score-1.384]
81 We generated expert data by first deriving 16 motor primitives for the action space using a clustering technique on a near-optimal trajectory produced by a rapidly-exploring random tree (RRT). [sent-185, score-0.498]
82 Trajectories from this controller were treated as expert data for our policy prior model. [sent-187, score-1.028]
83 In the beach problem, the agent needed to track a beach ball on a 2D grid. [sent-191, score-0.288]
84 We compared our inference approaches with two approaches that did not leverage the expert data: expectation-maximization (EM) used to learn a finite world model of the correct size and the infinite POMDP [12], which placed the same nonparametric prior over world models as we did. [sent-193, score-1.077]
85 3 shows the learning curves for our policy priors approaches (problems ordered by state space size); the cumulative rewards and final values are shown in Table 1. [sent-200, score-0.833]
86 As expected, approaches that leverage expert trajectories generally perform better than those that ignore the near-optimality of the expert data. [sent-201, score-0.941]
87 Here, even though the inferred state spaces could grow large, policies remained relatively simple. [sent-203, score-0.312]
88 The optimization used in the policy-based approach—recall we use the stochastic search to find a probable policy—was also key to producing reasonable policies with limited computation. [sent-204, score-0.289]
89 tiger network shuttle follow gridworld hallway beach rocksample tag image Cumulative Reward iPOMDP App. [sent-205, score-0.362]
90 Both [1, 4] describe how expert data augment learning. [sent-313, score-0.408]
91 In contrast, [4] lets the agent query an expert for optimal actions. [sent-317, score-0.594]
92 While policy information may be much easier to specify—incorporating the result of a single query into the prior over models is challenging; the particle-filtering approach of [4] can be brittle as model-spaces grow large. [sent-318, score-0.656]
93 Our policy priors approach uses entire trajectories; by learning policies rather than single actions, we can generalize better and evaluate models more holistically. [sent-319, score-0.834]
94 Targeted criteria for asking for expert trajectories, especially one with performance guarantees such as [4], would be an interesting extension to our approach. [sent-321, score-0.408]
95 6 Conclusion We addressed a key gap in the learning-by-demonstration literature: learning from both expert and agent data in a partially observable setting. [sent-322, score-0.717]
96 Prior work used expert data in MDP and imitationlearning cases, but less work exists for the general POMDP case. [sent-323, score-0.408]
97 Our Bayesian approach combined priors over the world models and policies, connecting information about world dynamics and expert trajectories. [sent-324, score-0.98]
98 Taken together, these priors are a new way to think about specifying priors over models: instead of simply putting a prior over the dynamics, our prior provides a bias towards models with simple dynamics and simple optimal policies. [sent-325, score-0.601]
99 We show with our approach expert data never reduces performance, and our extra bias towards controllability improves performance even without expert data. [sent-326, score-0.882]
100 Our policy priors over nonparametric finite state controllers were relatively simple; classes of priors to address more problems is an interesting direction for future work. [sent-327, score-0.861]
