nips nips2007 nips2007-30 knowledge-graph by maker-knowledge-mining

30 nips-2007-Bayes-Adaptive POMDPs

Source: pdf

Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau

Abstract: Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows us to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can tradeoff between improving the model, identifying the state, and gathering reward. We show how the model can be ﬁnitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We describe approximations for belief tracking and planning in this model. [sent-13, score-0.505]

2 Empirical results on two domains show that the model estimate and agent’s return improve over time, as the agent learns better model estimates. [sent-14, score-0.22]

3 Thus we seek a decision-theoretic planner which can take into account the uncertainty over model parameters during the planning process, as well as being able to learn from experience the values of these unknown parameters. [sent-20, score-0.253]

4 (2) how to approximate the inﬁnite dimensional belief space to perform belief monitoring and compute the optimal policy. [sent-25, score-0.578]

5 The ﬁrst problem is solved by including the Dirichlet parameters in the state space and maintaining belief states over these parameters. [sent-27, score-0.464]

6 We provide theoretical results for bounding the state space while preserving the value function and we use these results to derive approximate solving and belief monitoring algorithms. [sent-29, score-0.458]

7 We compare several belief approximations in two problem domains. [sent-30, score-0.299]

8 Empirical results show that the agent is able to learn good POMDP models and improve its return as it learns better model estimate. [sent-31, score-0.19]

9 It has transition ′ ′ probabilities {T sas }s,s′ ∈S,a∈A where T sas = Pr(st+1 = s′ |st = s, at = a) and observation probabilities {Osaz }s∈S,a∈A,z∈Z where Osaz = Pr(zt = z|st = s, at−1 = a). [sent-33, score-0.819]

10 The reward function R : S × A → R speciﬁes the immediate reward obtained by the agent. [sent-34, score-0.172]

11 Instead the agent perceives an observation z ∈ Z at each time step, which (along with the action sequence) allows it to maintain a belief state b ∈ ∆S. [sent-36, score-0.668]

12 The belief state speciﬁes the probability of being in each state given the history of action and observation experienced so far, starting from an initial belief b0 . [sent-37, score-0.904]

13 It can be updated at each time step using Baye’s rule: bt+1 (s′ ) = ′a z t t+1 P sat s′ bt (s) s∈S T ′′ s′′ at zt+1 P T sat s bt (s) ′′ ∈s O s s∈S Os P . [sent-38, score-0.142]

14 A policy π : ∆S → A indicates how the agent should select actions as a function of the current belief. [sent-39, score-0.194]

15 Solving a POMDP involves ﬁnding the optimal policy π ∗ that maximizes the expected discounted return over the inﬁnite horizon. [sent-40, score-0.144]

16 The return obtained by following π ∗ from a belief b is deﬁned by Bellman’s equation: V ∗ (b) = ∗ maxa∈A s∈S b(s)R(s, a) + γ z∈Z Pr(z|b, a)V (τ (b, a, z)) , where τ (b, a, z) is the new belief after performing action a and observation z and γ ∈ [0, 1) is the discount factor. [sent-41, score-0.806]

17 3 Bayes-Adaptive POMDP In this section, we introduce the Bayes-Adaptive POMDP (BAPOMDP) model, an optimal decisiontheoretic algorithm for learning and planning in POMDPs under parameter uncertainty. [sent-46, score-0.179]

18 Throughout we assume that the state, action, and observation spaces are ﬁnite and known, but that the transition and observation probabilities are unknown or partially known. [sent-47, score-0.32]

19 We also assume that the reward function is known as it is generally speciﬁed by the user for the speciﬁc task he wants to accomplish, but the model can easily be generalised to learn the reward function as well. [sent-48, score-0.202]

20 ′ To model the uncertainty on the transition T sas and observation Osaz parameters, we use Dirichlet distributions, which are probability distributions over the parameters of multinomial distributions. [sent-49, score-0.518]

21 The uncertainty on the distributions T sa· and Os a· can be represented by experience counts: φa ′ ∀s′ represents the number of times the transition (s, a, s′ ) ocss a curred, similarly ψs′ z ∀z is the number of times observation z was made in state s′ after doing action a. [sent-71, score-0.396]

22 Let φ be the vector of all transition counts and ψ be the vector of all observation counts. [sent-72, score-0.201]

23 Given ′ ′ sas the count vectors φ and ψ, the expected transition probability for T sas is: Tφ = similarly for O s′ az : s′ Oψ az = a ψs′ z P a ′ ∈Z ψs′ z ′ z P φa ′ ss φa ′′ ss s′′ ∈S , and . [sent-73, score-1.624]

24 The objective of the BAPOMDP is to learn an optimal policy, such that actions are chosen to maximize reward taking into account both state and parameter uncertainty. [sent-74, score-0.258]

25 To model this, we follow the Bayes-Adaptive MDP framework, and include the φ and ψ vectors in the state of the BAPOMDP. [sent-75, score-0.216]

26 Thus, the state space S ′ of the BAPOMDP is deﬁned as S ′ = S × T × O, 2 where T = {φ ∈ N|S| |A| |∀(s, a), s′ ∈S φa ′ > 0} represents the space in which φ lies and ss a O = {ψ ∈ N|S||A||Z| |∀(s, a), z∈Z ψsz > 0} represents the space in which ψ lies. [sent-76, score-0.377]

27 The action and observation sets of the BAPOMDP are the same as in the original POMDP. [sent-77, score-0.164]

28 Transition and observation functions of the BAPOMDP must capture how the state and count vectors φ, ψ evolve after every time step. [sent-78, score-0.401]

29 Consider an agent in a given state s with count vectors φ and ψ, which performs action a, causing it to move to state s′ and observe z. [sent-79, score-0.57]

30 Then the vector φ′ after the transition is deﬁned a a as φ′ = φ + δss′ , where δss′ is a vector full of zeroes, with a 1 for the count φa ′ , and the vector ss ′ a a ψ after the observation is deﬁned as ψ ′ = ψ + δs′ z , where δs′ z is a vector full of zeroes, with a 1 a for the count ψs′ z . [sent-80, score-0.68]

31 Hence, we deﬁne T ′ and O′ to be: ′ T ′ ((s, φ, ψ), a, (s′ , φ′ , ψ ′ )) = ′ sas s a a Tφ Oψ az , if φ′ = φ + δss′ and ψ ′ = ψ + δs′ z 0, otherwise. [sent-82, score-0.418]

32 (1) (2) Note here that the observation probabilities are folded into the transition function, and that the observation function becomes deterministic. [sent-84, score-0.288]

33 This happens because a state transition in the BAPOMDP automatically speciﬁes which observation is acquired after transition, via the way the counts are incremented. [sent-85, score-0.309]

34 Since the counts do not affect the reward, the reward function of the BAPOMDP is deﬁned as R′ ((s, φ, ψ), a) = R(s, a); the discount factor of the BAPOMDP remains the same. [sent-86, score-0.157]

35 The belief state of the BAPOMDP represents a distribution over both states and count values. [sent-88, score-0.553]

36 The model is learned by simply maintaining this belief state, as the distribution will concentrate over most likely models, given the prior and experience so far. [sent-89, score-0.328]

37 If b0 is the initial belief state of the unknown POMDP, and the count vectors φ0 ∈ T and ψ0 ∈ O represent the prior knowledge on this POMDP, then the initial belief of the BAPOMDP is: b′ (s, φ0 , ψ0 ) = {b0 (s), if (φ, ψ) = (φ0 , ψ0 ); 0 0, otherwise}. [sent-90, score-0.802]

38 After actions are taken, the uncertainty on the POMDP model is represented by mixtures of Dirichlet distributions (i. [sent-91, score-0.138]

39 Hence the belief update function and optimal value function are still deﬁned as in Section 2. [sent-95, score-0.262]

40 Maintaining the belief state is practical only if the number of states with non-zero probabilities is ﬁnite. [sent-97, score-0.474]

41 0 ′ The proof of this theorem suggests that it is sufﬁcient to iterate over S and Sb′ in order to compute t−1 ′ the belief state bt when an action and observation are taken in the environment. [sent-105, score-0.605]

42 ′ for all (s, φ, ψ, s′ ) ∈ Sb × S do ′ ′ a a a a sas′ s′ b (s , φ + δss′ , ψ + δs′ z ) ← b′ (s′ , φ + δss′ , ψ + δs′ z ) + b(s, φ, ψ)Tφ Oψ az end for return normalized b′ Algorithm 3. [sent-111, score-0.222]

43 In order to compute these more efﬁciently, we show in the next section that the inﬁnite state space can be reduced to a ﬁnite state space, while still preserving the value function to arbitrary precision for any horizon t. [sent-117, score-0.3]

44 4 Approximating the BAPOMDP: Theory and Algorithms Solving a BAPOMDP exactly for all belief states is impossible in practice due to the dimensionnality of the state space (in particular to the fact that the count vectors can grow unbounded). [sent-118, score-0.598]

45 We now show how we can reduce this inﬁnite state space to a ﬁnite state space. [sent-119, score-0.216]

46 This allows us to compute an ǫoptimal value function over the resulting ﬁnite-dimensionnal belief space using standard POMDP techniques. [sent-120, score-0.262]

47 Various methods for belief tracking in the inﬁnite model are also presented. [sent-121, score-0.319]

48 This bound uses the following deﬁnitions: given φ, φ′ ∈ T , and sas′ sas′ saz saz sa sa ψ, ψ ′ ∈ O, deﬁne DS (φ, φ′ ) = s′ ∈S Tφ − Tφ′ and DZ (ψ, ψ ′ ) = z∈Z Oψ − Oψ′ , sa and Nφ = s′ ∈S sa φa ′ and Nψ = ss z∈Z ′ a ψsz . [sent-124, score-0.945]

49 2 suggests that if we want a precision of ǫ on the value function, we just need to restrict 2 ˜ the space of Dirichlet parameters to count vectors φ ∈ Tǫ = {φ ∈ N|S| |A| |∀a ∈ A, s ∈ S, 0 < sa ǫ sa ǫ ˜ ˜ ˜ Nφ ≤ NS } and ψ ∈ Oǫ = {ψ ∈ N|S||A||Z| |∀a ∈ A, s ∈ S, 0 < Nψ ≤ NZ }. [sent-137, score-0.468]

50 To deﬁne the transition and observation functions over ˜ ˜ S that ﬁnite state space, we need to make sure that when the count vectors are incremented, they stay ˜ within the ﬁnite space. [sent-139, score-0.439]

51 projects every state in S to their closest state in S Deﬁnition 4. [sent-141, score-0.216]

52 1 as a distance between states that only differs by their φ and ψ vectors, and uses an upper bound on that value when the states differ. [sent-147, score-0.16]

53 Thus ˜ ˜ Pǫ always maps states (s, φ, ψ) ∈ S ′ to some state (s, φ′ , ψ ′ ) ∈ Sǫ . [sent-148, score-0.166]

54 Using Pǫ , the transition and observation function are deﬁned as follows: ′ ˜ Tǫ ((s, φ, ψ), a, (s′ , φ′ , ψ ′ )) = ′ sas s a a Tφ Oψ az , if (s′ , φ′ , ψ ′ ) = Pǫ (s′ , φ + δss′ , ψ + δs′ z ) 0, otherwise. [sent-150, score-0.579]

55 (4) (5) These deﬁnitions are the same as the one in the inﬁnite BAPOMDP, except that now we add an extra ˜ projection to make sure that the incremented count vectors stays in Sǫ . [sent-152, score-0.2]

56 1, the number of states with non-zero probability grows exponentially in the planning horizon, thus exact belief monitoring can quickly become intractable. [sent-171, score-0.598]

57 We now discuss different particle-based approximations that allow polynomial-time belief tracking. [sent-172, score-0.299]

58 Given a prior belief b, followed by action a and observation z, the new belief b′ is obtained by ﬁrst sampling K states from the distribution b, then for each sampled s a new state s′ is sampled from T (s, a, ·). [sent-174, score-0.854]

59 Finally, the probability O(s′ , a, z) is added to b′ (s′ ) and the belief b′ is re-normalized. [sent-175, score-0.262]

60 Most Probable: Alternately, we can do the exact belief update at a given time step, but then only keep the K most probable states in the new belief b′ and renormalize b′ . [sent-179, score-0.742]

61 As in the Most Probable method, we do an exact belief update, however in this case we ﬁt the posterior distribution using a greedy K-means procedure, where distance is deﬁned as in Deﬁnition 4. [sent-183, score-0.351]

62 1, weighted by the probability of the state to remove. [sent-184, score-0.156]

63 3 Online planning While the ﬁnite model presented in Section 4. [sent-187, score-0.209]

64 Our approach simply performs dynamic programming over all the beliefs reachable within some ﬁxed ﬁnite planning horizon from the current belief. [sent-190, score-0.229]

65 The action with highest return over that ﬁnite horizon is executed and then planning is conducted again on the next belief. [sent-191, score-0.399]

66 To further limit the complexity of the online planning algorithm, we used the approximate belief monitoring methods detailed above. [sent-192, score-0.547]

67 Its overall complexity is in O((|A||Z|)D Cb ) where D is the planning horizon and Cb is the complexity of updating the belief. [sent-193, score-0.229]

68 5 Empirical Results We begin by evaluating the different belief approximations introduced above. [sent-194, score-0.299]

69 To do so, we use a simple online d-step lookahead search, and compare the overall expected return and model accuracy ′ in two different problems: the well-known Tiger [5] and a new domain called Follow. [sent-195, score-0.255]

70 Given T sas ′ and Os az the exact probabilities of the (unknown) POMDP, the model accuracy is measured in terms of the weighted sum of L1-distance, denoted W L1, between the exact model and the probable models in a belief state b: W L1(b) L1(φ, ψ) 5. [sent-196, score-1.176]

71 1 = = ′ (s,φ,ψ)∈Sb b(s, φ, ψ)L1(φ, ψ) ′ a∈A s′ ∈S s∈S ′ sas |Tφ − T sas | + ′ z∈Z ′ s |Oψ az − Os az | (6) Tiger In the Tiger problem [5], we consider the case where the transition and reward parameters are known, but the observation probabilities are not. [sent-197, score-1.129]

72 We deﬁne the observation count vector ψ = (ψLl , ψLr , ψRl , ψRr ). [sent-199, score-0.206]

73 Episodes terminate when the agent opens a door, at which point the POMDP state (i. [sent-203, score-0.181]

74 tiger’s position) is reset, but the distribution over count vector is carried over to the next episode. [sent-205, score-0.125]

75 Figures 1 and 2 show how the average return and model accuracy evolve over the 100 episodes (results are averaged over 1000 simulations), using an online 3-step lookahead search with varying belief approximations and parameters. [sent-206, score-0.653]

76 Returns obtained by planning directly with the prior and exact model (without learning) are shown for comparison. [sent-207, score-0.254]

77 Model accuracy is measured on the initial belief of each episode. [sent-208, score-0.291]

78 Figure 3 compares the average planning time per action taken by each approach. [sent-209, score-0.262]

79 2 0 0 20 40 60 Episode 80 100 Planning Time/Action (ms) 1 1 15 10 5 0 MP (2) MC (64) WD (2) Figure 1: Return with different Figure 2: Model accuracy with Figure 3: Planning Time with belief approximations. [sent-215, score-0.291]

80 The goal of the Follow task is for a robot to continuously follow one of two individuals in a 2D open area. [sent-223, score-0.185]

81 The two subjects have different motion behavior, requiring the robot to use a different policy for each. [sent-224, score-0.254]

82 At every episode, the target person is selected randomly with P r = 0. [sent-225, score-0.154]

83 The state space has two features: a binary variable indicating which person is being followed, and a position variable indicating the person’s position relative to the robot (5 × 5 square grid with the robot always at the center). [sent-228, score-0.566]

84 Initially, the robot and person are at the same position. [sent-229, score-0.306]

85 Both the robot and the person can perform ﬁve motion actions {N oAction, N orth, East, South, W est}. [sent-230, score-0.415]

86 The person follows a ﬁxed stochastic policy (stationary over space and time), but the parameters of this behavior are unknown. [sent-231, score-0.211]

87 The robot perceives observations indicating the person’s position relative to the robot: {Same, N orth, East, South, W est, U nseen}. [sent-232, score-0.213]

88 The robot perceives the correct observation P r = 0. [sent-233, score-0.294]

89 The reward R = +1 if the robot and person are at the same position (central grid cell), R = 0 if the person is one cell away from the robot, and R = −1 if the person is two cells away. [sent-236, score-0.7]

90 The task terminates if the person reaches a distance of 3 cells away from the robot, also causing a reward of -20. [sent-237, score-0.312]

91 When formulating the BAPOMDP, the robot’s motion model (deterministic), the observation probabilities and the rewards are assumed to be known. [sent-240, score-0.202]

92 We maintain a separate count vector for each person, representing the number of times they move in each direction, i. [sent-241, score-0.125]

93 We assume a prior φ1 = (2, 3, 1, 2, 2) 0 N N E S W N N E S W for person 1 and φ2 = (2, 1, 3, 2, 2) for person 2, while in reality person 1 moves with probabilities 0 P r = (0. [sent-244, score-0.508]

94 The count vectors’ distributions are reset after every simulation, and the target person is reset after every episode. [sent-256, score-0.359]

95 We use a 2-step lookahead search for planning in the BAPOMDP. [sent-257, score-0.236]

96 Figures 4 and 5 show how the average return and model accuracy evolve over the 100 episodes (averaged over the 200 simulations) with different belief approximations. [sent-258, score-0.507]

97 Figure 6 compares the planning time taken by each approach. [sent-259, score-0.179]

98 We observe from these ﬁgures that the results for the Weighted Distance approximations are much better both in terms of return and model accuracy, even with fewer particles (16). [sent-260, score-0.216]

99 5 0 0 20 40 60 Episode 80 100 Planning Time/Action (ms) Exact model 150 100 50 0 MP (64) MC (64) WD (16) Figure 4: Return with different Figure 5: Model accuracy with Figure 6: Planning Time with belief approximations. [sent-265, score-0.321]

100 We provided practical approaches for belief tracking and online planning in this model, and validated these using two experimental domains. [sent-271, score-0.52]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bapomdp', 0.465), ('pomdp', 0.348), ('sas', 0.283), ('ss', 0.269), ('belief', 0.262), ('planning', 0.179), ('person', 0.154), ('robot', 0.152), ('sa', 0.149), ('az', 0.135), ('count', 0.125), ('probable', 0.115), ('state', 0.108), ('pomdps', 0.099), ('return', 0.087), ('reward', 0.086), ('action', 0.083), ('observation', 0.081), ('tiger', 0.08), ('transition', 0.08), ('dirichlet', 0.079), ('agent', 0.073), ('episode', 0.071), ('carlo', 0.071), ('monte', 0.071), ('nz', 0.07), ('sb', 0.067), ('nite', 0.066), ('os', 0.064), ('actions', 0.064), ('particles', 0.062), ('osaz', 0.061), ('perceives', 0.061), ('states', 0.058), ('observable', 0.057), ('policy', 0.057), ('lookahead', 0.057), ('episodes', 0.057), ('monitoring', 0.054), ('mcgill', 0.053), ('online', 0.052), ('horizon', 0.05), ('weighted', 0.048), ('qc', 0.048), ('ns', 0.048), ('probabilities', 0.046), ('canada', 0.046), ('exact', 0.045), ('pineau', 0.045), ('motion', 0.045), ('vectors', 0.045), ('uncertainty', 0.044), ('distance', 0.044), ('evolve', 0.042), ('dz', 0.042), ('nseen', 0.04), ('olr', 0.04), ('orth', 0.04), ('saz', 0.04), ('south', 0.04), ('sz', 0.04), ('reset', 0.04), ('counts', 0.04), ('bt', 0.039), ('reinforcement', 0.039), ('st', 0.038), ('approximations', 0.037), ('pk', 0.037), ('maintaining', 0.036), ('zeroes', 0.035), ('horizons', 0.035), ('brahim', 0.035), ('joelle', 0.035), ('pi', 0.034), ('nitions', 0.034), ('preserving', 0.034), ('follow', 0.033), ('cb', 0.032), ('east', 0.032), ('phane', 0.032), ('ross', 0.032), ('sat', 0.032), ('wd', 0.032), ('partially', 0.032), ('theorem', 0.032), ('discount', 0.031), ('model', 0.03), ('mdp', 0.03), ('montr', 0.03), ('qu', 0.03), ('incremented', 0.03), ('pr', 0.03), ('accuracy', 0.029), ('mc', 0.028), ('causing', 0.028), ('jair', 0.028), ('ds', 0.028), ('tracking', 0.027), ('est', 0.027), ('ln', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 30 nips-2007-Bayes-Adaptive POMDPs

Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau

2 0.38002518 215 nips-2007-What makes some POMDP problems easy to approximate?

Author: Wee S. Lee, Nan Rong, Daniel J. Hsu

Abstract: Point-based algorithms have been surprisingly successful in computing approximately optimal solutions for partially observable Markov decision processes (POMDPs) in high dimensional belief spaces. In this work, we seek to understand the belief-space properties that allow some POMDP problems to be approximated efﬁciently and thus help to explain the point-based algorithms’ success often observed in the experiments. We show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of a reachable belief space, which is the subset of the belief space reachable from a given belief point. We also show that under the weaker condition of having a small covering number for an optimal reachable space, which is the subset of the belief space reachable under an optimal policy, computing an approximately optimal solution is NP-hard. However, given a suitable set of points that “cover” an optimal reachable space well, an approximate solution can be computed in polynomial time. The covering number highlights several interesting properties that reduce the complexity of POMDP planning in practice, e.g., fully observed state variables, beliefs with sparse support, smooth beliefs, and circulant state-transition matrices. 1

3 0.27887988 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

Abstract: Planning in partially observable environments remains a challenging problem, despite signiﬁcant recent advances in ofﬂine approximation techniques. A few online methods have also been proposed recently, and proven to be remarkably scalable, but without the theoretical guarantees of their ofﬂine counterparts. Thus it seems natural to try to unify ofﬂine and online techniques, preserving the theoretical properties of the former, and exploiting the scalability of the latter. In this paper, we provide theoretical guarantees on an anytime algorithm for POMDPs which aims to reduce the error made by approximate ofﬂine value iteration algorithms through the use of an efﬁcient online searching procedure. The algorithm uses search heuristics based on an error analysis of lookahead search, to guide the online search towards reachable beliefs with the most potential to reduce error. We provide a general theorem showing that these search heuristics are admissible, and lead to complete and ǫ-optimal algorithms. This is, to the best of our knowledge, the strongest theoretical result available for online POMDP solution methods. We also provide empirical evidence showing that our approach is also practical, and can ﬁnd (provably) near-optimal solutions in reasonable time. 1

4 0.13644198 86 nips-2007-Exponential Family Predictive Representations of State

Author: David Wingate, Satinder S. Baveja

Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufﬁcient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which deﬁnes as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1

5 0.13322547 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identiﬁcation of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modiﬁes the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

6 0.13242815 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

7 0.11963373 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

8 0.11449058 162 nips-2007-Random Sampling of States in Dynamic Programming

9 0.10120187 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

10 0.097551577 185 nips-2007-Stable Dual Dynamic Programming

11 0.086943261 102 nips-2007-Incremental Natural Actor-Critic Algorithms

12 0.08526174 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

13 0.078878306 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

14 0.076484703 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

15 0.067036405 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

16 0.064278647 191 nips-2007-Temporal Difference Updating without a Learning Rate

17 0.063987076 52 nips-2007-Competition Adds Complexity

18 0.059858177 100 nips-2007-Hippocampal Contributions to Control: The Third Way

19 0.058922067 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

20 0.054843716 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.198), (1, -0.27), (2, 0.051), (3, -0.124), (4, -0.161), (5, 0.017), (6, 0.001), (7, 0.01), (8, 0.137), (9, -0.038), (10, -0.034), (11, 0.068), (12, 0.007), (13, -0.133), (14, -0.051), (15, -0.194), (16, 0.186), (17, -0.16), (18, 0.189), (19, -0.086), (20, -0.001), (21, -0.203), (22, -0.008), (23, 0.128), (24, -0.223), (25, 0.157), (26, -0.029), (27, 0.126), (28, 0.025), (29, -0.067), (30, 0.029), (31, 0.036), (32, 0.114), (33, -0.053), (34, 0.111), (35, 0.075), (36, -0.002), (37, -0.052), (38, 0.047), (39, 0.006), (40, 0.042), (41, -0.039), (42, 0.022), (43, 0.036), (44, -0.012), (45, -0.025), (46, -0.029), (47, 0.062), (48, -0.014), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96692717 30 nips-2007-Bayes-Adaptive POMDPs

Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau

2 0.93853551 215 nips-2007-What makes some POMDP problems easy to approximate?

Author: Wee S. Lee, Nan Rong, Daniel J. Hsu

3 0.84209681 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

4 0.51992947 52 nips-2007-Competition Adds Complexity

Author: Judy Goldsmith, Martin Mundhenk

Abstract: It is known that determinining whether a DEC-POMDP, namely, a cooperative partially observable stochastic game (POSG), has a cooperative strategy with positive expected reward is complete for NEXP. It was not known until now how cooperation affected that complexity. We show that, for competitive POSGs, the complexity of determining whether one team has a positive-expected-reward strategy is complete for NEXPNP .

5 0.45221809 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

6 0.44369709 86 nips-2007-Exponential Family Predictive Representations of State

7 0.43331215 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process

8 0.41919369 162 nips-2007-Random Sampling of States in Dynamic Programming

9 0.3681311 171 nips-2007-Scan Strategies for Meteorological Radars

10 0.32135841 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

11 0.29516852 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

12 0.29315808 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

13 0.286614 185 nips-2007-Stable Dual Dynamic Programming

14 0.28496501 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

15 0.28237581 174 nips-2007-Selecting Observations against Adversarial Objectives

16 0.27978832 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

17 0.27437785 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

18 0.26190588 3 nips-2007-A Bayesian Model of Conditioned Perception

19 0.25013059 191 nips-2007-Temporal Difference Updating without a Learning Rate

20 0.23593296 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.034), (9, 0.377), (13, 0.063), (16, 0.019), (18, 0.024), (21, 0.032), (31, 0.025), (34, 0.034), (35, 0.023), (47, 0.073), (49, 0.016), (83, 0.105), (85, 0.021), (87, 0.021), (90, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7275157 30 nips-2007-Bayes-Adaptive POMDPs

Author: Stephane Ross, Brahim Chaib-draa, Joelle Pineau

2 0.65094024 96 nips-2007-Heterogeneous Component Analysis

Author: Shigeyuki Oba, Motoaki Kawanabe, Klaus-Robert Müller, Shin Ishii

Abstract: In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and speciﬁc components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept. 1

3 0.47991991 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray

Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratiﬁed Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1

4 0.46547863 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

Author: Stephane Ross, Joelle Pineau, Brahim Chaib-draa

5 0.45926809 215 nips-2007-What makes some POMDP problems easy to approximate?

Author: Wee S. Lee, Nan Rong, Daniel J. Hsu

6 0.44146118 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

7 0.41082621 158 nips-2007-Probabilistic Matrix Factorization

8 0.39190173 63 nips-2007-Convex Relaxations of Latent Variable Training

9 0.38664806 84 nips-2007-Expectation Maximization and Posterior Constraints

10 0.3839727 86 nips-2007-Exponential Family Predictive Representations of State

11 0.38247716 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

12 0.38102955 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

13 0.38036594 115 nips-2007-Learning the 2-D Topology of Images