jmlr jmlr2006 jmlr2006-74 knowledge-graph by maker-knowledge-mining

74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Source: pdf

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) deﬁned on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally deﬁned on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. [sent-15, score-0.843]

2 Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. [sent-17, score-0.728]

3 Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. [sent-19, score-0.351]

4 Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration 1. [sent-20, score-0.778]

5 When the state of the environment is only partially observable through noisy measurements, and actions have stochastic effects, optimizing the course of action ˚ o is a non-trivial task. [sent-27, score-0.508]

6 In contrast, model-based approaches assume knowledge about a transition model encoding the effects of actions on environment states, an observation model deﬁning the correlations between environment states and sensor observations, and a reward model encoding the utility of environment states. [sent-49, score-0.644]

7 For instance, in a robot navigation task, the state space may correspond to robot poses (x, y, θ), the observations may correspond to distances to obstacles measured by sonars or laser range ﬁnders, and actions may correspond to velocity and angular controls. [sent-57, score-0.547]

8 Duff (2002) considered a special case of continuous POMDPs in the context of model-based Bayesian reinforcement learning, in which beliefs are maintained over the space of unknown parameters of the transition model of a discrete-state MDP. [sent-63, score-0.398]

9 Moreover, they demonstrated that the value function for ﬁnite horizon is PWLC over the robot belief space for any functional form of the beliefs. [sent-69, score-0.419]

10 We demonstrate that the value function for arbitrary continuous POMDPs is in general convex, and it is PWLC in the particular case when the states are continuous but the actions and observations are discrete. [sent-75, score-0.492]

11 , beliefs, observation, action and reward models) can have arbitrary forms that may not be parameterizable. [sent-79, score-0.372]

12 In order to design feasible algorithms for continuous POMDPs, it is crucial to work with classes of functions that have simple parameterizations and that yield to closed belief updates and Bellman backups. [sent-80, score-0.365]

13 Using these representations, we extend the P ERSEUS algorithm (Spaan and Vlassis, 2005) to solve POMDPs with continuous states but discrete actions and observations. [sent-82, score-0.383]

14 We also show that POMDPs with continuous states, actions, and observations can be reduced to POMDPs with continuous states, discrete actions, and discrete observations using sampling strategies. [sent-83, score-0.447]

15 The objective of MDP-based planning is to determine an optimal policy, π ∗ , that is, a policy that assigns to each state the action from which the most future reward can be expected. [sent-100, score-0.586]

16 When the transition and the reward model are known in advance, we can use planning algorithms mainly developed within the operations research ﬁeld. [sent-102, score-0.342]

17 The initial belief is assumed to be known and, if b is the belief of the agent about the state, the updated belief after executing action a and observing o is ba,o (s ) = p(o|s ) p(s |b, a), p(o|b, a) (2) with p(s |b, a) the propagation of the belief b through the transition model. [sent-120, score-1.188]

18 o a∈A For discrete observation and action spaces, the integral over the observation space is replaced by a sum and the sup over the action space by a max operator. [sent-128, score-0.55]

19 Each α-vector is generated for a particular action, and the optimal action n with planning horizon n for a given belief is the action associated with the α-vector that deﬁnes Vn for that belief. [sent-134, score-0.684]

20 In Section 6, we discuss how to tackle POMDPs with continuous actions and observations via sampling. [sent-150, score-0.33]

21 First, we prove that the value function for a continuous POMDP is convex and, next, that it is PWLC for the case of continuous states, but discrete observations and actions. [sent-151, score-0.33]

22 For planning horizon 0, we only have to take into account the immediate reward and, thus, we have that V0 (b) = sup ra , b , a∈A and, therefore, if we deﬁne the continuous set {αi }i = {ra }a∈A , 0 (5) we have that, as desired V0 (b) = sup αi , b . [sent-163, score-0.625]

23 Z o αa,o,b do}a∈A , (8) is a continuous set of functions parameterized in the continuous action set. [sent-174, score-0.367]

24 Intuitively, each α n function corresponds to a plan and, the action associated with a given α n -function is the optimal action for planning horizon n for all beliefs that have such function as the maximizing one. [sent-175, score-0.675]

25 Lemma 2 When the state space is continuous but the observation and action sets are discrete, the ﬁnite horizon value function is piecewise-linear convex (PWLC). [sent-181, score-0.416]

26 For the general 0 j case, we have to observe that, for discrete actions and observations and assuming M = |{α n−1 }|, j the sets {αa,o } are discrete: for a given action a and observation o we can generate at most M 2336 P OINT-BASED VALUE I TERATION FOR C ONTINUOUS POMDP S αa,o -functions. [sent-186, score-0.515]

27 Proof Let us denote as a1 the action that maximizes HV at point b and a2 the action that does so for HU HV (b) = H a1 V (b), HU(b) = H a2 U(b). [sent-211, score-0.338]

28 In this section, we concentrate on continuous-state POMDPs (POMDPs with continuous states, but discrete actions and observations). [sent-221, score-0.342]

29 Extensions of P ERSEUS to deal with continuous action and observation spaces are detailed in Section 6. [sent-226, score-0.357]

30 A full backup, that is, a backup for the whole belief space, involves the computation of all relevant αelements for Vn . [sent-231, score-0.364]

31 Full backups are computationally expensive (they involve an exponentially growing set of α-elements), but the backup for a single belief point is relatively cheap. [sent-232, score-0.417]

32 The α-elements for this restricted set of belief points generalize over the whole belief space and, thus, they can be used to approximate the value function for any belief point. [sent-235, score-0.706]

33 The backup for a given belief point b is backup(b) = arg max αi , b , n {αi }i n where αi (s) is deﬁned in Eqs. [sent-236, score-0.364]

34 6) since these α-elements are independent of the belief point and are the base components to deﬁne the α a,o,b for any particular belief point, b. [sent-242, score-0.493]

35 First, the belief update has to be closed, that is, the representation used for the beliefs must be closed under the propagation through the transition model (Eq. [sent-273, score-0.525]

36 Finally, the reward model ra (s) is deﬁned by a linear combination of (a ﬁxed number of) Gaussians ra (s) = ∑ wi φi (s|µa , Σa ), i i i where µa and Σa are the mean and covariance of each Gaussian. [sent-300, score-0.477]

37 l l j,o j,o j j With this, we have j αa,o (s) = ∑ wk wo l j k,l =∑ Z δk,l βk,l (s) φ(s |s, Σ) ds j,o s j,o,a j j,o j,o,a wk wo δk,l βk,l (s) l k,l Z s φ(s |s, Σ) ds = ∑ wk wo δk,l βk,l (s). [sent-314, score-0.641]

38 If No is the number of components in the observation model and Cr is the average number of components in the reward model, the number of components in the αn -functions scales with O((No )n Cr ). [sent-319, score-0.368]

39 The ﬁrst one is the application of the action model on the current belief state. [sent-331, score-0.397]

40 j In the second step of the belief update, the prediction obtained with the action model is corrected using the information provided by the observation model ba,o (s ) ∝ ∑ wo φ(s |so , Σo ) ∑ w j φ(s|s j + ∆(a), Σ j + Σa ) i i i i =∑ wo w j φ(s i j o o |si , Σi ) φ(s|s j + ∆(a), Σ j + Σa ). [sent-335, score-0.597]

41 ∑ k wk k 2344 P OINT-BASED VALUE I TERATION FOR C ONTINUOUS POMDP S An increase in the number of components representing a belief occurs when computing the belief update just detailed. [sent-340, score-0.575]

42 2 PARTICLE - BASED R EPRESENTATION An alternative to parameterize the belief densities using Gaussian mixtures is to represent the belief using N random samples, or particles, positioned at points s i and with weights wi . [sent-346, score-0.533]

43 Thus, the belief is N bt (s) = ∑ wi d(s − si ), i=1 where d(s − si ) is a Dirac’s delta function centered at 0. [sent-347, score-0.34]

44 , 1999; Isard and Blake, 1998) samples particles s i using the motion model p(s |si , a), then it assigns a new weight to each one of these particles proportional to the likelihood p(o|si ), and ﬁnally it re-samples particles using these new weights in order to make all particles weights equal. [sent-354, score-0.476]

45 p(o|µi j ) Using the sample-based belief representation the averaging operator ·, · becomes α, b = Z s ∑ wk φ(s|sk , Σk ) ∑ wl d(s − sl ) k = ∑ wk k ds l Z s φ(s|sk , Σk ) ∑ wl d(s − sl ) ds l = ∑ wk ∑ wl φ(sl |sk , Σk ) k l = ∑ wk wl φ(sl |sk , Σk ). [sent-365, score-0.978]

46 , beliefs and α-functions are both continuous functions of the state space), the α-functions also admit a particle representation. [sent-369, score-0.388]

47 Note however that we cannot have both beliefs and α-functions represented by particles since the computation of α, b requires that either b or α be in functional form to generalize over the entire state space. [sent-370, score-0.351]

48 Now, we describe how to extend the presented framework to deal with continuous sets of actions and observations so that fully continuous POMDPs can also be addressed. [sent-378, score-0.429]

49 When the action space A is ﬁnite and small, the above optimization can be carried out efﬁciently by enumerating all possible actions and choosing the best, but in very large discrete action spaces this is 2347 P ORTA , V LASSIS , S PAAN AND P OUPART computationally inefﬁcient. [sent-383, score-0.616]

50 As proposed by Spaan and Vlassis (2005), we can replace the full maximization over actions with a sampled max operator that performs the maximization over a subset of actions randomly sampled from A. [sent-385, score-0.45]

51 One may devise several sampling schemes for choosing actions from A, for example, uniform over A or using the best action up to a given moment. [sent-386, score-0.38]

52 The use of a sampled max operator is very well suited for the point-based backups of P ERSEUS, in which we only require that the values of belief points do not decrease over two consecutive backup stages. [sent-389, score-0.477]

53 However, some modiﬁcations need to be introduced in the action and reward models described in Section 5. [sent-390, score-0.372]

54 The action model described can be easily extended to continuous actions deﬁning a continuous instead of a discrete function ∆ : A → S and evaluating it for the actions in the newly sampled A. [sent-392, score-0.818]

55 Observe that, before computing the backup for the randomly selected belief point b (line 14b), we have to sample a new set of actions, A (line 14a) and the transition and reward models have to be modiﬁed accordingly, since they depend on the action set. [sent-395, score-0.796]

56 Note however, that the actions are sampled speciﬁcally for each belief b and, j therefore we can not compute something similar to the αa,o -elements (see Eq. [sent-397, score-0.436]

57 2 Dealing with Continuous Observations In value iteration, the backup of a belief point b involves computing the expectation over observations Vn (b) = arg max ra , b + γVn−1 (ba ) , a with Z Vn−1 (ba ) = o p(o|b, a)Vn−1 (ba,o ) do. [sent-400, score-0.554]

58 12, all observations that lead to belief states b a,o with the j same maximizing α-element can be aggregated together into one meta observation O a,b deﬁned as follows j j Oa,b = {o | αn−1 = arg max α, ba,o }. [sent-404, score-0.372]

59 Therefore this new operator can be used both in P ERSEUS with either discrete or sampled continuous actions (see Table 1 and Table 2, respectively). [sent-427, score-0.402]

60 j Note that if we work with continuous observation spaces, the α a,b -functions are computed speciﬁcally for each belief and, therefore no precomputation similar to those of the α a,o -elements is possible. [sent-428, score-0.381]

61 The robot only receives positive reward when it enters the target door (see Fig. [sent-436, score-0.435]

62 Two Gaussians represent the negative reward for trying to enter a door at the wrong position (with means ±25, covariance 12. [sent-442, score-0.339]

63 In all reported experiments, the set of beliefs B used in the P ERSEUS algorithm contains 500 unique belief points collected using random walks departing from a uniform belief, the latter being approximated with a Gaussian mixture with four components. [sent-446, score-0.454]

64 The walks of the robot along the corridor are organized in episodes of 30 actions (thus, for instance, the robot can repetitively try to enter the correct door accumulating positive reward). [sent-447, score-0.625]

65 The second plot (top-right) shows the expected discounted reward computed by running for 50 episodes the policy available at the corresponding time slice. [sent-475, score-0.355]

66 α-elements 50 400 300 200 100 450 350 0 500 1000 time (s) 1500 2000 0 2500 450 0 500 1000 time (s) Figure 2: Results for the simulated robotic problem using continuous states but discrete actions and observations. [sent-479, score-0.406]

67 Top: Evolution of the value for all the beliefs in B and the average accumulated discounted reward for 10 episodes. [sent-480, score-0.55]

68 The changes in the policy are computed as the number of beliefs in B with a different optimal action from one time slice to the next. [sent-487, score-0.47]

69 The snapshots show the evolution of the belief of the robot, and the actions taken, from the beginning of the episode (the robot starts at location 7) until the target door is entered. [sent-491, score-0.665]

70 components Figure 5: Execution time in seconds for the ﬁrst iteration of P ERSEUS as the number of components representing the beliefs increase (left) and as the number of components representing the α-functions increase (right). [sent-502, score-0.338]

71 The plotted data correspond to the time in seconds for the ﬁrst P ERSEUS value update stage, that is, for the computation of the ﬁrst backup (line 14 in Table 1) and the new value for all the beliefs in B (line 18 in the algorithm). [sent-513, score-0.379]

72 beliefs 0 500 1000 1500 2000 2500 time (s) Figure 6: Reduction in the obtained average accumulated discounted reward when reducing the number of components in the beliefs to just one (dotted line) and in the α-functions to three (dashed line). [sent-529, score-0.764]

73 The solid line is the average accumulated reward when using 4 components for the beliefs and 9 for the α-functions. [sent-530, score-0.515]

74 reward when representing beliefs with one component and α-functions with three components. [sent-531, score-0.402]

75 When discretizing the environment, the granularity has to be in accordance with the size of the actions taken by the robot (±2 left/right) and, thus, the number of states and, consequently, the cost of the planning grows as the environment grows. [sent-543, score-0.46]

76 With more than 100 states the execution is slower than that for the continuous version when using 4 components for the beliefs and 9 for the α-functions (the dashed line in Fig. [sent-548, score-0.409]

77 7-right shows the average accumulated discounted reward obtained with the discrete version of P ERSEUS working on different number of states compared with the one obtained in the ﬁrst experiment (see Fig. [sent-561, score-0.431]

78 The average accumulated discounted reward with a discretization with 200 states is shown in Fig. [sent-565, score-0.4]

79 Top: Evolution of the value for all the beliefs in B and the average accumulated discounted reward for 10 episodes. [sent-578, score-0.55]

80 9 shows the average accumulated discounted reward using two different sets of actions. [sent-583, score-0.329]

81 When using an action set including short robot movements (±1), the number of steps to reach the target increase and, since positive reward is only obtained at the end of the run when entering the door, the average accumulated reward decreases. [sent-584, score-0.781]

82 When using a set of actions with too large movements (±4) the robot has problems aiming the correct door and the average accumulated reward also decreases. [sent-585, score-0.693]

83 Since the appropriate set of actions for each problem is hard to forecast, it would be nice to have a planning system able to determine a proper set of actions by itself. [sent-586, score-0.443]

84 For this purpose in the next experiment we let the robot execute actions in the continuous range [−6, 6], where an action can be regarded here as a measure of velocity of the robot. [sent-587, score-0.58]

85 In each backup at planning stage n, we consider the optimal action according to Vn−1 and three more 2357 400 600 P ORTA , V LASSIS , S PAAN AND P OUPART 700 800 900 Acc. [sent-591, score-0.384]

86 Reward 2 1 0 −1 −2 −3 Action set ±1 Action set ±2 Action set ±4 −4 −5 0 500 1000 1500 2000 2500 time (s) Figure 9: Average accumulated discounted reward using different sets of actions. [sent-593, score-0.329]

87 2, meaning that the algorithm is able to determine better motion actions than the ones we manually ﬁxed in the initial version of the problem (±2), and that is able to select enter door actions when necessary. [sent-599, score-0.5]

88 Top: Evolution of the value for all the beliefs in B and the average accumulated discounted reward for 10 episodes. [sent-621, score-0.55]

89 Top: Evolution of the value for all the beliefs in B and the average accumulated discounted reward for 100 episodes. [sent-637, score-0.55]

90 The MC-POMDP method approximates the Bellman backup operator by sampling from the belief transition model, whereas in our case, we compute the Bellman backup operator analytically given the particular value-function representation. [sent-642, score-0.657]

91 In the case of continuous action spaces only few methods exist that can handle continuous action spaces directly (Thrun, 2000; Ng and Jordan, 2000; Baxter and Bartlett, 2001). [sent-658, score-0.606]

92 Certain policy search methods tackle continuous actions, for instance Pegasus (Ng and Jordan, 2000), which estimates the value of a policy by simulating trajectories from the POMDP using a ﬁxed random seed, and adapts its policy in order to maximize this value. [sent-659, score-0.427]

93 Pegasus can handle continuous action spaces at the cost of a sample complexity that is polynomial in the size of the state space (Ng and Jordan, 2000, Theorem 3). [sent-660, score-0.336]

94 Finally, the analytic solution for the quadratic reward case mentioned above can also handle continuous observations with a linear noise model (Bertsekas, 2001). [sent-668, score-0.351]

95 For POMDPs with continuous states, we demonstrated the piecewise linearity and convexity of value functions deﬁned over inﬁnite-dimensional belief states induced by continuous states. [sent-671, score-0.489]

96 We also demonstrated that continuous Bellman backups are isotonic and contracting, allowing value iteration to be adapted to continuous POMDPs. [sent-672, score-0.348]

97 These are expressive representations that are closed under Bellman backups and belief updates. [sent-674, score-0.347]

98 Finally, we also extended P ERSEUS to continuous actions and observations by particular sampling strategies that reduce the problem to a continuous state POMDP that can be tackled by the P ERSEUS algorithm. [sent-675, score-0.491]

99 Another interesting research direction would be to investigate which families of functions (beyond mixtures of Gaussians) are closed under Bellman backups and belief updates for different types of transition, observation and reward models. [sent-682, score-0.617]

100 Solving pomdps with continuous or large discrete observation spaces. [sent-835, score-0.554]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pomdp', 0.4), ('pomdps', 0.34), ('erseus', 0.293), ('belief', 0.228), ('vn', 0.219), ('reward', 0.203), ('beliefs', 0.199), ('actions', 0.182), ('action', 0.169), ('backup', 0.136), ('robot', 0.13), ('orta', 0.127), ('oupart', 0.127), ('paan', 0.127), ('teration', 0.127), ('ra', 0.119), ('particles', 0.119), ('lassis', 0.107), ('ontinuous', 0.107), ('policy', 0.102), ('door', 0.102), ('continuous', 0.099), ('poupart', 0.093), ('ds', 0.088), ('hv', 0.085), ('wk', 0.082), ('planning', 0.079), ('vlassis', 0.079), ('accumulated', 0.076), ('wo', 0.073), ('bellman', 0.073), ('spaan', 0.068), ('sk', 0.067), ('porta', 0.067), ('gaussians', 0.066), ('observable', 0.063), ('discrete', 0.061), ('pwlc', 0.06), ('transition', 0.06), ('particle', 0.057), ('observation', 0.054), ('backups', 0.053), ('pineau', 0.053), ('psfrag', 0.052), ('hu', 0.05), ('discounted', 0.05), ('observations', 0.049), ('replacements', 0.049), ('agent', 0.047), ('corridor', 0.047), ('isotonic', 0.047), ('recursion', 0.044), ('sup', 0.043), ('states', 0.041), ('mixtures', 0.041), ('robotics', 0.04), ('elementn', 0.04), ('valuen', 0.04), ('reinforcement', 0.04), ('oi', 0.04), ('horizon', 0.039), ('closed', 0.038), ('si', 0.038), ('wl', 0.037), ('components', 0.037), ('roy', 0.036), ('wi', 0.036), ('spaces', 0.035), ('enter', 0.034), ('operator', 0.034), ('duff', 0.034), ('execution', 0.033), ('oij', 0.033), ('state', 0.033), ('partially', 0.033), ('sl', 0.032), ('mdp', 0.032), ('ba', 0.03), ('discretization', 0.03), ('sampling', 0.029), ('iteration', 0.028), ('representations', 0.028), ('hoey', 0.028), ('environment', 0.028), ('mixture', 0.027), ('gaussian', 0.027), ('cassandra', 0.027), ('sampled', 0.026), ('markov', 0.025), ('boutilier', 0.024), ('evolution', 0.023), ('condensation', 0.023), ('navigation', 0.023), ('robotic', 0.023), ('value', 0.022), ('uncertainty', 0.02), ('sensor', 0.02), ('plan', 0.02), ('wj', 0.02), ('localization', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

2 0.19699313 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.

3 0.19498451 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efﬁcient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

4 0.1247794 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour

Abstract: We incorporate statistical conﬁdence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it sufﬁces to pull the arms a total of O (n/ε2 ) log(1/δ) times to ﬁnd an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the conﬁdence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.

5 0.11433259 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

Author: Anders Jonsson, Andrew Barto

Abstract: We present Variable Inﬂuence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efﬁcient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can signiﬁcantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction

6 0.11208703 75 jmlr-2006-Policy Gradient in Continuous Time

7 0.094208896 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

8 0.063554034 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

9 0.061354592 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study (Special Topic on Machine Learning and Optimization)

10 0.048790943 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

11 0.045556199 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

12 0.036470063 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)

13 0.032754377 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research (Special Topic on Machine Learning and Optimization)

14 0.031115089 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

15 0.0280242 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

16 0.027457388 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

17 0.027217994 25 jmlr-2006-Distance Patterns in Structural Similarity

18 0.026837459 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis

19 0.025718389 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

20 0.025260193 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.184), (1, 0.233), (2, -0.366), (3, 0.052), (4, 0.33), (5, -0.003), (6, -0.115), (7, 0.068), (8, -0.06), (9, -0.026), (10, 0.034), (11, 0.068), (12, -0.023), (13, 0.015), (14, 0.04), (15, -0.028), (16, -0.005), (17, -0.027), (18, -0.078), (19, -0.017), (20, 0.013), (21, -0.115), (22, -0.081), (23, 0.067), (24, 0.013), (25, -0.061), (26, -0.009), (27, 0.012), (28, -0.036), (29, -0.087), (30, -0.019), (31, -0.072), (32, 0.007), (33, 0.025), (34, 0.049), (35, -0.075), (36, 0.051), (37, 0.113), (38, 0.151), (39, 0.06), (40, 0.064), (41, -0.051), (42, 0.03), (43, 0.044), (44, -0.026), (45, -0.022), (46, -0.021), (47, -0.047), (48, 0.045), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96642345 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

2 0.66152364 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

3 0.60426521 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

4 0.55887955 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour

5 0.49769458 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

Author: Anders Jonsson, Andrew Barto

6 0.44255078 75 jmlr-2006-Policy Gradient in Continuous Time

7 0.41263691 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

8 0.28051206 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study (Special Topic on Machine Learning and Optimization)

9 0.24832815 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

10 0.20862526 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

11 0.20795831 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

12 0.20787013 49 jmlr-2006-Learning Parts-Based Representations of Data

13 0.1878107 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

14 0.14757967 43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)

15 0.13928521 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)

16 0.13130264 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

17 0.12720662 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

18 0.12088609 42 jmlr-2006-Kernels on Prolog Proof Trees: Statistical Learning in the ILP Setting (Special Topic on Inductive Programming)

19 0.11635835 93 jmlr-2006-Universal Kernels

20 0.11325901 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.019), (36, 0.045), (45, 0.028), (49, 0.468), (50, 0.037), (63, 0.041), (67, 0.024), (68, 0.017), (76, 0.013), (78, 0.026), (81, 0.041), (84, 0.019), (90, 0.036), (91, 0.027), (96, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75008744 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

2 0.23728532 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

3 0.23544435 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

Author: Anders Jonsson, Andrew Barto

4 0.23082738 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classiﬁcation where the documents are allowed to belong to more than one category at a time. The classiﬁcation model is a variant of the Maximum Margin Markov Network framework, where the classiﬁcation hierarchy is represented as a Markov tree equipped with an exponential family deﬁned on the edges. We present an efﬁcient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classiﬁcation hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efﬁcient as training independent SVM-light classiﬁers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classiﬁcation learning algorithms. Keywords: kernel methods, hierarchical classiﬁcation, text categorization, convex optimization, structured outputs

5 0.22274217 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

6 0.21949576 70 jmlr-2006-Online Passive-Aggressive Algorithms

7 0.21780694 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

8 0.21406838 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

9 0.21333903 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

10 0.21309385 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

11 0.21299848 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

12 0.21236876 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs (Special Topic on Machine Learning and Optimization)

13 0.21218815 44 jmlr-2006-Large Scale Transductive SVMs

14 0.2117427 66 jmlr-2006-On Model Selection Consistency of Lasso

15 0.20846084 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

16 0.20817249 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming (Special Topic on Machine Learning and Optimization)

17 0.20722152 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

18 0.20715332 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models (Special Topic on Machine Learning and Optimization)

19 0.2069148 53 jmlr-2006-Learning a Hidden Hypergraph

20 0.20569062 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis