jmlr jmlr2006 jmlr2006-20 knowledge-graph by maker-knowledge-mining

20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation


Source: pdf

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. [sent-7, score-1.001]

2 First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. [sent-8, score-0.9]

3 KOK AND V LASSIS group of agents in which the success of the team is measured by the specific combination of actions of the agents (Parker, 2002). [sent-21, score-0.936]

4 However, the fact that the behavior of one agent now influences the outcome of the individually selected actions of the other agents results in a dynamic environment and possibly compromises convergence. [sent-40, score-1.041]

5 In a CG each node represents an agent and connected agents indicate a local coordination dependency. [sent-52, score-1.052]

6 The algorithms are distributed in the sense that each agent only needs to communicate with the neighboring agents on which it depends. [sent-59, score-0.959]

7 This problem is different than finding the joint action that maximizes predefined payoff relations, since in this case the payoff relations themselves have to be learned. [sent-74, score-1.024]

8 In the agent-based decomposition the local function of an agent is based on its own action and those of its neighboring agents. [sent-78, score-0.897]

9 Each agent i selects an individual action ai from its action set A i and the resulting joint action a = (a1 , . [sent-98, score-1.595]

10 Fortunately, in many problems the action of one agent does not depend on the actions of all other agents, but only on a small subset. [sent-108, score-0.957]

11 (1) i=1 Each local payoff function fi depends on a subset of all actions, ai ⊆ a, where ai = A i × (× j∈Γ(i) A j ), corresponding to the action of agent i and those of the agents on which it depends. [sent-114, score-1.704]

12 This decomposition can be depicted using an undirected graph G = (V, E) in which each node i ∈ V represents an agent and an edge (i, j) ∈ E indicates that the corresponding agents have to coordinate their actions, that is, i ∈ Γ( j) and j ∈ Γ(i). [sent-115, score-1.018]

13 This new agent contains an individual local payoff function that is defined over the combined actions of the involved agents, and returns the corresponding value of the original function. [sent-122, score-1.04]

14 Furthermore, new pairwise payoff functions have to be defined between each involved agent and the new agent in order to ensure that the action selected by the involved agent corresponds to its part of the (combined) action selected by the new agent. [sent-124, score-2.378]

15 A local payoff function fi (ai ) specifies the payoff contribution for the individual action ai of agent i, and fi j defines the payoff contribution for pairs of actions (ai , a j ) of neighboring agents (i, j) ∈ E. [sent-130, score-2.567]

16 Before an agent (node) is eliminated, the agent first collects all payoff functions related to its edges. [sent-136, score-1.289]

17 Next, it computes a conditional payoff function which returns the maximal value it is able to contribute to the system for every action combination of its neighbors, and a best-response function (or conditional strategy) which returns the action corresponding to the maximizing value. [sent-137, score-0.929]

18 Note that when the neighboring agent receives a function including an action of an agent on which it did not depend before, a new coordination dependency is added between these agents. [sent-139, score-1.439]

19 The agents are iteratively eliminated until one agent remains. [sent-140, score-0.873]

20 This agent selects the action that maximizes the final conditional payoff function. [sent-141, score-1.131]

21 A second pass in the reverse order is then performed in which every agent computes its optimal action based on its conditional strategy and the fixed actions of its neighbors. [sent-143, score-0.957]

22 The elimination of agent 1 induces a new dependency between agent 2 and 3 and thus a change in the graph’s topology. [sent-149, score-1.013]

23 A second pass in the reverse 4 elimination order is performed in which each agent computes its optimal (unconditional) action from its best-response function and the fixed actions from its neighbors. [sent-156, score-0.992]

24 In order to compute the optimal joint action a∗ that maximizes (2), each agent i (node in G) repeatedly sends a message µi j to its neighbors j ∈ Γ(i). [sent-191, score-1.0]

25 This message is an approximation of the maximum payoff agent i is able to achieve for a given action of agent j, and is computed by maximizing (over the actions of agent i) the sum of the payoff functions fi and fi j and all incoming messages to agent i except that from agent j. [sent-193, score-3.698]

26 Note that this message only depends on the payoff relations between agent i and agent j and the incoming message to agent i. [sent-194, score-1.862]

27 Second, an agent i only has to sum over the received messages from its neighbors which are defined over individual actions, instead of enumerating over all possible action combinations of its neighbors. [sent-200, score-0.953]

28 In the VE algorithm, the elimination of an agent often results in new dependencies between agents that did not have to coordinate initially. [sent-203, score-0.95]

29 i ai (8) If there is only one maximizing action for every agent i, the globally optimal joint action a∗ = arg maxa u(a) is unique and has elements a∗ = (a∗ ). [sent-209, score-1.305]

30 In this case, each agent informs its neighbors in a predefined order about its action choice such that the other agents are able to fix their actions accordingly. [sent-215, score-1.391]

31 For the anytime extension, we insert the current computed joint action into (2) after every iteration and only update the joint action when it improves upon the best value found so far. [sent-238, score-0.954]

32 For the distributed case, the implementation of the anytime extension is much more complex since the agents do not have direct access to the actions of the other agents or the global payoff function (2). [sent-245, score-1.476]

33 Therefore, the evaluation of the (distributed) joint action is only initiated by an agent when it believes it is worthwhile to do so, for example, after a big increase in the values of the received messages. [sent-246, score-0.917]

34 An agent receiving an evaluation message fixes its individual action until after the evaluation. [sent-250, score-0.862]

35 When an agent is a leaf of ST it also computes its local contribution to the global payoff and sends this value to its parent in ST . [sent-251, score-0.935]

36 In order to generate balanced graphs in which each agent approximately has the same degree, we start with a graph without edges and iteratively connect the two agents with the minimum number of neighbors. [sent-272, score-0.989]

37 In case multiple agents satisfy this condition, an agent is picked at random from the possibilities. [sent-273, score-0.873]

38 In the first set, each edge (i, j) ∈ E is associated with a payoff function fi j defined over five actions per agent and each action combination is assigned a random payoff from a standard normal distribution, that is, fi j (ai , a j ) ∼ N (0, 1). [sent-286, score-1.738]

39 For the third test set, we specify a payoff function based on 10 actions per agent resulting in 1015 different joint actions. [sent-289, score-1.073]

40 The graphs with 10 actions per agent require more time compared to the two other sets because the computation of every message involves a maximization over 100 instead of 25 joint actions. [sent-336, score-0.871]

41 Furthermore, the elimination of an agent often causes a neighboring agent to receive a conditional strategy involving agents it did not have to coordinate with before, changing the graph topology to an even denser graph. [sent-343, score-1.497]

42 For example, during a local maximization of an agent with five neighbors 55 = 3, 125 actions have to be enumerated in the case of 5 actions per agent. [sent-353, score-0.935]

43 A relative payoff of 1 indicates that the found joint action corresponds to the optimal joint action, while a relative payoff of 0 indicates that it corresponds to the joint action with the minimal possible payoff. [sent-372, score-1.48]

44 Collaborative Multiagent Reinforcement Learning Until now, we have been discussing the problem of selecting an optimal joint action in a group of agents for a given payoff structure and a single state only. [sent-388, score-1.098]

45 In such problems, the agents select a joint action which provides them a reward and causes a transition to a new state. [sent-390, score-1.006]

46 Figure 6: Relative payoff compared to VE for both standard max-plus (graphs on the left) and anytime max-plus (graphs on the right) for graphs with 15 agents and cycles. [sent-446, score-0.907]

47 • A reward function Ri : S × A → R which provides agent i with an individual reward rit ∈ Ri (st , at ) based on the joint action at taken in state st . [sent-473, score-1.432]

48 The agents do observe the current state and also receive an individual reward depending on the performed joint action and the unknown reward function. [sent-502, score-1.308]

49 2 MDP Learners In principle, a collaborative multiagent MDP can be regarded as one large single agent in which each joint action is represented as a single action. [sent-505, score-1.087]

50 In this MDP learners approach either a central controller models the complete MDP and communicates to each agent its individual action, or each agent models the complete MDP separately and selects the individual action that corresponds to its own identity. [sent-507, score-1.378]

51 In the latter case, the agents do not need to communicate but they have to be able to observe the executed joint action and the received individual rewards. [sent-508, score-0.859]

52 This approach results in big storage and computational savings in the action-space, for example, with 7 agents and 6 actions per agent only 42 Q-values have to be stored per state. [sent-520, score-1.079]

53 Because the actions of the other agents are ignored in the representation of the Q-functions, and these agents also change their behavior while learning, the system becomes nonstationary from the perspective of an individual agent. [sent-522, score-0.967]

54 4 Coordinated Reinforcement Learning In many situations an agent has to coordinate its actions with a few agents only, and acts independently with respect to the other agents. [sent-527, score-1.06]

55 Each local Qi is based on a subset of all state and action variables, n Q(s, a) = ∑ Qi (si , ai ), (12) i=1 where si and ai are respectively the subset of state and action variables related to agent i. [sent-532, score-1.355]

56 The corresponding CG is constructed by adding an edge between agent i and j when the action of agent j is included in the action variables of agent i, that is, a j ∈ ai . [sent-535, score-2.219]

57 The estimate of the global Q-value in s, Q(s, a) in (13), is computed by fixing the action of every agent to the one assigned in a and applying a message passing scheme similar to the one used in the VE algorithm. [sent-541, score-0.886]

58 Each agent maintains an individual local Q-function, Qi (si , ai ), based on its individual action and updates it by incorporating the Q-functions of its neighboring agents. [sent-555, score-1.041]

59 A weight function f (i, j) determines how much the Q-value of an agent j contributes to the update of the Q-value of agent i. [sent-556, score-1.015]

60 This function defines a graph structure of agent dependencies, in which an edge is added between agents i and j if the corresponding function f (i, j) is non-zero. [sent-557, score-0.973]

61 In the agent-based decomposition the local function of an agent is based on its own action and those of its neighboring agents. [sent-568, score-0.897]

62 Every agent i is associated with a local Q-function Qi (si , ai ) which only depends on a subset of all possible state and action variables. [sent-579, score-0.921]

63 The Qi -functions correspond to a CG which is constructed by connecting each agent with all agents in which its action variable is involved. [sent-581, score-1.173]

64 However, we can use the VE algorithm to compute, in a distributed manner, the maximizing joint action a∗ = arg maxa′ Q(s′ , a′ ) in state s′ , and from this compute the local contribution Qi (s′ , a∗ ) of each agent to the total action i i value Q(s′ , a∗ ). [sent-586, score-1.302]

65 Note that the local contribution of an agent to the global action value might be lower than the maximizing value of its local Q-function because it is unaware of the dependencies of its neighboring agents with the other agents in the CG. [sent-587, score-1.798]

66 As a consequence, the agents are not able to distinguish which agents are responsible for the received reward, and all functions, including the ones which are not related to the received reward, are updated equally. [sent-595, score-0.868]

67 Then, the local Q-function Qi of agent i is defined as the summation of half the value of all local Q-functions Qi j of agent i and its neighbors j ∈ Γ(i), that is, Qi (si , ai ) = 1 2 ∑ j∈Γ(i) 1808 Qi j (si j , ai , a j ). [sent-616, score-1.258]

68 Because, one half of every local Q-function Qi j is updated by agent i and the other half by agent j, agent j updates the local Q-function Qi j using a similar decomposition as (19). [sent-629, score-1.625]

69 (20) i i j |Γ(i)| |Γ( j)| Each local Q-function Qi j is updated with a proportional part of the received reward of the two agents it is related to and with the contribution of this edge to the maximizing joint action a∗ = 1809 KOK AND V LASSIS (a∗ ) = arg maxa′ Q(s′ , a′ ) in the state s′ . [sent-631, score-1.24]

70 After the execution of a joint action a, the episode is immediately ended and the system provides each agent an individual reward Ri (a). [sent-672, score-1.199]

71 The local reward Ri received by an agent i i=1 only depends on a subset of the actions of the other agents. [sent-674, score-0.976]

72 These dependencies are modeled using a graph in which each edge corresponds to a local reward function that assigns a value r(ai , a j ) to each possible action combination of the actions of agent i and agent j. [sent-675, score-1.846]

73 9 shows an example of the construction of the individual reward received by an agent based on its interaction with its four neighbors, together with an example reward function r(ai , a j ) corresponding to an edge between agent i and agent j. [sent-681, score-2.09]

74 The goal of the agents is to learn, based on the received individual rewards, to select a joint action that maximizes the global reward. [sent-682, score-0.914]

75 First, the outcome of a selected action of an agent also depends on the actions of its neighbors. [sent-684, score-0.957]

76 2, in which the the agents have to select a joint action that maximizes predefined payoff functions, is that in this case the payoff relations themselves have to be learned based on the received rewards. [sent-689, score-1.45]

77 1811 KOK AND V LASSIS replacements r(a1 , a2 ) 1 r(a1 , a3 ) R1 = ∑ j∈Γ(1) r(a1 , a j ) r(a1 , a4 ) r(a1 , a5 ) 5 action agent i 3 2 1 2 3 4 action agent j 1 2 3 9. [sent-690, score-1.578]

78 Furthermore, we assume that the agents have access to a CG which for each agent specifies on which other agents it depends. [sent-723, score-1.257]

79 Apart from the different Q-learning methods, we also apply an approach that selects a joint action uniformly at random and keeps track of the best joint action found so far, and a method that enumerates all possible joint actions and stores the one with the highest reward. [sent-725, score-1.063]

80 An agent selects an action that maximizes its own local Q-function Qi . [sent-728, score-0.861]

81 Coordinated reinforcement learning (CoordRL) Each agent i stores an individual Q-function based on its own action and the actions of its neighbors j ∈ Γ(i). [sent-737, score-1.091]

82 Sparse cooperative Q-learning, agent-based (SparseQ agent) Each agent stores a Q-function that is based on its own action and the actions of its neighbors j ∈ Γ(i). [sent-741, score-1.08]

83 An exploration action should be made (probability ε), and this exploration action should equal the optimal joint action (probability of 41 ). [sent-762, score-1.026]

84 In the IL/DVF method each agent only stores functions based on its individual action and is thus constant in the number of dependencies in the graph. [sent-779, score-0.865]

85 88 0 Optimal SparseQ edge−based, agent updates (VE) SparseQ edge−based, agent updates (anytime) SparseQ edge−based, edge updates (VE) SparseQ edge−based, edge updates (anytime) 5000 10000 0. [sent-857, score-1.27]

86 88 0 15000 Optimal SparseQ edge−based, agent updates (VE) SparseQ edge−based, agent updates (anytime) SparseQ edge−based, edge updates (VE) SparseQ edge−based, edge updates (anytime) 5000 time steps (a) Average degree less than 2. [sent-863, score-1.322]

87 9 10000 time steps Optimal SparseQ edge−based, agent updates (VE) SparseQ edge−based, agent updates (anytime) SparseQ edge−based, edge updates (VE) SparseQ edge−based, edge updates (anytime) 5000 10000 0. [sent-875, score-1.27]

88 88 0 time steps Optimal SparseQ edge−based, agent updates (VE) SparseQ edge−based, agent updates (anytime) SparseQ edge−based, edge updates (VE) SparseQ edge−based, edge updates (anytime) 5000 10000 15000 time steps (c) Average degree between 3 and 4. [sent-881, score-1.322]

89 1817 KOK AND V LASSIS method Random with memory IL CoordRL SparseQ agent (VE) SparseQ edge, agent (VE) SparseQ edge, edge (VE) SparseQ edge, agent (anytime) SparseQ edge, edge (anytime) (1, 2] (2, 3] (3, 4] (4, 5] 0. [sent-894, score-1.623]

90 However, in order to learn this policy based on the received rewards, the agents have to cope with the delayed reward and learn how to coordinate their actions such that multiple targets are hit simultaneously. [sent-974, score-0.88]

91 different SparseQ variants have access to a CG which specifies for each agent on which other agents it depends. [sent-1003, score-0.873]

92 As a result the agents are able to quickly learn a good policy that captures the targets in a few steps, but it takes a long time to converge to a joint action that does not involve the unnecessary focus actions of some of the agents. [sent-1029, score-0.969]

93 of steps average reward 30 30 Optimal SparseQ edge, edge (anytime) SparseQ agent (VE) SparseQ edge, agent (anytime) MDP DVF IL 25 20 15 10 5 0 0 200 50 (a) Average reward. [sent-1031, score-1.322]

94 3500 80 Optimal SparseQ edge, edge (anytime) SparseQ agent (VE) SparseQ edge, agent (anytime) MDP DVF IL 70 cumulative time (s) cumulative average reward 4000 150 200 (b) Average number of steps. [sent-1032, score-1.356]

95 Because in the agentbased decomposition the full action space is decomposed into different independent local action values, it does result in a better performance than the MDP learners, both in the obtained average reward and the number of steps needed to capture the targets. [sent-1038, score-0.933]

96 131 method reward steps SparseQ edge, edge (anytime) SparseQ edge, edge (VE) SparseQ agent (VE) SparseQ edge, agent (VE) SparseQ edge, agent (anytime) 27. [sent-1055, score-1.859]

97 This method stores a Q-function based on all action combinations of an agent and its neighbors in the CG. [sent-1077, score-0.861]

98 First, we described a payoff propagation algorithm (max-plus) that can be used as an alternative to variable elimination (VE) for finding the optimal joint action in a coordination graph (CG) with predefined payoff functions. [sent-1089, score-1.209]

99 Effectively, each agent learns its part of the global solution by only coordinating with the agents on which it depends. [sent-1105, score-0.928]

100 Since all Q-functions and updates are defined locally, it is possible to compensate the addition or removal of an agent by redefining only the Q-functions in which this agent is involved. [sent-1121, score-1.012]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('agent', 0.489), ('agents', 0.384), ('payoff', 0.311), ('sparseq', 0.303), ('action', 0.3), ('reward', 0.236), ('actions', 0.168), ('multiagent', 0.165), ('anytime', 0.145), ('kok', 0.126), ('coordination', 0.12), ('qi', 0.117), ('cg', 0.096), ('ve', 0.088), ('joint', 0.086), ('edge', 0.078), ('ollaborative', 0.075), ('ropagation', 0.075), ('ultiagent', 0.075), ('ai', 0.074), ('graphs', 0.067), ('lassis', 0.067), ('mdp', 0.057), ('einforcement', 0.057), ('episode', 0.057), ('global', 0.055), ('degree', 0.052), ('coordrl', 0.051), ('cooperative', 0.051), ('neighbors', 0.05), ('collaborative', 0.047), ('vlassis', 0.043), ('si', 0.043), ('dvf', 0.043), ('received', 0.042), ('message', 0.042), ('local', 0.041), ('neighboring', 0.041), ('messages', 0.041), ('guestrin', 0.038), ('maxa', 0.038), ('st', 0.037), ('update', 0.037), ('rewards', 0.036), ('plus', 0.036), ('elimination', 0.035), ('updates', 0.034), ('dsn', 0.031), ('reinforcement', 0.031), ('policy', 0.031), ('fi', 0.031), ('individual', 0.031), ('average', 0.03), ('coordinated', 0.03), ('sensors', 0.03), ('distributed', 0.029), ('edges', 0.027), ('send', 0.027), ('ri', 0.027), ('decomposition', 0.026), ('timing', 0.025), ('decompositions', 0.025), ('propagation', 0.024), ('centralized', 0.024), ('dependencies', 0.023), ('learners', 0.023), ('stores', 0.022), ('graph', 0.022), ('earning', 0.022), ('contribution', 0.022), ('sensor', 0.022), ('autonomous', 0.021), ('il', 0.021), ('maxai', 0.02), ('msg', 0.02), ('qfunction', 0.02), ('exploration', 0.02), ('width', 0.019), ('per', 0.019), ('coordinate', 0.019), ('receive', 0.018), ('gi', 0.018), ('maximizing', 0.018), ('discounted', 0.018), ('connected', 0.018), ('densely', 0.018), ('cumulative', 0.017), ('enumeration', 0.017), ('deadline', 0.017), ('sends', 0.017), ('cycles', 0.017), ('state', 0.017), ('maximizes', 0.016), ('aamas', 0.016), ('communicate', 0.016), ('payoffs', 0.016), ('zilberstein', 0.016), ('updated', 0.016), ('ji', 0.016), ('selects', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

2 0.19498451 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration

3 0.16583602 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour

Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (n/ε2 ) log(1/δ) times to find an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.

4 0.13072564 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

5 0.1261819 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

Author: Shimon Whiteson, Peter Stone

Abstract: Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice. Keywords: reinforcement learning, temporal difference methods, evolutionary computation, neuroevolution, on-

6 0.10770007 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

7 0.089212149 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving     (Special Topic on Inductive Programming)

8 0.08426223 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

9 0.05253743 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

10 0.051001363 2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs

11 0.050863471 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

12 0.039078094 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

13 0.037189227 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

14 0.034166817 25 jmlr-2006-Distance Patterns in Structural Similarity

15 0.031954382 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

16 0.031876583 32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems

17 0.029514799 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study     (Special Topic on Machine Learning and Optimization)

18 0.029072834 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

19 0.027642613 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

20 0.026783269 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.167), (1, 0.222), (2, -0.375), (3, 0.101), (4, 0.291), (5, 0.053), (6, 0.023), (7, -0.031), (8, -0.033), (9, -0.168), (10, 0.066), (11, 0.013), (12, -0.132), (13, -0.011), (14, -0.026), (15, -0.089), (16, 0.103), (17, 0.124), (18, 0.094), (19, -0.153), (20, -0.039), (21, -0.033), (22, 0.108), (23, 0.045), (24, 0.019), (25, 0.044), (26, -0.076), (27, -0.104), (28, -0.038), (29, 0.101), (30, -0.022), (31, 0.019), (32, 0.028), (33, -0.013), (34, -0.142), (35, -0.05), (36, -0.066), (37, 0.045), (38, -0.041), (39, -0.029), (40, -0.056), (41, -0.007), (42, 0.047), (43, -0.018), (44, 0.065), (45, -0.025), (46, 0.001), (47, 0.019), (48, 0.083), (49, -0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98065263 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

2 0.70262021 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour

Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (n/ε2 ) log(1/δ) times to find an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.

3 0.63632649 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning

Author: Shimon Whiteson, Peter Stone

Abstract: Temporal difference methods are theoretically grounded and empirically effective methods for addressing reinforcement learning problems. In most real-world reinforcement learning tasks, TD methods require a function approximator to represent the value function. However, using function approximators requires manually making crucial representational decisions. This paper investigates evolutionary function approximation, a novel approach to automatically selecting function approximator representations that enable efficient individual learning. This method evolves individuals that are better able to learn. We present a fully implemented instantiation of evolutionary function approximation which combines NEAT, a neuroevolutionary optimization technique, with Q-learning, a popular TD method. The resulting NEAT+Q algorithm automatically discovers effective representations for neural network function approximators. This paper also presents on-line evolutionary computation, which improves the on-line performance of evolutionary computation by borrowing selection mechanisms used in TD methods to choose individual actions and using them in evolutionary computation to select policies for evaluation. We evaluate these contributions with extended empirical studies in two domains: 1) the mountain car task, a standard reinforcement learning benchmark on which neural network function approximators have previously performed poorly and 2) server job scheduling, a large probabilistic domain drawn from the field of autonomic computing. The results demonstrate that evolutionary function approximation can significantly improve the performance of TD methods and on-line evolutionary computation can significantly improve evolutionary methods. This paper also presents additional tests that offer insight into what factors can make neural network function approximation difficult in practice. Keywords: reinforcement learning, temporal difference methods, evolutionary computation, neuroevolution, on-

4 0.58127236 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart

Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration

5 0.48022628 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

Author: Anders Jonsson, Andrew Barto

Abstract: We present Variable Influence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efficient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can significantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction

6 0.38824689 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving     (Special Topic on Inductive Programming)

7 0.27054217 75 jmlr-2006-Policy Gradient in Continuous Time

8 0.26448932 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

9 0.19327135 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

10 0.17944935 2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs

11 0.17497292 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study     (Special Topic on Machine Learning and Optimization)

12 0.15604019 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

13 0.13838321 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

14 0.13764414 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

15 0.13149557 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events

16 0.1184618 25 jmlr-2006-Distance Patterns in Structural Similarity

17 0.11587296 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

18 0.11356435 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

19 0.11332784 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

20 0.11171678 53 jmlr-2006-Learning a Hidden Hypergraph


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.013), (15, 0.39), (21, 0.023), (36, 0.072), (45, 0.024), (49, 0.026), (50, 0.054), (61, 0.015), (63, 0.033), (68, 0.011), (70, 0.012), (76, 0.032), (78, 0.02), (79, 0.011), (81, 0.025), (84, 0.018), (90, 0.017), (91, 0.041), (96, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73950493 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

Author: Jelle R. Kok, Nikos Vlassis

Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation

2 0.29790494 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

Author: Mikio L. Braun

Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds

3 0.29753411 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour

Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (n/ε2 ) log(1/δ) times to find an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.

4 0.29516214 53 jmlr-2006-Learning a Hidden Hypergraph

Author: Dana Angluin, Jiang Chen

Abstract: We consider the problem of learning a hypergraph using edge-detecting queries. In this model, the learner may query whether a set of vertices induces an edge of the hidden hypergraph or not. We show that an r-uniform hypergraph with m edges and n vertices is learnable with O(2 4r m · poly(r, log n)) queries with high probability. The queries can be made in O(min(2 r (log m + r)2 , (log m + r)3 )) rounds. We also give an algorithm that learns an almost uniform hypergraph of ∆ ∆ dimension r using O(2O((1+ 2 )r) · m1+ 2 · poly(log n)) queries with high probability, where ∆ is the difference between the maximum and the minimum edge sizes. This upper bound matches our ∆ lower bound of Ω(( m∆ )1+ 2 ) for this class of hypergraphs in terms of dependence on m. The 1+ 2 queries can also be made in O((1 + ∆) · min(2r (log m + r)2 , (log m + r)3 )) rounds. Keywords: query learning, hypergraph, multiple round algorithm, sampling, chemical reaction network

5 0.29397464 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs

6 0.29330924 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

7 0.29198876 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

8 0.29021829 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

9 0.28971148 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

10 0.28870553 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models     (Special Topic on Machine Learning and Optimization)

11 0.28829238 44 jmlr-2006-Large Scale Transductive SVMs

12 0.28806055 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

13 0.28513098 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

14 0.28468692 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

15 0.28468275 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

16 0.28451937 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs     (Special Topic on Machine Learning and Optimization)

17 0.28427476 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

18 0.283602 94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation

19 0.28330532 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming     (Special Topic on Machine Learning and Optimization)

20 0.28283212 70 jmlr-2006-Online Passive-Aggressive Algorithms