nips nips2005 nips2005-153 knowledge-graph by maker-knowledge-mining

153 nips-2005-Policy-Gradient Methods for Planning


Source: pdf

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. [sent-4, score-0.632]

2 These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. [sent-5, score-0.261]

3 1 Introduction To date, only a few planning tools have attempted to handle general probabilistic temporal planning problems. [sent-10, score-0.559]

4 We apply policy-gradient reinforcement learning (RL) to these domains with the goal of creating tools that produce good policies in real-world domains rather than perfect policies in toy domains. [sent-12, score-0.589]

5 Policy gradient methods do not enumerate states and are applicable to multi-agent settings with function approximation [1, 2], thus they are a natural match for our approach to handling large planning problems. [sent-14, score-0.373]

6 We use the GPOMDP algorithm [3] to estimate the gradient of a long-term average reward of the planner’s performance, with respect to the parameters of each task policy. [sent-15, score-0.397]

7 We show that maximising a simple reward function naturally minimises plan durations and maximises the probability of reaching the plan goal. [sent-16, score-0.49]

8 A minor contribution of this paper is a description of how policy-gradient methods can be used to prune a decision tree over possible policies. [sent-18, score-0.274]

9 After training, the decision tree can be translated into a list of policy rules. [sent-19, score-0.484]

10 Previous probabilistic temporal planners include CPTP [4], Prottle [5], Tempastic [6] and a military operations planner [7]. [sent-20, score-0.417]

11 CPTP, Prottle, and Tempastic minimise either plan duration or failure probability, not both. [sent-25, score-0.286]

12 2 Probabilistic temporal planning Tasks are the basic planning unit corresponding to grounded1 durative actions. [sent-27, score-0.564]

13 A task is eligible to begin when its preconditions are satisfied and sufficient resources are available. [sent-31, score-0.553]

14 As tasks end a set of effects appropriate to the outcome are applied. [sent-33, score-0.296]

15 Typically, but not necessarily, succeeding tasks set some facts to true, while failing tasks do nothing or negate facts. [sent-34, score-0.414]

16 The planning goal is to set a subset of the conditions to a desired value. [sent-37, score-0.282]

17 Actions in temporal planning consist of launching multiple tasks concurrently. [sent-42, score-0.493]

18 The number of candidate actions available in a given state is the power set of the tasks that are eligible to start. [sent-43, score-0.71]

19 That is, with N eligible tasks there are 2N possible actions. [sent-44, score-0.546]

20 When combined with probabilistic outcomes the state space explosion cripples existing planners for tens of tasks and actions. [sent-46, score-0.537]

21 A key reason treat each task as an individual policy agent is to deal with this explosion of the action space. [sent-47, score-0.588]

22 We replace the single agent choosing from the power-set of eligible tasks with a single simple agent for each task. [sent-48, score-0.852]

23 The policy learnt by each agent is whether to start its associated task given its observation, independent of the decisions made by the other agents. [sent-49, score-0.519]

24 Indeed, if the agents received perfect state information they could learn to predict the decision of the other agents and still act optimally. [sent-51, score-0.655]

25 3 POMDP formulation of planning Our intention is to deliberately use simple agents that only consider partial state information. [sent-53, score-0.577]

26 Goal states are states where all the goal state variables are satisfied. [sent-57, score-0.269]

27 From failure states it is impossible to reach a goal state, usually because time or resources have run out. [sent-58, score-0.316]

28 These two classes of state are combined to form the set of reset states that produce an immediate reset to the 1 Grounded means that tasks do not have parameters that can be instantiated. [sent-59, score-0.515]

29 A single trajectory through the state space consists of many individual trials that automatically reset to s0 each time a goal state or failure state is reached. [sent-61, score-0.6]

30 An entry of 1 at index n means ‘Yes’ begin task n, and a 0 entry means ‘No’ do not start task n. [sent-65, score-0.255]

31 The probability of actions is Pr[a|o, θ], where conditioning on θ reflects the fact that the policy is controlled by a set of real valued parameters θ ∈ Rp . [sent-66, score-0.283]

32 t=0 In the context of planning, the instantaneous reward provides the agent with a measure of progress toward the goal. [sent-73, score-0.335]

33 A simple reward scheme is to set r(s) = 1 for all states s that represent the goal state, and 0 for all other states. [sent-74, score-0.276]

34 To maximise η(θ), successful planning outcomes must be reached as frequently as possible. [sent-75, score-0.336]

35 It is tempting to provide a negative reward for failure states, but this can introduce poor local maxima in the form of policies that avoid negative rewards by avoiding progress altogether. [sent-77, score-0.504]

36 We provide a reward of 1000 each time the goal is achieved, plus an admissible heuristic reward for progress toward the goal. [sent-78, score-0.407]

37 This additional shaping reward provides a reward of 1 for every goal state variable achieved, and -1 for every goal variable that becomes unset. [sent-79, score-0.574]

38 Policies that are optimal with the additional shaping reward are still optimal under the basic goal state reward [9]. [sent-80, score-0.531]

39 1 Planning state space For probabilistic temporal planning our state description contains [7]: the state’s absolute time, a queue of impending events, the status of each task, the truth value of each condition, and the available resources. [sent-82, score-0.736]

40 In a particular state, only a subset of the eligible tasks will satisfy all preconditions for execution. [sent-83, score-0.604]

41 When a decision to start a fixed duration task is made an end-task event is added to a time ordered event queue. [sent-85, score-0.538]

42 The event queue holds a list of events that the planner is committed to, although the outcome of those events may be uncertain. [sent-86, score-0.483]

43 The state update then proceeds to process events until there is at least one task that is eligible to begin. [sent-91, score-0.601]

44 Future states are only generated at points where tasks can be started. [sent-95, score-0.258]

45 Thus, if an event outcome is processed and no tasks are enabled, the search recurses to the next event in the queue. [sent-96, score-0.446]

46 4 Factored Policy-Gradient We assume the presence of policy agents, parameterised with independent sets of parameters for each agent θ = {θ1 , . [sent-97, score-0.481]

47 We seek to adjust the parameters of the policy to maximise the long-term average reward η(θ). [sent-101, score-0.483]

48 The GPOMDP algorithm [3] estimates the gradient η(θ) of the long-term average reward with respect to the current set of policy Alg. [sent-102, score-0.475]

49 2: Gradient Estimator 1: Set s0 to initial state, t = 0, et = [0] 2: while t < T do 3: et = βet−1 4: Generate observation ot of st 5: for Each eligible task n do 6: Sample atn =Yes or atn =No 7: et = et + log Pr[atn |o, θn ] 8: end for 9: Try action at = {at1 , at2 , . [sent-118, score-1.473]

50 Once an estimate ˆ η(θ) is computed over T simulation steps, we maximise the long-term average reward with the gradient ascent θ ← θ + α ˆ η(θ), where α is a small step size. [sent-122, score-0.323]

51 We do not guarantee that the best representable policy is found, but our experiments have produced policies comparable to global methods like real-time dynamic programming [8]. [sent-124, score-0.469]

52 1; (6) the agents receive the global reward for the new state action and update their gradient estimates. [sent-127, score-0.698]

53 Each vector action at is a combination of independent ‘Yes’ or ‘No’ choices made by each eligible agent. [sent-129, score-0.434]

54 Each agent is parameterised by an independent set of parameters that make up θ ∈ Rp : θ1 , θ2 , . [sent-130, score-0.271]

55 If atn represents the binary decision made by agent n at time t about whether to start its corresponding task, then the policy factors into Pr[at |ot , θ] = Pr[at1 , . [sent-134, score-0.737]

56 It is not necessary for all agents to receive the same observation, and it may be advantageous to show different agents different parts of the state, leading to a decentralised planning algorithm. [sent-141, score-0.667]

57 The main requirement for each policy-agent is that log Pr[atn |ot , θn ] be differentiable with respect to the parameters for each choice task start atn =‘Yes’ or ‘No’. [sent-145, score-0.403]

58 1 Linear approximator agents One representation of agents is a linear network mapped into probabilities using a logistic regression function: Task 1 Pr[Y es|ot , θ1 ] = 0. [sent-148, score-0.496]

59 1 ot Current State Time Conditions Eligible tasks Task status Resources Event queue Pr[N o|ot , θ1 ] = 0. [sent-149, score-0.583]

60 5 Time Conditions Eligible tasks Task status Resources Event queue Pr[atn = Y es|ot , θn ] = Fig. [sent-153, score-0.338]

61 After some experimentation we chose an observation vector that is a binary description of the eligible tasks and the state variable truth values plus a constant 1 bit to provide bias to the agents’ linear networks. [sent-163, score-0.755]

62 2 Decision tree agents Often we have a selection of potential control rules. [sent-165, score-0.385]

63 A decision tree can represent all such control rules at the leaves. [sent-166, score-0.315]

64 An action a is selected by starting at the root node and following a path down the tree, visiting a set of decision nodes D. [sent-168, score-0.345]

65 For multi-agent domains, such as our formulation of planning, we have a decision tree for each task agent. [sent-174, score-0.373]

66 Nodes A, D, F, H represent hard coded rules that switch with probability one between the Yes and No branches based on a boolean observation that gives the truth of the statement in the node for the current state. [sent-177, score-0.299]

67 Parameter adjustments have the simple effect of pruning parts the tree that represent poor policies, leaving the hard coded rules to choose the best action given the observation. [sent-180, score-0.348]

68 The policy encoded by the parameter is written in the node label. [sent-181, score-0.29]

69 For example for task agent n, and decision node C “task duration matters? [sent-182, score-0.526]

70 3 we would be following the policy: if the task IS eligible, and probability this task success does NOT matter, and the duration of this task DOES matter, and this task IS fast, then start, otherwise do not start. [sent-185, score-0.487]

71 Apart from being easy to interpret the optimised decision tree as a set of — possibly stochastic — if-then rules, we can also encode highly expressive policies with only a few parameters. [sent-186, score-0.539]

72 Line 8 computes the log gradient of the sampled action probability and adds the gradient for the n’th agent’s parameters into the eligibility trace. [sent-198, score-0.34]

73 The gradient for parameters not relating to agent n is 0. [sent-199, score-0.269]

74 If all eligible agents decide not to start their tasks, we issue a null-action. [sent-201, score-0.61]

75 If the state event queue is not empty, we process the next event, otherwise time is incremented by 1 to ensure all possible policies will eventually reach a reset state. [sent-202, score-0.57]

76 1 Experiments Comparison with previous work We compare the FPG-Planner with that of our earlier RTDP based planner for military operations [7], which is based on real-time dynamic programming with [8]. [sent-204, score-0.34]

77 The domains come from the Australian Defence Science and Technology Organisation, and represent military operations planning scenarios. [sent-205, score-0.418]

78 There are two problems, the first with 18 tasks and 12 conditions, and the second with 41 tasks and 51 conditions. [sent-206, score-0.414]

79 FPG-Planning with a linear approximator significantly shortens the duration of plans, without increasing the failure rate. [sent-259, score-0.294]

80 The very simple decision tree performs less well than than the linear approximator, but better than the dynamic programming algorithm. [sent-260, score-0.346]

81 The shorter duration for the Webber decision tree is probably due to the slightly higher failure rate. [sent-262, score-0.5]

82 Table 1 assumes that the observation vector o presented to linear agents is a binary description of the eligible tasks and the condition truth values plus a constant 1 bit to provide bias to the agents’ linear networks. [sent-264, score-0.845]

83 Table 2 shows that giving the agents less information in the observation harms performance. [sent-265, score-0.262]

84 2 Large artificial domains Each scenario consists of N tasks and C state variables. [sent-267, score-0.417]

85 The goal state of the synthetic scenarios is to assert 90% of the state variables, chosen during scenario synthesis, to be true. [sent-268, score-0.291]

86 All synthetic scenarios are guaranteed to have at least one policy which will reach the operation goal assuming all tasks succeed. [sent-271, score-0.49]

87 Even a few tens of tasks and conditions can generate a state space too large for main memory. [sent-272, score-0.365]

88 Although the number of tasks and conditions is similar to the Webber problem described above, these problems demonstrate significantly more choices to the planner, making planning nontrivial. [sent-274, score-0.446]

89 Unlike the initial experiments, all tasks can be repeated as often as necessary so the overall probability of failure depends on how well the planner chooses and orders tasks to avoid running out of time and resources. [sent-275, score-0.724]

90 Thus, to demonstrate FPGPlanning is having some effect, we compared the optimised policies to two simple policies. [sent-277, score-0.265]

91 The random policy starts each eligible task with probability 0. [sent-278, score-0.648]

92 The results for the decision tree illustrated in Fig. [sent-283, score-0.274]

93 The “Dumb” row is a decision stub, with one parameter per task that simply learns whether to start when eligible. [sent-286, score-0.259]

94 This tests the sensitivity of the tree to node ordering. [sent-289, score-0.251]

95 For example, when node E is swapped with B, the resultant policies use less resources. [sent-291, score-0.267]

96 Linear network agents (20,200 parameters) optimised for 14 hours, before terminating with small gradients, and resulted in a plan with 20. [sent-296, score-0.352]

97 The decision tree agent (800 parameters) optimised for 6 hours before terminating with a 1. [sent-298, score-0.505]

98 The smaller number of parameters and a priori policies embedded in the tree, allow the decision tree to perform well in very large domains. [sent-300, score-0.494]

99 Inspection of the resulting parameters demonstrated that different tasks pruned different regions of the decision tree. [sent-301, score-0.343]

100 Policy invariance under reward transformations: Theory and application to reward shaping. [sent-365, score-0.364]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eligible', 0.339), ('ot', 0.245), ('planning', 0.239), ('atn', 0.214), ('agents', 0.214), ('policy', 0.21), ('tasks', 0.207), ('policies', 0.187), ('reward', 0.182), ('planner', 0.175), ('tree', 0.171), ('pr', 0.17), ('agent', 0.153), ('dur', 0.136), ('failure', 0.135), ('state', 0.124), ('yes', 0.119), ('gpomdp', 0.117), ('decision', 0.103), ('task', 0.099), ('action', 0.095), ('event', 0.094), ('duration', 0.091), ('domains', 0.086), ('parameterised', 0.085), ('queue', 0.085), ('gradient', 0.083), ('node', 0.08), ('optimised', 0.078), ('rtdp', 0.078), ('webber', 0.078), ('res', 0.077), ('optimisation', 0.072), ('factored', 0.068), ('approximator', 0.068), ('planners', 0.068), ('plan', 0.06), ('maximise', 0.058), ('minimises', 0.058), ('ndsuccessor', 0.058), ('preconditions', 0.058), ('prottle', 0.058), ('thi', 0.058), ('military', 0.058), ('resources', 0.057), ('start', 0.057), ('branch', 0.055), ('plans', 0.054), ('resource', 0.052), ('branches', 0.052), ('states', 0.051), ('outcome', 0.051), ('aberdeen', 0.051), ('peshkin', 0.051), ('reset', 0.05), ('st', 0.049), ('observation', 0.048), ('temporal', 0.047), ('status', 0.046), ('eligibility', 0.046), ('fail', 0.045), ('australian', 0.043), ('goal', 0.043), ('rules', 0.041), ('coded', 0.041), ('actions', 0.04), ('dynamic', 0.04), ('outcomes', 0.039), ('events', 0.039), ('assault', 0.039), ('baux', 0.039), ('conds', 0.039), ('cptp', 0.039), ('defence', 0.039), ('dumb', 0.039), ('durative', 0.039), ('icaps', 0.039), ('sylvie', 0.039), ('tempastic', 0.039), ('douglas', 0.039), ('end', 0.038), ('truth', 0.037), ('australia', 0.037), ('nodes', 0.036), ('operations', 0.035), ('tens', 0.034), ('reaching', 0.034), ('optimising', 0.034), ('maximising', 0.034), ('concurrent', 0.034), ('probabilistic', 0.034), ('parameters', 0.033), ('et', 0.033), ('programming', 0.032), ('starting', 0.031), ('explosion', 0.031), ('durations', 0.031), ('maximises', 0.031), ('return', 0.031), ('reach', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 153 nips-2005-Policy-Gradient Methods for Planning

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

2 0.28619727 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

3 0.2715089 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

4 0.23925412 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

5 0.16754018 36 nips-2005-Bayesian models of human action understanding

Author: Chris Baker, Rebecca Saxe, Joshua B. Tenenbaum

Abstract: We present a Bayesian framework for explaining how people reason about and predict the actions of an intentional agent, based on observing its behavior. Action-understanding is cast as a problem of inverting a probabilistic generative model, which assumes that agents tend to act rationally in order to achieve their goals given the constraints of their environment. Working in a simple sprite-world domain, we show how this model can be used to infer the goal of an agent and predict how the agent will act in novel situations or when environmental constraints change. The model provides a qualitative account of several kinds of inferences that preverbal infants have been shown to perform, and also fits quantitative predictions that adult observers make in a new experiment.

6 0.16752176 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

7 0.15745306 144 nips-2005-Off-policy Learning with Options and Recognizers

8 0.13705401 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

9 0.13477251 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

10 0.1333099 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

11 0.12154437 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

12 0.11973672 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

13 0.11152304 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

14 0.089822419 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

15 0.082910739 53 nips-2005-Cyclic Equilibria in Markov Games

16 0.078924417 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

17 0.0767859 148 nips-2005-Online Discovery and Learning of Predictive State Representations

18 0.075465038 125 nips-2005-Message passing for task redistribution on sparse graphs

19 0.074006386 64 nips-2005-Efficient estimation of hidden state dynamics from spike trains

20 0.070040256 26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.201), (1, -0.027), (2, 0.487), (3, -0.123), (4, -0.061), (5, 0.0), (6, -0.009), (7, 0.048), (8, 0.063), (9, 0.018), (10, -0.001), (11, -0.106), (12, 0.033), (13, 0.044), (14, -0.025), (15, 0.009), (16, -0.005), (17, -0.059), (18, 0.017), (19, 0.009), (20, 0.033), (21, 0.012), (22, 0.04), (23, -0.034), (24, -0.015), (25, 0.016), (26, -0.091), (27, -0.072), (28, 0.054), (29, 0.011), (30, 0.08), (31, -0.02), (32, 0.06), (33, 0.036), (34, -0.036), (35, 0.003), (36, -0.075), (37, -0.006), (38, -0.017), (39, 0.053), (40, 0.065), (41, -0.012), (42, 0.008), (43, 0.034), (44, 0.023), (45, 0.035), (46, 0.001), (47, 0.032), (48, 0.005), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97663724 153 nips-2005-Policy-Gradient Methods for Planning

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

2 0.87319487 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

3 0.81871498 87 nips-2005-Goal-Based Imitation as Probabilistic Inference over Graphical Models

Author: Deepak Verma, Rajesh P. Rao

Abstract: Humans are extremely adept at learning new skills by imitating the actions of others. A progression of imitative abilities has been observed in children, ranging from imitation of simple body movements to goalbased imitation based on inferring intent. In this paper, we show that the problem of goal-based imitation can be formulated as one of inferring goals and selecting actions using a learned probabilistic graphical model of the environment. We first describe algorithms for planning actions to achieve a goal state using probabilistic inference. We then describe how planning can be used to bootstrap the learning of goal-dependent policies by utilizing feedback from the environment. The resulting graphical model is then shown to be powerful enough to allow goal-based imitation. Using a simple maze navigation task, we illustrate how an agent can infer the goals of an observed teacher and imitate the teacher even when the goals are uncertain and the demonstration is incomplete.

4 0.7551201 78 nips-2005-From Weighted Classification to Policy Search

Author: Doron Blatt, Alfred O. Hero

Abstract: This paper proposes an algorithm to convert a T -stage stochastic decision problem with a continuous state space to a sequence of supervised learning problems. The optimization problem associated with the trajectory tree and random trajectory methods of Kearns, Mansour, and Ng, 2000, is solved using the Gauss-Seidel method. The algorithm breaks a multistage reinforcement learning problem into a sequence of single-stage reinforcement learning subproblems, each of which is solved via an exact reduction to a weighted-classification problem that can be solved using off-the-self methods. Thus the algorithm converts a reinforcement learning problem into simpler supervised learning subproblems. It is shown that the method converges in a finite number of steps to a solution that cannot be further improved by componentwise optimization. The implication of the proposed algorithm is that a plethora of classification methods can be applied to find policies in the reinforcement learning problem. 1

5 0.64039207 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Author: Yaakov Engel, Peter Szabo, Dmitry Volkinshtein

Abstract: The Octopus arm is a highly versatile and complex limb. How the Octopus controls such a hyper-redundant arm (not to mention eight of them!) is as yet unknown. Robotic arms based on the same mechanical principles may render present day robotic arms obsolete. In this paper, we tackle this control problem using an online reinforcement learning algorithm, based on a Bayesian approach to policy evaluation known as Gaussian process temporal difference (GPTD) learning. Our substitute for the real arm is a computer simulation of a 2-dimensional model of an Octopus arm. Even with the simplifications inherent to this model, the state space we face is a high-dimensional one. We apply a GPTDbased algorithm to this domain, and demonstrate its operation on several learning tasks of varying degrees of difficulty. 1

6 0.61902332 144 nips-2005-Off-policy Learning with Options and Recognizers

7 0.60884213 36 nips-2005-Bayesian models of human action understanding

8 0.6072315 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration

9 0.58369482 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

10 0.58046538 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

11 0.53407609 91 nips-2005-How fast to work: Response vigor, motivation and tonic dopamine

12 0.51941133 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

13 0.43265611 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

14 0.36966476 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

15 0.36574155 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

16 0.34778535 53 nips-2005-Cyclic Equilibria in Markov Games

17 0.34354371 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

18 0.33962205 125 nips-2005-Message passing for task redistribution on sparse graphs

19 0.32243282 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

20 0.31736371 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.035), (10, 0.064), (17, 0.313), (27, 0.074), (31, 0.123), (34, 0.053), (39, 0.019), (55, 0.041), (65, 0.012), (69, 0.052), (73, 0.023), (88, 0.074), (91, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78446722 153 nips-2005-Policy-Gradient Methods for Planning

Author: Douglas Aberdeen

Abstract: Probabilistic temporal planning attempts to find good policies for acting in domains with concurrent durative tasks, multiple uncertain outcomes, and limited resources. These domains are typically modelled as Markov decision problems and solved using dynamic programming methods. This paper demonstrates the application of reinforcement learning — in the form of a policy-gradient method — to these domains. Our emphasis is large domains that are infeasible for dynamic programming. Our approach is to construct simple policies, or agents, for each planning task. The result is a general probabilistic temporal planner, named the Factored Policy-Gradient Planner (FPG-Planner), which can handle hundreds of tasks, optimising for probability of success, duration, and resource use. 1

2 0.48280746 145 nips-2005-On Local Rewards and Scaling Distributed Reinforcement Learning

Author: Drew Bagnell, Andrew Y. Ng

Abstract: We consider the scaling of the number of examples necessary to achieve good performance in distributed, cooperative, multi-agent reinforcement learning, as a function of the the number of agents n. We prove a worstcase lower bound showing that algorithms that rely solely on a global reward signal to learn policies confront a fundamental limit: They require a number of real-world examples that scales roughly linearly in the number of agents. For settings of interest with a very large number of agents, this is impractical. We demonstrate, however, that there is a class of algorithms that, by taking advantage of local reward signals in large distributed Markov Decision Processes, are able to ensure good performance with a number of samples that scales as O(log n). This makes them applicable even in settings with a very large number of agents n. 1

3 0.47769895 65 nips-2005-Estimating the wrong Markov random field: Benefits in the computation-limited setting

Author: Martin J. Wainwright

Abstract: Consider the problem of joint parameter estimation and prediction in a Markov random field: i.e., the model parameters are estimated on the basis of an initial set of data, and then the fitted model is used to perform prediction (e.g., smoothing, denoising, interpolation) on a new noisy observation. Working in the computation-limited setting, we analyze a joint method in which the same convex variational relaxation is used to construct an M-estimator for fitting parameters, and to perform approximate marginalization for the prediction step. The key result of this paper is that in the computation-limited setting, using an inconsistent parameter estimator (i.e., an estimator that returns the “wrong” model even in the infinite data limit) is provably beneficial, since the resulting errors can partially compensate for errors made by using an approximate prediction technique. En route to this result, we analyze the asymptotic properties of M-estimators based on convex variational relaxations, and establish a Lipschitz stability property that holds for a broad class of variational methods. We show that joint estimation/prediction based on the reweighted sum-product algorithm substantially outperforms a commonly used heuristic based on ordinary sum-product. 1 Keywords: Markov random fields; variational method; message-passing algorithms; sum-product; belief propagation; parameter estimation; learning. 1

4 0.47313827 108 nips-2005-Layered Dynamic Textures

Author: Antoni B. Chan, Nuno Vasconcelos

Abstract: A dynamic texture is a video model that treats a video as a sample from a spatio-temporal stochastic process, specifically a linear dynamical system. One problem associated with the dynamic texture is that it cannot model video where there are multiple regions of distinct motion. In this work, we introduce the layered dynamic texture model, which addresses this problem. We also introduce a variant of the model, and present the EM algorithm for learning each of the models. Finally, we demonstrate the efficacy of the proposed model for the tasks of segmentation and synthesis of video.

5 0.4673017 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

Author: Jin Yu, Douglas Aberdeen, Nicol N. Schraudolph

Abstract: Reinforcement learning by direct policy gradient estimation is attractive in theory but in practice leads to notoriously ill-behaved optimization problems. We improve its robustness and speed of convergence with stochastic meta-descent, a gain vector adaptation method that employs fast Hessian-vector products. In our experiments the resulting algorithms outperform previously employed online stochastic, offline conjugate, and natural policy gradient methods. 1

6 0.46661019 78 nips-2005-From Weighted Classification to Policy Search

7 0.46511245 204 nips-2005-Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

8 0.4636327 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

9 0.46092567 36 nips-2005-Bayesian models of human action understanding

10 0.45206827 142 nips-2005-Oblivious Equilibrium: A Mean Field Approximation for Large-Scale Dynamic Games

11 0.45108265 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

12 0.4461951 144 nips-2005-Off-policy Learning with Options and Recognizers

13 0.44112217 111 nips-2005-Learning Influence among Interacting Markov Chains

14 0.44053626 187 nips-2005-Temporal Abstraction in Temporal-difference Networks

15 0.43840799 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

16 0.43724748 199 nips-2005-Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

17 0.43676496 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

18 0.43661132 45 nips-2005-Conditional Visual Tracking in Kernel Space

19 0.43660137 46 nips-2005-Consensus Propagation

20 0.43557179 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks