nips nips2002 nips2002-134 knowledge-graph by maker-knowledge-mining

134 nips-2002-Learning to Take Concurrent Actions


Source: pdf

Author: Khashayar Rohanimanesh, Sridhar Mahadevan

Abstract: We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. [sent-5, score-0.774]

2 We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. [sent-6, score-0.74]

3 1 Introduction We investigate a general framework for modeling concurrent actions. [sent-7, score-0.276]

4 The notion of concurrent action is formalized in a general way, to capture both situations where a single agent can execute multiple parallel processes, as well as the multi-agent case where many agents act in parallel. [sent-8, score-0.622]

5 Most previous work on concurrency has focused on parallelizing primitive (unit step) actions. [sent-10, score-0.185]

6 Reiter developed axioms for concurrent planning using the situation calculus framework [4]. [sent-11, score-0.307]

7 Knoblock [3] and Boutilier [1] modify the STRIPS representation of actions to allow for concurrent actions. [sent-12, score-0.6]

8 Prior work in decision-theoretic planning includes work on multi-dimensional vector action spaces [2], and models based on dynamic merging of multiple MDPs [6]. [sent-14, score-0.149]

9 There is also a massive literature on concurrent processes, dynamic logic, and temporal logic. [sent-15, score-0.276]

10 Parts of these lines of research deal with the specification and synthesis of concurrent actions, including probabilistic ones [8]. [sent-16, score-0.276]

11 We provide a detailed analysis of three termination schemes for composing parallel action structures. [sent-19, score-0.626]

12 The three schemes – any, all, and continue – are illustrated in Figure 1. [sent-20, score-0.269]

13 We characterize the class of policies under each scheme. [sent-21, score-0.164]

14 We also theoretically compare the optimality of the concurrent policies under each scheme with that of the typical sequential case. [sent-22, score-0.631]

15 Here, a concurrent action is simply represented as a set of primary actions (hereafter called a multi-action), where each primary action is either a single step action, or a temporally extended action (e. [sent-25, score-1.278]

16 , modeled as a closed loop policy over single step actions [7]). [sent-27, score-0.559]

17 We denote the set of multi-actions that can be executed in a state s by A(s). [sent-28, score-0.176]

18 In practice, this function can capture resource constraints that limit how many actions an agent can execute in parallel. [sent-29, score-0.51]

19 Thus, the transition probability distribution in practice may be defined over a much smaller subset than the power-set of primary actions (e. [sent-30, score-0.446]

20 , in the grid world example in Figure 3, the power set is > 100, but the set of concurrent actions is only ≈ 10). [sent-32, score-0.6]

21 A principal goal of this paper is to understand how to define decision epochs for concurrent processes, since the primary actions in a multi-action may not terminate at the same time. [sent-36, score-0.92]

22 The event of termination of a multi-action can be defined in many ways. [sent-37, score-0.423]

23 In the Tall termination scheme (Figure 1, middle), the next decision epoch is the earliest time at which all the primary actions within the multi-action currently being executed have terminated. [sent-40, score-1.148]

24 A deterministic Markovian (memoryless) policy in CAMs is defined as the mapping π : S → ℘(A). [sent-42, score-0.219]

25 Note that even though the mapping is defined independent of the termination scheme, the behavior of a multi-action policy depends on the termination scheme that is used in the model. [sent-43, score-1.127]

26 To illustrate this, let < π, τ > (called a policy-termination construct) denote the process of executing the multi-action policy π using the termination scheme τ ∈ {Tany , Tall }. [sent-44, score-0.76]

27 To simplify notation, we only use this form whenever we want to explicitly point out what termination scheme is being used for executing the policy π. [sent-45, score-0.747]

28 For a given Markovian policy, we can write the value of that policy in an arbitrary state given the termination mechanism used in the model. [sent-46, score-0.69]

29 Let Θ(π, st , τ ) denote the event of initiating the multi-action π(st ) at time t and terminating it according to the τ ∈ {Tany , Tall } termination scheme. [sent-47, score-0.782]

30 Also let π ∗τ denote the optimal multi-action policy within the space of policies over multi-actions that terminate according to the τ ∈ {Tany , Tall } termination scheme. [sent-48, score-0.973]

31 To simplify notation, we may alternatively use ∗τ to denote optimality with respect to the τ termination scheme. [sent-49, score-0.475]

32 + γ k−1 rt+k + γ k max a∈A(st+k ) Q∗τ (st+k , a) | Θ(π ∗τ , st , τ )} where Q∗τ (st+k , a) denotes the multi-action value of executing a in state st+k (terminated using τ ) and following the optimal policy π ∗τ thereafter. [sent-53, score-0.643]

33 The continue policy is defined as the mapping πcont : S × H → ℘(A) in which H is a set of continue-sets ht . [sent-55, score-0.666]

34 Note that the value function definition for the continue policy should be defined over both state st and the continue-set ht (represented by st , ht ), i. [sent-56, score-1.523]

35 Let the function A(st , ht ) return the set of multi-actions that can be executed in state st that include the continuing primary actions in ht . [sent-59, score-1.413]

36 Then the continue policy is formally defined as: πcont ( st , ht ) = arg maxa∈A(st ,ht ) Qπcont ( st , ht , a). [sent-60, score-1.475]

37 To illustrate this, assume that the current state is st and the multi-action at = {a1 , a2 , a3 , a4 } is executed in state st . [sent-61, score-0.777]

38 Also, assume that the primary action a1 is the first action that terminates after k steps in state st+k . [sent-62, score-0.472]

39 According to the definition of the continue termination scheme (that terminates based on T any ), the multi-action at is terminated at time t + k and we need to select a new multiaction to execute in state st+k (with the continue-set ht+k = {a2 , a3 , a4 }). [sent-63, score-0.956]

40 The continue policy will select the best multi-action at+k that includes the primary actions {a2 , a3 , a4 }, since they did not terminate in state st+k (see Figure 1, right). [sent-64, score-1.021]

41 3 Theoretical Results In this section we present some of our theoretical results comparing the optimality of various policies under different termination schemes introduced in the previous section. [sent-65, score-0.691]

42 Note that in theorems 1 and 3 which compare the continue policy with π ∗any and π ∗all policies, the value function is written over the pair s t , ht to be consistent with the definition of the continue policy. [sent-68, score-0.898]

43 This does not influence the original definition of the value function for the optimal policies in Tany and Tall termination schemes, since they are independent of the continue-set h t . [sent-69, score-0.621]

44 First, we compare the optimal multi-action policies based on the Tany termination scheme and the continue policy. [sent-70, score-0.902]

45 Theorem 1: For every state st ∈ S, and all continue-set ht ∈ H, V πcont ( ) ≤ V ∗any ( st , ht st , ht ). [sent-71, score-1.626]

46 , A(st , ht )) which is a subset of the larger set A(st ) that is maximized over, in the π ∗any case. [sent-74, score-0.243]

47 Next, we show that the optimal plans with multi-actions that terminate according to the Tany termination scheme are better compared to the optimal plans with multi-actions that terminate according to the Tall termination scheme: Theorem 2: For every state s ∈ S, V ∗all (s) ≤ V ∗any (s). [sent-75, score-1.342]

48 Proof: The proof is based on the following lemma which states that if we alter the execution of the optimal multi-action policy based on Tall (i. [sent-76, score-0.332]

49 , π ∗all ) in such a way that at every decision epoch the next multi-action is still selected from π ∗all , but we terminate it based on Tany then the new policy-termination construct represented by < ∗all , any > is better than the π ∗all policy. [sent-78, score-0.206]

50 Intuitively this makes sense, since if we interrupt π ∗all (s) when the first primary action ai ∈ a = π ∗all (s) terminates in some future state s , due to the optimality of π ∗all , executing π ∗all (s ) is always better than or equal to continuing some other policy such as the one in progress (i. [sent-79, score-0.69]

51 Note that the proof is not as simple as in the first theorem since the two different policies discussed in this theorem (i. [sent-82, score-0.236]

52 , π ∗any and π ∗all ) are not being executed using the same termination method. [sent-84, score-0.538]

53 ∗all Proof: Let Vn,any (s) denote the value of following the optimal π ∗all policy in state s, where for the first n decision epochs we use the Tany termination scheme and for the rest we use the Tall termination scheme. [sent-86, score-1.329]

54 This suggests that if we always terminate a multi-action π ∗all (st ) according to the Tany termination scheme, we achieve a ∗all better return; or mathematically V ∗all (s) ≤ limn→∞ Vn,any (s) = V <∗all ,any> (s). [sent-88, score-0.556]

55 Using Lemma 1, and the optimality of π ∗any in the space of policies with termination scheme according to Tany , it follows that V ∗all (s) ≤ V <∗all ,any> (s) ≤ V ∗any (s). [sent-89, score-0.704]

56 policies, multi-actions are executed until all of the primary actions of that multi-action terminate. [sent-91, score-0.561]

57 The continue policy, however, may also initiate new useful primary action in addition to those already running which may achieve ∗all a better return. [sent-92, score-0.444]

58 By induction on n, it can be shown that ∗all V ∗all ( st , ht ) ≤ Vn,cont ( st , ht ) for all n. [sent-94, score-1.052]

59 This suggests that as we increase n, the altered policy behaves more like the continue policy and thus in the limit ∗all we have V ∗all ( st , ht ) ≤ limn→∞ Vn,cont ( st , ht ) = V πcont ( st , ht ) which proves the theorem. [sent-95, score-2.237]

60 Finally we show that the optimal multi-action policies based on Tall termination scheme are as good as the case where the agent always executes a single primary action at a time, as it is the case in standard SMDPs. [sent-96, score-1.049]

61 Note that this theorem does not state that concurrent plans are always better than sequential ones; it simply says that if in a problem, the sequential execution of the primary actions is the best policy, CAM is able to represent and find that policy. [sent-97, score-1.008]

62 Proof: It suffices to show that sequential policies are within the space of concurrent policies. [sent-99, score-0.502]

63 This holds since a single primary action can be considered as a multi-action containing only one primary action whose termination is consistent with either of the multi-action termination schemes (i. [sent-100, score-1.391]

64 , in the sequential case both T any and Tall termination schemes are same). [sent-102, score-0.55]

65 It shows how different policies in a concurrent action model using different termination schemes compare to each other in terms of optimality. [sent-104, score-1.061]

66 According to this figure, the optimal multi-action policies based on T any and Tall , and also continue multi-action policies dominate (with respect to the partial ordering relation defined over policies) the optimal policies over the sequential case. [sent-108, score-0.871]

67 Furthermore, policies based on continue multi-actions dominate the optimal multiaction policies based on Tall termination scheme, while themselves being dominated by the optimal multi-action policies based on Tany termination scheme. [sent-109, score-1.651]

68 Multi-action policies using Tany Continue multi-action policies Multi-action policies using Tall Policies over sequential actions Figure 2: Comparison of policies over multi-actions and sequential primary actions using different termination schemes. [sent-110, score-1.973]

69 4 Experimental Results In this section we present experimental results using a grid world task comparing various termination schemes (see Figure 3). [sent-111, score-0.488]

70 Each hallway connects two rooms, and has a door with two locks. [sent-112, score-0.116]

71 An agent has to retrieve two keys and hold both keys at the same time in order to open both locks. [sent-113, score-0.289]

72 The process of picking up keys is modeled as a temporally extended action that takes different amount of times for each key. [sent-114, score-0.298]

73 Moreover, keys cannot be held indefinitely, since the agent may drop a key occasionally. [sent-115, score-0.272]

74 Therefore the agent needs to find an efficient solution for picking up the keys in parallel with navigation to act optimally. [sent-116, score-0.304]

75 This is an episodic task, in which at the beginning of each episode the agent is placed in a fixed position (upper left corner) and the goal of the agent is to navigate to a fixed position goal (hallway H3). [sent-117, score-0.241]

76 The agent can execute two types of action concurrently: (1) navigation actions, and (2) key actions. [sent-121, score-0.432]

77 Navigation actions include a set of one-step stochastic navigation actions (Up, Left, Down and Right) that move the agent in the corresponding direction with probability 0. [sent-122, score-0.832]

78 Upon failure the agent 1 moves instead in one of the other three directions, each with probability 30 . [sent-125, score-0.113]

79 There is also a set of temporally extended actions defined over the one step navigation actions that transport the agent from within the room to one of the two hallway cells leading out of the room (Figure 4 (left)). [sent-126, score-1.063]

80 Key actions are defined to manipulate each key (get-key, putback-key, pickup-key, etc). [sent-127, score-0.403]

81 Among them pickup-key is a temporally extended action (Figure 4 (right)). [sent-128, score-0.198]

82 Primitive action "get-key" Door is closed & both keys are ready Primitive action "key-nop" Primitive action "putback-key" Door is open Multi-step action "pickup-key" Inside the room Multi-step hallway action can be taken 1 1 S0 . [sent-130, score-0.832]

83 7 1 Multi-step hallway action can not be taken 0. [sent-142, score-0.193]

84 3 1 Door is closed & keys are not ready S0 Key 2 Outside the room 1 S1 Key Ready 1 S2 . [sent-143, score-0.167]

85 1 S6 Key Dropped 1 Figure 4: Left: the policy associated with one of the hallway temporally extended actions. [sent-146, score-0.374]

86 Right: representation of the key pickup actions for each key process. [sent-147, score-0.438]

87 In this example, navigation actions can be executed concurrently with key actions. [sent-148, score-0.587]

88 Actions that manipulate different keys can be also executed concurrently. [sent-149, score-0.225]

89 However, the agent is not allowed to execute more than one navigation action, or more than one key action (from the same key action set) concurrently. [sent-150, score-0.607]

90 In order to properly handle concurrent execution of actions, we have used a factored state space defined by state variables position (104 positions), key1-state (11 states) and key2-state (7 states). [sent-151, score-0.411]

91 In our previous work we showed that concurrent actions formed an SMDP over primitive actions [5], which turns out to hold for all the termination schemes described above. [sent-152, score-1.498]

92 Thus, we can use SMDP Q-learning to compare concurrent policies over different termination schemes with the use of this method for purely sequential policy learning [7]. [sent-153, score-1.224]

93 The agent is punished by −1 for each primitive action. [sent-155, score-0.199]

94 Figure 5 (left) compares the number of primitive actions taken until success, and Figure 5 (right) shows the median number of decision epochs per trial, where for trial n, it is the median of all trials from 1 to n. [sent-156, score-0.582]

95 As shown in figure 5 (left), concurrent actions over any termination scheme yield a faster plan than sequential execution. [sent-158, score-1.163]

96 We conjecture that sequential execution and Tall converge faster compared to Tany , due to the frequency with which multi-actions are terminated. [sent-163, score-0.117]

97 This is intuitive since Tall terminates only when all of the primary actions in a multi-action are completed, and hence it involves less interruption compared to learning based on Tany . [sent-165, score-0.532]

98 Right: moving median of number of multi-action level decision epochs taken to the goal. [sent-169, score-0.123]

99 Even though it uses the Tany termination scheme, it continues executing primary actions that did not terminate naturally when the first primary action terminates, making it similar to Tall . [sent-171, score-1.285]

100 Also, we can additionally exploit the hierarchical structure of multi-actions to compile them into an effective policy over primary actions. [sent-173, score-0.341]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('termination', 0.423), ('actions', 0.324), ('tany', 0.309), ('st', 0.283), ('concurrent', 0.276), ('tall', 0.258), ('ht', 0.243), ('policy', 0.219), ('continue', 0.204), ('policies', 0.164), ('cont', 0.157), ('primary', 0.122), ('action', 0.118), ('executed', 0.115), ('agent', 0.113), ('terminate', 0.104), ('keys', 0.088), ('primitive', 0.086), ('hallway', 0.075), ('concurrency', 0.074), ('execute', 0.073), ('navigation', 0.071), ('terminates', 0.066), ('schemes', 0.065), ('scheme', 0.062), ('sequential', 0.062), ('key', 0.057), ('temporally', 0.056), ('epoch', 0.055), ('terminated', 0.055), ('dn', 0.055), ('erent', 0.054), ('seq', 0.049), ('state', 0.048), ('decision', 0.047), ('epochs', 0.047), ('executing', 0.043), ('di', 0.042), ('door', 0.041), ('execution', 0.039), ('plans', 0.039), ('optimality', 0.039), ('room', 0.038), ('smdp', 0.037), ('optimal', 0.034), ('planning', 0.031), ('median', 0.029), ('proof', 0.026), ('ready', 0.025), ('khashayar', 0.025), ('limn', 0.025), ('maxa', 0.025), ('multiaction', 0.025), ('parallelizing', 0.025), ('rohanimanesh', 0.025), ('sridhar', 0.025), ('terminating', 0.025), ('extended', 0.024), ('rt', 0.024), ('theorem', 0.023), ('agents', 0.022), ('amherst', 0.022), ('continuing', 0.022), ('manipulate', 0.022), ('cam', 0.022), ('initiating', 0.022), ('concurrently', 0.02), ('interruption', 0.02), ('parallel', 0.02), ('trial', 0.02), ('corollary', 0.019), ('interrupted', 0.018), ('boutilier', 0.018), ('mdps', 0.017), ('altered', 0.017), ('faster', 0.016), ('dominate', 0.016), ('according', 0.016), ('max', 0.016), ('closed', 0.016), ('nition', 0.016), ('ordering', 0.016), ('compare', 0.015), ('making', 0.015), ('left', 0.015), ('right', 0.015), ('lemma', 0.014), ('naturally', 0.014), ('drop', 0.014), ('dropped', 0.014), ('always', 0.013), ('rest', 0.013), ('return', 0.013), ('ned', 0.013), ('denote', 0.013), ('theorems', 0.013), ('theoretically', 0.013), ('partial', 0.013), ('picking', 0.012), ('de', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 134 nips-2002-Learning to Take Concurrent Actions

Author: Khashayar Rohanimanesh, Sridhar Mahadevan

Abstract: We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. 1

2 0.24083254 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

3 0.14684393 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

Author: Christopher G. Atkeson, Jun Morimoto

Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.

4 0.14630994 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.

5 0.13936865 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

Author: Eric Allender, Sanjeev Arora, Michael Kearns, Cristopher Moore, Alexander Russell

Abstract: We establish a new hardness result that shows that the difficulty of planning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of factored MDPs with linear rewards whose optimal policies and value functions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connections with the complexity class of Arthur-Merlin games.

6 0.12131394 20 nips-2002-Adaptive Caching by Refetching

7 0.11441427 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

8 0.10010564 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

9 0.095400259 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

10 0.087869078 185 nips-2002-Speeding up the Parti-Game Algorithm

11 0.083356515 78 nips-2002-Efficient Learning Equilibrium

12 0.076181568 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

13 0.070723213 205 nips-2002-Value-Directed Compression of POMDPs

14 0.068933167 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

15 0.068777598 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives

16 0.059407204 199 nips-2002-Timing and Partial Observability in the Dopamine System

17 0.052796267 61 nips-2002-Convergent Combinations of Reinforcement Learning with Linear Function Approximation

18 0.049727816 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

19 0.048994988 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

20 0.045555357 45 nips-2002-Boosted Dyadic Kernel Discriminants


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.104), (1, -0.009), (2, -0.313), (3, -0.102), (4, 0.016), (5, -0.123), (6, 0.068), (7, -0.088), (8, 0.037), (9, -0.008), (10, -0.002), (11, 0.047), (12, 0.003), (13, 0.014), (14, 0.018), (15, -0.021), (16, 0.026), (17, -0.038), (18, 0.103), (19, 0.045), (20, -0.032), (21, -0.056), (22, -0.148), (23, -0.083), (24, -0.132), (25, 0.028), (26, -0.053), (27, 0.018), (28, -0.053), (29, -0.033), (30, 0.039), (31, -0.021), (32, 0.096), (33, -0.016), (34, 0.037), (35, 0.011), (36, 0.064), (37, -0.023), (38, -0.015), (39, -0.027), (40, -0.075), (41, 0.015), (42, 0.043), (43, 0.154), (44, -0.072), (45, -0.037), (46, -0.023), (47, -0.195), (48, -0.016), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98076886 134 nips-2002-Learning to Take Concurrent Actions

Author: Khashayar Rohanimanesh, Sridhar Mahadevan

Abstract: We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. 1

2 0.6982677 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

3 0.67899561 20 nips-2002-Adaptive Caching by Refetching

Author: Robert B. Gramacy, Manfred K. Warmuth, Scott A. Brandt, Ismail Ari

Abstract: We are constructing caching policies that have 13-20% lower miss rates than the best of twelve baseline policies over a large variety of request streams. This represents an improvement of 49–63% over Least Recently Used, the most commonly implemented policy. We achieve this not by designing a specific new policy but by using on-line Machine Learning algorithms to dynamically shift between the standard policies based on their observed miss rates. A thorough experimental evaluation of our techniques is given, as well as a discussion of what makes caching an interesting on-line learning problem.

4 0.66092628 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

Author: Eric Allender, Sanjeev Arora, Michael Kearns, Cristopher Moore, Alexander Russell

Abstract: We establish a new hardness result that shows that the difficulty of planning in factored Markov decision processes is representational rather than just computational. More precisely, we give a fixed family of factored MDPs with linear rewards whose optimal policies and value functions simply cannot be represented succinctly in any standard parametric form. Previous hardness results indicated that computing good policies from the MDP parameters was difficult, but left open the possibility of succinct function approximation for any fixed factored MDP. Our result applies even to policies which yield a polynomially poor approximation to the optimal value, and highlights interesting connections with the complexity class of Arthur-Merlin games.

5 0.65307707 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efficiency gains over naive enumeration.

6 0.56098545 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

7 0.50346929 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

8 0.48125422 185 nips-2002-Speeding up the Parti-Game Algorithm

9 0.42512763 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations

10 0.40696537 205 nips-2002-Value-Directed Compression of POMDPs

11 0.36642703 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

12 0.32819515 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

13 0.30838978 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

14 0.3006658 78 nips-2002-Efficient Learning Equilibrium

15 0.28645223 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives

16 0.25567278 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

17 0.25440225 142 nips-2002-Maximum Likelihood and the Information Bottleneck

18 0.20225595 199 nips-2002-Timing and Partial Observability in the Dopamine System

19 0.19970518 167 nips-2002-Rational Kernels

20 0.19203509 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.064), (41, 0.449), (42, 0.026), (54, 0.08), (55, 0.022), (67, 0.022), (73, 0.013), (74, 0.074), (92, 0.048), (98, 0.082)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95720017 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations

Author: Daniel J. Navarro, Michael D. Lee

Abstract: unkown-abstract

same-paper 2 0.81879222 134 nips-2002-Learning to Take Concurrent Actions

Author: Khashayar Rohanimanesh, Sridhar Mahadevan

Abstract: We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. 1

3 0.69723618 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

Author: Dan Klein, Christopher D. Manning

Abstract: We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.

4 0.67765117 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

Author: Leonid Taycher, John Iii, Trevor Darrell

Abstract: Accurate representation of articulated motion is a challenging problem for machine perception. Several successful tracking algorithms have been developed that model human body as an articulated tree. We propose a learning-based method for creating such articulated models from observations of multiple rigid motions. This paper is concerned with recovering topology of the articulated model, when the rigid motion of constituent segments is known. Our approach is based on finding the Maximum Likelihood tree shaped factorization of the joint probability density function (PDF) of rigid segment motions. The topology of graphical model formed from this factorization corresponds to topology of the underlying articulated body. We demonstrate the performance of our algorithm on both synthetic and real motion capture data.

5 0.62918121 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

Author: Patrick J. Wolfe, Simon J. Godsill

Abstract: The Bayesian paradigm provides a natural and effective means of exploiting prior knowledge concerning the time-frequency structure of sound signals such as speech and music—something which has often been overlooked in traditional audio signal processing approaches. Here, after constructing a Bayesian model and prior distributions capable of taking into account the time-frequency characteristics of typical audio waveforms, we apply Markov chain Monte Carlo methods in order to sample from the resultant posterior distribution of interest. We present speech enhancement results which compare favourably in objective terms with standard time-varying filtering techniques (and in several cases yield superior performance, both objectively and subjectively); moreover, in contrast to such methods, our results are obtained without an assumption of prior knowledge of the noise power.

6 0.35040972 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

7 0.3451207 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

8 0.3377808 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions

9 0.32121116 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

10 0.31946111 147 nips-2002-Monaural Speech Separation

11 0.31820494 137 nips-2002-Location Estimation with a Differential Update Network

12 0.31731561 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

13 0.30984783 3 nips-2002-A Convergent Form of Approximate Policy Iteration

14 0.30912209 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

15 0.30842733 27 nips-2002-An Impossibility Theorem for Clustering

16 0.30774897 74 nips-2002-Dynamic Structure Super-Resolution

17 0.30732626 55 nips-2002-Combining Features for BCI

18 0.30717695 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study

19 0.30699143 44 nips-2002-Binary Tuning is Optimal for Neural Rate Coding with High Temporal Resolution

20 0.30679122 10 nips-2002-A Model for Learning Variance Components of Natural Images