nips nips2002 nips2002-185 knowledge-graph by maker-knowledge-mining

185 nips-2002-Speeding up the Parti-Game Algorithm

Source: pdf

Author: Maxim Likhachev, Sven Koenig

Abstract: In this paper, we introduce an efﬁcient replanning algorithm for nondeterministic domains, namely what we believe to be the ﬁrst incremental heuristic minimax search algorithm. We apply it to the dynamic discretization of continuous domains, resulting in an efﬁcient implementation of the parti-game reinforcement-learning algorithm for control in high-dimensional domains.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we introduce an efﬁcient replanning algorithm for nondeterministic domains, namely what we believe to be the ﬁrst incremental heuristic minimax search algorithm. [sent-5, score-0.676]

2 We apply it to the dynamic discretization of continuous domains, resulting in an efﬁcient implementation of the parti-game reinforcement-learning algorithm for control in high-dimensional domains. [sent-6, score-0.158]

3 1 Introduction We recently developed Lifelong Planning A* (LPA*), a search algorithm for deterministic domains that combines incremental and heuristic search to reduce its search time [1]. [sent-7, score-0.595]

4 We believe that the resulting search algorithm, called Minimax LPA*, is the ﬁrst incremental heuristic minimax search algorithm. [sent-10, score-0.714]

5 We apply it to the dynamic discretization of continuous domains, resulting in an efﬁcient implementation of the popular parti-game algorithm [3]. [sent-11, score-0.158]

6 Our ﬁrst experiments suggest that this implementation of the parti-game algorithm can be an order of magnitude faster in two-dimensional domains than one with uninformed search from scratch and thus might allow the parti-game algorithm to scale up to larger domains. [sent-12, score-0.484]

7 There also exist other ways of decreasing the amount of search performed by the parti-game algorithm. [sent-13, score-0.12]

8 2 Parti-Game Algorithm The objective of the parti-game algorithm is to move an agent from given start coordinates to given goal coordinates in continuous and potentially high-dimensional domains with obstacles of arbitrary shapes. [sent-15, score-0.553]

9 To solve these problems, one can ﬁrst discretize the domains and then use conventional search algorithms to determine plans that move the agent to the goal coordinates. [sent-17, score-0.532]

10 The parti-game algorithm solves this dilemma by starting with a coarse discretization and reﬁnes it during execution only when and where it is needed (for example, around obstacles), resulting in a nonuniform discretization. [sent-19, score-0.169]

11 Figure 1(a) shows the initial discretization of our example domain into 12 large cells together with the start coordinates of the agent (A) and the goal region ). [sent-21, score-0.456]

12 Thus, it can always attempt to move towards the center of each (cell containing adjacent cell (that is, cell that its current cell shares a border line with). [sent-22, score-0.348]

13 The agent can initially attempt to move towards the centers of either , , or , as shown in the ﬁgure. [sent-23, score-0.344]

14 Each state corresponds to a cell and each action corresponds to a movement option. [sent-25, score-0.257]

15 The parti-game algorithm initially ignores obstacles and makes the optimistic (and sometimes wrong) assumption that each action deterministically reaches the intended state, for example, that the agent indeed reaches if it is somewhere in and moves towards the center of . [sent-26, score-0.8]

16 The costs of an action outcome approximates the Euclidean distance from the center of the old cell of the agent to the center of its new cell. [sent-27, score-0.653]

17 ) The parti-game algorithm then determines whether the minimax goal the agent to distance of the current state of the agent is ﬁnite. [sent-30, score-1.013]

18 If so, the parti-game algorithm repeatedly chooses the action that minimizes the worst-case plan-execution cost, until the agent reaches or observes additional action outcomes. [sent-31, score-0.617]

19 The minimax goal distance of is and the agent minimizes the worst-case plan-execution cost by moving from towards the centers of either or . [sent-32, score-0.762]

20 Assume that it decides to move towards the center of . [sent-33, score-0.15]

21 The agent always continues to move until it either gets blocked by an obstacle or enters a new cell. [sent-34, score-0.362]

22 When the agent observes additional action outcomes it adds them to the state space. [sent-36, score-0.503]

23 Thus, it now assumes that it can end up in either or if it is somewhere in and moves towards the center of . [sent-37, score-0.173]

24 The same scenario repeats when the agent ﬁrst attempts to move towards the center of and then attempts to move towards the center of but gets blocked twice by the obstacle in . [sent-38, score-0.652]

25 Figure 1(c) shows the state space after the attempted moves towards the centers of and , and Figure 1(d) shows the state space after the attempted move towards the center of . [sent-39, score-0.438]

26 We say that is unsolvable since an agent in is not guaranteed to reach with ﬁnite plan-execution cost. [sent-41, score-0.378]

27 In this case, the parti-game algorithm reﬁnes the discretization by splitting all solvable cells that border unsolvable cells and all unsolvable cells that border solvable cells. [sent-42, score-0.634]

28 The parti-game algorithm then removes those states (and their actions) from the state space that correspond to the old cells and adds states (and actions) for the new cells, again making the optimistic assumption that each action for the new states deterministically reaches the intended state. [sent-46, score-0.707]

29 This ensures that the minimax goal distance becomes ﬁnite. [sent-47, score-0.388]

30 The parti-game of algorithm now repeats the process until either the agent reaches or the domain cannot be discretized any further because the resolution limit is reached. [sent-49, score-0.387]

31 For example, an agent cannot only end up in and but also in if it moves from somewhere in towards the center of . [sent-54, score-0.42]

32 The parti-game algorithm therefore needs to determine the minimax goal distances of the states with a minimax search algorithm. [sent-55, score-0.922]

33 Furthermore, the parti-game algorithm repeatedly determines plans that minimize the worst-case plan© 2 2 3 42 32 2 that contains the center of the new and old state of the agent. [sent-56, score-0.265]

34 Similarly, the heuristic of a state is computed as the maximum of the absolute differences of the x and y coordinates between the imaginary grid cell that contains the center of the state of the agent and the imaginary grid cell that contains the center of the state in question. [sent-57, score-1.026]

35 Furthermore, it is only used to compute the costs and heuristics and does not restrict either the placement of obstacles or the movement of the agent. [sent-59, score-0.169]

36 The pseudocode uses the following functions to manage the priority queue : U. [sent-60, score-0.269]

37 Top returns a state with the smallest priority of all states in . [sent-61, score-0.397]

38 TopKey returns the smallest priority of all states in . [sent-66, score-0.296]

39 (It does nothing if the current priority of already equals . [sent-72, score-0.182]

40 Pop ; 11 if /* is locally overconsistent */ 12 ; 13 for all UpdateState ; 14 else /* is locally underconsistent */ 15 ; 16 for all UpdateState ; ¢ ¢ E D B B A ¡ ( hP ¢ E D C B B A @ ¡ 1 54'C f3@ '¨I4'4434£320 ¢ X¤¡ ¢ ¤¡ ¢ E D C B B A 4'443@ £¡ ¢ £¡ ¢ £¡ " 9g ! [sent-88, score-0.205]

41 Figures 1(b), (c), (d) and (f) show the state spaces for our example directly after the parti-game algorithm has used Minimax LPA* to determine the minimax goal distance of . [sent-116, score-0.519]

42 All expanded states (that is, all states whose minimax goal distances have been computed) are shown in gray. [sent-117, score-0.569]

43 Minimax LPA* speeds up the searches by reusing information from previous searches, which is the reason why it expands only three states in Figure 1(d). [sent-118, score-0.241]

44 Minimax LPA* also speeds up the searches by using heuristics to focus them, which is the reason why it expands only four states in Figure 1(f). [sent-119, score-0.295]

45 ¨§ ¤ ¤ ¢ 1¦¦¥£ 3 Minimax LPA* Minimax LPA* repeatedly determines plans that minimize the worst-case plan-execution cost from to as the agent moves towards in nondeterministic domains where the costs of actions increase or decrease over time. [sent-120, score-0.664]

46 It generalizes two incremental search algorithms, namely our LPA* [1] and DynamicSWSF-FP [7]. [sent-121, score-0.216]

47 is the set of successor states that can result from the execution of in . [sent-128, score-0.141]

48 The agent incurs cost if the execution of in results in . [sent-131, score-0.344]

49 is the minimax goal distance of , deﬁned as the solution of the system of equations: if , and for all with . [sent-132, score-0.388]

50 is the current state of the agent, and the minimal worst-case plan-execution cost from to is . [sent-133, score-0.148]

51 2 Heuristics and Variables ¨§ ¤ ¢ ¡¦¤ £¥ § ¥ £ ¨¦¤¡ Minimax LPA* searches backward from to and uses heuristics to focus its search. [sent-143, score-0.149]

52 The g-value of a state estimates its minimax goal distance. [sent-149, score-0.464]

53 It is carried forward from one search to the next one and can be used after each search to determine a plan that minimizes the worst-case plan-execution cost from to . [sent-150, score-0.391]

54 The rhs-value of a state also estimates its minimax goal distance. [sent-151, score-0.464]

55 A state is called locally consistent iff its g-value is equal to its rhs-value. [sent-154, score-0.171]

56 Minimax LPA* also maintains a priority queue that always contains exactly the locally inconsistent states (Invariant 2). [sent-155, score-0.467]

57 The main function Main() ﬁrst calls Initialize() 18 to set the g-values and rhs-values of all states to 03 . [sent-164, score-0.127]

58 Thus, is the only locally inconsistent state value of and is inserted into the otherwise empty priority queue 02, 05 . [sent-166, score-0.477]

59 (Note that, in an actual implementation, Minimax LPA* needs to initialize a state only once it encounters it during the search and thus does not need to initialize all states up front. [sent-167, score-0.382]

60 ) Then, Minimax LPA* calls ComputePlan() to compute a plan that minimizes the to 19 . [sent-169, score-0.14]

61 If the agent has not reached worst-case plan-execution cost from yet 20 , it executes the ﬁrst action of the plan 22 and updates 23 . [sent-170, score-0.499]

62 To maintain Invariants 1, 2, and 3, it calls UpdateState() if some action costs have changed 28 to update the rhs-values and keys of the states potentially affected by the changed action costs as well as their membership in the priority queue if they become locally consistent or inconsistent. [sent-172, score-0.906]

63 It then recalculates the priorities of all states in the priority queue 29-30 . [sent-173, score-0.42]

64 This is necessary because the heuristics change when the agent moves, since they are computed with respect to . [sent-174, score-0.301]

65 £ ¢ E'C 43©@ ¨P w A a ¤¡C B ¡ D B B A cb4a ` ¨h P 7C 43©@ E D B B A ¦ ¨Y7C 43©@ £320 P ¢ E D B B A ¡ 1 " " ; to the new state of the agent after the action execution; ; " ; " ; " UpdateState( ); for all U. [sent-188, score-0.447]

66 Figure 4: Parti-game algorithm using Minimax LPA* changes the priorities of the states in the priority queue but not which states are locally consistent and thus in the priority queue. [sent-198, score-0.767]

67 Finally, it recalculates a plan 31 and repeats the process. [sent-199, score-0.137]

68 It repeatedly removes the locally inconsistent state with the smallest key from the priority queue 10 and expands it 11-16 . [sent-201, score-0.566]

69 A state is called locally overconsistent iff its g-value is larger than it rhs-value. [sent-203, score-0.21]

70 We can prove that the rhs-value of a locally overconsistent state that is about to be expanded is equal to its minimax goal distance. [sent-204, score-0.597]

71 A state is called locally underconsistent iff its g-value is smaller than it rhs-value. [sent-206, score-0.197]

72 It is locally consistent and its key is less than or equal to the terminates as soon as keys of all locally inconsistent states. [sent-209, score-0.229]

73 " 0 0 " 0 " 0 " 0 " ¨§ ¤ ¢ ¡¦¤ £ ¨ ¥§ Theorem 1 ComputePlan of Minimax LPA* expands each state at most twice and thus terminates. [sent-210, score-0.156]

74 Assume that, after ComputePlan() terminates, one starts in and always executes an action in the current state that minimizes until is reached (ties can be broken arbitrarily). [sent-211, score-0.26]

75 Then, the plan-execution cost is no larger than the minimax . [sent-212, score-0.365]

76 # ¨ © © § 0 0 g 7F2'4 © We can also prove several additional theorems about the efﬁciency of Minimax LPA*, including the fact that it only expands those states whose g-values are not already correct [5]. [sent-214, score-0.146]

77 4 Using Minimax LPA* to Implement the Parti-Game Algorithm Figure 4 shows how Minimax LPA* can be used to implement the parti-game algorithm in a more efﬁcient way than with uninformed search from scratch, using some of the functions from Figure 3. [sent-218, score-0.252]

78 If the minimax goal distance of is inﬁnity, then it § ¥ £ ¨¦¤¡ ¨§ ¤ ¤ ¢ ¡¥£ 0 " 0 5 4 ¥ ' ¢3 0 " © © ¥ ! [sent-220, score-0.388]

79 Otherwise, it repeatedly executes the action that minimizes the worst-case plan-execution cost 26’-27’ . [sent-223, score-0.24]

80 If it observes an unknown action outcome 28’ , then it records it 29’-31’ , ensures that Invariants 1, 2 and 3 continue to hold 32’-34’ , uses ComputePlan() to ﬁnd a new plan incrementally 35’ , and then continues to execute actions until either is unsolvable or the agent reaches 24’ . [sent-224, score-0.755]

81 In the former case, it reﬁnes the discretization 19’ , uses ComputePlan() to ﬁnd a new plan from scratch rather than incrementally (because the discretization changes the state space substantially) 20’-22’ , and then repeats the process. [sent-225, score-0.504]

82 All expanded states are shown in gray, and all locally inconsistent states (that is, states in the priority queue) are shown in bold. [sent-230, score-0.586]

83 ¨§ ¤ ¤ ¢ ¡¥£ ¨§ ¤ ¤ ¢ ¡¥£ is unsolvable and the parti-game algorithm thus It happens quite frequently that has to reﬁne the discretization. [sent-231, score-0.161]

84 If is unsolvable, Minimax LPA* expands a large number of states because it has to disprove the existence of a plan rather than ﬁnd one. [sent-232, score-0.221]

85 We speed up Minimax LPA* for the special case where is unsolvable but every other is unsolvable. [sent-233, score-0.131]

86 If states state is solvable since it occurs about half of the time when other than become unsolvable, some of them need to be predecessors of . [sent-234, score-0.273]

87 is unsolvable but every other state is solvable, Minimax LPA* can To prove that therefore show that all predecessors of are solvable but itself is not. [sent-235, score-0.313]

88 To are solvable, Minimax LPA* checks that they are show that all predecessors of locally consistent, their keys are no larger than U. [sent-236, score-0.176]

89 It can also use uninformed search (using the zero heuristic) and informed search (using the heuristic that we used in the context of the example from Figure 1). [sent-241, score-0.443]

90 All of them use binary heaps to implement the priority queue and the same optimizations but the implementations with search from scratch do not contain any code needed only for incremental search. [sent-243, score-0.644]

91 In each case, the goal coordinates are in the center of the terrain, and the start coordinates are in the vertical center and ten percent to the right of the left edge. [sent-245, score-0.276]

92 We also report the average of the ratios of the three measures for each of the four implementations and the one with incremental heuristic search (which is different from the ratio of the averages), together with their 95-percent conﬁdence intervals. [sent-246, score-0.321]

93 00 1 min 45 sec ¢¢ ¢¢ The average number of searches, measured by calls to ComputePlan(), is 29,885 until the agent reaches . [sent-274, score-0.42]

94 The table shows that the search times of the parti-game algorithm § ¥ ¨¢ £ ¡ are substantial due to the large number of searches performed (even though each search is fast), and that the searches take up most of its run time. [sent-275, score-0.46]

95 The table also shows that incremental and heuristic search individually speed up the parti-game algorithm and together speed it up even more. [sent-277, score-0.306]

96 The implementations of the parti-game algorithm in [3] and [6] make slightly different assumptions from ours, for example, minimize state transitions rather than cost. [sent-278, score-0.176]

97 Our results show that our implementation with Minimax LPA* performs about 5 percent of the state expansions of the implementation with uninformed search from scratch. [sent-280, score-0.485]

98 However, these results are very preliminary since the time per state expansion is different for the different implementations and it is future work to compare the various implementations of the parti-game algorithm in a common testbed. [sent-282, score-0.221]

99 Speeding up reinforcement learning with incremental heuristic minimax search. [sent-311, score-0.496]

100 An incremental algorithm for a generalization of the shortest-path problem. [sent-320, score-0.126]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rhs', 0.46), ('lpa', 0.418), ('minimax', 0.318), ('gd', 0.287), ('agent', 0.247), ('computeplan', 0.222), ('priority', 0.182), ('unsolvable', 0.131), ('search', 0.12), ('scratch', 0.114), ('uninformed', 0.102), ('state', 0.101), ('action', 0.099), ('incremental', 0.096), ('searches', 0.095), ('calculatekey', 0.091), ('states', 0.091), ('discretization', 0.089), ('queue', 0.087), ('sgoal', 0.078), ('updatestate', 0.078), ('plan', 0.075), ('locally', 0.07), ('costs', 0.063), ('actions', 0.061), ('heuristic', 0.06), ('cell', 0.057), ('expands', 0.055), ('heuristics', 0.054), ('center', 0.053), ('nondeterministic', 0.052), ('obstacles', 0.052), ('keys', 0.052), ('towards', 0.051), ('reaches', 0.05), ('execution', 0.05), ('solvable', 0.05), ('domains', 0.049), ('sec', 0.049), ('imaginary', 0.048), ('cost', 0.047), ('move', 0.046), ('goal', 0.045), ('implementations', 0.045), ('expansions', 0.043), ('coordinates', 0.042), ('informed', 0.041), ('deterministically', 0.041), ('percent', 0.041), ('overconsistent', 0.039), ('implementation', 0.039), ('min', 0.038), ('inconsistent', 0.037), ('calls', 0.036), ('repeats', 0.036), ('initialize', 0.035), ('moves', 0.035), ('obstacle', 0.035), ('priorities', 0.034), ('somewhere', 0.034), ('georgia', 0.034), ('blocked', 0.034), ('outcome', 0.034), ('repeatedly', 0.034), ('cells', 0.033), ('changed', 0.032), ('invariants', 0.031), ('executes', 0.031), ('optimistic', 0.031), ('predecessors', 0.031), ('hp', 0.031), ('algorithm', 0.03), ('sr', 0.03), ('observes', 0.029), ('execute', 0.029), ('grid', 0.029), ('minimizes', 0.029), ('outcomes', 0.027), ('speeding', 0.027), ('border', 0.027), ('intended', 0.027), ('ffivw', 0.026), ('koenig', 0.026), ('likhachev', 0.026), ('maxim', 0.026), ('recalculates', 0.026), ('terrain', 0.026), ('underconsistent', 0.026), ('plans', 0.025), ('distance', 0.025), ('expanded', 0.024), ('resolution', 0.024), ('returns', 0.023), ('nity', 0.023), ('checks', 0.023), ('sweeping', 0.023), ('prioritized', 0.023), ('atlanta', 0.023), ('old', 0.022), ('reinforcement', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 185 nips-2002-Speeding up the Parti-Game Algorithm

Author: Maxim Likhachev, Sven Koenig

2 0.10528453 3 nips-2002-A Convergent Form of Approximate Policy Iteration

Author: Theodore J. Perkins, Doina Precup

Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the ﬁrst convergence result for any form of approximate policy iteration under similar computational-resource assumptions.

3 0.092559658 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

Author: Jun Morimoto, Christopher G. Atkeson

Abstract: We developed a robust control policy design method in high-dimensional state space by using differential dynamic programming with a minimax criterion. As an example, we applied our method to a simulated ﬁve link biped robot. The results show lower joint torques from the optimal control policy compared to a hand-tuned PD servo controller. Results also show that the simulated biped robot can successfully walk with unknown disturbances that cause controllers generated by standard differential dynamic programming and the hand-tuned PD servo to fail. Learning to compensate for modeling error and previously unknown disturbances in conjunction with robust control design is also demonstrated.

4 0.087869078 134 nips-2002-Learning to Take Concurrent Actions

Author: Khashayar Rohanimanesh, Sridhar Mahadevan

Abstract: We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes – all, any and continue – and theoretically and experimentally compare them. 1

5 0.073695995 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: Multiagent learning is a key problem in AI. In the presence of multiple Nash equilibria, even agents with non-conﬂicting interests may not be able to learn an optimal coordination policy. The problem is exaccerbated if the agents do not know the game and independently receive noisy payoffs. So, multiagent reinforfcement learning involves two interrelated problems: identifying the game and learning to play. In this paper, we present optimal adaptive learning, the ﬁrst algorithm that converges to an optimal Nash equilibrium with probability 1 in any team Markov game. We provide a convergence proof, and show that the algorithm’s parameters are easy to set to meet the convergence conditions.

6 0.071558051 78 nips-2002-Efficient Learning Equilibrium

7 0.066873819 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

8 0.063848682 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

9 0.060635563 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

10 0.05944987 89 nips-2002-Feature Selection by Maximum Marginal Diversity

11 0.058404941 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

12 0.049072307 42 nips-2002-Bias-Optimal Incremental Problem Solving

13 0.044103485 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

14 0.043783754 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

15 0.042893939 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

16 0.036479376 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

17 0.034461297 115 nips-2002-Informed Projections

18 0.033756398 205 nips-2002-Value-Directed Compression of POMDPs

19 0.033155464 178 nips-2002-Robust Novelty Detection with Single-Class MPM

20 0.032481186 181 nips-2002-Self Supervised Boosting

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.097), (1, -0.002), (2, -0.155), (3, -0.035), (4, 0.005), (5, -0.015), (6, 0.009), (7, -0.047), (8, -0.004), (9, -0.016), (10, 0.022), (11, 0.068), (12, 0.056), (13, -0.011), (14, -0.025), (15, 0.043), (16, 0.022), (17, -0.035), (18, 0.017), (19, 0.025), (20, -0.017), (21, 0.048), (22, -0.049), (23, -0.057), (24, -0.042), (25, -0.015), (26, -0.078), (27, -0.024), (28, 0.055), (29, -0.116), (30, -0.023), (31, 0.144), (32, 0.105), (33, -0.064), (34, -0.034), (35, 0.063), (36, 0.047), (37, -0.005), (38, 0.216), (39, 0.013), (40, -0.015), (41, -0.14), (42, 0.013), (43, 0.03), (44, 0.052), (45, -0.002), (46, 0.053), (47, -0.045), (48, -0.135), (49, 0.001)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96133912 185 nips-2002-Speeding up the Parti-Game Algorithm

Author: Maxim Likhachev, Sven Koenig

2 0.58484298 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

Author: Jun Morimoto, Christopher G. Atkeson

3 0.45866489 134 nips-2002-Learning to Take Concurrent Actions

Author: Khashayar Rohanimanesh, Sridhar Mahadevan

4 0.44550988 130 nips-2002-Learning in Zero-Sum Team Markov Games Using Factored Value Functions

Author: Michail G. Lagoudakis, Ronald Parr

Abstract: We present a new method for learning good strategies in zero-sum Markov games in which each side is composed of multiple agents collaborating against an opposing team of agents. Our method requires full observability and communication during learning, but the learned policies can be executed in a distributed manner. The value function is represented as a factored linear architecture and its structure determines the necessary computational resources and communication bandwidth. This approach permits a tradeoff between simple representations with little or no communication between agents and complex, computationally intensive representations with extensive coordination between agents. Thus, we provide a principled means of using approximation to combat the exponential blowup in the joint action space of the participants. The approach is demonstrated with an example that shows the efﬁciency gains over naive enumeration.

5 0.41885129 155 nips-2002-Nonparametric Representation of Policies and Value Functions: A Trajectory-Based Approach

Author: Christopher G. Atkeson, Jun Morimoto

Abstract: A longstanding goal of reinforcement learning is to develop nonparametric representations of policies and value functions that support rapid learning without suffering from interference or the curse of dimensionality. We have developed a trajectory-based approach, in which policies and value functions are represented nonparametrically along trajectories. These trajectories, policies, and value functions are updated as the value function becomes more accurate or as a model of the task is updated. We have applied this approach to periodic tasks such as hopping and walking, which required handling discount factors and discontinuities in the task dynamics, and using function approximation to represent value functions at discontinuities. We also describe extensions of the approach to make the policies more robust to modeling error and sensor noise.

6 0.38566568 178 nips-2002-Robust Novelty Detection with Single-Class MPM

7 0.38231149 33 nips-2002-Approximate Linear Programming for Average-Cost Dynamic Programming

8 0.36158863 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

9 0.35352045 42 nips-2002-Bias-Optimal Incremental Problem Solving

10 0.33205596 3 nips-2002-A Convergent Form of Approximate Policy Iteration

11 0.32372114 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

12 0.32230836 78 nips-2002-Efficient Learning Equilibrium

13 0.31369427 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization

14 0.30379435 6 nips-2002-A Formulation for Minimax Probability Machine Regression

15 0.28292435 13 nips-2002-A Note on the Representational Incompatibility of Function Approximation and Factored Dynamics

16 0.26640093 89 nips-2002-Feature Selection by Maximum Marginal Diversity

17 0.26569667 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

18 0.26404932 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

19 0.26077998 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology

20 0.25471464 9 nips-2002-A Minimal Intervention Principle for Coordinated Movement

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.019), (23, 0.017), (41, 0.02), (42, 0.075), (54, 0.092), (55, 0.044), (57, 0.016), (67, 0.022), (68, 0.034), (74, 0.099), (86, 0.385), (92, 0.022), (98, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80561888 185 nips-2002-Speeding up the Parti-Game Algorithm

Author: Maxim Likhachev, Sven Koenig

2 0.75691348 144 nips-2002-Minimax Differential Dynamic Programming: An Application to Robust Biped Walking

Author: Jun Morimoto, Christopher G. Atkeson

3 0.49163151 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

Author: Sepp Hochreiter, Klaus Obermayer

Abstract: We investigate the problem of learning a classiﬁcation task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to diﬀerent sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classiﬁer which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive deﬁnite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classiﬁcation and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classiﬁcation results. 1

4 0.40109223 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf

Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1

5 0.39668322 2 nips-2002-A Bilinear Model for Sparse Coding

Author: David B. Grimes, Rajesh P. Rao

Abstract: Recent algorithms for sparse coding and independent component analysis (ICA) have demonstrated how localized features can be learned from natural images. However, these approaches do not take image transformations into account. As a result, they produce image codes that are redundant because the same feature is learned at multiple locations. We describe an algorithm for sparse coding based on a bilinear generative model of images. By explicitly modeling the interaction between image features and their transformations, the bilinear approach helps reduce redundancy in the image code and provides a basis for transformationinvariant vision. We present results demonstrating bilinear sparse coding of natural images. We also explore an extension of the model that can capture spatial relationships between the independent features of an object, thereby providing a new framework for parts-based object recognition.

6 0.39536262 169 nips-2002-Real-Time Particle Filters

7 0.39522973 3 nips-2002-A Convergent Form of Approximate Policy Iteration

8 0.39491785 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

9 0.39479613 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

10 0.39392638 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

11 0.39344037 163 nips-2002-Prediction and Semantic Association

12 0.39335737 74 nips-2002-Dynamic Structure Super-Resolution

13 0.39281005 124 nips-2002-Learning Graphical Models with Mercer Kernels

14 0.39278844 152 nips-2002-Nash Propagation for Loopy Graphical Games

15 0.39197171 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

16 0.39100239 39 nips-2002-Bayesian Image Super-Resolution

17 0.39096263 10 nips-2002-A Model for Learning Variance Components of Natural Images

18 0.3903912 53 nips-2002-Clustering with the Fisher Score

19 0.38844433 173 nips-2002-Recovering Intrinsic Images from a Single Image

20 0.38834816 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories