nips nips2003 nips2003-34 knowledge-graph by maker-knowledge-mining

34 nips-2003-Approximate Policy Iteration with a Policy Language Bias


Source: pdf

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Lafayette, IN 47907 Abstract We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. [sent-2, score-0.767]

2 In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. [sent-4, score-0.701]

3 1 Introduction Dynamic-programming approaches to finding optimal control policies in Markov decision processes (MDPs) [4, 14] using explicit (flat) state space representations break down when the state space becomes extremely large. [sent-5, score-0.378]

4 These extensions have not yet shown the capacity to solve large classical planning problems such as the benchmark problems used in planning competitions [2]. [sent-7, score-0.649]

5 For familiar STRIPS planning domains (among others), useful cost functions can be difficult or impossible to represent compactly. [sent-9, score-0.467]

6 Existing inductive forms of approximate policy iteration (API) select compactly represented, approximate cost functions at each iteration of dynamic programming [5], again suffering when such representation is difficult. [sent-12, score-0.701]

7 1 Perhaps one reason is the complexity of typical cost functions for these problems, for which it is often more natural to specify a policy space. [sent-14, score-0.421]

8 Recent work on inductive policy selection in relational planning domains [17, 19, 28], has shown that useful policies can be learned using a policy-space bias, described by a generic knowledge representation language. [sent-15, score-1.248]

9 Here, we incorporate that work into a practical approach to API for STRIPS planning domains. [sent-16, score-0.269]

10 We replace the use of cost-function approximations as policy representations in API2 with direct, compact state-action mappings, and use a standard relational learner to learn these mappings. [sent-17, score-0.625]

11 We inherit from familiar API methods a (sampled) policy-evaluation phase using simulation of the current policy, or rollout [25], and an inductive policy-selection phase inducing an approximate next policy from sampled current policy values. [sent-18, score-0.839]

12 1 Recent work in relational reinforcement learning has been applied to STRIPS problems with much simpler goals than typical benchmark planning domains, and is discussed below in Section 5. [sent-19, score-0.49]

13 We evaluate our API approach in several STRIPS planning domains, showing iterative policy improvement. [sent-21, score-0.64]

14 Our technique solves entire planning domains, finding a policy that can be applied to any problem in the domain, rather than solving just a single problem instance from the domain. [sent-22, score-0.66]

15 We view each planning domain as a single large MDP where each “state” specifies both the current world and the goal. [sent-23, score-0.383]

16 The API method thus learns control knowledge (a “policy”) for the given planning domain. [sent-24, score-0.35]

17 Our API technique naturally leverages heuristic functions (cost function estimates), if available—this allows us to benefit from recent advances in domain-independent heuristics for classical planning, as discussed below. [sent-25, score-0.201]

18 Even when greedy heuristic search solves essentially none of the domain instances, our API technique successfully bootstraps from the heuristic guidance. [sent-26, score-0.298]

19 We also demonstrate that our technique is able to iteratively improve policies that correspond to previously published hand-coded control knowledge (for TL-plan [3]) and policies learned by Yoon et al. [sent-27, score-0.531]

20 Our technique gives a new way of using heuristics in planning domains, complementing traditional heuristic search strategies. [sent-29, score-0.419]

21 2 Approximate Policy Iteration We first review API for a general, action-simulator–based MDP representation, and later, in Section 3, detail a particular representation of planning domains as relational MDPs and the corresponding policy-space learning bias. [sent-30, score-0.574]

22 We represent an MDP using a generative model S, A, T, C, I , where S is a finite set of states, A is a finite set of actions, and T is a randomized “action-simulation” algorithm that, given state s and action a, returns a next state t. [sent-33, score-0.238]

23 For MDP M = S, A, T, C, I , a policy π is a (possibly stochastic) mapping from S to A. [sent-36, score-0.371]

24 Given a current policy π, we can define a new improved policy PI[π](s) by argmina∈A Qπ (s, a). [sent-39, score-0.742]

25 The cost function of PI[π] is guaranteed to be no worse than that M of π at each state and to improve at some state for non-optimal π. [sent-40, score-0.214]

26 Exact policy iteration iterates policy improvement (PI) from any initial policy to reach an optimal fixed point. [sent-41, score-1.295]

27 API, as described in [5], heuristically approximates policy iteration in large state spaces by using an approximate policy-improvement operator trained with Monte-Carlo simulation. [sent-44, score-0.567]

28 The use of API assumes that states and perhaps actions are represented in factored form (typically, a feature vector) that facilitates generalizing properties of the training data to the entire state and action spaces. [sent-46, score-0.321]

29 Due to API’s inductive nature, there are typically no guarantees for policy improvement—nevertheless, API often “converges” usefully, e. [sent-47, score-0.415]

30 We start API by providing it with an initial policy π0 and a real-valued heuristic function H, where H(s) is interpreted as an estimate of the cost of state s (presumably with respect to the optimal policy). [sent-50, score-0.642]

31 For example, in goal-based planning domains either π0 should occasionally reach a goal or H should provide non-trivial goaldistance information. [sent-55, score-0.449]

32 In our experiments we consider scenarios that use different types of initial policies and heuristics to bootstrap API. [sent-56, score-0.237]

33 , am }, T, C, I , API produces a policy sequence by iterating steps of approximate policy improvement—note that π0 is used in only the initial iteration but the heuristic is always used. [sent-60, score-0.994]

34 Approximate policy improvement computes an (approximate) improvement π of a policy π by attempting to approximate the output of exact policy improvement, i. [sent-61, score-1.228]

35 (see [25]) Given π, we construct a training set D, ˆ ˆ describing an improved policy π , consisting of tuples s,π (s), Q(s, a1 ), . [sent-67, score-0.397]

36 Select π with the goal of minimizing the cumulative Q-cost for π over D (approximating the same minimization over S in exact policy iteration). [sent-74, score-0.423]

37 our learner (see Section 3) uses a bias that focuses on states where large improvement appears possible. [sent-79, score-0.21]

38 3 API for Relational Planning In order to use our API framework, we represent classical planning domains (not just single instances) as relationally factored MDPs. [sent-80, score-0.497]

39 We then describe our compact relational policy language and the associated learner for use in step 2 of our API framework. [sent-81, score-0.623]

40 We say that an MDP S, A, T, C, I is relational when S and A are defined by giving the finite sets of objects O, predicates P , and action types Y . [sent-83, score-0.367]

41 An action is an action type applied to the appropriate number of objects, and the action space A is the set of all actions. [sent-86, score-0.312]

42 A classical planning domain is specified by providing a set of world predicates, action types, and an action simulator. [sent-87, score-0.642]

43 We simultaneously solve all problem instances of such a planning domain4 by constructing a relational MDP as described below. [sent-88, score-0.468]

44 Let O be a fixed set of objects and Y be the set of action types from the planning domain. [sent-89, score-0.407]

45 4 As an example, the blocks world is a classical planning domain, where a problem instance is an initial block configuration and a set of goal conditions. [sent-92, score-0.5]

46 API (n, w, h, H,π 0 ) // training set size n, sampling width w, // horizon h, initial policy π0 , // cost estimator (heuristic function) H. [sent-94, score-0.628]

47 an initial state and a goal) from the planning domain by specifying both the current world and the goal. [sent-103, score-0.494]

48 We achieve this by letting P be the set of world predicates from the classical domain together with a new set of goal predicates, one for each world predicate. [sent-104, score-0.325]

49 Thus, the MDP states are sets of world and goal facts involving some or all objects in O. [sent-106, score-0.215]

50 The objective is to reach MDP states where the goal facts are a subset of the world facts (goal states). [sent-107, score-0.218]

51 The state {on-table(a), on(a, b), clear(b), gclear(b)} is thus a goal state in a blocks-world MDP, but would not be a goal state without clear(b). [sent-108, score-0.265]

52 We represent this objective by defining C to assign zero cost to actions taken in goal states and a positive cost to actions in all other states. [sent-109, score-0.266]

53 In addition, we take T to be the action simulator from the planning domain (e. [sent-110, score-0.431]

54 With this cost function, a low-cost policy must arrive at goal states as “quickly” as possible. [sent-113, score-0.509]

55 Finally, the initial state distribution I can be any program that generates legal problem instances (MDP states) of the planning domain—e. [sent-114, score-0.422]

56 one might use a problem generator from a planning competition. [sent-116, score-0.269]

57 We adapt the API method of Section 2 by using, for Step 2, the policy-space language bias and learning method of our previous work on learning policies in relational domains from small problem solutions [28], briefly reviewed here. [sent-119, score-0.536]

58 In relational domains, useful rules often take the form “apply action type a to any object in set C”, e. [sent-120, score-0.284]

59 Given a state s and a concept C expressed in taxonomic syntax, it is straightforward to compute, in time polynomial in the sizes of s and C, the set of domain objects that are represented by C in s. [sent-127, score-0.203]

60 Restricting our attention to one-argument–action types5 , we write a policy as C1 :a1 , C2 :a2 , . [sent-128, score-0.371]

61 6 Using Q-advantage rather than Q-cost focuses the learner toward instances where large improvement over the previous policy is possible. [sent-143, score-0.531]

62 1) Using only the guidance of an (often weak) domain-independent heuristic, API learns effective policies for entire classical planning domains. [sent-145, score-0.513]

63 2) Each learned policy is a domain-specific planner that is fast and empirically compares well to the state-of-the-art domain-independent planner FF [13]. [sent-146, score-0.477]

64 We consider two deterministic domains with standard definitions and three stochastic domains from Yoon et al. [sent-149, score-0.35]

65 [28]—these are: BW(n), the n-block blocks world; LW(l,t,p), the l location, t truck, p package logistics world; SBW(n), a stochastic variant of BW(n); SLW(l,c,t,p), the stochastic logistics world with c cars and t trucks; and SPW(n), a version of SBW(n) with a paint action. [sent-150, score-0.31]

66 We draw problem instances from each domain by generating pairs of random initial states and goal conditions. [sent-151, score-0.25]

67 8 Each experiment specifies a planning domain and an initial policy and then iterates API9 until “no more progress” is made. [sent-154, score-0.762]

68 We evaluate each policy on 1000 random problem instances, recording the success 5 Multiple argument actions can be simulated at some cost with multiple single argument actions. [sent-155, score-0.504]

69 8 Space precludes a description of this complex and well studied planning heuristic here. [sent-165, score-0.379]

70 2 0 0 SR AL(S)/H 1 0 2 4 6 8 10 iteration (a) 0 5 10 15 20 25 30 0 35 iteration (b) 0 2 4 6 8 10 iteration (c) Figure 2: Bootstrapping API with a domain-independent heuristic. [sent-179, score-0.219]

71 Examination of the learned BW(15) policies shows that early iterations find important concepts and later iterations find a policy that achieves a small SR; at that point, rapid improvement ensues. [sent-216, score-0.643]

72 FF-Greedy performs very well here; nevertheless, API yields compact declarative policies of the same quality as FF-Greedy. [sent-218, score-0.224]

73 TL-Plan [3] uses human-coded domain-specific control knowledge to solve classical planning problems. [sent-221, score-0.401]

74 Here we use initial policies for API that correspond to the domain-specific control knowledge appearing in [3]. [sent-222, score-0.298]

75 A sampling width of 1 corresponds to a preference to draw a small number of trajectories from each of a variety of problems rather than a larger number from each of relatively fewer training problems—in either case, the learner must be robust to the noise resulting from stochastic effects. [sent-224, score-0.26]

76 11 We can not exactly capture the TL-Plan knowledge in our policy language. [sent-226, score-0.403]

77 Instead, we write policies that capture the knowledge but prune away some “bad” actions that TL-Plan might consider. [sent-227, score-0.244]

78 world TL-Plan provides three sets of control knowledge of increasing quality—we use the best and second best sets to get the policies TL-BW-a and TL-BW-b, respectively. [sent-228, score-0.31]

79 For logistics there is only one set of knowledge given, yielding the policy TL-LW. [sent-229, score-0.472]

80 Starting with TL-BW-a and TL-LW, which have perfect SR, API uncovers policies that maintain SR but improve AL/H by approximately 6. [sent-232, score-0.239]

81 Starting with TL-BW-b, which has SR of only 30%, API quickly uncovers policies with perfect SR. [sent-234, score-0.209]

82 There is a dramatic difference in the quality of FF-Greedy (iteration 0 of Figure 2a), TLBW-a, and TL-BW-b in BW(10); yet, for each initial policy, API finds policies of roughly identical quality—requiring more iterations for lower quality initial policies. [sent-235, score-0.307]

83 [28], policies were learned from solutions to randomly drawn small problems for the three stochastic domains we test here, among others. [sent-238, score-0.429]

84 A significant range of policy qualities results, due to the random draw. [sent-239, score-0.371]

85 For each domain, API is shown to improve the SR for two arbitrarily selected, below-average, learned starting policies to nearly 1. [sent-242, score-0.283]

86 learned policy corresponds to a domainFF (in C) API (Scheme) specific planner for the target planning doDomains SR AL Time SR AL Time main. [sent-248, score-0.72]

87 5s planer, with respect to planning time and sucBW(20) 0. [sent-256, score-0.269]

88 8s icy and logistics-world policy corresponding LW(4,4,12) 1 42 0. [sent-265, score-0.371]

89 7s to the learned policies (beyond iteration 0) in LW(5,14,20) 1 73 0. [sent-267, score-0.3]

90 We applied FF and the appropriate selected policy to each of 1000 new test problems from each of the domains shown in Table 1. [sent-270, score-0.539]

91 In blocks worlds with more than 10 blocks, the API policy improves on FF in every category, with scaling much better to 20 and 30 blocks. [sent-273, score-0.452]

92 Using the same heuristic information (in a different way), API uncovers policies that significantly outperform FF. [sent-274, score-0.319]

93 [23, 27, 10, 15, 1], and, more closely related, to learn stand-alone control policies [17, 19, 28]. [sent-280, score-0.246]

94 Regarding the latter, our work is novel in using API to iteratively improve 12 For these stochastic domains we provide the heuristic (designed for deterministic domains) with a deterministic STRIPS domain approximation (using the mostly likely outcome of each action). [sent-285, score-0.42]

95 In addition, we leverage a domain-independent planning heuristic to avoid the need for access to small problems. [sent-287, score-0.379]

96 The most closely related work is relational reinforcement learning (RRL) [9], a form of online API that learns relational cost-function approximations. [sent-289, score-0.338]

97 Q-cost functions are learned in the form of relational decision trees (Q-trees) and are used to learn corresponding policies (P -trees). [sent-290, score-0.43]

98 The FF planning system: Fast plan generation through heuristic search. [sent-342, score-0.379]

99 A sparse sampling algorithm for nearoptimal planning in large markov decision processes. [sent-354, score-0.316]

100 Learning generalized policies in planning domains using concept languages. [sent-365, score-0.59]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('api', 0.609), ('policy', 0.371), ('planning', 0.269), ('sr', 0.241), ('bw', 0.222), ('policies', 0.173), ('relational', 0.157), ('domains', 0.148), ('ff', 0.121), ('heuristic', 0.11), ('action', 0.104), ('mdp', 0.103), ('yoon', 0.084), ('rrl', 0.083), ('lw', 0.082), ('iteration', 0.073), ('learner', 0.073), ('predicates', 0.072), ('strips', 0.072), ('logistics', 0.069), ('sbw', 0.069), ('state', 0.067), ('horizon', 0.062), ('jm', 0.06), ('domain', 0.058), ('states', 0.056), ('world', 0.056), ('mlj', 0.055), ('slw', 0.055), ('spw', 0.055), ('learned', 0.054), ('classical', 0.051), ('cost', 0.05), ('aij', 0.05), ('control', 0.049), ('blocks', 0.048), ('improvement', 0.045), ('inductive', 0.044), ('taxonomic', 0.044), ('initial', 0.044), ('instances', 0.042), ('actions', 0.039), ('facts', 0.037), ('bias', 0.036), ('uncovers', 0.036), ('objects', 0.034), ('trajectories', 0.034), ('stochastic', 0.034), ('syntax', 0.033), ('worlds', 0.033), ('knowledge', 0.032), ('goal', 0.032), ('planners', 0.031), ('heuristically', 0.031), ('improve', 0.03), ('al', 0.03), ('width', 0.03), ('factored', 0.029), ('sampled', 0.028), ('figures', 0.028), ('argmina', 0.028), ('borrajo', 0.028), ('carbonell', 0.028), ('declarative', 0.028), ('fahiem', 0.028), ('starting', 0.026), ('planner', 0.026), ('boutilier', 0.026), ('craig', 0.026), ('training', 0.026), ('sampling', 0.025), ('approximate', 0.025), ('learn', 0.024), ('fern', 0.024), ('givan', 0.024), ('minton', 0.024), ('raymond', 0.024), ('reinforcement', 0.024), ('mdps', 0.023), ('rules', 0.023), ('quality', 0.023), ('argument', 0.022), ('decision', 0.022), ('language', 0.022), ('programming', 0.021), ('guidance', 0.02), ('iterates', 0.02), ('cumulative', 0.02), ('estimator', 0.02), ('heuristics', 0.02), ('technique', 0.02), ('benchmark', 0.02), ('problems', 0.02), ('deterministic', 0.02), ('cutoff', 0.019), ('bootstrapping', 0.019), ('representative', 0.019), ('dynamic', 0.019), ('utility', 0.018), ('draw', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

2 0.31136706 62 nips-2003-Envelope-based Planning in Relational MDPs

Author: Natalia H. Gardiol, Leslie P. Kaelbling

Abstract: A mobile robot acting in the world is faced with a large amount of sensory data and uncertainty in its action outcomes. Indeed, almost all interesting sequential decision-making domains involve large state spaces and large, stochastic action sets. We investigate a way to act intelligently as quickly as possible in domains where finding a complete policy would take a hopelessly long time. This approach, Relational Envelopebased Planning (REBP) tackles large, noisy problems along two axes. First, describing a domain as a relational MDP (instead of as an atomic or propositionally-factored MDP) allows problem structure and dynamics to be captured compactly with a small set of probabilistic, relational rules. Second, an envelope-based approach to planning lets an agent begin acting quickly within a restricted part of the full state space and to judiciously expand its envelope as resources permit. 1

3 0.26923925 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

4 0.23243383 78 nips-2003-Gaussian Processes in Reinforcement Learning

Author: Malte Kuss, Carl E. Rasmussen

Abstract: We exploit some useful properties of Gaussian process (GP) regression models for reinforcement learning in continuous state spaces and discrete time. We demonstrate how the GP model allows evaluation of the value function in closed form. The resulting policy iteration algorithm is demonstrated on a simple problem with a two dimensional state space. Further, we speculate that the intrinsic ability of GP models to characterise distributions of functions would allow the method to capture entire distributions over future values instead of merely their expectation, which has traditionally been the focus of much of reinforcement learning.

5 0.15523018 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

Author: Curt Bererton, Geoffrey J. Gordon, Sebastian Thrun

Abstract: The design of cooperative multi-robot systems is a highly active research area in robotics. Two lines of research in particular have generated interest: the solution of large, weakly coupled MDPs, and the design and implementation of market architectures. We propose a new algorithm which joins together these two lines of research. For a class of coupled MDPs, our algorithm automatically designs a market architecture which causes a decentralized multi-robot system to converge to a consistent policy. We can show that this policy is the same as the one which would be produced by a particular centralized planning algorithm. We demonstrate the new algorithm on three simulation examples: multi-robot towing, multi-robot path planning with a limited fuel resource, and coordinating behaviors in a game of paint ball. 1

6 0.14226778 42 nips-2003-Bounded Finite State Controllers

7 0.14194824 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

8 0.12937763 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

9 0.11762726 55 nips-2003-Distributed Optimization in Adaptive Networks

10 0.11388678 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

11 0.11334947 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

12 0.11220861 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs

13 0.095042326 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

14 0.089659736 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

15 0.074913405 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

16 0.071431197 118 nips-2003-Link Prediction in Relational Data

17 0.071395464 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

18 0.063874774 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

19 0.062979512 68 nips-2003-Eye Movements for Reward Maximization

20 0.059508149 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.174), (1, 0.325), (2, -0.167), (3, 0.034), (4, -0.084), (5, 0.02), (6, -0.167), (7, -0.167), (8, 0.159), (9, 0.122), (10, -0.046), (11, -0.182), (12, 0.023), (13, 0.105), (14, -0.011), (15, 0.037), (16, 0.043), (17, -0.126), (18, -0.049), (19, 0.009), (20, -0.022), (21, -0.013), (22, 0.069), (23, 0.044), (24, -0.144), (25, -0.052), (26, 0.054), (27, -0.085), (28, 0.054), (29, -0.198), (30, 0.015), (31, -0.143), (32, -0.037), (33, 0.062), (34, 0.05), (35, 0.06), (36, -0.011), (37, 0.058), (38, -0.101), (39, -0.055), (40, 0.086), (41, -0.069), (42, -0.021), (43, -0.041), (44, -0.018), (45, 0.056), (46, 0.018), (47, -0.094), (48, 0.038), (49, -0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97379082 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

2 0.87170225 62 nips-2003-Envelope-based Planning in Relational MDPs

Author: Natalia H. Gardiol, Leslie P. Kaelbling

Abstract: A mobile robot acting in the world is faced with a large amount of sensory data and uncertainty in its action outcomes. Indeed, almost all interesting sequential decision-making domains involve large state spaces and large, stochastic action sets. We investigate a way to act intelligently as quickly as possible in domains where finding a complete policy would take a hopelessly long time. This approach, Relational Envelopebased Planning (REBP) tackles large, noisy problems along two axes. First, describing a domain as a relational MDP (instead of as an atomic or propositionally-factored MDP) allows problem structure and dynamics to be captured compactly with a small set of probabilistic, relational rules. Second, an envelope-based approach to planning lets an agent begin acting quickly within a restricted part of the full state space and to judiciously expand its envelope as resources permit. 1

3 0.83205593 158 nips-2003-Policy Search by Dynamic Programming

Author: J. A. Bagnell, Sham M. Kakade, Jeff G. Schneider, Andrew Y. Ng

Abstract: We consider the policy search approach to reinforcement learning. We show that if a “baseline distribution” is given (indicating roughly how often we expect a good policy to visit each state), then we can derive a policy search algorithm that terminates in a finite number of steps, and for which we can provide non-trivial performance guarantees. We also demonstrate this algorithm on several grid-world POMDPs, a planar biped walking robot, and a double-pole balancing problem. 1

4 0.62942439 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

Author: Milos Hauskrecht, Branislav Kveton

Abstract: Approximate linear programming (ALP) has emerged recently as one of the most promising methods for solving complex factored MDPs with finite state spaces. In this work we show that ALP solutions are not limited only to MDPs with finite state spaces, but that they can also be applied successfully to factored continuous-state MDPs (CMDPs). We show how one can build an ALP-based approximation for such a model and contrast it to existing solution methods. We argue that this approach offers a robust alternative for solving high dimensional continuous-state space problems. The point is supported by experiments on three CMDP problems with 24-25 continuous state factors. 1

5 0.60207778 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

Author: H. J. Kim, Michael I. Jordan, Shankar Sastry, Andrew Y. Ng

Abstract: Autonomous helicopter flight represents a challenging control problem, with complex, noisy, dynamics. In this paper, we describe a successful application of reinforcement learning to autonomous helicopter flight. We first fit a stochastic, nonlinear model of the helicopter dynamics. We then use the model to learn to hover in place, and to fly a number of maneuvers taken from an RC helicopter competition.

6 0.55815184 78 nips-2003-Gaussian Processes in Reinforcement Learning

7 0.48574448 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

8 0.48228005 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

9 0.47595075 42 nips-2003-Bounded Finite State Controllers

10 0.41144189 55 nips-2003-Distributed Optimization in Adaptive Networks

11 0.39305326 118 nips-2003-Link Prediction in Relational Data

12 0.36765978 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality

13 0.36029351 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

14 0.33799607 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

15 0.32838553 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

16 0.29363558 68 nips-2003-Eye Movements for Reward Maximization

17 0.26190832 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

18 0.24798514 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

19 0.22744823 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

20 0.21634525 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.051), (11, 0.014), (29, 0.012), (30, 0.119), (35, 0.058), (53, 0.047), (71, 0.077), (76, 0.034), (85, 0.063), (88, 0.236), (91, 0.108), (99, 0.077)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81771934 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

Author: Alan Fern, Sungwook Yoon, Robert Givan

Abstract: We explore approximate policy iteration, replacing the usual costfunction learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for classical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs. 1

2 0.69677389 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

Author: Xavier Carreras, Lluís Màrquez

Abstract: This work presents an architecture based on perceptrons to recognize phrase structures, and an online learning algorithm to train the perceptrons together and dependently. The recognition strategy applies learning in two layers: a filtering layer, which reduces the search space by identifying plausible phrase candidates, and a ranking layer, which recursively builds the optimal phrase structure. We provide a recognition-based feedback rule which reflects to each local function its committed errors from a global point of view, and allows to train them together online as perceptrons. Experimentation on a syntactic parsing problem, the recognition of clause hierarchies, improves state-of-the-art results and evinces the advantages of our global training method over optimizing each function locally and independently. 1

3 0.69256997 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

Author: Michael J. Quinlan, Stephan K. Chalup, Richard H. Middleton

Abstract: This article addresses the issues of colour classification and collision detection as they occur in the legged league robot soccer environment of RoboCup. We show how the method of one-class classification with support vector machines (SVMs) can be applied to solve these tasks satisfactorily using the limited hardware capacity of the prescribed Sony AIBO quadruped robots. The experimental evaluation shows an improvement over our previous methods of ellipse fitting for colour classification and the statistical approach used for collision detection.

4 0.62318629 62 nips-2003-Envelope-based Planning in Relational MDPs

Author: Natalia H. Gardiol, Leslie P. Kaelbling

Abstract: A mobile robot acting in the world is faced with a large amount of sensory data and uncertainty in its action outcomes. Indeed, almost all interesting sequential decision-making domains involve large state spaces and large, stochastic action sets. We investigate a way to act intelligently as quickly as possible in domains where finding a complete policy would take a hopelessly long time. This approach, Relational Envelopebased Planning (REBP) tackles large, noisy problems along two axes. First, describing a domain as a relational MDP (instead of as an atomic or propositionally-factored MDP) allows problem structure and dynamics to be captured compactly with a small set of probabilistic, relational rules. Second, an envelope-based approach to planning lets an agent begin acting quickly within a restricted part of the full state space and to judiciously expand its envelope as resources permit. 1

5 0.6130904 71 nips-2003-Fast Embedding of Sparse Similarity Graphs

Author: John C. Platt

Abstract: This paper applies fast sparse multidimensional scaling (MDS) to a large graph of music similarity, with 267K vertices that represent artists, albums, and tracks; and 3.22M edges that represent similarity between those entities. Once vertices are assigned locations in a Euclidean space, the locations can be used to browse music and to generate playlists. MDS on very large sparse graphs can be effectively performed by a family of algorithms called Rectangular Dijsktra (RD) MDS algorithms. These RD algorithms operate on a dense rectangular slice of the distance matrix, created by calling Dijsktra a constant number of times. Two RD algorithms are compared: Landmark MDS, which uses the Nyström approximation to perform MDS; and a new algorithm called Fast Sparse Embedding, which uses FastMap. These algorithms compare favorably to Laplacian Eigenmaps, both in terms of speed and embedding quality. 1

6 0.58273661 144 nips-2003-One Microphone Blind Dereverberation Based on Quasi-periodicity of Speech Signals

7 0.57432711 158 nips-2003-Policy Search by Dynamic Programming

8 0.57130361 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

9 0.56316501 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

10 0.55780315 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

11 0.55365515 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus

12 0.54736596 162 nips-2003-Probabilistic Inference of Speech Signals from Phaseless Spectrograms

13 0.54703844 157 nips-2003-Plasticity Kernels and Temporal Statistics

14 0.54673302 78 nips-2003-Gaussian Processes in Reinforcement Learning

15 0.54302734 121 nips-2003-Log-Linear Models for Label Ranking

16 0.54246539 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

17 0.54068518 189 nips-2003-Tree-structured Approximations by Expectation Propagation

18 0.53638631 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

19 0.5336023 30 nips-2003-Approximability of Probability Distributions

20 0.53205645 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions