nips nips2007 nips2007-98 knowledge-graph by maker-knowledge-mining

98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion


Source: pdf

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. [sent-5, score-0.743]

2 Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. [sent-6, score-0.805]

3 However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. [sent-7, score-0.307]

4 For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i. [sent-8, score-0.699]

5 In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. [sent-11, score-1.054]

6 This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. [sent-12, score-0.546]

7 This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. [sent-13, score-0.501]

8 In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. [sent-14, score-1.15]

9 1 Introduction In this paper we consider apprenticeship learning in the setting of large, complex domains. [sent-15, score-0.436]

10 Apprenticeship learning is based on the insight that often it is easier for an “expert” to demonstrate the desired behavior than it is to specify a reward function that induces this behavior. [sent-17, score-0.256]

11 However, when attempting to apply apprenticeship learning to large domains, several challenges arise. [sent-18, score-0.436]

12 First, past algorithms for apprenticeship learning require the expert to demonstrate complete trajectories in the domain, and we are specifically concerned with domains that are sufficiently complex so that even this task is not feasible. [sent-19, score-0.892]

13 Second, these past algorithms require the ability to solve the “easier” problem of finding a nearly optimal policy given some candidate reward function, and even this is challenging in large domains. [sent-20, score-0.354]

14 Indeed, such domains often necessitate hierarchical control in order to reduce the complexity of the control task [2, 4, 15, 12]. [sent-21, score-0.317]

15 As a motivating application, consider the task of navigating a quadruped robot (shown in Figure 1(a)) over challenging, irregular terrain (shown in Figure 1(b,c)). [sent-22, score-0.819]

16 Fortunately, this control task succumbs very naturally to a hierarchical decomposition: we first plan a general path over the terrain, then plan footsteps along this path, and finally plan joint movements 1 Figure 1: (a) LittleDog robot, designed and built by Boston Dynamics, Inc. [sent-24, score-0.488]

17 However, it is very challenging to specify a proper reward, specifically for the higher levels of control, as this requires quantifying the trade-off between many features, including progress toward a goal, the height differential between feet, the slope of the terrain underneath its feet, etc. [sent-29, score-0.319]

18 Moreover, consider the apprenticeship learning task of specifying a complete set of foot locations, across an entire terrain, that properly captures all the trade-offs above; this itself is a highly non-trivial task. [sent-30, score-0.591]

19 Motivated by these difficulties, we present a unified method for hierarchical apprenticeship learning. [sent-31, score-0.59]

20 At the lower levels of the control hierarchy, our method only requires that the expert be able to demonstrate good local behavior, rather than behavior that is optimal for the entire task. [sent-33, score-0.402]

21 This type of advice is often feasible for the expert to give even when the expert is entirely unable to give full trajectory demonstrations. [sent-34, score-0.851]

22 Thus the approach allows for apprenticeship learning in extremely complex, previously intractable domains. [sent-35, score-0.436]

23 This algorithm extends the apprenticeship learning paradigm to complex, highdimensional control tasks by allowing an expert to demonstrate desired behavior at multiple levels of abstraction. [sent-38, score-0.887]

24 Second, we apply the hierarchical apprenticeship approach to the quadruped locomotion problem discussed above. [sent-39, score-1.119]

25 By applying this method, we achieve performance that is, to the best of our knowledge, well beyond any published results for quadruped locomotion. [sent-40, score-0.349]

26 In Section 3 we present the general formulation of the hierarchical apprenticeship learning algorithm. [sent-43, score-0.619]

27 In Section 4 we present experimental results, both on a hierarchical multi-room grid world, and on the real-world quadruped locomotion task. [sent-44, score-0.706]

28 As we are often concerned with MDPs for which no reward function is given, we use the notation MDP\R to denote an MDP minus the reward function. [sent-47, score-0.344]

29 The value of a policy π is given by V (π) = E t=0 R(st )|π , where the expectation is taken with respect to the random state sequence s0 , s1 , . [sent-49, score-0.164]

30 In Section 4 we present results crossing terrain of comparable difficulty and distance in 30-35 seconds. [sent-56, score-0.246]

31 2 Often the reward function R can be represented more compactly as a function of the state. [sent-57, score-0.172]

32 We consider the case where the reward function R is a linear combination of the features: R(s) = wT φ(s) for parameters w ∈ Rn . [sent-59, score-0.172]

33 Then we have that the value of a policy φ is linear in the reward function weights H H H V (π) = E[ t=0 R(st )|π] = E[ t=0 wT φ(st )|π] = wT E[ t=0 φ(st )|π] = wT µφ (π) (1) where we used linearity of expectation to bring w outside of the expectation. [sent-60, score-0.286]

34 3 The Hierarchical Apprenticeship Learning Algorithm We now present our hierarchical apprenticeship learning algorithm (hereafter HAL). [sent-62, score-0.59]

35 For simplicity, we present a two level hierarchical formulation of the control task, referred to generically as the low-level and high-level controllers. [sent-63, score-0.308]

36 1 Reward Decomposition in HAL At the heart of the HAL algorithm is a simple decomposition of the reward function that links the two levels of control. [sent-66, score-0.259]

37 2 For example, in the case of the quadruped locomotion problem the low-level MDP\R describes the state of all four feet, while the high-level MDP\R describes only the position of the robot’s center of mass. [sent-68, score-0.606]

38 As is standard in apprenticeship learning, we suppose that the rewards in the low-level MDP\R can be represented as a linear function of state features, R(s ) = wT φ(s ). [sent-69, score-0.543]

39 The HAL algorithm assumes that the reward of a high-level state is equal to the average reward over all its corresponding low-level states. [sent-70, score-0.394]

40 An important consequence of (2) is that the high level reward is now also linear in the low-level reward weights w. [sent-73, score-0.438]

41 This will enable us in the subsequent sections to formulate a unified hierarchical apprenticeship learning algorithm that is able to incorporate expert advice at both the high level and the low level simultaneously. [sent-74, score-1.306]

42 2 Expert Advice at the High Level Similar to past apprenticeship learning methods, expert advice at the high level consists of full policies demonstrated by the expert. [sent-76, score-1.134]

43 If the expert suggests that (i) (i) πh,E is an optimal policy for some given MDP\R Mh , then this corresponds to the following constraint, which states that the expert’s policy outperforms all other policies: (i) (i) (i) V (i) (πh,E ) ≥ V (i) (πh ) ∀πh . [sent-78, score-0.566]

44 While there has also been recent work on the automated discovery of state abstractions[5], we have found that there is often a very natural decomposition of control tasks into multiple levels (as we will discuss for the specific case of quadruped locomotion). [sent-81, score-0.539]

45 3 a sample from these feature expectations, so we simply use the observed expert features counts (i) (i) µφ (πh,E ) in lieu of the true expectations. [sent-82, score-0.336]

46 To resolve the ambiguity in w, and to allow the expert to provide noisy advice, we use regularization and slack variables (similar to standard SVM formulations), which results in the following formulation: n 1 minw,η 2 w 2 + Ch i=1 η (i) 2 (i) (i) (i) (i) s. [sent-84, score-0.361]

47 3 Expert Advice at the Low Level Our approach differs from standard apprenticeship learning when we consider advice at the low level. [sent-89, score-0.679]

48 Unlike the apprenticeship learning paradigm where an expert specifies full trajectories in the target domain, we allow for an expert to specify single, greedy actions in the low-level domain. [sent-90, score-1.202]

49 4 This results in the following constraints on the reward function parameters w, wT φ(s ) ≥ wT φ(s ) for all s reachable from s . [sent-94, score-0.25]

50 As before, to resolve the ambiguity in w and to allow for the expert to provide noisy advice, we use regularization and slack variables. [sent-95, score-0.361]

51 wT φ(s ) ≥ wT φ(s ) + 1 − ξ (j) ∀s ,j (j) (j) where s indexes over all states reachable from s and j indexes over all low-level demonstrations provided by the expert. [sent-98, score-0.19]

52 4 The Unified HAL Algorithm From (2) we see the high level and low level rewards are a linear combination of the same set of reward weights w. [sent-100, score-0.423]

53 This allows us to combine both types of expert advice presented above to obtain the following unified optimization problem n m (j) minw,η,ξ 1 w 2 + C + Ch i=1 η (i) 2 j=1 ξ 2 (j) (j) (j) (3) s. [sent-101, score-0.522]

54 However, in our full formulation with low-level advice also taken into account, this becomes less of an issue, and so we present the above formulation for simplicity. [sent-112, score-0.273]

55 4 Alternatively, one interpret low-level advice at the level of actions, and interpret the expert picking action a as the constraint that s Psa (s )R(s ) ≥ s Psa (s )R(s ) ∀a = a. [sent-114, score-0.675]

56 (b) Performance versus number of training samples for HAL and flat apprenticeship learning. [sent-120, score-0.504]

57 Find the current reward weights by solving the current optimization problem. [sent-125, score-0.172]

58 Solve the reinforcement learning problem at the high level of the hierarchy to find the optimal (high-level) policies for the current reward for each MDP\R, i. [sent-127, score-0.416]

59 Otherwise, no constraints are violated and the current reward weights are the solution of the optimization problem. [sent-129, score-0.212]

60 While this is not meant to be a challenging control task, it allows us to compare the performance of HAL to traditional “flat” (non-hierarchical) apprenticeship learning methods, as these algorithms are feasible in such domains. [sent-132, score-0.527]

61 The grid world domain has a very natural hierarchical decomposition: if we average the cost over each room, we can form a “high-level” approximation of the grid world. [sent-133, score-0.225]

62 Our hierarchical controller first plans in this domain to choose a path over the rooms. [sent-134, score-0.351]

63 Then for each room along this path we plan a low-level path to the desired exit. [sent-135, score-0.251]

64 6 As expected, the flat apprenticeship learning algorithm eventually converges to a superior policy, since it employs full value iteration to find the optimal policy, while HAL uses the (non-optimal) hierarchical controller. [sent-137, score-0.59]

65 However, for small amounts of training data, HAL outperforms the flat method, since it is able to leverage the small amount of data provided by the expert at both levels of the hierarchy. [sent-138, score-0.391]

66 Figure 2(c) shows performance versus number of MDPs in the training set for HAL and well as for algorithms which receive the same training data as HAL (that is, both high level and low level expert demonstrations), but which make use of only one or the other. [sent-139, score-0.611]

67 Each state has 40 binary features, sampled from a distribution particular to that room, and the reward function is chosen randomly to have 10 “small” [-0. [sent-144, score-0.222]

68 In all cases we generated multiple training MDPs, which differ in which features are active at each state and we provided the algorithm with one expert demonstration for each sampled MDP. [sent-151, score-0.468]

69 In the comparison of HAL with flat apprenticeship learning in Figure 2(b), one training example corresponds to one expert action. [sent-158, score-0.785]

70 Concretely, for HAL the number of training examples for a given training MDP corresponds to the number of high level actions in the high level demonstration plus the (equal) number of low level expert actions provided. [sent-159, score-0.819]

71 For flat apprenticeship learning the number of training examples for a given training MDP corresponds to the number of expert actions in the expert’s full trajectory demonstration. [sent-160, score-0.899]

72 5 Figure 3: (a) High-level (path) expert demonstration. [sent-161, score-0.307]

73 This will be especially important in domains such as the quadruped locomotion task, where we have access to very few training MDPs (i. [sent-164, score-0.597]

74 2 Quadruped Robot In this section we present the primary experimental result of this paper, a successful application of hierarchical apprenticeship learning to the task of quadruped locomotion. [sent-168, score-0.97]

75 The robot consists of 12 independently actuated servo motors, three on each leg, with two at the hip and one at the knee. [sent-175, score-0.233]

76 We perform all computation on a desktop computer, and send commands to the robot via a wireless connection. [sent-178, score-0.205]

77 As mentioned in the introduction, we employ a hierarchical control scheme for navigating the quadruped over the terrain. [sent-179, score-0.588]

78 The high level controller is a body path planner, that plans an approximate trajectory for the robot’s center of mass over the terrain; the low-level controller is a footstep planner that, given a path for the robot’s center, plans a set of footsteps that follow this path. [sent-181, score-0.835]

79 To form the high-level cost, we aggregate features from the footstep planner. [sent-184, score-0.188]

80 After forming the cost function for both levels, we used value iteration to find the optimal policy for the body path planner, and a five-step lookahead receding horizon search to find a good set of footsteps for the footstep planner. [sent-187, score-0.482]

81 2 Hierarchical Apprenticeship Learning for Quadruped Locomotion All experiments were carried out on two terrains: a relatively easy terrain for training, and a significantly more challenging terrain for testing. [sent-190, score-0.442]

82 To give advice at the high level, we specified complete body trajectories for the robot’s center of mass, as shown in Figure 3(a). [sent-191, score-0.369]

83 To give advice for the low level we looked for situations in which the robot stepped in a suboptimal location, and then indicated the correct greedy foot placement, as shown in Figure 3(b). [sent-192, score-0.62]

84 The entire training set con6 Figure 4: Snapshots of quadruped while traversing the testing terrain. [sent-193, score-0.423]

85 Figure 5: Body and footstep plans for different constraints on the training (left) and testing (right) terrains: (Red) No Learning, (Green) HAL, (Blue) Path Only, (Yellow) Footstep Only. [sent-194, score-0.32]

86 sisted of a single high-level path demonstration across the training terrain, and 20 low-level footstep demonstrations on this terrain; it took about 10 minutes to collect the data. [sent-195, score-0.363]

87 Figure 4 shows snapshots of the quadruped crossing the testing board. [sent-197, score-0.449]

88 Figure 5 shows the resulting footsteps taken for each of the different types of constraints, which shows a very large qualitative difference between the footsteps chosen before and after training. [sent-198, score-0.19]

89 Using only footstep constraints does quite well on the training board, but on the testing board the lack of high-level training leads the robot to take a very roundabout route, and it performs much worse. [sent-201, score-0.555]

90 The quadruped fails at crossing the testing terrain when learning from the path-level demonstration only or when not learning at all. [sent-202, score-0.667]

91 Finally, prior to undertaking our work on hierarchical apprenticeship learning, we invested several weeks attempting to hand-tune controller capable of picking good footsteps across challenging terrain. [sent-203, score-0.801]

92 5 Related Work and Discussion The work presented in this paper relates to many areas of reinforcement learning, including apprenticeship learning and hierarchical reinforcement learning, and to a large body of past work in quadruped locomotion. [sent-205, score-1.208]

93 In the introduction and in the formulation of our algorithm we discussed the connection to the inverse reinforcement learning algorithm of [1] and the maximum margin planning algorithm of [13]. [sent-206, score-0.21]

94 However, all the work in this area that we are aware of deals with the more standard reinforcement learning formulation where known rewards are given to the agent as it acts in a (possibly unknown) environment. [sent-208, score-0.184]

95 In contrast, our work follows the apprenticeship learning paradigm where the model, but not the rewards, are known to the agent. [sent-209, score-0.463]

96 Prior work on legged locomotion has mostly focused on generating gaits for stably traversing fairly flat 7 Training Testing Time (sec) Time (sec) HAL 31. [sent-210, score-0.18]

97 Dashes indicate that the robot fell over and did not reach the goal. [sent-216, score-0.205]

98 Gait planning of quadruped walking and climbing robot for locomotion in 3D environment. [sent-248, score-0.769]

99 A complete control architecture for quadruped locomotion over rough terrain. [sent-257, score-0.606]

100 Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. [sent-283, score-0.199]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('apprenticeship', 0.436), ('quadruped', 0.349), ('expert', 0.307), ('hal', 0.265), ('advice', 0.215), ('robot', 0.205), ('terrain', 0.202), ('locomotion', 0.18), ('reward', 0.172), ('sh', 0.165), ('footstep', 0.159), ('hierarchical', 0.154), ('mdp', 0.134), ('policy', 0.114), ('wt', 0.106), ('foot', 0.1), ('reinforcement', 0.098), ('footsteps', 0.095), ('terrains', 0.079), ('mdps', 0.079), ('feet', 0.076), ('level', 0.072), ('path', 0.071), ('psa', 0.063), ('room', 0.059), ('rewards', 0.057), ('slack', 0.054), ('controller', 0.054), ('control', 0.053), ('ch', 0.053), ('policies', 0.052), ('demonstrations', 0.051), ('planner', 0.051), ('state', 0.05), ('actions', 0.05), ('andrew', 0.049), ('kolter', 0.048), ('littledog', 0.048), ('margin', 0.048), ('ratliff', 0.047), ('plans', 0.047), ('decomposition', 0.045), ('crossing', 0.044), ('body', 0.043), ('training', 0.042), ('levels', 0.042), ('gridworld', 0.041), ('demonstration', 0.04), ('constraints', 0.04), ('trajectories', 0.038), ('challenging', 0.038), ('reachable', 0.038), ('specify', 0.037), ('board', 0.035), ('suboptimality', 0.035), ('robotics', 0.035), ('indexes', 0.035), ('planning', 0.035), ('testing', 0.032), ('joel', 0.032), ('navigating', 0.032), ('zico', 0.032), ('states', 0.031), ('task', 0.031), ('abbeel', 0.03), ('minw', 0.03), ('past', 0.03), ('constraint', 0.03), ('st', 0.03), ('formulation', 0.029), ('features', 0.029), ('low', 0.028), ('plan', 0.028), ('pieter', 0.028), ('actuated', 0.028), ('biped', 0.028), ('uni', 0.027), ('paradigm', 0.027), ('action', 0.027), ('automation', 0.027), ('center', 0.027), ('proceedings', 0.027), ('versus', 0.026), ('domains', 0.026), ('nathan', 0.025), ('mh', 0.025), ('domain', 0.025), ('easier', 0.025), ('complete', 0.024), ('picking', 0.024), ('boston', 0.024), ('snapshots', 0.024), ('international', 0.023), ('grid', 0.023), ('high', 0.022), ('bagnell', 0.022), ('abstraction', 0.022), ('desired', 0.022), ('trajectory', 0.022), ('expectations', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1

2 0.33918735 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

3 0.1746095 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

4 0.12090827 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1

5 0.11893582 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

6 0.1178023 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

7 0.11462118 215 nips-2007-What makes some POMDP problems easy to approximate?

8 0.10726747 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

9 0.10120187 30 nips-2007-Bayes-Adaptive POMDPs

10 0.098545805 125 nips-2007-Markov Chain Monte Carlo with People

11 0.097021222 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

12 0.090295002 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

13 0.088333048 163 nips-2007-Receding Horizon Differential Dynamic Programming

14 0.087918349 100 nips-2007-Hippocampal Contributions to Control: The Third Way

15 0.0867365 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

16 0.07953684 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

17 0.071141683 185 nips-2007-Stable Dual Dynamic Programming

18 0.062825404 86 nips-2007-Exponential Family Predictive Representations of State

19 0.052447837 52 nips-2007-Competition Adds Complexity

20 0.051210925 203 nips-2007-The rat as particle filter


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.171), (1, -0.269), (2, 0.054), (3, -0.074), (4, -0.164), (5, 0.123), (6, 0.041), (7, -0.029), (8, 0.042), (9, 0.111), (10, -0.011), (11, -0.109), (12, 0.045), (13, -0.038), (14, 0.003), (15, 0.032), (16, -0.058), (17, 0.029), (18, 0.072), (19, -0.049), (20, 0.011), (21, 0.043), (22, -0.086), (23, 0.028), (24, -0.001), (25, -0.084), (26, -0.057), (27, -0.044), (28, 0.009), (29, -0.043), (30, -0.08), (31, 0.021), (32, -0.141), (33, 0.05), (34, -0.014), (35, 0.031), (36, 0.108), (37, 0.081), (38, -0.077), (39, -0.004), (40, 0.117), (41, 0.059), (42, -0.146), (43, -0.03), (44, -0.17), (45, -0.069), (46, 0.008), (47, -0.166), (48, 0.014), (49, -0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95458317 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1

2 0.84840393 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

Author: Umar Syed, Robert E. Schapire

Abstract: We study the problem of an apprentice learning to behave in an environment with an unknown reward function by observing the behavior of an expert. We follow on the work of Abbeel and Ng [1] who considered a framework in which the true reward function is assumed to be a linear combination of a set of known and observable features. We give a new algorithm that, like theirs, is guaranteed to learn a policy that is nearly as good as the expert’s, given enough examples. However, unlike their algorithm, we show that ours may produce a policy that is substantially better than the expert’s. Moreover, our algorithm is computationally faster, is easier to implement, and can be applied even in the absence of an expert. The method is based on a game-theoretic view of the problem, which leads naturally to a direct application of the multiplicative-weights algorithm of Freund and Schapire [2] for playing repeated matrix games. In addition to our formal presentation and analysis of the new algorithm, we sketch how the method can be applied when the transition function itself is unknown, and we provide an experimental demonstration of the algorithm on a toy video-game environment. 1

3 0.55327785 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

Author: Gerald Tesauro, Rajarshi Das, Hoi Chan, Jeffrey Kephart, David Levine, Freeman Rawson, Charles Lefurgy

Abstract: Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations. 1

4 0.54849958 162 nips-2007-Random Sampling of States in Dynamic Programming

Author: Chris Atkeson, Benjamin Stephens

Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1

5 0.54268539 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

6 0.51135653 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

7 0.47975889 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

8 0.47875527 52 nips-2007-Competition Adds Complexity

9 0.43227491 163 nips-2007-Receding Horizon Differential Dynamic Programming

10 0.4286184 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

11 0.40499097 185 nips-2007-Stable Dual Dynamic Programming

12 0.3794513 102 nips-2007-Incremental Natural Actor-Critic Algorithms

13 0.31299418 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

14 0.30954239 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

15 0.30597928 215 nips-2007-What makes some POMDP problems easy to approximate?

16 0.30343205 203 nips-2007-The rat as particle filter

17 0.2814213 125 nips-2007-Markov Chain Monte Carlo with People

18 0.27626601 30 nips-2007-Bayes-Adaptive POMDPs

19 0.24587683 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation

20 0.24516205 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.031), (13, 0.064), (16, 0.02), (18, 0.011), (21, 0.046), (31, 0.025), (34, 0.016), (35, 0.024), (46, 0.011), (47, 0.106), (49, 0.024), (53, 0.017), (56, 0.022), (72, 0.318), (83, 0.089), (85, 0.025), (87, 0.016), (90, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87400687 29 nips-2007-Automatic Generation of Social Tags for Music Recommendation

Author: Douglas Eck, Paul Lamere, Thierry Bertin-mahieux, Stephen Green

Abstract: Social tags are user-generated keywords associated with some resource on the Web. In the case of music, social tags have become an important component of “Web2.0” recommender systems, allowing users to generate playlists based on use-dependent terms such as chill or jogging that have been applied to particular songs. In this paper, we propose a method for predicting these social tags directly from MP3 files. Using a set of boosted classifiers, we map audio features onto social tags collected from the Web. The resulting automatic tags (or autotags) furnish information about music that is otherwise untagged or poorly tagged, allowing for insertion of previously unheard music into a social recommender. This avoids the ”cold-start problem” common in such systems. Autotags can also be used to smooth the tag space from which similarities and recommendations are made by providing a set of comparable baseline tags for all tracks in a recommender system. 1

same-paper 2 0.73027343 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Author: J. Z. Kolter, Pieter Abbeel, Andrew Y. Ng

Abstract: We consider apprenticeship learning—learning from expert demonstrations—in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain. However, in many problems even an expert has difficulty controlling the system, which makes this approach infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate complete trajectories. This allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work. 1

3 0.56083477 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

Author: Ozgur Sumer, Umut Acar, Alexander T. Ihler, Ramgopal R. Mettu

Abstract: Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of adaptive inference in tree-structured Bayesian networks. We describe an algorithm for adaptive inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm. 1

4 0.45844617 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

5 0.45785236 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

6 0.4541328 86 nips-2007-Exponential Family Predictive Representations of State

7 0.45084211 63 nips-2007-Convex Relaxations of Latent Variable Training

8 0.44862852 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

9 0.44811827 100 nips-2007-Hippocampal Contributions to Control: The Third Way

10 0.44662765 84 nips-2007-Expectation Maximization and Posterior Constraints

11 0.44592601 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

12 0.44253692 115 nips-2007-Learning the 2-D Topology of Images

13 0.44214585 163 nips-2007-Receding Horizon Differential Dynamic Programming

14 0.44111779 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

15 0.44103634 105 nips-2007-Infinite State Bayes-Nets for Structured Domains

16 0.44099686 24 nips-2007-An Analysis of Inference with the Universum

17 0.43972552 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

18 0.43946072 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

19 0.43868488 69 nips-2007-Discriminative Batch Mode Active Learning

20 0.43850607 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data