jmlr jmlr2007 jmlr2007-41 knowledge-graph by maker-knowledge-mining

41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

Source: pdf

Author: Mohammad Ghavamzadeh, Sridhar Mahadevan

Abstract: Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL including HAMs, options, MAXQ, and PHAMs has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. The average reward optimality criterion has been recognized to be more appropriate for a wide class of continuing tasks than the discounted framework. Although average reward RL has been studied for decades, prior work has been largely limited to ﬂat policy representations. In this paper, we develop a framework for HRL based on the average reward optimality criterion. We investigate two formulations of HRL based on the average reward SMDP model, both for discrete-time and continuous-time. These formulations correspond to two notions of optimality that have been previously explored in HRL: hierarchical optimality and recursive optimality. We present algorithms that learn to ﬁnd hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models. We use two automated guided vehicle (AGV) scheduling tasks as experimental testbeds to study the empirical performance of the proposed algorithms. The ﬁrst problem is a relatively simple AGV scheduling task, in which the hierarchically and recursively optimal policies are different. We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. The second problem is a larger AGV scheduling task. We model this problem using both discrete-time and continuous-time models. We use a hierarchical task decomposition in which the hierarchically and recursively optimal policies are the same for this problem. We compare the performance of the proposed algorithms

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We present algorithms that learn to ﬁnd hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models. [sent-12, score-1.278]

2 We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. [sent-15, score-1.377]

3 We compare the performance of the proposed algorithms with a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm, as well as a non-hierarchical average reward algorithm. [sent-19, score-1.86]

4 The results show that the proposed hierarchical average reward algorithms converge to the same performance as their discounted reward counterparts. [sent-20, score-1.216]

5 Keywords: semi-Markov decision processes, hierarchical reinforcement learning, average reward reinforcement learning, hierarchical and recursive optimality c 2007 Mohammad Ghavamzadeh and Sridhar Mahadevan. [sent-21, score-1.02]

6 The goal is to select actions to maximize (minimize) some measure of long-term reward (cost), such as the expected discounted sum of rewards (costs), or the expected average reward (cost). [sent-27, score-1.151]

7 In addition to being an appropriate optimality criterion for continuing tasks, average reward optimality allows for more efﬁcient state abstraction in HRL than discounted reward optimality, as will be discussed in Section 5. [sent-48, score-1.326]

8 2630 H IERARCHICAL AVERAGE R EWARD R EINFORCEMENT L EARNING In this paper, we extend previous work on HRL to the average reward setting, and investigate two formulations of HRL based on average reward SMDPs. [sent-49, score-0.966]

9 Hierarchically optimal average reward RL (HAR) algorithms ﬁnd a hierarchical policy within the space of policies deﬁned by the hierarchical decomposition that maximizes the global gain. [sent-54, score-1.129]

10 Recursively optimal average reward RL (RAR) algorithms treat non-primitive subtasks as continuing average reward problems, where the goal at each subtask is to maximize its gain given the policies of its children. [sent-55, score-1.885]

11 We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. [sent-59, score-1.377]

12 We compare the performance of the proposed algorithms with a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm, as well as a non-hierarchical average reward algorithm. [sent-63, score-1.86]

13 The results show that the proposed hierarchical average reward algorithms converge to the same performance as their discounted reward counterparts. [sent-64, score-1.216]

14 In Section 5, we extend the previous work on HRL to the average reward setting, and study two formulations of HRL based on the average reward SMDP model. [sent-69, score-0.966]

15 2, we investigate different methods to formulate subtasks in a recursively optimal hierarchical average reward RL setting, and present discrete-time and continuous-time recursively optimal average reward RL (RAR) algorithms. [sent-73, score-1.431]

16 HRL methods generalize the subtask idea to closed-loop policies or, more precisely, closed-loop partial policies because they are generally deﬁned for a subset of the state space. [sent-84, score-0.903]

17 Each action in the recursively optimal policy is optimal with respect to the subtask containing the action, all descendant subtasks, and the pseudo-reward chosen by the designer of the system. [sent-119, score-1.021]

18 This expected reward contains all necessary information about the reward to analyze the SMDP model. [sent-146, score-0.918]

19 The aim of average reward SMDP algorithms is to compute policies that yield the highest average reward or gain. [sent-153, score-1.055]

20 The average reward or gain of a policy µ at state s, gµ (s), can be deﬁned as the ratio of the expected total reward and the expected total number of time steps of following policy µ starting at state s gµ (s) = lim inf n→∞ n−1 E ∑t=0 r(st , µ(st ))|s0 = s, µ . [sent-154, score-1.517]

21 In the next section, we will extend this framework to the average reward model and present our hierarchical average reward reinforcement learning algorithms. [sent-175, score-1.195]

22 A policy for subtask Mi can only be executed if the current state s belongs to (Si − Ti ). [sent-214, score-0.944]

23 • Ri is the reward structure inside subtask Mi and could be different from the reward function of MDP M . [sent-219, score-1.56]

24 Besides the reward of the overall task MDP M , each subtask Mi can use additional rewards to guide its local learning. [sent-223, score-1.152]

25 If the reward structure inside a subtask is different from the reward function of the overall task, we need to deﬁne two types of value functions for each subtask, internal value function and external value function. [sent-225, score-1.56]

26 Internal value function is deﬁned based on both the local reward structure of the subtask and the reward of the overall task, and only is used in learning the subtask. [sent-226, score-1.56]

27 This reward structure for each subtask in our framework is more general than the one in MAXQ, and of course, includes MAXQ’s pseudo-reward. [sent-228, score-1.101]

28 6 Each primitive action a is a primitive subtask in this decomposition, such that a is always executable and it terminates immediately after execution. [sent-229, score-0.918]

29 2 Policy Execution If we have a policy for each subtask in a hierarchy, we can deﬁne a hierarchical policy for the model. [sent-232, score-1.183]

30 Deﬁnition 2: A hierarchical policy µ is a set of policies, one policy for each subtask in the hierarchy: µ = {µ0 , . [sent-233, score-1.183]

31 Each subtask policy takes a state and returns the name of a primitive action to execute or the name of a subtask to invoke. [sent-240, score-1.715]

32 If any subtask on the Task-Stack terminates, then all subtasks below it are immediately aborted, and control returns to the subtask that had invoked the terminated subtask. [sent-243, score-1.384]

33 µ We also deﬁne a multi-step abstract transition probability Fi : Si × N × Si → [0, 1] for each subtask µ Mi under the hierarchical policy µ. [sent-246, score-1.054]

34 The term Fi (s , N|s) denotes the N-step abstract transition probability from state s to state s under hierarchical policy µ at subtask Mi , where N is the number of actions taken by subtask Mi , not the number of primitive actions taken in this transition. [sent-247, score-2.079]

35 In this µ paper, we use the multi-step abstract transition probability Fi to model state transition at the subtask µ level, and the multi-step transition probability Pi to model state transition at the level of primitive µ actions. [sent-248, score-1.131]

36 In this form of optimality, the policy for each individual subtask is not necessarily locally optimal, but the policy for the entire hierarchy is optimal. [sent-258, score-1.104]

37 Deﬁnition 4: Recursive optimality is a weaker but more ﬂexible form of optimality which only guarantees that the policy of each subtask is optimal given the policies of its children. [sent-261, score-1.057]

38 It is an important and ﬂexible form of optimality because it permits each subtask to learn a locally optimal policy while ignoring the behavior of its ancestors in the hierarchy. [sent-262, score-0.908]

39 , µm−1 } such that for each subtask Mi in the hierarchy, the expected cumulative reward of executing policy µ i and the policies of all descendants of Mi is maximized. [sent-269, score-1.425]

40 In this case, the value function to be learned for subtask Mi under hierarchical policy µ must contain only the reward received during the execution of subtask Mi . [sent-270, score-2.124]

41 The expected cumulative reward outside a subtask is not a part of its projected value function. [sent-272, score-1.148]

42 It makes the projected value function of a subtask dependent only on the subtask and its descendants. [sent-273, score-1.331]

43 In this case, the value function to be learned for subtask Mi under hierarchical policy µ must contain the reward received during the execution of subtask M i , and the reward after subtask Mi terminates. [sent-275, score-3.225]

44 The hierarchical value function of a subtask includes the expected reward outside the subtask and therefore depends on the subtask and all its ancestors up to the root of the hierarchy. [sent-277, score-2.59]

45 2639 G HAVAMZADEH AND M AHADEVAN Theorem 1: Under a hierarchical policy µ, each subtask M i can be modeled by a SMDP consisting µ ¯ ¯ ˆ of components (Si , Ai , Pi , Ri ), where Ri (s, a) = V µ (a, s) for all a ∈ Ai . [sent-288, score-0.995]

46 The projected value of subtask Mi at state s under hierarchical policy µ can be written as ∞ ∑ γk r(sk , ak )|s0 = s, µ ˆ V µ (i, s) = E (3) . [sent-304, score-1.125]

47 N,s ∈Si The right-most term in this equation is the expected discounted cumulative reward of completing subtask Mi after executing action a in state s. [sent-312, score-1.396]

48 4, since the expected reward after execution of subtask Mi is not a component of the projected action-value function, the two-part value function decomposition allows only for recursive optimality. [sent-326, score-1.216]

49 The projected value function ˆ V (i, s) is broken into two parts: Part 1) the projected value function of subtask M a for state s, and Part 2) the completion function, the expected discounted cumulative reward of completing subtask Mi after executing action a in state s. [sent-330, score-2.267]

50 V(i,x) Part 3 Part 2 Part 1 C(i,s,a) Execution of Subtask i V(a,s) x x=(ω ,s) I x’ x T Execution of Action a Figure 3: This ﬁgure shows the three-part decomposition for V (i, x), the hierarchical value function of subtask Mi for the shaded state x = (ω, s). [sent-337, score-0.911]

51 Moreover, average reward optimality allows for more efﬁcient state abstraction in HRL than the discounted reward formulation. [sent-343, score-1.229]

52 It means that for all policies executed by Ma and its descendants, and for all pairs of states s1 and s2 in Si (the state space of subtask Mi ) that differ only in their values for the state variables in Ya , we have µ µ Pi (s , N|s1 , a) = Pi (s , N|s2 , a) , ∀s ∈ Si , ∀N ∈ N. [sent-345, score-0.954]

53 However, if we were using discounted reward optimality, the robot’s location would not be irrelevant, because the probability that the collect trash at T1 subtask will terminate in N steps would depend on the location of the robot, which could differ in states s1 and s2 . [sent-356, score-1.346]

54 In this section, we extend previous work on hierarchical reinforcement learning (HRL) to the average reward framework, and investigate two formulations of HRL based on the average reward SMDP model. [sent-359, score-1.195]

55 In the hierarchically optimal average reward RL (HAR) algorithms, the aim is to ﬁnd a hierarchical policy within the space of policies deﬁned by the hierarchical decomposition that maximizes the global gain (Ghavamzadeh and Mahadevan, 2002). [sent-364, score-1.285]

56 In the recursively optimal average reward RL (RAR) algorithms, we treat subtasks as continuing average reward problems, where the goal at each subtask is to maximize its gain given the policies of its children (Ghavamzadeh and Mahadevan, 2001). [sent-365, score-1.967]

57 (10) We refer to a hierarchical policy µ∗ which satisﬁes Equation 10 as a hierarchically optimal average ∗ reward policy, and to gµ as the hierarchically optimal average reward or the hierarchically optimal gain. [sent-383, score-1.742]

58 N2 ,s2 ∈Si The term Cµ (i, x, a) is the expected average-adjusted reward of completing subtask M i after executing action a in state x = (ω, s). [sent-399, score-1.287]

59 The term CE µ (i, x, a) is the expected average-adjusted reward received after subtask M i terminates. [sent-401, score-1.101]

60 In the update formula on Line 17, the projected average-adjusted value function ˆ H(a∗ , (a∗ ω, s )) is the average-adjusted reward of executing action a ∗ in state (a∗ ω, s ) and is recursively calculated by subtask Ma∗ and its descendants using Equations 18 and 19. [sent-420, score-1.433]

61 This leaves open the question of what local optimality criterion should be used for each subtask in a recursively optimal average reward RL setting. [sent-433, score-1.285]

62 In this method, the projected average-adjusted value functions with respect to the global gain satisfy the following equations:  µ if Mi is a primitive action, r(s, i) − g     ˆ if s ∈ Ti (s is a terminal state of subtask Mi ), H µ (i, s) = 0      µ ˆ ˆ otherwise. [sent-437, score-0.919]

63 Since the expected average-adjusted reward after execution of subtask M i is not a component of the average-adjusted value function of subtask Mi , this approach does not necessarily allow for hierarchical optimality, as we will show in the experiments of Section 6. [sent-439, score-1.936]

64 Moreover, the policy learned for each subtask using this approach is not context free, since each subtask maximizes its averageadjusted reward with respect to the global gain. [sent-440, score-1.931]

65 In other words, states reached after the execution of a subtask cannot be changed by altering the policies of the subtask and its descendants. [sent-443, score-1.427]

66 Another approach is to formulate subtasks as continuing average reward problems, where the goal at each subtask is to maximize its gain given the policies of its children (Ghavamzadeh and Mahadevan, 2001). [sent-445, score-1.384]

67 3, we use this method to ﬁnd recursively optimal average reward policies, and present discretetime and continuous-time recursively optimal average reward RL (RAR) algorithms. [sent-453, score-1.166]

68 r0 (s, µ0 (s)) µ and y0 (s, µ0 (s)) are the expected total reward and the expected number of time steps between two decision epochs at root, given that the system occupies state s at the ﬁrst decision epoch µ µ ¯ and the agent chooses its actions according to the hierarchical policy µ. [sent-465, score-1.044]

69 2649 G HAVAMZADEH AND M AHADEVAN Under this assumption, each subtask can be considered an episodic problem and each instantiation of a subtask can be considered an episode. [sent-483, score-1.284]

70 Finally, the dummy state s∗ transits with reward i i zero to one of the initial states (∈ Ii ) of subtask Mi upon the next instantiation of Mi . [sent-489, score-1.21]

71 These are dummy transitions and do not add any time-step to the cycle of subtask M i and therefore are not taken into consideration when the average reward of subtask M i is calculated. [sent-490, score-1.767]

72 s*i r = 0 , I = In r=0,F=1 Ii n Figure 4: This ﬁgure shows how each subtask in a hierarchical decomposition of a continuing problem can be modeled as a continuing task. [sent-501, score-0.902]

73 11 i Lemma 1 is equivalent to assuming that for every subtask M i in the hierarchy, the underlying Markov chain for every hierarchical policy µ has a single recurrent class and state s ∗ is its recurrent i µ state. [sent-508, score-1.138]

74 Under this model, the gain of subtask Mi under the hierarchical policy µ, gi , is well deﬁned for every hierarchical policy µ and does not depend on the initial state. [sent-509, score-1.381]

75 When the state space of subtask Mi is ﬁnite or countable, the gain of subtask Mi can be written as12 µ gi = µ ¯ µ F Ii r Ii , µ ¯ µ F I yI i µ µ µ i µ µ where rIi and yIi are vectors with elements rIi (s, µi (s)) and yIi (s, µi (s)), for all s ∈ Si . [sent-510, score-1.4]

76 When the underlying Markov chain for every hierarchical policy µ at subtask M i has a single recurrent class, F Ii has µ equal rows, and thus the right hand side of the equation is a vector with elements all equal to g i . [sent-523, score-1.035]

77 1, and every other non-primitive subtask in the hierarchy is modeled as an average reward problem using the model described in Section 5. [sent-526, score-1.211]

78 Under these assumptions, the average reward for every non-primitive subtask in the hierarchy including root is well deﬁned for every hierarchical policy and does not vary with initial state. [sent-529, score-1.604]

79 The projected average-adjusted value function of hierarchical polµ icy µ at subtask Mi is the average-adjusted (with respect to the local gain gi ) sum of rewards earned by following policy µi and the policies of all descendants of subtask Mi starting in state s until Mi terminates. [sent-537, score-1.941]

80 In this algorithm, a gain is deﬁned for every non-primitive subtask in the hierarchy and this gain is updated every time a subtask is non-randomly chosen. [sent-548, score-1.436]

81 As described earlier, the expected average-adjusted sum of rewards after execution of subtask M i is not a component of the average-adjusted value function of subtask Mi in RAR algorithm. [sent-560, score-1.347]

82 To achieve recursive optimality, the policy learned for each subtask must be context free, that is, each subtask should maximize its local gain given the policies of its descendants. [sent-562, score-1.613]

83 In RAR algorithm, although each subtask maximizes its local gain given the policies of its descendants, the policy learned for each subtask is not necessarily context free, and as a result, the algorithm does not ﬁnd a recursively optimal average reward policy in general. [sent-563, score-2.365]

84 The initial state distribution Ii (s), the probability of being in state s at the next instantiation of subtask Mi , depends not only on the policies of subtask Mi and all its descendants, but also on the policies of its ancestors. [sent-566, score-1.628]

85 Deﬁnition 9 (Initial Distribution Invariance Condition): The initial state distribution for each non-primitive subtask in the hierarchy is independent of the policies of its ancestors. [sent-570, score-0.9]

86 Experimental Results The goal of this section is to show the type of optimality achieved by the hierarchically optimal average reward RL (HAR) and the recursively optimal average reward RL (RAR) algorithms proposed in Sections 5. [sent-577, score-1.267]

87 In the discrete-time model, we compare the performance of HAR and RAR algorithms with a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm, as well as a non-hierarchical (ﬂat) average reward algorithm. [sent-587, score-1.86]

88 We also test MAXQ-Q, the recursively optimal discounted reward HRL algorithm proposed by Dietterich (2000), and a hierarchically optimal discounted reward RL algorithm (HDR) on this task. [sent-597, score-1.377]

89 Part 1 is more important than part 2, therefore the AGV gets a reward of 20 when part 1 is delivered to destination G1 and a reward of 1 when part 2 is delivered to destination G2. [sent-603, score-0.918]

90 The discounted reward algorithm, continuous-time MAXQQ, learns faster than both the average reward algorithms, HAR and RAR. [sent-723, score-1.051]

91 Moreover, the hierarchically optimal average reward algorithm (HAR) learns more slowly than the recursively optimal average reward algorithm (RAR) due to more parameters to be learned. [sent-724, score-1.207]

92 Moreover, average reward optimality allows for more efﬁcient state abstraction in HRL than the discounted reward optimality, as discussed in Section 5. [sent-735, score-1.229]

93 In this paper, we extended previous work on HRL to the average reward setting, and presented new discrete-time and continuous-time hierarchically optimal average reward RL (HAR) and recursively optimal average reward RL (RAR) algorithms. [sent-743, score-1.69]

94 The HAR algorithm searches the space of policies deﬁned by the hierarchical decomposition to ﬁnd a hierarchical policy with maximum global gain (the gain of the Markov chain that results from ﬂattening the hierarchy using a hierarchical policy). [sent-745, score-0.965]

95 In the recursively optimal average reward RL setting, the formulation of learning algorithms directly depends on the local optimality criterion used for each subtask in the hierarchy. [sent-746, score-1.285]

96 We demonstrated that the policy learned for each subtask by the RAR algorithm is not necessarily context free, and as a result the algorithms do not ﬁnd a recursively optimal average reward policy in general. [sent-748, score-1.601]

97 We compared the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. [sent-753, score-1.377]

98 We compared the performance of the proposed algorithms with a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm, as well as a ﬂat average reward algorithm in this problem. [sent-758, score-1.86]

99 The results showed that the proposed hierarchical average reward algorithms converge to the same performance as their discounted reward counterparts. [sent-759, score-1.216]

100 Policy invariance under reward transformations: Theory and application to reward shaping. [sent-885, score-0.918]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('subtask', 0.642), ('reward', 0.459), ('policy', 0.188), ('agv', 0.187), ('hierarchical', 0.165), ('hrl', 0.154), ('mi', 0.135), ('rar', 0.124), ('hierarchically', 0.123), ('smdp', 0.116), ('discounted', 0.109), ('subtasks', 0.1), ('har', 0.093), ('maxq', 0.09), ('rl', 0.089), ('policies', 0.089), ('primitive', 0.087), ('hierarchy', 0.086), ('state', 0.083), ('recursively', 0.082), ('action', 0.073), ('actions', 0.065), ('reinforcement', 0.064), ('eward', 0.063), ('scheduling', 0.062), ('einforcement', 0.06), ('havamzadeh', 0.06), ('trash', 0.06), ('optimality', 0.06), ('transition', 0.059), ('completion', 0.052), ('ahadevan', 0.051), ('ierarchical', 0.051), ('mahadevan', 0.051), ('projected', 0.047), ('si', 0.046), ('andre', 0.042), ('root', 0.04), ('mdp', 0.04), ('continuing', 0.037), ('hams', 0.036), ('epoch', 0.035), ('rewards', 0.035), ('abstraction', 0.035), ('smdps', 0.033), ('gain', 0.033), ('pi', 0.031), ('executed', 0.031), ('dump', 0.03), ('ghavamzadeh', 0.03), ('rvi', 0.03), ('executing', 0.03), ('agent', 0.029), ('terminates', 0.029), ('tadepalli', 0.028), ('execution', 0.028), ('status', 0.027), ('terminal', 0.027), ('childseq', 0.027), ('hdr', 0.027), ('warehouse', 0.027), ('states', 0.026), ('russell', 0.025), ('average', 0.024), ('attening', 0.024), ('navigate', 0.024), ('seq', 0.024), ('gt', 0.023), ('door', 0.023), ('fi', 0.023), ('ti', 0.021), ('assembly', 0.021), ('precup', 0.021), ('throughput', 0.021), ('decomposition', 0.021), ('markov', 0.021), ('robot', 0.021), ('recurrent', 0.02), ('occupies', 0.02), ('chain', 0.02), ('earning', 0.019), ('recursive', 0.019), ('phams', 0.018), ('seri', 0.018), ('unload', 0.018), ('dietterich', 0.018), ('collect', 0.018), ('optimal', 0.018), ('st', 0.017), ('ct', 0.017), ('gamma', 0.017), ('descendants', 0.017), ('location', 0.016), ('task', 0.016), ('navigation', 0.016), ('station', 0.016), ('puterman', 0.016), ('options', 0.015), ('programmable', 0.015), ('ri', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

Author: Mohammad Ghavamzadeh, Sridhar Mahadevan

2 0.13033883 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Author: Sridhar Mahadevan, Mauro Maggioni

Abstract: This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A speciﬁc instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a ﬁnal parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A speciﬁc instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nystr¨ m extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker o sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are brieﬂy summarized at the end. Keywords: Markov decision processes, reinforcement learning, value function approximation, manifold learning, spectral graph theory

3 0.11531969 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artiﬁcial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping

4 0.054149918 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

Author: Avrim Blum, Yishay Mansour

Abstract: External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modiﬁed online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efﬁcient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modiﬁcation rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modiﬁcation rule is the difference between the cost of the online algorithm and the cost of the modiﬁed online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. Keywords: online learning, internal regret, external regret, multi-arm bandit, sleeping experts, reductions

5 0.043185785 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

Author: Manu Chhabra, Robert A. Jacobs, Daniel Štefankovič

Abstract: In a search problem, an agent uses the membership oracle of a target concept to ﬁnd a positive example of the concept. In a shaped search problem the agent is aided by a sequence of increasingly restrictive concepts leading to the target concept (analogous to behavioral shaping). The concepts are given by membership oracles, and the agent has to ﬁnd a positive example of the target concept while minimizing the total number of oracle queries. We show that for the concept class of intervals on the real line an agent using a bounded number of queries per oracle exists. In contrast, for the concept class of unions of two intervals on the real line no agent with a bounded number of queries per oracle exists. We then investigate the (amortized) number of queries per oracle needed for the shaped search problem over other concept classes. We explore the following methods to obtain efﬁcient agents. For axis-parallel rectangles we use a bootstrapping technique to obtain gradually better approximations of the target concept. We show that given rectangles R ⊆ A ⊆ R d one can obtain a rectangle A ⊇ R with vol(A )/vol(R) ≤ 2, using only O(d · vol(A)/vol(R)) random samples from A. For ellipsoids of bounded eccentricity in Rd we analyze a deterministic ray-shooting process which uses a sequence of rays to get close to the centroid. Finally, we use algorithms for generating random points in convex bodies (Dyer et al., 1991; Kannan et al., 1997) to give a randomized agent for the concept class of convex bodies. In the ﬁnal section, we explore connections between our bootstrapping method and active learning. Speciﬁcally, we use the bootstrapping technique for axis-parallel rectangles to active learn axis-parallel rectangles under the uniform distribution in O(d ln(1/ε)) labeled samples. Keywords: computational learning theory, behavioral shaping, active learning

6 0.030593116 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

7 0.028875571 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

8 0.028323567 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

9 0.027167782 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

10 0.025709985 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

11 0.022194926 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

12 0.020587454 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

13 0.016434278 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

14 0.015367246 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

15 0.014476676 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

16 0.014314373 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

17 0.014251149 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

18 0.013864692 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

19 0.013626729 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

20 0.013111942 11 jmlr-2007-Anytime Learning of Decision Trees

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.097), (1, 0.044), (2, -0.074), (3, -0.042), (4, -0.257), (5, 0.031), (6, 0.147), (7, 0.012), (8, -0.025), (9, 0.112), (10, 0.112), (11, -0.33), (12, 0.075), (13, 0.22), (14, -0.052), (15, 0.138), (16, -0.05), (17, -0.052), (18, 0.1), (19, 0.009), (20, 0.005), (21, -0.022), (22, 0.165), (23, -0.213), (24, -0.119), (25, 0.117), (26, -0.205), (27, 0.036), (28, -0.04), (29, -0.073), (30, 0.083), (31, -0.09), (32, 0.029), (33, -0.048), (34, -0.104), (35, 0.099), (36, 0.084), (37, -0.049), (38, 0.016), (39, -0.024), (40, -0.043), (41, 0.06), (42, 0.095), (43, 0.017), (44, 0.034), (45, -0.061), (46, -0.083), (47, 0.037), (48, 0.074), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98637277 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

Author: Mohammad Ghavamzadeh, Sridhar Mahadevan

2 0.75763947 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

3 0.52989393 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Author: Sridhar Mahadevan, Mauro Maggioni

4 0.20426393 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

Author: Avrim Blum, Yishay Mansour

5 0.13840909 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

Author: Manu Chhabra, Robert A. Jacobs, Daniel Štefankovič

6 0.13145095 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers

7 0.1163001 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

8 0.10827284 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

9 0.10822236 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

10 0.1000688 6 jmlr-2007-A Probabilistic Analysis of EM for Mixtures of Separated, Spherical Gaussians

11 0.097763948 2 jmlr-2007-A Complete Characterization of a Family of Solutions to a Generalized Fisher Criterion

12 0.096456431 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)

13 0.096320875 11 jmlr-2007-Anytime Learning of Decision Trees

14 0.087803833 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

15 0.086601004 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

16 0.083163597 15 jmlr-2007-Bilinear Discriminant Component Analysis

17 0.078630626 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

18 0.077874586 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

19 0.077262551 1 jmlr-2007-"Ideal Parent" Structure Learning for Continuous Variable Bayesian Networks

20 0.067477152 90 jmlr-2007-Value Regularization and Fenchel Duality

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.024), (10, 0.019), (12, 0.013), (15, 0.015), (28, 0.022), (30, 0.029), (40, 0.029), (48, 0.017), (60, 0.025), (67, 0.51), (80, 0.019), (85, 0.057), (98, 0.103)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79821372 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

Author: Mohammad Ghavamzadeh, Sridhar Mahadevan

2 0.24618518 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Author: Sridhar Mahadevan, Mauro Maggioni

3 0.23726788 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

Author: Guy Lebanon, Yi Mao, Joshua Dillon

Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efﬁcient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing

4 0.22940455 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identiﬁcation is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efﬁciently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model ﬁtting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Speciﬁcally, a pairwise separability measure for clusters is deﬁned using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difﬁculty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

5 0.2290445 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semideﬁnite programming

6 0.2289798 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

7 0.22895275 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

8 0.22852334 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

9 0.22734196 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

10 0.22628525 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

11 0.22622257 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

12 0.22607489 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

13 0.22594999 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

14 0.22543357 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

15 0.22496665 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

16 0.22473851 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

17 0.22332016 77 jmlr-2007-Stagewise Lasso

18 0.2229341 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

19 0.22225991 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

20 0.22199754 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling