jmlr jmlr2012 jmlr2012-116 knowledge-graph by maker-knowledge-mining

116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

Source: pdf

Author: George Konidaris, Ilya Scheidwasser, Andrew Barto

Abstract: We present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features, and that transfer can be achieved via those shared features. The framework attempts to capture the notion of tasks that are related but distinct, and provides some insight into when transfer can be usefully applied to a problem sequence and when it cannot. We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to signiﬁcantly improve performance in a later related task, even given a very brief training period. We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that signiﬁcantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-speciﬁc skills. Keywords: reinforcement learning, transfer, shaping, skills

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We apply the framework to the knowledge transfer problem, and show that an agent can learn a portable shaping function from experience in a sequence of tasks to signiﬁcantly improve performance in a later related task, even given a very brief training period. [sent-11, score-1.398]

2 We also apply the framework to skill transfer, to show that agents can learn portable skills across a sequence of tasks that signiﬁcantly improve performance on later related tasks, approaching the performance of agents given perfectly learned problem-speciﬁc skills. [sent-12, score-1.19]

3 Although reinforcement learning researchers study algorithms for improving task performance with experience, we do not yet understand how to effectively transfer learned skills and knowledge from one problem setting to another. [sent-16, score-0.629]

4 In this paper we present a framework for transfer in reinforcement learning based on the idea that related tasks share some common features and that transfer can take place through functions deﬁned only over those shared features. [sent-20, score-0.92]

5 1 Reinforcement Learning Reinforcement learning (Sutton and Barto, 1998) is a machine learning paradigm where an agent attempts to learn how to maximize a numerical reward signal over time in a given environment. [sent-30, score-0.643]

6 As a reinforcement learning agent interacts with its environment, it receives a reward (or sometimes incurs a cost) for each action taken. [sent-31, score-0.805]

7 If the reward received by the agent at time k is denoted rk , we denote this cumulative reward (termed return) from time t as Rt = ∑∞ γi rt+i+1 , where 0 < γ ≤ 1 is a discount factor that expresses the extent to which the agent i=0 prefers immediate reward over delayed reward. [sent-35, score-1.38]

8 Given a policy π mapping states to actions, a reinforcement learning agent may learn a value function, V , mapping states to expected return. [sent-36, score-0.741]

9 The options framework provides methods for learning and planning using options as temporally extended actions in the standard reinforcement learning framework (Sutton and Barto, 1998). [sent-60, score-0.826]

10 Policy learning is usually performed by an off-policy reinforcement learning algorithm so that the agent can update many options simultaneously after taking an action (Sutton et al. [sent-62, score-0.913]

11 Since transfer hinges on the tasks being related, the nature of that relationship will deﬁne how transfer can take place. [sent-82, score-0.7]

12 Related Tasks Share Common Features Successful transfer requires an agent that must solve a sequence of tasks that are related but distinct— different, but not so different that experience in one is irrelevant to experience in another. [sent-88, score-0.998]

13 A sequence of tasks must be (at least approximately) reward-linked if we aim to transfer information about the optimal value function: if the reward functions in two tasks use different sensors then there is no reason to hope that their value functions contain useful information about each other. [sent-130, score-0.804]

14 If learning to solve each individual task is possible and feasible in agent-space, then transfer is trivial: the tasks are effectively a single task and we can learn a single policy in agent-space for all tasks. [sent-137, score-0.663]

15 Section 4 shows that an agent can transfer value-functions learned in agent-space to signiﬁcantly decrease the time taken to ﬁnd an initial solution to a task, given experience in a sequence of related and reward-linked tasks. [sent-140, score-0.826]

16 In Section 5 we show that an agent can learn portable high-level skills directly in agent-space which can dramatically improve task performance, given experience in a sequence of related tasks. [sent-141, score-0.787]

17 We empirically demonstrate the effects of knowledge transfer using a relatively simple demonstration domain (a rod positioning task with an artiﬁcial agent space) and a more challenging domain (Keepaway). [sent-145, score-0.828]

18 1985), so that the agent can safely learn easier versions of the same task and use the resulting policy to speed learning as the task becomes more complex. [sent-152, score-0.629]

19 2 Unfortunately, this type of shaping does not generally transfer between tasks— it can only be used to gently introduce an agent to a single task, and is therefore not suited to a sequence of distinct tasks. [sent-153, score-0.95]

20 Alternatively, the agent’s reward function could be augmented through the use of intermediate shaping rewards or “progress indicators” (Matari´ , 1997) that provide an augmented (and hopefully c more informative) reinforcement signal to the agent. [sent-154, score-0.703]

21 This has the effect of shortening the reward horizon of the problem—the number of correct actions the agent must execute before receiving a useful reward signal (Laud and DeJong, 2003). [sent-155, score-0.845]

22 (1999) proved that an arbitrary externally speciﬁed reward function could be included as a potential-based shaping function in a reinforcement learning system without modifying its optimal policy. [sent-157, score-0.658]

23 In the following sections we show that an agent can learn its own shaping function from experience across several related, reward-linked tasks without having it speciﬁed in advance. [sent-165, score-0.971]

24 Alternatively, once an agent has completed some task S j and has learned a good j j approximation of the value of each state using V j , it can use its (di , vi ) pairs as training examples for a supervised learning algorithm to learn L. [sent-181, score-0.639]

25 Thus, when facing a new task Mk , the agent can use L to provide a good initial estimate for Vk that can be reﬁned using a standard reinforcement learning algorithm. [sent-184, score-0.654]

26 3 A Rod Positioning Experiment In this section we empirically evaluate the potential beneﬁts of a learned shaping function in a rod positioning task (Moore and Atkeson, 1993), where we add a simple artiﬁcial agent space that can be easily manipulated for experimental purposes. [sent-187, score-0.885]

27 The agent receives a reward of −1 for each action, and a reward of 1000 when reaching the goal (whereupon the current episode ends). [sent-195, score-0.886]

28 Since these beacons are present in every task, the sensor readings are an agent-space and we include an element in the agent that learns L and uses it to predict reward for each adjacent state given the ﬁve signal levels present there. [sent-198, score-0.709]

29 The usefulness of L as a reward predictor will depend on the relationship between beacon placement and reward across a sequence of individual rod positioning tasks. [sent-199, score-0.628]

30 01) in problem-space and used training tasks that were either 10x10 (where it was given 100 episodes to converge in each training task), or 15x15 (when it was given 150 episodes to converge), and tested in a 40x40 task. [sent-212, score-0.718]

31 Figure 3(a) shows the number of steps (averaged over 50 runs) required to ﬁrst reach the goal in the test task, against the number of training tasks completed by the agent for the four types of learned shaping elements (linear and quadratic L, and either 10x10 or 15x15 training tasks). [sent-223, score-1.022]

32 This illustrates the difference in overall learning performance on a single new task between agents that have had many training experiences and agents that have 3. [sent-233, score-0.906]

33 It also shows that the number of episodes required for convergence is roughly the same as that of an agent using a uniformly optimistic value table initialization of 500, and slightly longer than that of an agent using a uniformly pessimistic value table initialization of 0. [sent-242, score-1.032]

34 4 11 x 10 Uni0 Uni500 10Lin 10Quad 15Lin 15Quad 10 9 8 Steps to Goal 7 6 5 4 3 2 1 0 0 20 40 60 80 100 120 140 160 180 Episodes in the Test Task (b) Steps to reward against episodes in the homing test task for agents that have completed 20 training tasks. [sent-255, score-0.946]

35 This suggests that extra information in the agent-space more than makes up for a shaping function being difﬁcult to accurately represent—in all cases the performance of agents learning using the triangle beacon arrangement is better than that of those learning using the homing beacon arrangement. [sent-260, score-0.847]

36 4 11 x 10 Uni0 Uni500 10Lin 10Quad 15Lin 15Quad 10 9 8 Steps to Goal 7 6 5 4 3 2 1 0 0 20 40 60 80 100 120 140 160 Episodes in the Test Task (b) Steps to reward against episodes in the triangle test task for agents that have completed 20 training tasks. [sent-276, score-0.919]

37 5 0 0 20 40 60 80 100 120 140 160 Episodes in the Test Task (b) Steps to reward against episodes in the test task for agents that have completed 20 training task episodes using a noisy signal. [sent-297, score-1.242]

38 We see that, although these agents often do worse than learning from scratch in the ﬁrst episode, they subsequently do better when η < 1, and again converge at roughly the same rate as agents that use an optimistic initial value function. [sent-304, score-0.726]

39 However, the possible performance penalty for high η is more severe—an agent using a learned shaping function that rewards it for following a beacon signal may take nearly four times as long to ﬁrst solve the test problem when that feature becomes random (at η = 1). [sent-311, score-0.864]

40 Again, however, when η < 1 the agents recover after their ﬁrst episode to outperform agents that learn from scratch within the ﬁrst few episodes. [sent-312, score-0.816]

41 5 S UMMARY The ﬁrst two experiments above show that an agent able to learn its own shaping rewards through training can use even a few training experiences to signiﬁcantly improve its initial policy in a novel task. [sent-315, score-0.996]

42 They also show that such training results in agents with convergence characteristics similar to that of agents using uniformly optimistic initial value functions. [sent-316, score-0.74]

43 Thus, an agent that learns its own shaping rewards can improve its initial speed at solving a task when compared to an agent that cannot, but it will not converge to an approximately optimal policy in less time (as measured in episodes). [sent-317, score-1.243]

44 Beyond that, however, agents may experience negative transfer where either noisy features or an imperfect or approximate set of agent-space features result in poor learned shaping functions. [sent-321, score-1.083]

45 5 0 0 20 40 60 80 100 120 140 160 Episodes in the Test Task (b) Steps to reward against episodes in the test task for agents that have completed 20 training task episodes using features with imperfectly preserved semantics. [sent-344, score-1.261]

46 We use Keepaway to illustrate the use of learned shaping rewards on a standard but challenging benchmark that has been used in other transfer studies (Taylor et al. [sent-350, score-0.635]

47 Figure 9 shows sample learning curves for agents learning from scratch and agents using transferred knowledge from 5000 episodes of 3v2 Keepaway, demonstrating that agents that transfer knowledge start with better policies and learn faster. [sent-439, score-1.715]

48 5 Discussion The results presented above suggest that agents that employ reinforcement learning methods can be augmented to use their experience to learn their own shaping rewards. [sent-447, score-0.928]

49 It also creates the possibility of training such agents on easy tasks as a way of equipping them with knowledge that will make harder tasks tractable, and is thus an instance of an autonomous developmental learning system (Weng et al. [sent-449, score-0.702]

50 In some situations, the learning algorithm chosen to learn the shaping function, or the sensory patterns given to it, might result in an agent that is completely unable to learn anything useful. [sent-451, score-0.742]

51 We do not expect such an agent to do much worse than one without any shaping rewards at all. [sent-452, score-0.7]

52 Given that an agent facing a sequence of tasks receives many example transitions between pairs of agentspace descriptors, it may prove efﬁcient to instead learn an approximate transition model in agentspace and then use that model to obtain a shaping function via planning. [sent-458, score-0.91]

53 In reinforcement learning the state space is searched by the agent itself, but its initial value function (either directly or via a shaping function) acts to order the selection of unvisited nodes by the agent. [sent-461, score-0.91]

54 Therefore, we argue that reinforcement learning agents using non-uniform initial value functions are using something very similar to a heuristic, and those that are able to learn their own portable shaping functions are in effect able to learn their own heuristics. [sent-462, score-1.03]

55 Although this can lead to faster learning on later tasks in the same state space, learned options would be more useful if they could be reused in later tasks that are related but have distinct state spaces. [sent-550, score-0.757]

56 In this section we demonstrate empirically that an agent that learns portable options directly in agent-space can reuse those options in future related tasks to signiﬁcantly improve performance. [sent-551, score-1.295]

57 The agent is also either given, or learns, a set of higher-level options to reduce the time required j to solve the task. [sent-563, score-0.698]

58 Although the agent will be learning task and option policies in different spaces, both types of policies can be updated simultaneously as the agent receives both agent-space and problem-space descriptors at each state. [sent-567, score-1.036]

59 To support learning a portable shaping function, an agent space should contain some features that are correlated to return across tasks. [sent-568, score-0.857]

60 Five pieces of data form a problem-space descriptor for any lightworld instance: the current room number, the x and y coordinates of the agent in that room, whether or not the agent has the key, and whether or not the door is open. [sent-593, score-1.012]

61 1 T YPES OF AGENT We used ﬁve types of reinforcement learning agents: agents without options, agents with problemspace options, agents with perfect problem-space options, agents with agent-space options, and agents with both option types. [sent-598, score-1.996]

62 The agents without options used Sarsa(λ) with ε-greedy action selection (α = 0. [sent-599, score-0.693]

63 Agents with perfect problem-space options were given options with pre-learned policies for each salient event, though they still performed option updates and were otherwise identical to the standard agent with options. [sent-612, score-1.154]

64 Finally, agents with both types of options were included to represent agents that learn both general portable and speciﬁc non-portable skills simultaneously. [sent-624, score-1.23]

65 To evaluate the performance of agent-space options as the agents gained more experience, we similarly obtained 1000 lightworld samples and test tasks, but for each test task we ran the agents once without training and then with between 1 and 10 training experiences. [sent-632, score-1.235]

66 Each training experience for a test lightworld task consisted of 100 episodes in a training lightworld randomly selected from the remaining 99. [sent-633, score-0.685]

67 Although the agents updated their options during evaluation in the test lightworld, these updates were discarded before the next training experience so the agent-space options never received prior training in the test lightworld. [sent-634, score-1.118]

68 3 R ESULTS Figure 11(a) shows average learning curves for agents employing problem-space options, and Figure 11(b) shows the same for agents employing agent-space options. [sent-637, score-0.71]

69 The ﬁrst time an agent-space option agent encounters a lightworld, it performs similarly to an agent without options (as evidenced by the two topmost learning curves in each ﬁgure), but its performance rapidly improves with experience in other lightworlds. [sent-638, score-1.272]

70 (PO), agent-space options with 0-10 training experiences (dark bars), and both option types with 0-10 training experiences (light bars). [sent-644, score-0.672]

71 4 The ﬁrst time such agents encounter a lightworld, they perform as well as agents using problem-space 4. [sent-647, score-0.688]

72 This explains why agents using only agent-space options and no training experiences perform more like agents without options than like agents with problem-space options. [sent-652, score-1.795]

73 Second, options learned in our problem-space can represent exact solutions to speciﬁc subgoals, whereas options learned in our agent-space are general and must be approximated, and are therefore likely to be slightly less efﬁcient for any speciﬁc subgoal. [sent-653, score-0.714]

74 This explains why agents using both types of options perform better in the long run than agents using only agent-space options. [sent-654, score-0.996]

75 Figure 11(d) shows the mean total number of steps required over 70 episodes for agents using no options, problem-space options, perfect options, agent-space options, and both option types. [sent-655, score-0.72]

76 It also clearly shows that agents using both types of options do consistently better than those using agent-space options alone. [sent-657, score-0.96]

77 3 The Conveyor Belt Domain In the previous section we showed that an agent can use experience in related tasks to learn portable options, and that those options can improve performance in later tasks, when the agent has a highdimensional agent-space. [sent-663, score-1.503]

78 , 2000), it did not occur during the same number of samples obtained for agents with agent-space options only. [sent-674, score-0.652]

79 We ran experiments where the agents learned three options: one to move the current object to the bin at the end of the belt it is currently on, one for moving it to the belt above it, and one for moving it to the belt below it. [sent-683, score-0.739]

80 We used the same agent types and experimental structure as before, except that the agent-space options did not use function approximation. [sent-684, score-0.698]

81 1 R ESULTS Figures 13(a), 13(b) and 13(c) show learning curves for agents employing no options, problemspace options and perfect options; agents employing agent-space options; and agents employing both types of options, respectively. [sent-687, score-1.394]

82 Figure 13(b) shows that the agents with agent-space options and no prior experience initially improve quickly but eventually obtain lower quality solutions than agents with problem-space options (Figure 13(a)). [sent-688, score-1.396]

83 One or two training experiences result in roughly the same curve as agents using problem-space options, but by 5 training experiences the agent-space options are a signiﬁcant improvement (although due to their limited range they are never as good as perfect options). [sent-689, score-0.978]

84 This initial dip relatively to agents with no prior experience is probably due to the limited range of the agent-space options (due to the limited range of the camera) and the fact that they are only locally Markov, even for their own subgoals. [sent-690, score-0.763]

85 tions (NO), learned problem-space options (LO), perfect options (PO), agent-space options with 0-10 training experiences (dark bars), and both option types with 0-10 training experiences (light bars). [sent-694, score-1.369]

86 Figure 13(c) shows that agents with both option types do not experience this initial dip relative to agents with no prior experience and outperform problem-space options immediately, most likely because the agent-space options are able to generalise across belts. [sent-696, score-1.6]

87 4 Summary Our results show that options learned in agent-space can be successfully transferred between related tasks, and that this signiﬁcantly improves performance in sequences of tasks where the agent space cannot be used for learning directly. [sent-702, score-0.929]

88 Our results suggest that when the agent space is large but can support global policies, experience in related tasks can eventually result in options that perform as well as perfect problem-speciﬁc options. [sent-703, score-0.97]

89 When the agent space is only locally Markov, learned portable options will improve performance but are unlikely to reach the performance of perfect problem-speciﬁc options due to their limited range. [sent-704, score-1.251]

90 In such situations, learning both problem-speciﬁc and agent space options simultaneously will likely obtain better performance than either individually. [sent-706, score-0.698]

91 Konidaris and Hayes (2004) describe a similar method to ours that uses training tasks to learn associations between reward and strong signals at reward states, resulting in a signiﬁcant improvement in the total reward obtained by a simulated robot learning to ﬁnd a puck in a novel maze. [sent-745, score-0.859]

92 2 learn options in the same state space in which the agent is performing reinforcement learning, and thus the options can only be reused for the same problem or for a new problem in the same space. [sent-759, score-1.275]

93 Discussion The work in the preceding sections has shown that both knowledge and skill transfer can be effected across a sequence of tasks through the use of features common to all tasks in the sequence. [sent-778, score-0.662]

94 When learning portable shaping functions, if the action space differs across tasks then we can simply learn shaping functions deﬁned over states only (as potential-based shaping functions were originally deﬁned by Ng et al. [sent-791, score-1.259]

95 Although we expect that learning using portable state-only shaping functions will not perform as well as learning using portable state-action shaping functions, we nevertheless expect that they will result in substantial performance gains for reward-linked tasks. [sent-793, score-0.85]

96 More broadly, given a set of previously learned related tasks that are not reward-linked, which one should we use as the source of a portable shaping function for a new related task? [sent-823, score-0.65]

97 If the method used to compare the two reward functions returns a distance metric, then the agent could use it to cluster portable shaping functions and build libraries of them, drawing on an appropriate one for each new task it encounters. [sent-827, score-1.086]

98 However, we do not believe the complete absence of prior information about a task is representative of applied reinforcement learning settings where the agent must solve multiple tasks sequentially. [sent-830, score-0.783]

99 Summary and Conclusions We have presented a framework for transfer in reinforcement learning based on the idea that related tasks share common features and that transfer can take place through functions deﬁned over those related features. [sent-835, score-0.893]

100 Learning relational options for inductive transfer in relational reinforcement learning. [sent-880, score-0.758]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('agent', 0.39), ('agents', 0.344), ('options', 0.308), ('shaping', 0.284), ('transfer', 0.276), ('episodes', 0.252), ('reward', 0.2), ('reinforcement', 0.174), ('keepaway', 0.166), ('tasks', 0.148), ('portable', 0.141), ('barto', 0.137), ('konidaris', 0.134), ('experiences', 0.114), ('cheidwasser', 0.102), ('einforcement', 0.102), ('hared', 0.102), ('lightworld', 0.102), ('ransfer', 0.102), ('beacon', 0.096), ('belt', 0.096), ('experience', 0.092), ('eatures', 0.087), ('episode', 0.075), ('keepers', 0.075), ('task', 0.071), ('option', 0.07), ('keeper', 0.07), ('rod', 0.064), ('policy', 0.063), ('descriptor', 0.06), ('conveyor', 0.059), ('skills', 0.059), ('learned', 0.049), ('possession', 0.048), ('skill', 0.048), ('policies', 0.046), ('robot', 0.044), ('sutton', 0.044), ('state', 0.043), ('takers', 0.043), ('action', 0.041), ('actions', 0.036), ('door', 0.035), ('room', 0.035), ('transferred', 0.034), ('learn', 0.034), ('earning', 0.033), ('training', 0.033), ('perfect', 0.032), ('sensors', 0.032), ('precup', 0.032), ('semantics', 0.032), ('environment', 0.03), ('sensor', 0.03), ('autonomous', 0.029), ('stone', 0.028), ('source', 0.028), ('positioning', 0.027), ('shared', 0.027), ('agentspace', 0.027), ('beacons', 0.027), ('belts', 0.027), ('homing', 0.027), ('simsek', 0.027), ('usefully', 0.027), ('rewards', 0.026), ('across', 0.023), ('reach', 0.023), ('descriptors', 0.023), ('ferguson', 0.023), ('ball', 0.023), ('bin', 0.022), ('curves', 0.022), ('steps', 0.022), ('jonsson', 0.021), ('ravindran', 0.021), ('wilson', 0.021), ('goal', 0.021), ('lock', 0.021), ('moore', 0.021), ('taylor', 0.021), ('mappings', 0.021), ('states', 0.02), ('mapping', 0.02), ('features', 0.019), ('completed', 0.019), ('initial', 0.019), ('scratch', 0.019), ('termination', 0.019), ('mdps', 0.019), ('light', 0.019), ('signal', 0.019), ('noise', 0.019), ('ave', 0.018), ('reused', 0.018), ('placement', 0.018), ('lo', 0.018), ('po', 0.018), ('moving', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

Author: George Konidaris, Ilya Scheidwasser, Andrew Barto

2 0.13834541 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

Abstract: A fundamental problem in reinforcement learning is balancing exploration and exploitation. We address this problem in the context of model-based reinforcement learning in large stochastic relational domains by developing relational extensions of the concepts of the E 3 and R- MAX algorithms. Efﬁcient exploration in exponentially large state spaces needs to exploit the generalization of the learned model: what in a propositional setting would be considered a novel situation and worth exploration may in the relational setting be a well-known context in which exploitation is promising. To address this we introduce relational count functions which generalize the classical notion of state and action visitation counts. We provide guarantees on the exploration efﬁciency of our framework using count functions under the assumption that we had a relational KWIK learner and a near-optimal planner. We propose a concrete exploration algorithm which integrates a practically efﬁcient probabilistic rule learner and a relational planner (for which there are no guarantees, however) and employs the contexts of learned relational rules as features to model the novelty of states and actions. Our results in noisy 3D simulated robot manipulation problems and in domains of the international planning competition demonstrate that our approach is more effective than existing propositional and factored exploration techniques. Keywords: reinforcement learning, statistical relational learning, exploration, relational transition models, robotics

3 0.095492341 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

Author: Aviv Tamar, Dotan Di Castro, Ron Meir

Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel deﬁnition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms

4 0.091963008 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

Abstract: We propose a novel model for meta-generalisation, that is, performing prediction on novel tasks based on information from multiple different but related tasks. The model is based on two coupled Gaussian processes with structured covariance function; one model performs predictions by learning a constrained covariance function encapsulating the relations between the various training tasks, while the second model determines the similarity of new tasks to previously seen tasks. We demonstrate empirically on several real and synthetic data sets both the strengths of the approach and its limitations due to the distributional assumptions underpinning it. Keywords: transfer learning, meta-generalising, multi-task learning, Gaussian processes, mixture of experts

5 0.090976998 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions

Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy

Abstract: We present a general and detailed development of an algorithm for ﬁnite-horizon ﬁtted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a realworld decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the ﬁeld of evidence-based clinical decision support. Keywords: reinforcement learning, dynamic programming, decision making, linear regression, preference elicitation

6 0.078056417 34 jmlr-2012-Dynamic Policy Programming

7 0.071786501 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

8 0.038318288 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration

9 0.027683331 50 jmlr-2012-Human Gesture Recognition on Product Manifolds

10 0.026843963 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

11 0.024695925 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

12 0.02466001 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

13 0.024538824 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

14 0.021811498 110 jmlr-2012-Static Prediction Games for Adversarial Learning Problems

15 0.021781418 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

16 0.021490218 97 jmlr-2012-Regularization Techniques for Learning with Matrices

17 0.020394634 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

18 0.020182407 106 jmlr-2012-Sign Language Recognition using Sub-Units

19 0.019659201 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

20 0.019258881 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.116), (1, -0.036), (2, 0.133), (3, -0.171), (4, 0.038), (5, -0.189), (6, -0.147), (7, -0.117), (8, 0.001), (9, 0.206), (10, -0.155), (11, 0.078), (12, 0.055), (13, -0.127), (14, -0.038), (15, -0.055), (16, -0.006), (17, 0.097), (18, 0.004), (19, 0.004), (20, 0.08), (21, 0.11), (22, 0.002), (23, -0.05), (24, 0.101), (25, 0.034), (26, -0.054), (27, 0.221), (28, -0.124), (29, 0.176), (30, -0.097), (31, 0.004), (32, -0.019), (33, 0.029), (34, 0.038), (35, 0.059), (36, 0.025), (37, -0.169), (38, 0.057), (39, 0.034), (40, -0.093), (41, -0.002), (42, -0.059), (43, 0.068), (44, -0.011), (45, -0.03), (46, -0.111), (47, -0.085), (48, -0.01), (49, 0.175)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96835262 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

Author: George Konidaris, Ilya Scheidwasser, Andrew Barto

2 0.77161998 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

3 0.55031997 1 jmlr-2012-A Case Study on Meta-Generalising: A Gaussian Processes Approach

Author: Grigorios Skolidis, Guido Sanguinetti

4 0.34242746 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning

Author: Aviv Tamar, Dotan Di Castro, Ron Meir

5 0.33690834 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions

Author: Daniel J. Lizotte, Michael Bowling, Susan A. Murphy

6 0.30858469 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

7 0.28637663 34 jmlr-2012-Dynamic Policy Programming

8 0.18542191 114 jmlr-2012-Towards Integrative Causal Analysis of Heterogeneous Data Sets and Studies

9 0.18485555 30 jmlr-2012-DARWIN: A Framework for Machine Learning and Computer Vision Research and Development

10 0.18138191 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization

11 0.15474105 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization

12 0.15338269 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

13 0.15053411 3 jmlr-2012-A Geometric Approach to Sample Compression

14 0.14293846 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

15 0.13660881 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel

16 0.13638383 57 jmlr-2012-Learning Symbolic Representations of Hybrid Dynamical Systems

17 0.13498865 50 jmlr-2012-Human Gesture Recognition on Product Manifolds

18 0.120718 73 jmlr-2012-Multi-task Regression using Minimal Penalties

19 0.1184897 78 jmlr-2012-Nonparametric Guidance of Autoencoder Representations using Label Information

20 0.11571088 32 jmlr-2012-Discriminative Hierarchical Part-based Models for Human Parsing and Action Recognition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.015), (21, 0.029), (26, 0.037), (27, 0.015), (29, 0.039), (35, 0.021), (48, 0.402), (56, 0.02), (57, 0.056), (64, 0.013), (69, 0.02), (75, 0.048), (77, 0.015), (79, 0.014), (81, 0.01), (92, 0.081), (96, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72607708 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features

Author: George Konidaris, Ilya Scheidwasser, Andrew Barto

2 0.34758458 70 jmlr-2012-Multi-Assignment Clustering for Boolean Data

Author: Mario Frank, Andreas P. Streich, David Basin, Joachim M. Buhmann

Abstract: We propose a probabilistic model for clustering Boolean data where an object can be simultaneously assigned to multiple clusters. By explicitly modeling the underlying generative process that combines the individual source emissions, highly structured data are expressed with substantially fewer clusters compared to single-assignment clustering. As a consequence, such a model provides robust parameter estimators even when the number of samples is low. We extend the model with different noise processes and demonstrate that maximum-likelihood estimation with multiple assignments consistently infers source parameters more accurately than single-assignment clustering. Our model is primarily motivated by the task of role mining for role-based access control, where users of a system are assigned one or more roles. In experiments with real-world access-control data, our model exhibits better generalization performance than state-of-the-art approaches. Keywords: clustering, multi-assignments, overlapping clusters, Boolean data, role mining, latent feature models

3 0.33243111 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning

Author: Tobias Lang, Marc Toussaint, Kristian Kersting

4 0.32892817 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

5 0.32321921 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is deﬁned by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identiﬁcation property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identiﬁcation

6 0.32195252 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

7 0.32172105 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

8 0.31741518 80 jmlr-2012-On Ranking and Generalization Bounds

9 0.31688502 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

10 0.31555048 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

11 0.31504321 72 jmlr-2012-Multi-Target Regression with Rule Ensembles

12 0.31437135 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

13 0.31390238 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

14 0.31319869 103 jmlr-2012-Sampling Methods for the Nyström Method

15 0.31229499 13 jmlr-2012-Active Learning via Perfect Selective Classification

16 0.31209287 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality

17 0.31203696 73 jmlr-2012-Multi-task Regression using Minimal Penalties

18 0.3114458 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

19 0.31070036 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

20 0.30999899 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms