nips nips2007 nips2007-100 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Máté Lengyel, Peter Dayan
Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. [sent-8, score-0.433]
2 Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. [sent-9, score-0.306]
3 We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. [sent-10, score-1.087]
4 We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. [sent-11, score-0.983]
5 We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. [sent-12, score-0.42]
6 1 Introduction What use is an episodic memory? [sent-13, score-0.631]
7 However, why should it be better to act on the basis of the recollection of single happenings, rather than the seemingly normative use of accumulated statistics from multiple events? [sent-15, score-0.166]
8 The task of building such a statistical model is normally the dominion of semantic memory [2], the other main form of declarative memory. [sent-16, score-0.257]
9 Issues of this kind are frequently discussed under the rubric of multiple memory systems [3, 4]; here we consider it from a normative viewpoint in which memories are directly used for control. [sent-17, score-0.261]
10 Our answer to the initial question is the computational challenge of using a semantic memory as a forward model in sequential decision tasks in which many actions must be taken before a goal is reached [5]. [sent-18, score-0.509]
11 Forward and backward search in the tree of actions and consequent states (ie modelbased reinforcement learning [6]) in such domains impose crippling demands on working memory 1 (to store partial evaluations) and it may not even be possible to expand out the tree in reasonable times. [sent-19, score-0.486]
12 If we think of the inevitable resulting errors in evaluation as a form of computational noise or uncertainty, then the use of the semantic memory for control will be expected to be subject to substantial error. [sent-20, score-0.59]
13 The main task for this paper is to explore and understand the circumstances under which episodic control, although seemingly less efficient in its use of experience, should be expected to be more accurate, and therefore be evident both psychologically and neurally. [sent-21, score-0.704]
14 This argument about episodic control exactly parallels one recently made for habitual or cached control [5]. [sent-22, score-1.244]
15 It is therefore optimal to employ cached control rather than model-based control only after sufficient experience, when the inaccuracy of the former over learning is outweighed by the computational noise induced in using the latter. [sent-25, score-0.667]
16 We will show that in general, just as model-free control is better than model-based control after substantial experience, episodic control is better than model-based control after only very limited experience. [sent-27, score-1.487]
17 For some classes of environments, these two other controllers significantly squeeze the domain of optimal use of semantic control. [sent-28, score-0.224]
18 It was argued [5] that the transition from model-based to model-free control explains a wealth of psychological observations about the transition over the course of learning from goaldirected control (which is considered to be model-based) to habitual control (which is model-free). [sent-31, score-0.928]
19 In turn, this is associated with an apparent functional segregation between the dorsolateral prefrontal cortex and dorsomedial striatum, implementing model-based control, and the dorsolateral striatum (and its neuromodulatory inputs), implementing model-free control. [sent-32, score-0.362]
20 Exactly how the uncertainties associated with these two types of control are calculated is not clear, although it is known that the prelimbic and infralimbic cortices are somehow involved in arbitration. [sent-33, score-0.271]
21 The psychological construct for episodic control is obvious; its neural realization is likely to be the hippocampus and medial temporal cortical regions. [sent-34, score-1.034]
22 How arbitration might work for this third controller is also not clear, although there have been suggestions on how uncertainty may be represented neurally in the hippocampus [8]. [sent-35, score-0.548]
23 In this paper, we explore the nature and (f)utility of episodic control. [sent-37, score-0.631]
24 2 Paradigm for analysis We seek to analyse computational and statistical trade-offs that arise in choosing actions that maximize long-term rewards in sequential decision making problems. [sent-41, score-0.29]
25 We characterize these tasks as Markov decision processes (MDPs) [6] whose transition and reward structure are initially unknown by the subject, but are drawn from a parameterized prior that is known. [sent-43, score-0.18]
26 The key question is how well different possible control strategies can perform given this prior and a measured amount of experience. [sent-44, score-0.2]
27 Like [11], we simplify exploration using a form of parallel sampling model in order to focus on the ability of controllers to exploit knowledge extracted about an environment. [sent-45, score-0.207]
28 Performance is naturally measured using the average reward that would be collected in a trial; this average is then itself averaged over draws of the MDP and the stochasticity associated with the exploratory actions. [sent-46, score-0.181]
29 We analyse three controllers: a model-based controller without computational noise, which provides a theoretical upper limit on performance, a realistic model-based 2 A B 3 Performance 2. [sent-47, score-0.507]
30 10 0 reward 10 0 probability terminal state Figure 1: A, An example tree-structured MDP, with depth D = 2, branching factor B = 3, and A = 4 available actions in each non-terminal state. [sent-55, score-0.374]
31 The horizontal stacked bars in the boxes of the left and middle column show the transition probabilities for different actions at non-terminal states, color coded by the successor states to which they lead (matching the color of the corresponding arrows). [sent-56, score-0.252]
32 controller with computational noise that we regard as the model of semantic memory-based control, and an ‘episodic controller’. [sent-64, score-0.603]
33 Actions lead to further states (and potentially rewards), from where further possible actions and thus states become available, and so on. [sent-67, score-0.194]
34 3 The model-based controller In our paradigm, the task for the model-based controllers is to use the data from the exploratory trials to work out posterior distributions over the unknown transition and reward structure of the tMDP, and then report the best action at each state. [sent-71, score-0.79]
35 First, we consider the model-based controller in the case that it has experienced so many samples that the parameters of the tMDP are known exactly. [sent-76, score-0.397]
36 Second, we approximate 3 the impact of incomplete exploration by corrupting the controller by an aliquot of noise whose magnitude is determined by the parameters of the problem. [sent-78, score-0.594]
37 Equation 1 is actually an interesting result in and of itself – it indicates the extent to which the controller can take advantage of the variability µ − µr ∝ σr in ¯ ¯ boosting its expected return from the root node as a function of the depth of the tree. [sent-85, score-0.421]
38 The second step is to observe that we expect the benefits of episodic control to be most apparent given very limited exploratory experience. [sent-86, score-1.02]
39 To make analytical progress, we are forced to make the significant assumption that the effects of this can be modeled by assuming that the controller does not have access to the true values of actions, but only to ‘noisy’ versions. [sent-87, score-0.504]
40 This ‘noise’ comes from the fact that computing the values of different actions is based on estimates of transition probability and reward distributions. [sent-88, score-0.274]
41 We have been able to show that the form of the resulting ‘noise’ in the action values can have the effect of scaling down the true values of actions at states by a factor φ1 and adding extra noise φ2 . [sent-90, score-0.316]
42 Figure 1B shows the learning curve for the model-based controller computed using our analytical predictions (blue line) and using exhaustive numerical simulations (red line, average performance in 100 sample tMDPs, with the learning process rerun 100 times in each). [sent-92, score-0.486]
43 The dark blue solid curve in figure 2A (labelled η2 = 0) shows the performance of model-based control as a function of the number of exploration samples (the equivalent of the dark blue curve in figure 1B, but for A = 4 rather than A = 3). [sent-94, score-0.339]
44 The final step is to model the effects of the computational complexity of the model-based controller on performance arising from the severe demands it places on such facets as working memory. [sent-97, score-0.537]
45 We treat the effects of all approximations by forcing the controller to have access to only noisy versions of the (exploration-limited) action values. [sent-99, score-0.559]
46 Note that whereas the terms φ1 , φ2 characterizing the effects of learning are determined by the number of samples; η1 , η2 are set by hand to capture the assumed effects on inference of the computational complexity. [sent-101, score-0.197]
47 The asymptotic values of the curves in figure 2A for various values of η2 (for all of them, η1 = 1) demonstrate the effects of inferential noise on performance. [sent-102, score-0.331]
48 1 0 100 2 20 40 60 Learning time 80 100 Figure 2: A, Learning curves for the model-based controller at different levels of computational noise: η1 = 1, η2 is increased from 0 to 3. [sent-126, score-0.487]
49 The approximations used for computing these curves are less accurate in the low-noise limit, hence the paradoxical slight decrease in the performance of the perfect controller (without noise) at the end of learning. [sent-127, score-0.497]
50 B, Performance of noisy controllers normalized by that of the perfect controller in the same environment at the same amount of experience. [sent-129, score-0.647]
51 So far, we have separately considered the effects of computational noise and uncertainty due to limited experience. [sent-133, score-0.266]
52 The full plots in figure 2A, B show the interaction of these two factors (figure 2B shows the same data as figure 2A, but scaled to the performance of the noise-free controller for the given amount of experience). [sent-135, score-0.372]
53 Computational noise not only makes the asymptotic performance worse, by simply down-scaling average rewards, but it also makes learning effectively slower. [sent-136, score-0.163]
54 This is because the adverse effects of computational noise depend on the differences between the values of possible actions. [sent-137, score-0.206]
55 However, if action values appear roughly the same, then a little noise can easily change their ordering and make the controller choose a suboptimal one. [sent-139, score-0.523]
56 Little experience only licenses small apparent differences between values, and this boosts the corrupting effect of the inferential noise. [sent-140, score-0.224]
57 Given more experience, the controller increasingly learns to make distinctions between different actions that looked the same a priori. [sent-141, score-0.508]
58 How much experience constitutes ‘little’ and how much noise counts as ’much’ is of course relative to the complexity of the environment. [sent-143, score-0.239]
59 4 Episodic control If model-based control is indeed crippled by computational noise given limited exploration, could there be an effective alternative? [sent-144, score-0.574]
60 Although outside the scope of our formal analysis, this is particularly important given the ubiquity of non-stationary environments [13], for which the effects of continual change bound the effective number of exploratory samples. [sent-145, score-0.196]
61 That the cache-based or habitual controller is even worse in this limit (since it learns by bootstrapping) was a main rationale for the uncertainty-based account of the transfer from goal-directed to habitual control suggested by Daw et al [5]. [sent-146, score-0.96]
62 Thus the habitual controller cannot step into the breach. [sent-147, score-0.514]
63 It is here that we expect episodic control to be most useful. [sent-148, score-0.857]
64 Intuitively, if a subject has experienced a complex environment just a few times, and found a sequence of actions that works reasonably well, then, provided that exploitation is at a premium over exploration, it seems obvious for the subject 5 A B 4 3. [sent-149, score-0.242]
65 Solid red line shows the performance of noisy modelbased control (η2 = 2), blue line shows that of episodic control. [sent-169, score-1.017]
66 Dashed red line shows the case of perfect model-based control which constitutes the best performance that could possibly be achieved. [sent-170, score-0.304]
67 This act of replaying a particular sequence of events from the past is exactly an instance of episodic control. [sent-173, score-0.659]
68 We expect such a strategy to be useful in the low data limit because, unlike in cache-based control, there is no issue of bootstrapping and temporal credit assignment, and unlike in model-based control, there is no exhaustive tree-search involved in action selection. [sent-176, score-0.208]
69 Of course its advantages will be ultimately counteracted by the haphazardness of using single samples that are ‘adequate’, but by that time the other controllers can take over. [sent-177, score-0.174]
70 Although we expect our approximate analytical methods to provide some insight into its characteristics, we have so far only been able to use simulations to study the episodic controller in the usual class of tMDPs. [sent-178, score-1.143]
71 Comparing the blue (episodic) and red (model-based, but noisy; η2 = 2) curves, in figure 3A-C, it is apparent that episodic control indeed outperforms noisy model-based control in the low data limit. [sent-179, score-1.139]
72 The dashed curves show the performance of the idealized model-based controller that is noise-free. [sent-180, score-0.412]
73 This emphasizes the arbitrariness of our choice of noise level – the greater the noise, the longer the dominance of episodic control. [sent-181, score-0.733]
74 However, in complicated environments, even very small amounts of noise are near catastrophic for model-based control (see brown line in Fig. [sent-182, score-0.333]
75 At the same level of computational noise, episodic control supplants model-based control for increasing volumes of exploratory samples. [sent-186, score-1.167]
76 We expect that the same is true if the complexity of the environment is increased by increasing the depth of the tree (D) instead, or as well. [sent-187, score-0.215]
77 Figure 3A-C also makes the point that the asymptotic performance of the episodic controller is rather poor, and is barely improved by extra learning. [sent-188, score-1.064]
78 A smarter episodic strategy, perhaps involving reconsolidation to eliminate unfortunate sample trajectories, might perform more competently. [sent-189, score-0.631]
79 5 Discussion An episodic controller operates by remembering for each state the single action that led to the best outcome so far observed. [sent-190, score-1.101]
80 Here, we studied the nature and benefits of episodic control. [sent-191, score-0.631]
81 This controller is statistically inefficient for solving Markov decision problems compared with the normative strategy of building a statistical forward model of the transitions and outcomes, and searching for the optimal action. [sent-192, score-0.579]
82 However, episodic control is computationally very straightforward, and therefore does not suffer from any excess uncertainty or noise arising from the severe calculational and search complexities of the forward model. [sent-193, score-1.091]
83 This implies that it can best forward model control under various circumstances. [sent-194, score-0.268]
84 We then used theoretical and empirical methods to analyze the statistical structure of control based on a forward model in the face of limited data. [sent-196, score-0.299]
85 We showed that this control can readily be outperformed by an episodic controller which does not suffer from computational inaccuracy, at least in the particular limits of high task complexity and significant inferential noise in the modelbased controller. [sent-197, score-1.464]
86 We also showed how the noise in the latter has a particularly pernicious effect on the course of learning, corrupting the choice between actions whose values appear, because of limited experience, closer than they actually are. [sent-198, score-0.356]
87 Our analysis paralleled that of [5], who showed that the noisy forward-model controller is also beaten by a cached (actor-critic-like) controller in the opposite limit of substantial experience in an environment. [sent-203, score-1.001]
88 The cached controller is also computationally straightforward, but relies on a completely different structure of learning and inference. [sent-204, score-0.443]
89 In psychological terms, the episodic controller is best thought of as being goal-directed, since the ultimate outcome forms part of the episode that is recalled. [sent-205, score-1.076]
90 Unfortunately, this makes it difficult to distinguish behaviorally from goal-directed control resulting from the forward model. [sent-206, score-0.268]
91 In neural terms, the episodic controller is likely to rely on the very well investigated systems involved in episodic memory, namely the hippocampus and medial temporal cortices. [sent-207, score-1.832]
92 Importantly, there is direct evidence of the transfer of control from hippocampal to striatal structures over the course of learning [9, 10], and there is some evidence that episodic and habitual control can be simultaneously active. [sent-208, score-1.396]
93 Unfortunately, there are few data [14] on structures that might control the competition or transfer process, and no test as to whether there is an intermediate phase in which prefrontal mechanisms instantiating the forward model might be dominant. [sent-209, score-0.454]
94 This paper is an extended answer to the question of the computational benefit of episodic memory, which, crudely speaking, stores particular samples, over semantic memory, which stores probability distributions. [sent-211, score-0.846]
95 Equally, in game theoretic interactions between competitors, Nash equilibria are typically stochastic, and therefore seemingly excellent candidates for control based on a semantic memory. [sent-213, score-0.329]
96 However, taking advantage of the flaws in an opponent require exactly remembering how its actions deviate from stationary statistics, for which an episodic memory is a most useful tool [16]. [sent-214, score-0.95]
97 This form of semantic memory can be seen as arising without any consolidation process whatsoever. [sent-216, score-0.252]
98 However, although this method has its computational attractions, it is psychologically implausible since phenomena such as priming make it extremely difficult to recall multiple closely related samples from an episodic memory, let alone to do so in a statistically unbiased way (but see [18]). [sent-217, score-0.704]
99 In sum, we have provided a normative justification from the perspective of appropriate control for the episodic component of a multiple memory system. [sent-218, score-1.062]
100 Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. [sent-252, score-0.275]
wordName wordTfidf (topN-words)
[('episodic', 0.631), ('controller', 0.372), ('control', 0.2), ('habitual', 0.142), ('actions', 0.136), ('controllers', 0.136), ('memory', 0.134), ('hippocampus', 0.106), ('noise', 0.102), ('prefrontal', 0.102), ('normative', 0.097), ('exploratory', 0.095), ('semantic', 0.088), ('reward', 0.086), ('dorsolateral', 0.081), ('tmdps', 0.081), ('experience', 0.073), ('cached', 0.071), ('hippocampal', 0.071), ('exploration', 0.071), ('analytical', 0.069), ('forward', 0.068), ('mdps', 0.068), ('inferential', 0.065), ('effects', 0.063), ('nois', 0.061), ('striatal', 0.061), ('striatum', 0.061), ('tmdp', 0.061), ('asymptotic', 0.061), ('dayan', 0.06), ('environment', 0.055), ('terminal', 0.054), ('cache', 0.053), ('inaccuracy', 0.053), ('medial', 0.053), ('modelbased', 0.053), ('transfer', 0.053), ('transition', 0.052), ('tree', 0.051), ('limit', 0.051), ('mdp', 0.05), ('depth', 0.049), ('action', 0.049), ('daw', 0.049), ('branching', 0.049), ('corrupting', 0.049), ('remembering', 0.049), ('perfect', 0.047), ('ie', 0.045), ('simulations', 0.045), ('psychological', 0.044), ('bootstrapping', 0.043), ('analyse', 0.043), ('stores', 0.043), ('decision', 0.042), ('computational', 0.041), ('arbitration', 0.041), ('clayton', 0.041), ('equicorrelated', 0.041), ('lengyel', 0.041), ('poldrack', 0.041), ('seemingly', 0.041), ('curves', 0.04), ('gure', 0.039), ('involved', 0.039), ('course', 0.038), ('approximations', 0.038), ('environments', 0.038), ('apparent', 0.037), ('noisy', 0.037), ('budapest', 0.035), ('successor', 0.035), ('declarative', 0.035), ('blue', 0.034), ('increased', 0.034), ('consequent', 0.032), ('uncertainties', 0.032), ('mem', 0.032), ('neurobiol', 0.032), ('psychologically', 0.032), ('competition', 0.031), ('severe', 0.031), ('line', 0.031), ('limited', 0.031), ('memories', 0.03), ('arising', 0.03), ('characterizing', 0.03), ('uncertainty', 0.029), ('states', 0.029), ('episode', 0.029), ('act', 0.028), ('rewards', 0.028), ('expect', 0.026), ('constitutes', 0.026), ('exploitation', 0.026), ('substantial', 0.025), ('material', 0.025), ('regular', 0.025), ('experienced', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 100 nips-2007-Hippocampal Contributions to Control: The Third Way
Author: Máté Lengyel, Peter Dayan
Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1
2 0.27771685 169 nips-2007-Retrieved context and the discovery of semantic structure
Author: Vinayak Rao, Marc Howard
Abstract: Semantic memory refers to our knowledge of facts and relationships between concepts. A successful semantic memory depends on inferring relationships between items that are not explicitly taught. Recent mathematical modeling of episodic memory argues that episodic recall relies on retrieval of a gradually-changing representation of temporal context. We show that retrieved context enables the development of a global memory space that reflects relationships between all items that have been previously learned. When newly-learned information is integrated into this structure, it is placed in some relationship to all other items, even if that relationship has not been explicitly learned. We demonstrate this effect for global semantic structures shaped topologically as a ring, and as a two-dimensional sheet. We also examined the utility of this learning algorithm for learning a more realistic semantic space by training it on a large pool of synonym pairs. Retrieved context enabled the model to “infer” relationships between synonym pairs that had not yet been presented. 1
3 0.13114023 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
4 0.12547769 163 nips-2007-Receding Horizon Differential Dynamic Programming
Author: Yuval Tassa, Tom Erez, William D. Smart
Abstract: The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 24 state and 9 action dimensions. 1
5 0.10241339 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
Author: Alexander L. Strehl, Michael L. Littman
Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.
6 0.087918349 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
7 0.081793055 162 nips-2007-Random Sampling of States in Dynamic Programming
8 0.063823201 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
9 0.063624844 203 nips-2007-The rat as particle filter
10 0.060062923 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
11 0.059858177 30 nips-2007-Bayes-Adaptive POMDPs
12 0.058621049 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
13 0.058026981 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes
14 0.054588892 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
15 0.05230983 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
16 0.05204178 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
17 0.048186466 135 nips-2007-Multi-task Gaussian Process Prediction
18 0.047014244 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
19 0.045480751 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
20 0.044078011 191 nips-2007-Temporal Difference Updating without a Learning Rate
topicId topicWeight
[(0, -0.167), (1, -0.111), (2, 0.058), (3, -0.067), (4, -0.056), (5, 0.093), (6, 0.018), (7, 0.003), (8, -0.014), (9, -0.021), (10, -0.123), (11, -0.044), (12, 0.06), (13, -0.07), (14, 0.025), (15, -0.026), (16, -0.06), (17, -0.053), (18, 0.104), (19, -0.124), (20, 0.003), (21, 0.117), (22, -0.069), (23, -0.14), (24, 0.025), (25, -0.147), (26, 0.057), (27, 0.062), (28, -0.02), (29, -0.064), (30, 0.325), (31, -0.028), (32, -0.095), (33, -0.094), (34, -0.153), (35, 0.202), (36, -0.034), (37, 0.149), (38, 0.148), (39, 0.186), (40, -0.012), (41, -0.164), (42, 0.075), (43, -0.032), (44, 0.157), (45, -0.012), (46, 0.136), (47, 0.039), (48, -0.053), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.95680594 100 nips-2007-Hippocampal Contributions to Control: The Third Way
Author: Máté Lengyel, Peter Dayan
Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1
2 0.87890059 169 nips-2007-Retrieved context and the discovery of semantic structure
Author: Vinayak Rao, Marc Howard
Abstract: Semantic memory refers to our knowledge of facts and relationships between concepts. A successful semantic memory depends on inferring relationships between items that are not explicitly taught. Recent mathematical modeling of episodic memory argues that episodic recall relies on retrieval of a gradually-changing representation of temporal context. We show that retrieved context enables the development of a global memory space that reflects relationships between all items that have been previously learned. When newly-learned information is integrated into this structure, it is placed in some relationship to all other items, even if that relationship has not been explicitly learned. We demonstrate this effect for global semantic structures shaped topologically as a ring, and as a two-dimensional sheet. We also examined the utility of this learning algorithm for learning a more realistic semantic space by training it on a large pool of synonym pairs. Retrieved context enabled the model to “infer” relationships between synonym pairs that had not yet been presented. 1
3 0.52519667 163 nips-2007-Receding Horizon Differential Dynamic Programming
Author: Yuval Tassa, Tom Erez, William D. Smart
Abstract: The control of high-dimensional, continuous, non-linear dynamical systems is a key problem in reinforcement learning and control. Local, trajectory-based methods, using techniques such as Differential Dynamic Programming (DDP), are not directly subject to the curse of dimensionality, but generate only local controllers. In this paper,we introduce Receding Horizon DDP (RH-DDP), an extension to the classic DDP algorithm, which allows us to construct stable and robust controllers based on a library of local-control trajectories. We demonstrate the effectiveness of our approach on a series of high-dimensional problems using a simulated multi-link swimming robot. These experiments show that our approach effectively circumvents dimensionality issues, and is capable of dealing with problems of (at least) 24 state and 9 action dimensions. 1
4 0.38942373 162 nips-2007-Random Sampling of States in Dynamic Programming
Author: Chris Atkeson, Benjamin Stephens
Abstract: We combine three threads of research on approximate dynamic programming: sparse random sampling of states, value function and policy approximation using local models, and using local trajectory optimizers to globally optimize a policy and associated value function. Our focus is on finding steady state policies for deterministic time invariant discrete time control problems with continuous states and actions often found in robotics. In this paper we show that we can now solve problems we couldn’t solve previously. 1
5 0.36032712 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
6 0.29536524 171 nips-2007-Scan Strategies for Meteorological Radars
7 0.28719589 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
8 0.2764996 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
9 0.273469 191 nips-2007-Temporal Difference Updating without a Learning Rate
10 0.24895599 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes
11 0.24561478 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
12 0.23934956 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
13 0.23859164 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
14 0.21969527 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control
15 0.21631205 27 nips-2007-Anytime Induction of Cost-sensitive Trees
16 0.21499826 59 nips-2007-Continuous Time Particle Filtering for fMRI
17 0.21420561 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
18 0.21401976 203 nips-2007-The rat as particle filter
19 0.21306895 142 nips-2007-Non-parametric Modeling of Partially Ranked Data
20 0.21221299 129 nips-2007-Mining Internet-Scale Software Repositories
topicId topicWeight
[(5, 0.046), (13, 0.043), (16, 0.033), (18, 0.037), (19, 0.017), (21, 0.087), (27, 0.274), (31, 0.029), (34, 0.024), (35, 0.022), (47, 0.116), (49, 0.017), (56, 0.011), (83, 0.087), (85, 0.022), (87, 0.013), (90, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.77601165 100 nips-2007-Hippocampal Contributions to Control: The Third Way
Author: Máté Lengyel, Peter Dayan
Abstract: Recent experimental studies have focused on the specialization of different neural structures for different types of instrumental behavior. Recent theoretical work has provided normative accounts for why there should be more than one control system, and how the output of different controllers can be integrated. Two particlar controllers have been identified, one associated with a forward model and the prefrontal cortex and a second associated with computationally simpler, habitual, actor-critic methods and part of the striatum. We argue here for the normative appropriateness of an additional, but so far marginalized control system, associated with episodic memory, and involving the hippocampus and medial temporal cortices. We analyze in depth a class of simple environments to show that episodic control should be useful in a range of cases characterized by complexity and inferential noise, and most particularly at the very early stages of learning, long before habitization has set in. We interpret data on the transfer of control from the hippocampus to the striatum in the light of this hypothesis. 1
2 0.65921283 196 nips-2007-The Infinite Gamma-Poisson Feature Model
Author: Michalis K. Titsias
Abstract: We present a probability distribution over non-negative integer valued matrices with possibly an infinite number of columns. We also derive a stochastic process that reproduces this distribution over equivalence classes. This model can play the role of the prior in nonparametric Bayesian learning scenarios where multiple latent features are associated with the observed data and each feature can have multiple appearances or occurrences within each data point. Such data arise naturally when learning visual object recognition systems from unlabelled images. Together with the nonparametric prior we consider a likelihood model that explains the visual appearance and location of local image patches. Inference with this model is carried out using a Markov chain Monte Carlo algorithm. 1
3 0.62253028 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search
Author: Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke Chen, Gordon Sun
Abstract: We present a general boosting method extending functional gradient boosting to optimize complex loss functions that are encountered in many machine learning problems. Our approach is based on optimization of quadratic upper bounds of the loss functions which allows us to present a rigorous convergence analysis of the algorithm. More importantly, this general framework enables us to use a standard regression base learner such as single regression tree for £tting any loss function. We illustrate an application of the proposed method in learning ranking functions for Web search by combining both preference data and labeled data for training. We present experimental results for Web search using data from a commercial search engine that show signi£cant improvements of our proposed methods over some existing methods. 1
4 0.54937303 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
Author: Michael Ross, Andrew Cohen
Abstract: This paper describes a new model for human visual classification that enables the recovery of image features that explain human subjects’ performance on different visual classification tasks. Unlike previous methods, this algorithm does not model their performance with a single linear classifier operating on raw image pixels. Instead, it represents classification as the combination of multiple feature detectors. This approach extracts more information about human visual classification than previous methods and provides a foundation for further exploration. 1
5 0.54500276 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
Author: Matthew Hoffman, Arnaud Doucet, Nando D. Freitas, Ajay Jasra
Abstract: A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new interpretation, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to implement full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.
6 0.54117656 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
7 0.53845906 86 nips-2007-Exponential Family Predictive Representations of State
8 0.53843939 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
9 0.53503942 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
10 0.53205562 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
11 0.52931774 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
12 0.52911603 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
13 0.52845156 69 nips-2007-Discriminative Batch Mode Active Learning
14 0.52838916 158 nips-2007-Probabilistic Matrix Factorization
15 0.52674234 164 nips-2007-Receptive Fields without Spike-Triggering
16 0.52547777 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
17 0.52377135 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
18 0.52250016 24 nips-2007-An Analysis of Inference with the Universum
19 0.52246869 195 nips-2007-The Generalized FITC Approximation
20 0.5212028 174 nips-2007-Selecting Observations against Adversarial Objectives