nips nips2001 nips2001-126 knowledge-graph by maker-knowledge-mining

126 nips-2001-Motivated Reinforcement Learning

Source: pdf

Author: Peter Dayan

Abstract: The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine, 12,13 into the neurobiology and psychology of motivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. [sent-5, score-0.906]

2 Competition between actions is based on the motivating characteristics of their consequent states in this sense. [sent-6, score-0.167]

3 Substantial, careful, experiments reviewed in Dickinson & Balleine, 12,13 into the neurobiology and psychology of motivation shows that this view is incomplete. [sent-7, score-0.239]

4 In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. [sent-8, score-0.343]

5 Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. [sent-9, score-0.46]

6 We describe and model these motivational systems, and consider the way they interact. [sent-10, score-0.256]

7 1 Introduction Reinforcement learning (RL 28) bears a tortuous relationship with historical and contemporary ideas in classical and instrumental conditioning. [sent-11, score-0.429]

8 Although RL sheds important light in some murky areas, it has paid less attention to research concerning the motivation of stimulus-response (SR) links. [sent-12, score-0.146]

9 RL methods are mainly concerned with preparatory Pavlovian (eg secondary) conditioning, and, in instrumental conditioning, the competition between multiple possible actions given a particular stimulus or state, based on the future rewarding or punishing consequences of those actions. [sent-13, score-0.8]

10 These have been used to build successful and predictive models of the activity of monkey dopamine cells in conditioning. [sent-14, score-0.154]

11 SR research goes on to study how these habits, and also the motivation associated with them, are 'attached' in an appropriately preparatory sense to conditioned stimuli (CSs) that are predictive of the USs. [sent-17, score-0.294]

12 The difference between RL's competition between multiple actions and SR's motivation of a single action might seem trivial, particularly if an extra, nUll, action is included in the action competition in RL, so the subject can actively choose to do nothing. [sent-18, score-0.897]

13 However, there is substantial evidence from experi- ments in which drive states (eg hunger, thirst) are manipulated, that motivation in the SR sense works in a sophisticated, intrinsically goal-sensitive, way and can exert unexpected effects on instrumental conditioning. [sent-19, score-0.608]

14 By comparison with RL, psychological study of multiple goals within single environments is quite advanced, particularly in experiments in which one goal or set of goals is effective during learning, and another during performance. [sent-20, score-0.112]

15 Neither the Pavlovian nor the instrumental system maps cleanly onto the standard view of RL, and the suggestion about dopamine would clearly Significantly damage the RL interpretation of the involvement of this neuromodulatory system in conditioning. [sent-24, score-0.661]

16 2 Theoretical and Experimental Background Figure 1 shows a standard view of the involvement of the dopamine system in RL. [sent-27, score-0.195]

17 In the actor-critic 6 version of the dopamine theory, this TD error signal is put to two uses. [sent-30, score-0.108]

18 For this, 8(t) > 0 if the prediction from the state at time t, V 1T(x(t», is overly pessimistic with respect to the sum of the actual reward, r (t), and the estimated future reward, V 1T (x( t + 1», from the subsequent state. [sent-32, score-0.129]

19 The other use for 8 (t) is criticizing the action a adopted at time t. [sent-33, score-0.158]

20 For this, 8(t) > 0 implies that the action chosen is worth more than the average worth of x(t), and that the overall policy IT of the subject can therefore be improved by choosing it more often. [sent-34, score-0.252]

21 In a Q-Iearning 31 version of the theory, Q1T(X, a) values are learned using an analogous quantity to 8 (t), for each pair of states x and actions a, and can directly be used to choose between the actions to improve the policy. [sent-35, score-0.379]

22 5 The SR view of conditioning places its emphasis on motivational control of a prepotent action. [sent-37, score-0.407]

23 A) Evaluator: A TD error signal 8 to learn V 1T (x) to match the sum of future rewards r, putatively via the basolateral nuclei of the amygdala, the orbitofrontal cortex and the nucleus accumbens. [sent-39, score-0.287]

24 B) Instrumental controller: The TD error 8 is used to choose, and teach the choice of, appropriate actions a to maximize future reward based on the current state and stimuli, putatively via the dorsolateral prefrontal cortex and the dorsal striatum. [sent-40, score-0.607]

25 it is motivationally appropriate, according to the current goals of the animal. [sent-41, score-0.088]

26 The suggestion is that this is mediated by a separate motivational system. [sent-42, score-0.256]

27 A conclusion used to test this structure for the control of actions is that this motivational system could be able to energize any action being executed by the animal. [sent-44, score-0.623]

28 Animals executing an action for an outcome under instrumental control, will perform more quickly when a CS predictive of reward is presented, even if the CS predicts a completely different reward from the instrumental outcome. [sent-46, score-1.318]

29 This effect is abolished by lesions of the shell of the nucleus accumbens,10 one of the main targets of DA from the VTA. [sent-47, score-0.111]

30 The standard RL model offers no account of the speed or force of action (though one could certainly imagine various possible extensions), and has no natural way to accommodate this finding. [sent-48, score-0.158]

31 For instance, Berridge & Schulkin8 first gave rats sucrose and saline solutions with one of a bitter (quinine) and a sour (citric) taste. [sent-50, score-0.088]

32 Furthermore, the flavor paired with the salt was awarded positive hedonic reactions, whereas before pairing (and if it had been paired with sucrose instead) it was treated as being aversive. [sent-53, score-0.129]

33 Whereas the RL system could certainly take the physiological lack of salt as helping determine part of the state x(t), this could only exert an effect on behavior through learning, contrary to the evidence. [sent-55, score-0.133]

34 The final complexity for standard RL comes from incentive learning. [sent-56, score-0.121]

35 One paradigm involves a sequential chain of two actions (a1 and a2) that rats had to execute in order to get a reward. [sent-57, score-0.218]

36 5 The subjects were made hungry, and were first trained to perform action a2 to get a particular reward (a Noyes pellet), and then to perform the chain of actions a1 - a2 to get the reward. [sent-58, score-0.43]

37 In a final test phase, the animals were offered the chance of executing a1 and a2 in extinction, for half of them when they were still hungry; for the other half when they were sated on their normal diet. [sent-59, score-0.312]

38 Sated animals perform a1 at the same rate as hungry animals, but perform a2 sig*Note that aversive Pavlovian instrumental transfer, in the form of the suppression of appetitive instrumental responding, is the conventional method for testing aversive Pavlovian conditioning. [sent-61, score-1.29]

39 There is an obvious motivational explanation for this as well as the conventional view of competition between appetitive and protective actions. [sent-62, score-0.453]

40 ~ hungry 80 ieo s a ~ 40 E20 al al al al Figure 2: Incentive learning. [sent-64, score-0.545]

41 A) Mean total actions al and a 2 for an animal trained on the chain schedule al - a2 -Noyes pellets. [sent-65, score-0.437]

42 Hungry and sated rats perform al at the same rate, but sated animals fail to perform a 2. [sent-66, score-0.595]

43 B) Mean total actions when sated following prior re-exposure to the Noyes pellets when hungry ('hungry-sated') or when sated (,sated-sated'). [sent-67, score-0.753]

44 Animals re-exposed when sated are significantly less willing to perform a 2. [sent-68, score-0.183]

45 Here, before the test, animals were given a limited number of the Noyes pellets (without the availability of the manipulanda) either when hungry or when sated. [sent-73, score-0.303]

46 Those who experienced them hungry ('hungry-sated') show the same results as the 'sated' group of figure 2A; whereas those who experienced them sated (,sated-sated') now declined to perform action al either. [sent-74, score-0.601]

47 First, the action nearest to the reward (a2) is affected by the deprivation state without additional learning. [sent-76, score-0.363]

48 Second is that a change in the willingness to execute al happens after re-exposure to the Noyes pellets whilst sated; this learning is believed to involve insular cortex (part of gustatory neocortex4). [sent-78, score-0.201]

49 That re-exposure directly affects the choice of al suggests that the instrumental act is partly determined by an evaluation of its ultimate consequence, a conclusion that relates to a long-standing psychological debate about the 'cognitive' evaluation of actions. [sent-79, score-0.57]

50 Dickinson & Balleine 13 suggest that the execution of a2 is mainly controlled by Pavlovian contingencies, and that Pavlovian motivation is instantly sensitive to goal devaluation via satiation. [sent-80, score-0.357]

51 At this stage in the experiment, however, al is controlled by instrumental contingencies. [sent-81, score-0.524]

52 By comparison with Pavlovian motivation, instrumental motivation is powerful (since it can depend on response-outcome expectancies), but dumb (since, without re-exposure, the animal works hard doing al when it wouldn't be interested in the food in any case). [sent-82, score-0.793]

53 Ultimately, after extended training,14 in the birth of a new habit, al becomes controlled by Pavlovian contingencies too, and so becomes directly sensitive to devaluation. [sent-83, score-0.218]

54 Figure 3 shows a sketch of the new model, whose key principles include • Pavlovian motivation (figure 3A) is associated with prediction error 8(t) = r(t) + VTT(X(t + 1)) - VTT(x(t) for long term expected future rewards VTT(X, a), given a policy IT. [sent-85, score-0.23]

55 t It is not empirically clear whether actions that have become habits are completely automatic 1 or are subject to Pavlovian motivational influences. [sent-87, score-0.533]

56 CS prior , bias S S ~amygdala ," : CS ~ sta~e~ __>~ stimuli X sIs US 8 C prior bias habit stimuli X CS sI s shell ~~. [sent-89, score-0.294]

57 USs can also be evaluated via a plastic route, as in figure 1, but which nevertheless has prior biases. [sent-92, score-0.106]

58 CSs undergo Pavlovian stimulus substitution with the USs they predict, and can also be directly evaluated through the learned route. [sent-93, score-0.089]

59 The two sources of information for VTT(x) compete, forcing the plastic route to adjust to the hard-wired route. [sent-94, score-0.199]

60 B) Habit system: The SR mapping suggests an appropriate action based on the state X; the vigor of its execution is controlled by dopaminergic 8, putatively acting via the shell of the accumbens. [sent-95, score-0.383]

61 C) Instrumental controller: Action choice is based on advantages, which are learned, putatively via the core of the accumbens. [sent-96, score-0.107]

62 • ret) is determined by a devaluation-sensitive, 'hard-wired', US evaluator that provides direct value information about motivationally inappropriate USs. [sent-98, score-0.11]

63 • 8(t), possibly acting through the shell of the accumbens, provides Pavlovian motivation for pre-wired and new habits (figure 3B), as in Pavlovian instrumental transfer. [sent-99, score-0.758]

64 The plastic route via the amygdala takes responsibility for the remainder of the prediction; and the sum prediction is always correct (solid line). [sent-104, score-0.411]

65 • Short-term storage of predictive stimuli in prefrontal working memory is gated9 by 8(t), so can also be devaluation dependent. [sent-105, score-0.303]

66 • Instrumental motivation depends on policy-based advantages (3C; Baird 2) A 1T (x,a) = Q 1T(x,a) - V1T(X) trained by the error signal 8 A (t) = 8(t) -A1T(x,a) Over the course of policy improvement, the advantage of a sub-optimal action becomes negative, and of an optimal action tends to O. [sent-106, score-0.51]

67 The latter offers a natural model of the transition from instrumental action selection to an SR habit. [sent-107, score-0.587]

68 5 A 7T(X , b) , °0~--1-0~~20 0 50 cs-us interval successor sum 0. [sent-120, score-0.089]

69 2 50 iteration 101 0 ~-- 0 ---' 51 00 iteration Figure 4: A) Role of the hard-wired route (dashed line), via stimulus-substitution, in predicting future reward (r = 1) as a function of the CS-US interval. [sent-123, score-0.302]

70 B) Advantages of useful (a) and worthless (b) actions at the start state Xo. [sent-125, score-0.212]

71 The solid line shows the mean value; the dashed line the hard-wired component, providing immediate devaluation sensitivity. [sent-127, score-0.16]

72 0) Construction of A7T (xo, a) via a successor representation component l l (dashed) and a conventionally learned component (dotted). [sent-128, score-0.234]

73 B-D) Action a produces reward r = 1 with probability 0. [sent-130, score-0.105]

74 Figure 4B;C show two aspects of instrumental conditioning. [sent-132, score-0.429]

75 Two actions compete at state xo, one, a, with a small cost and a large future payoff; the other, b, with no cost and no payoff. [sent-133, score-0.248]

76 Figure 4B shows the development of the advantages of these actions over learning. [sent-134, score-0.215]

77 Action a starts looking worse, because it has a greater immediate cost; its advantage increases as the worth of a grows greater than the mean value of xo, and then goes to 0 (the birth of the habit) as the subject learns to choose it every time. [sent-135, score-0.116]

78 • On-line action choice is dependent on 8A (t) as in learned klinokinesis. [sent-139, score-0.203]

79 21 Incentive learning in chains suggests that the representation underlying the advantage of an action includes information about its future consequences, either through an explicit model,27,29 a successor representation,ll or perhaps a form of f3-model. [sent-140, score-0.283]

80 26 One way of arranging this would use a VTE-like 30 mechanism for proposing actions (perhaps using working memory in prefrontal cortex), in order to test their advantages. [sent-141, score-0.231]

81 Figure 4D shows the consequence of using a learned successor representation underlying the advantage A1T(XO, a) shown in figure 4B. [sent-142, score-0.134]

82 Re-exposure sensitivity (ie incentive learning) will exist over roughly iterations 25 - 75 . [sent-144, score-0.162]

83 • SR models also force consideration of the repertoire of possible actions or responses available at a given state (figure 3B;C). [sent-145, score-0.212]

84 4 Discussion Experiments pose a critical challenge to our understanding of the psychological and neural implementation of reinforcement learning, 12,13 suggesting the importance of two different sorts of motivation in controlling behavior. [sent-148, score-0.238]

85 The most critical addition is a hard-wired, stimulus-substitution sensitive, route for the evaluation of stimuli and states, which competes with a plastic route through the amygdala and the oFC. [sent-150, score-0.521]

86 This hard-wired route has the property of intrinsic sensitivity to various sorts of devaluation, and this leads to motivationally appropriate behavior. [sent-151, score-0.264]

87 The computational basis of the new aspects of the model focus on motivational control of SR links (via VTT ), to add to motivational control of instrumental actions (via ATT). [sent-152, score-1.192]

88 We also showed the potential decomposition of the advantages into a component based on the successor representation and therefore sensitive to re-exposure as in incentive learning, and a standard, learned, component. [sent-153, score-0.34]

89 In particular, we have yet to explore how habits get created from actions as the maximal advantage goes to o. [sent-155, score-0.277]

90 References [1] Adams, CD (1982) Variations in the sensitivity of instrumental responding to reinforcer devaluation. [sent-158, score-0.47]

91 [4] Balleine, BW & Dickinson, A (1998) Goal-directed instrumental action: Contingency and incentive learning and their cortical substrates. [sent-164, score-0.55]

92 [5] Balleine, BW, Garner, C, Gonzalez, F & Dickinson, A (1995) Motivational control of heterogeneous instrumental chains. [sent-166, score-0.471]

93 [8] Berridge, KC & Schulkin, J (1989) Palatability shift of a salt-associated incentive during sodium depletion. [sent-172, score-0.121]

94 [9] Braver, TS, Barch, DM & Cohen, JD (1999) Cognition and control in schizophrenia: A computational model of dopamine and prefrontal function. [sent-174, score-0.214]

95 [10] Corbit, LH, Muir, JL & Balleine, BW (200l) The role of the nucleus accumbens in instrumental conditioning: Evidence of a functional dissociation between accumbens core and shell. [sent-176, score-0.613]

96 [11] Dayan, P (1993) Improving generalisation for temporal difference learning: The successor representation. [sent-178, score-0.089]

97 [14] Dickinson, A, Balleine, B, Watt, A, Gonzalez, F & Boakes, RA (1995) Motivational control after extended instrumental training. [sent-185, score-0.471]

98 [17] Holland, PC & Rescorla, RA (1975) The effect of two ways of devaluing the unconditioned stimulus after first- and second-order appetitive conditioning. [sent-194, score-0.154]

99 [22] Montague, PR, Dayan, P & Sejnowski, TJ (1996) A framework for mesencephalic dopamine systems based on predictive Hebbian learning. [sent-204, score-0.154]

100 [23] Schoenbaum, G, Chiba, AA & Gallagher, M (1999) Neural encoding in or- [24] [25] [26] [27] bitofrontal cortex and basolateral amygdala during olfactory discrimination learning. [sent-206, score-0.236]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('instrumental', 0.429), ('motivational', 0.256), ('pavlovian', 0.229), ('balleine', 0.183), ('sated', 0.183), ('actions', 0.167), ('dickinson', 0.165), ('hungry', 0.165), ('action', 0.158), ('rl', 0.156), ('motivation', 0.146), ('sr', 0.133), ('amygdala', 0.13), ('devaluation', 0.128), ('route', 0.127), ('incentive', 0.121), ('xo', 0.114), ('appetitive', 0.11), ('habits', 0.11), ('dopamine', 0.108), ('reward', 0.105), ('al', 0.095), ('habit', 0.091), ('uss', 0.091), ('vtt', 0.091), ('successor', 0.089), ('animals', 0.083), ('animal', 0.08), ('conditioning', 0.077), ('accumbens', 0.073), ('berridge', 0.073), ('css', 0.073), ('noyes', 0.073), ('putatively', 0.073), ('shell', 0.073), ('plastic', 0.072), ('stimuli', 0.065), ('prefrontal', 0.064), ('dayan', 0.062), ('psychology', 0.061), ('rs', 0.055), ('td', 0.055), ('competition', 0.055), ('balkenius', 0.055), ('basolateral', 0.055), ('deprivation', 0.055), ('evaluator', 0.055), ('involvement', 0.055), ('konorski', 0.055), ('motivationally', 0.055), ('ofc', 0.055), ('pellets', 0.055), ('salt', 0.055), ('spier', 0.055), ('cortex', 0.051), ('rats', 0.051), ('sensitive', 0.049), ('prediction', 0.048), ('advantages', 0.048), ('montague', 0.048), ('worth', 0.047), ('reinforcement', 0.046), ('executing', 0.046), ('psychological', 0.046), ('cs', 0.046), ('predictive', 0.046), ('state', 0.045), ('learned', 0.045), ('stimulus', 0.044), ('food', 0.043), ('bw', 0.043), ('control', 0.042), ('sensitivity', 0.041), ('intrinsic', 0.041), ('sutton', 0.041), ('nucleus', 0.038), ('aversive', 0.037), ('birth', 0.037), ('contingencies', 0.037), ('flavor', 0.037), ('gonzalez', 0.037), ('neuromodulatory', 0.037), ('pellet', 0.037), ('preparatory', 0.037), ('sucrose', 0.037), ('future', 0.036), ('ag', 0.035), ('via', 0.034), ('physiological', 0.033), ('goals', 0.033), ('evidence', 0.033), ('component', 0.033), ('immediate', 0.032), ('view', 0.032), ('consequences', 0.032), ('gallagher', 0.032), ('holland', 0.032), ('dorsal', 0.032), ('extinction', 0.032), ('kc', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 126 nips-2001-Motivated Reinforcement Learning

Author: Peter Dayan

2 0.14646392 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

Author: Aaron C. Courville, David S. Touretzky

Abstract: The Temporal Coding Hypothesis of Miller and colleagues [7] suggests that animals integrate related temporal patterns of stimuli into single memory representations. We formalize this concept using quasi-Bayes estimation to update the parameters of a constrained hidden Markov model. This approach allows us to account for some surprising temporal effects in the second order conditioning experiments of Miller et al. [1 , 2, 3], which other models are unable to explain. 1

3 0.10822537 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

Author: Gregory Z. Grudic, Lyle H. Ungar

Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The ﬁrst concerns the efﬁcacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to signiﬁcant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢

4 0.10343158 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

Author: Bram Bakker

Abstract: This paper presents reinforcement learning with a Long ShortTerm Memory recurrent neural network: RL-LSTM. Model-free RL-LSTM using Advantage(,x) learning and directed exploration can solve non-Markovian tasks with long-term dependencies between relevant events. This is demonstrated in a T-maze task, as well as in a difficult variation of the pole balancing task. 1

5 0.10041298 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

Author: Eyal Even-dar, Yishay Mansour

Abstract: Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algorithm can be viewed as derandomization of the E-greedy Q-learning. 1

6 0.09792196 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

7 0.097701475 40 nips-2001-Batch Value Function Approximation via Support Vectors

8 0.097421885 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

9 0.093911193 3 nips-2001-ACh, Uncertainty, and Cortical Inference

10 0.09192688 128 nips-2001-Multiagent Planning with Factored MDPs

11 0.091767386 13 nips-2001-A Natural Policy Gradient

12 0.087779 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

13 0.085780844 148 nips-2001-Predictive Representations of State

14 0.073093511 121 nips-2001-Model-Free Least-Squares Policy Iteration

15 0.072529472 59 nips-2001-Direct value-approximation for factored MDPs

16 0.066001505 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

17 0.062005274 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

18 0.048651613 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding

19 0.047067929 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes

20 0.046219628 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.147), (1, -0.155), (2, 0.157), (3, 0.007), (4, -0.015), (5, 0.06), (6, -0.025), (7, -0.001), (8, -0.008), (9, 0.005), (10, -0.019), (11, 0.058), (12, -0.106), (13, 0.02), (14, -0.03), (15, -0.01), (16, 0.082), (17, 0.076), (18, 0.087), (19, 0.007), (20, -0.016), (21, -0.097), (22, -0.075), (23, 0.019), (24, -0.042), (25, 0.007), (26, 0.116), (27, 0.044), (28, 0.001), (29, -0.153), (30, 0.076), (31, 0.1), (32, -0.075), (33, -0.155), (34, 0.051), (35, -0.165), (36, 0.101), (37, 0.078), (38, 0.07), (39, -0.027), (40, 0.004), (41, 0.042), (42, 0.021), (43, 0.041), (44, 0.003), (45, 0.07), (46, -0.15), (47, -0.074), (48, -0.005), (49, -0.105)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95975572 126 nips-2001-Motivated Reinforcement Learning

Author: Peter Dayan

2 0.65735883 148 nips-2001-Predictive Representations of State

Author: Michael L. Littman, Richard S. Sutton

Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.

3 0.63526475 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

Author: Bram Bakker

4 0.62063193 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

Author: Aaron C. Courville, David S. Touretzky

5 0.59623015 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

Author: Jonathan L. Shapiro, J. Wearden

Abstract: Animal data on delayed-reward conditioning experiments shows a striking property - the data for different time intervals collapses into a single curve when the data is scaled by the time interval. This is called the scalar property of interval timing. Here a simple model of a neural clock is presented and shown to give rise to the scalar property. The model is an accumulator consisting of noisy, linear spiking neurons. It is analytically tractable and contains only three parameters. When coupled with reinforcement learning it simulates peak procedure experiments, producing both the scalar property and the pattern of single trial covariances. 1

6 0.56121022 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

7 0.54916549 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

8 0.48922154 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

9 0.47940814 3 nips-2001-ACh, Uncertainty, and Cortical Inference

10 0.46321028 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

11 0.39425963 40 nips-2001-Batch Value Function Approximation via Support Vectors

12 0.39393988 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

13 0.38211083 177 nips-2001-Switch Packet Arbitration via Queue-Learning

14 0.35768837 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

15 0.31394759 128 nips-2001-Multiagent Planning with Factored MDPs

16 0.29997781 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement

17 0.2899152 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

18 0.28908995 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

19 0.28592378 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

20 0.28455848 124 nips-2001-Modeling the Modulatory Effect of Attention on Human Spatial Vision

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (14, 0.018), (17, 0.014), (19, 0.031), (27, 0.087), (30, 0.06), (38, 0.016), (59, 0.025), (63, 0.383), (72, 0.045), (79, 0.031), (83, 0.04), (91, 0.143)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81210232 126 nips-2001-Motivated Reinforcement Learning

Author: Peter Dayan

2 0.72375029 62 nips-2001-Duality, Geometry, and Support Vector Regression

Author: J. Bi, Kristin P. Bennett

Abstract: We develop an intuitive geometric framework for support vector regression (SVR). By examining when -tubes exist, we show that SVR can be regarded as a classiﬁcation problem in the dual space. Hard and soft -tubes are constructed by separating the convex or reduced convex hulls respectively of the training data with the response variable shifted up and down by . A novel SVR model is proposed based on choosing the max-margin plane between the two shifted datasets. Maximizing the margin corresponds to shrinking the eﬀective -tube. In the proposed approach the eﬀects of the choices of all parameters become clear geometrically. 1

3 0.4734534 134 nips-2001-On Kernel-Target Alignment

Author: Nello Cristianini, John Shawe-Taylor, André Elisseeff, Jaz S. Kandola

Abstract: We introduce the notion of kernel-alignment, a measure of similarity between two kernel functions or between a kernel and a target function. This quantity captures the degree of agreement between a kernel and a given learning task, and has very natural interpretations in machine learning, leading also to simple algorithms for model selection and learning. We analyse its theoretical properties, proving that it is sharply concentrated around its expected value, and we discuss its relation with other standard measures of performance. Finally we describe some of the algorithms that can be obtained within this framework, giving experimental results showing that adapting the kernel to improve alignment on the labelled data significantly increases the alignment on the test set, giving improved classification accuracy. Hence, the approach provides a principled method of performing transduction. Keywords: Kernels, alignment, eigenvectors, eigenvalues, transduction 1

4 0.44281355 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

Author: Patrick Haffner

Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1

5 0.4426713 3 nips-2001-ACh, Uncertainty, and Cortical Inference

Author: Peter Dayan, Angela J. Yu

Abstract: Acetylcholine (ACh) has been implicated in a wide variety of tasks involving attentional processes and plasticity. Following extensive animal studies, it has previously been suggested that ACh reports on uncertainty and controls hippocampal, cortical and cortico-amygdalar plasticity. We extend this view and consider its effects on cortical representational inference, arguing that ACh controls the balance between bottom-up inference, inﬂuenced by input stimuli, and top-down inference, inﬂuenced by contextual information. We illustrate our proposal using a hierarchical hidden Markov model.

6 0.43812764 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

7 0.43768418 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

8 0.43763101 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

9 0.43749237 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

10 0.43577754 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

11 0.43570596 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

12 0.43408805 183 nips-2001-The Infinite Hidden Markov Model

13 0.43275085 13 nips-2001-A Natural Policy Gradient

14 0.43268472 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

15 0.43266529 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

16 0.43225205 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

17 0.43145281 89 nips-2001-Grouping with Bias

18 0.43144423 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

19 0.43108821 169 nips-2001-Small-World Phenomena and the Dynamics of Information

20 0.43104276 182 nips-2001-The Fidelity of Local Ordinal Encoding