nips nips2001 nips2001-51 knowledge-graph by maker-knowledge-mining

51 nips-2001-Cobot: A Social Reinforcement Learning Agent

Source: pdf

Author: Charles Lee Isbell Jr., Christian R. Shelton

Abstract: We report on the use of reinforcement learning with Cobot, a software agent residing in the well-known online community LambdaMOO. Our initial work on Cobot (Isbell et al.2000) provided him with the ability to collect social statistics and report them to users. Here we describe an application of RL allowing Cobot to take proactive actions in this complex social environment, and adapt behavior from multiple sources of human reward. After 5 months of training, and 3171 reward and punishment events from 254 different LambdaMOO users, Cobot learned nontrivial preferences for a number of users, modiﬁng his behavior based on his current state. Here we describe LambdaMOO and the state and action spaces of Cobot, and report the statistical results of the learning experiment. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Here we describe an application of RL allowing Cobot to take proactive actions in this complex social environment, and adapt behavior from multiple sources of human reward. [sent-6, score-0.317]

2 After 5 months of training, and 3171 reward and punishment events from 254 different LambdaMOO users, Cobot learned nontrivial preferences for a number of users, modiﬁng his behavior based on his current state. [sent-7, score-0.39]

3 These previous studies focus on systems that encounter human users one at a time, such as spoken dialogue systems (Singh et al. [sent-11, score-0.324]

4 In this paper, we report on an RL-based agent for LambdaMOO, a complex, open-ended, multi-user chat environment, populated by a community of human users with rich and often enduring social relationships. [sent-13, score-0.525]

5 Our long-term goal is to build an agent who can learn to perform useful, interesting and entertaining actions in LambdaMOO on the basis of user feedback. [sent-14, score-0.484]

6 how frequently and in what ways users interacted with one another), and provided summaries of these statistics as a service. [sent-21, score-0.322]

7 Cobot’s description to users indicates that he is male. [sent-29, score-0.286]

8 £ when each action is appropriate (rules that would be inaccurate and quickly become stale), we wanted Cobot to learn the individual and communal preferences of users. [sent-30, score-0.209]

9 Thus, we provided a mechanism for users to reward or punish Cobot, and programmed Cobot to use RL algorithms to alter his behavior on the basis of this feedback. [sent-31, score-0.483]

10 These should include social information such as which users are present, how experienced they are in LambdaMOO, how frequently they interact with one another, and so on. [sent-34, score-0.397]

11 Cobot lives in an environment with multiple, often conﬂicting sources of reward from different human users. [sent-36, score-0.201]

12 Inconsistency and drift of user rewards and desires. [sent-38, score-0.345]

13 Individual users may be inconsistent in the rewards they provide (even when they implicitly have a ﬁxed set of preferences), and their preferences may change over time (for example, due to becoming bored or irritated with an action). [sent-39, score-0.498]

14 Even when their rewards are consistent, there can be great temporal variation in their reward pattern. [sent-40, score-0.191]

15 Training data is scarce for many reasons, including user ﬁckleness, and the need to prevent Cobot from generating too much spam in the environment. [sent-44, score-0.28]

16 Reasons include that user preferences are not stationary, but drift as users become habituated or bored with Cobot’s behavior; and the tendency for satisﬁed users to stop providing Cobot with any feedback, positive or negative. [sent-50, score-0.999]

17 While many users provided only a moderate or small amount of RL training (rewards and punishments) to Cobot, a handful of users did invest signiﬁcant time in training him. [sent-53, score-0.61]

18 While many of the users that trained Cobot did not exhibit clear preferences for any of his actions over the others, some users clearly and consistently rewarded and punished particular actions over the others. [sent-55, score-0.959]

19 For those users who exhibited clear preferences through their rewards and punishments, Cobot successfully learned corresponding policies of behavior. [sent-57, score-0.561]

20 For those users who invested the most training time in Cobot, the observed distribution of his actions is signiﬁcantly altered by their presence. [sent-59, score-0.413]

21 Although some users for whom we have sufﬁcient data seem to have preferences that do not depend upon the social state features we constructed for the RL, others do in fact appear to change their preferences depending upon prevailing social conditions. [sent-61, score-0.866]

22 In Sections 5, 6 and 7 we describe our implementation of Cobot’s RL action space, reward mechanisms and state features, respectively. [sent-66, score-0.224]

23 LambdaMOO appears as a series of interconnected rooms, populated by users and objects who may move between them. [sent-70, score-0.314]

24 ” In addition to speech, users express themselves via a large collection of verbs, allowing a rich set of simulated actions, and the expression of emotional states: (1) (2) (3) (4) (5) (6) Buster is overwhelmed by all these deadlines. [sent-72, score-0.3]

25 Lines (1) and (2) are initiated by verb commands by user Buster, expressing his emotional state, while lines (3) and (4) are examples of verbs and speech acts, respectively, by HFh. [sent-79, score-0.36]

26 The rooms and objects in LambdaMOO are created by users themselves, who devise descriptions, and control access by other users. [sent-82, score-0.316]

27 As last count, the database contains 118,154 objects, including 4836 active user accounts. [sent-84, score-0.266]

28 Many users have interacted extensively with each other over many years, and users are widely acknowledged for their contribution of interesting objects. [sent-86, score-0.608]

29 The population is generally curious and technically savvy, and users are interested in automated objects meant to display some form of intelligence. [sent-88, score-0.314]

30 Like a human user, he connects via telnet, and from the point of view of the LambdaMOO server, is a user with all the rights and responsibilities implied. [sent-90, score-0.29]

31 The Living Room is a central public place, frequented both by many regulars, and by users new to LambdaMOO. [sent-92, score-0.286]

32 5 million separate events (about one event every eleven seconds) Previously, we implemented a variety of functionality on Cobot centering around gathering and reporting social statistics. [sent-96, score-0.192]

33 It is impossible to program rules anticipating when any given action is appropriate in such a complex and dynamic environment, so we applied reinforcement learning to adapt directly from user feedback. [sent-106, score-0.367]

34 Introduce two users who have not yet interacted in front of Cobot. [sent-119, score-0.31]

35 In the MDP framework, at each time step the agent senses the state of the environment, and chooses and executes an action from the set of actions available to it in that state. [sent-122, score-0.289]

36 The agent’s goal is to choose actions so as to maximize the expected sum of rewards over some time horizon. [sent-125, score-0.192]

37 An optimal policy is a mapping from states to actions that achieves the agent’s goal. [sent-126, score-0.189]

38 The learned value function is used to choose actions stochastically, so that in each state, actions with higher value are chosen with higher probability. [sent-129, score-0.288]

39 5 Cobot’s RL Actions To have any hope of learning to behave in a way interesting to LambdaMOO users, Cobot’s actions must “make sense” to them, ﬁt in with the social chat-based environment, and minimize the risk of causing irritation. [sent-142, score-0.25]

40 Any rewards or punishments received before the next RL action are attributed to the current action, and used to update Cobot’s value functions. [sent-147, score-0.207]

41 It is worth remembering that Cobot has two different categories of action: those actions taken proactively as a result of the RL, and those actions taken in response to a user’s action towards Cobot. [sent-148, score-0.317]

42 Some users are certainly aware of the distinction and can easily determine which actions fall into which category, but other users may occasionally reward or punish Cobot in response to a reactive action. [sent-149, score-0.898]

43 6 The RL Reward Function Cobot learns to behave directly from the feedback of LambdaMOO users, any of whom can reward or punish him. [sent-151, score-0.225]

44 We implemented explicit reward and punish verbs on Cobot that LambdaMOO users can invoke at any time. [sent-153, score-0.505]

45 The signal is attributed as immediate feedback for the current state and RL action, and “backed up” to previous states and actions in accordance with the standard RL algorithms. [sent-155, score-0.207]

46 One fundamental design choice is whether to learn a single value function for the entire community, or to learn separate value functions for each user based on individual feedback, combining the value functions of those present to determine how to act at each moment. [sent-159, score-0.306]

47 First, it was clear that for learning to have any hope of success, ths system must represent who is present at any given moment—different users simply have different personalities and preferences. [sent-161, score-0.286]

48 We felt that representing which users are present as additional state features would throw away valuable domain information, as the RL would have to discover on its own the primacy of user identity. [sent-162, score-0.635]

49 Having separate reward functions for each user is thus a way of asserting the importance of identity to the learning process. [sent-163, score-0.406]

50 Without a clear sense that their ( training has some impact on Cobot’s behavior, users will quickly lose interest in providing feedback. [sent-165, score-0.297]

51 ££¡ ¥¤¢ Third, we (correctly) anticipated the fact that certain users would provide an inordinate amount of training to Cobot, and we did not want the overall policy followed by Cobot to be dominated by the preferences of these individuals. [sent-170, score-0.481]

52 By learning separate policies for each user, and then combining these policies among those users present, we can limit the impact any single user can have on Cobot’s actions. [sent-171, score-0.639]

53 7 Cobot’s RL State Features The decision to maintain and learn separate value functions for each user means that we can maintain separate state spaces as well, in the hopes of simplifying states and speeding learning. [sent-172, score-0.372]

54 The state space for a user contains a number of features containing statistics about that particular user. [sent-174, score-0.332]

55 LambdaMOO is a social environment, and Cobot is learning to take social actions, so we felt that his state features should contain information allowing him to gauge social activity and relationships. [sent-175, score-0.416]

56 Even though we have simpliﬁed the state space by partitioning by user, the state space for a single user remains sufﬁciently complex to preclude standard table-based representation of value functions (also, each user’s state space is effectively inﬁnite, as there are real-valued state features). [sent-177, score-0.417]

57 Cobot’s RL actions are then chosen according to a mixture of the policies of the users present. [sent-179, score-0.444]

58 Indicates if Cobot’s currently saved roll call text has been used before, if someone has done a roll call since the last time Cobot did, and if there has been a roll call since the last time Cobot grabbed new text. [sent-188, score-0.414]

59 Each user has one feature that is always “on”; that is, this bias is always set to a value of 1. [sent-189, score-0.299]

60 Each user has his own state space and value function; the table thus describes the state space maintained for a generic user. [sent-192, score-0.348]

61 Upon launching the RL functionality publicly in the Living Room, Cobot logged all RL-related data (states visited, actions taken, rewards received from each user, parameters of the value functions, etc. [sent-197, score-0.256]

62 During this time, 63123 RL actions were taken (in addition, of course, to many more reactive non-RL actions), and 3171 reward and punishment events were received from 254 different users. [sent-199, score-0.377]

63 Instead, as shown in Figure 1a, the average cumulative reward received by Cobot actually goes down. [sent-202, score-0.195]

64 However, rather than indicating that users are becoming more dissatisﬁed as Cobot learns, the decay in reward reveals some peculiarities of human feedback in such an open-ended environment. [sent-203, score-0.481]

65 There are at least two difﬁculties with average cumulative reward in an environment of human users. [sent-204, score-0.227]

66 Thus a feature that is popular and exciting to users when it is introduced may eventually become an irritant (there are many examples of this phenomenon). [sent-207, score-0.301]

67 While difﬁcult to quantify in such a complex environment, this phenomenon is sufﬁciently prevalent in LambdaMOO to cast serious doubts on the use of average cumulative reward as the primary measure of performance. [sent-209, score-0.189]

68 The second and related difﬁculty is that even when users do maintain relatively ﬁxed preferences, they tend to give Cobot less feedback of either type (reward or punishment) as he manages to learn their preferences accurately. [sent-210, score-0.492]

69 Simply put, once Cobot seems to be behaving as they wish, users feel no need to continually provide reward for his “correct” actions or to punish him for the occasional “mistake. [sent-211, score-0.581]

70 ” This reward pattern is in contrast to typical RL applications, where there is an automated and indefatigable reward source. [sent-212, score-0.264]

71 These two users were among Cobot’s most dedicated trainers, each had strong preferences for certain actions, and Cobot learned to strongly modify his behavior in their presence to match their preferences. [sent-214, score-0.529]

72 Nevertheless, both users tended to provide less frequent feedback to Cobot as the experiment progressed, as shown in Figure 1a. [sent-215, score-0.331]

73 ” Among the 254 users who gave at least one reward or punishment event to Cobot, 218 gave 20 or fewer, while 15 gave 50 or more. [sent-218, score-0.442]

74 Thus, we found that while many users exhibited a passing interest in training Cobot, there was a small group that was willing to invest nontrivial time and effort in teaching Cobot their preferences. [sent-219, score-0.329]

75 User O appears to especially dislike roll call actions when there have been repeated roll calls and/or Cobot is repeating the same roll calls. [sent-223, score-0.512]

76 User C appears to have strong preferences about Cobot’s behavior when a “roll call party” is in progress (i. [sent-231, score-0.181]

77 ¥ £ ¢¥ Table 3: Relevant features for users with non-uniform policies. [sent-237, score-0.317]

78 Several of our top users had some features that deviated from their bias feature. [sent-238, score-0.349]

79 For the users above the double line, we have included only features whose weights had a length greater than 0. [sent-241, score-0.317]

80 Each of these users had bias weights of length greater than 1. [sent-243, score-0.304]

81 For the vast majority of users who participated in the RL training of Cobot, the policy learned was quite close to the uniform distribution. [sent-247, score-0.382]

82 However, we observed that for most users the learned policy’s dependence on state was weak, and the resulting distribution near uniform (though there are interesting and notable exceptions, as we shall see below). [sent-249, score-0.367]

83 This result is perhaps to be expected: most users provided too little feedback for Cobot to detect strong preferences, and may not have been exhibiting strong and consistent preferences in the feedback they did provide. [sent-250, score-0.553]

84 However, there was again a small group of users for whom a highly non-uniform policy was learned. [sent-251, score-0.348]

85 ) Several other users also exhibited less dramatic but still non-uniform distributions. [sent-257, score-0.298]

86 User M seemed to have a strong preference for roll call actions, with the learned policy selecting these with probability 0. [sent-258, score-0.246]

87 99, while User S preferred social commentary actions, with the learned policy giving them probability 0. [sent-259, score-0.219]

88 In Figure 1b, we demonstrate that the policy learned by Cobot for User M does in fact reﬂect the empirical pattern of rewards received over time. [sent-264, score-0.206]

89 The policies learned by Cobot for users can have strong impact on the empirical distribution of actions he actually ends up taking in LambdaMOO. [sent-268, score-0.52]

90 First, we note that by construction, the RL weights learned for the bias feature described in Table 2 represent the user’s preferences independent of state (since this feature is always on whenever the user is present). [sent-274, score-0.516]

91 Thus, we can determine that a feature is relevant for a user if that feature’s weight vector is far from that user’s bias feature weight vector, and from the all-zero vector. [sent-276, score-0.314]

92 , User M prefers roll calls); however, as we can see in Table 3, Cobot has learned a policy for some users that depends upon state. [sent-280, score-0.514]

93 numerical reward received may be larger or smaller than 1 at any time, as implicit rewards provide fractional reward, and the user may repeatedly reward or punish an action, with the feedback being summed. [sent-281, score-0.714]

94 2 1 reward all users abs reward all users reward user M abs reward user M reward user S abs reward user S 0. [sent-285, score-2.455]

95 b) Rewards received, policy learned, and effect on actions for User M. [sent-301, score-0.189]

96 The blue bars (left) show the average reward given by User M for each action (the average reward given by User M across all actions has been subtracted off to indicate relative preferences). [sent-304, score-0.474]

97 We see that the policy learned by Cobot for User M aligns nicely with the preferences expressed by M and that Cobot’s behavior shifts strongly towards the learned policy for User M whenever M is present. [sent-308, score-0.342]

98 To go beyond a qualitative visual analysis, we have deﬁned a metric that measures the extent to which two rankings of actions agree, while taking into account that some actions are extremely close in the each ranking. [sent-309, score-0.254]

99 9 Conclusions We have reported on our efforts to apply reinforcement learning in a complex human online social environment where many of the standard assumptions (stationary rewards, Markovian behavior, appropriateness of average reward) are clearly violated. [sent-312, score-0.24]

100 Cobot continues to take RL actions and receive rewards and punishments from LambdaMOO users, and we plan to continue and embellish this work as part of our overall efforts on Cobot. [sent-314, score-0.241]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cobot', 0.788), ('users', 0.286), ('user', 0.266), ('lambdamoo', 0.216), ('rl', 0.171), ('preferences', 0.133), ('actions', 0.127), ('reward', 0.126), ('roll', 0.119), ('social', 0.111), ('rewards', 0.065), ('action', 0.063), ('policy', 0.062), ('agent', 0.052), ('verbs', 0.051), ('punishments', 0.049), ('feedback', 0.045), ('room', 0.042), ('buster', 0.042), ('punish', 0.042), ('environment', 0.038), ('isbell', 0.035), ('state', 0.035), ('learned', 0.034), ('functionality', 0.034), ('living', 0.034), ('events', 0.033), ('features', 0.031), ('reactive', 0.031), ('policies', 0.031), ('punishment', 0.03), ('received', 0.03), ('verb', 0.029), ('dedicated', 0.029), ('reinforcement', 0.027), ('interacted', 0.024), ('human', 0.024), ('cumulative', 0.023), ('community', 0.022), ('abs', 0.021), ('hfh', 0.021), ('singh', 0.02), ('chat', 0.018), ('bias', 0.018), ('behavior', 0.017), ('nontrivial', 0.017), ('felt', 0.017), ('average', 0.016), ('objects', 0.016), ('strong', 0.016), ('sutton', 0.016), ('feature', 0.015), ('call', 0.015), ('maintain', 0.015), ('empirical', 0.015), ('experiences', 0.015), ('implicit', 0.014), ('presence', 0.014), ('bored', 0.014), ('chatting', 0.014), ('deviated', 0.014), ('dialogue', 0.014), ('emotional', 0.014), ('entertaining', 0.014), ('inappropriateness', 0.014), ('invest', 0.014), ('lewinsky', 0.014), ('proactive', 0.014), ('rooms', 0.014), ('rudimentary', 0.014), ('shelton', 0.014), ('spam', 0.014), ('tired', 0.014), ('trainers', 0.014), ('drift', 0.014), ('separate', 0.014), ('sources', 0.013), ('upon', 0.013), ('parents', 0.013), ('calls', 0.013), ('primary', 0.013), ('online', 0.013), ('learn', 0.013), ('ndings', 0.012), ('exhibited', 0.012), ('provided', 0.012), ('commentary', 0.012), ('senses', 0.012), ('conversation', 0.012), ('handful', 0.012), ('automated', 0.012), ('populated', 0.012), ('residing', 0.012), ('routines', 0.012), ('someone', 0.012), ('interesting', 0.012), ('table', 0.012), ('learns', 0.012), ('complex', 0.011), ('impact', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999899 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

Author: Charles Lee Isbell Jr., Christian R. Shelton

2 0.15422989 24 nips-2001-Active Information Retrieval

Author: Tommi Jaakkola, Hava T. Siegelmann

Abstract: In classical large information retrieval systems, the system responds to a user initiated query with a list of results ranked by relevance. The users may further refine their query as needed. This process may result in a lengthy correspondence without conclusion. We propose an alternative active learning approach, where the system responds to the initial user's query by successively probing the user for distinctions at multiple levels of abstraction. The system's initiated queries are optimized for speedy recovery and the user is permitted to respond with multiple selections or may reject the query. The information is in each case unambiguously incorporated by the system and the subsequent queries are adjusted to minimize the need for further exchange. The system's initiated queries are subject to resource constraints pertaining to the amount of information that can be presented to the user per iteration. 1

3 0.1138887 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a ﬁnite set of standard, scalar-reward learning algorithms. Suﬃcient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

4 0.11255993 140 nips-2001-Optimising Synchronisation Times for Mobile Devices

Author: Neil D. Lawrence, Antony I. T. Rowstron, Christopher M. Bishop, Michael J. Taylor

Abstract: With the increasing number of users of mobile computing devices (e.g. personal digital assistants) and the advent of third generation mobile phones, wireless communications are becoming increasingly important. Many applications rely on the device maintaining a replica of a data-structure which is stored on a server, for example news databases, calendars and e-mail. ill this paper we explore the question of the optimal strategy for synchronising such replicas. We utilise probabilistic models to represent how the data-structures evolve and to model user behaviour. We then formulate objective functions which can be minimised with respect to the synchronisation timings. We demonstrate, using two real world data-sets, that a user can obtain more up-to-date information using our approach. 1

5 0.10433849 113 nips-2001-Learning a Gaussian Process Prior for Automatically Generating Music Playlists

Author: John C. Platt, Christopher J. C. Burges, Steven Swenson, Christopher Weare, Alice Zheng

Abstract: This paper presents AutoDJ: a system for automatically generating music playlists based on one or more seed songs selected by a user. AutoDJ uses Gaussian Process Regression to learn a user preference function over songs. This function takes music metadata as inputs. This paper further introduces Kernel Meta-Training, which is a method of learning a Gaussian Process kernel from a distribution of functions that generates the learned function. For playlist generation, AutoDJ learns a kernel from a large set of albums. This learned kernel is shown to be more effective at predicting users’ playlists than a reasonable hand-designed kernel.

6 0.10276945 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

7 0.087779 126 nips-2001-Motivated Reinforcement Learning

8 0.087718934 128 nips-2001-Multiagent Planning with Factored MDPs

9 0.086894318 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

10 0.085224971 13 nips-2001-A Natural Policy Gradient

11 0.082009479 40 nips-2001-Batch Value Function Approximation via Support Vectors

12 0.070798665 121 nips-2001-Model-Free Least-Squares Policy Iteration

13 0.068668872 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

14 0.062065803 59 nips-2001-Direct value-approximation for factored MDPs

15 0.048146442 67 nips-2001-Efficient Resources Allocation for Markov Decision Processes

16 0.046433564 148 nips-2001-Predictive Representations of State

17 0.04550859 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

18 0.045487341 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

19 0.043901317 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

20 0.03948899 31 nips-2001-Algorithmic Luckiness

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.102), (1, -0.077), (2, 0.176), (3, -0.018), (4, -0.01), (5, 0.048), (6, -0.056), (7, -0.004), (8, -0.048), (9, 0.011), (10, 0.073), (11, -0.073), (12, -0.141), (13, -0.007), (14, -0.021), (15, 0.119), (16, 0.152), (17, 0.102), (18, -0.137), (19, 0.04), (20, 0.097), (21, 0.013), (22, -0.071), (23, 0.047), (24, -0.233), (25, 0.047), (26, 0.042), (27, -0.176), (28, -0.06), (29, -0.087), (30, -0.002), (31, 0.023), (32, -0.009), (33, -0.146), (34, -0.009), (35, -0.019), (36, 0.082), (37, 0.132), (38, 0.155), (39, 0.03), (40, -0.06), (41, 0.048), (42, 0.135), (43, 0.042), (44, -0.122), (45, 0.01), (46, -0.039), (47, 0.015), (48, -0.056), (49, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96480864 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

Author: Charles Lee Isbell Jr., Christian R. Shelton

2 0.72419244 140 nips-2001-Optimising Synchronisation Times for Mobile Devices

Author: Neil D. Lawrence, Antony I. T. Rowstron, Christopher M. Bishop, Michael J. Taylor

3 0.62660378 113 nips-2001-Learning a Gaussian Process Prior for Automatically Generating Music Playlists

Author: John C. Platt, Christopher J. C. Burges, Steven Swenson, Christopher Weare, Alice Zheng

4 0.54179835 24 nips-2001-Active Information Retrieval

Author: Tommi Jaakkola, Hava T. Siegelmann

5 0.47265464 126 nips-2001-Motivated Reinforcement Learning

Author: Peter Dayan

Abstract: The standard reinforcement learning view of the involvement of neuromodulatory systems in instrumental conditioning includes a rather straightforward conception of motivation as prediction of sum future reward. Competition between actions is based on the motivating characteristics of their consequent states in this sense. Substantial, careful, experiments reviewed in Dickinson & Balleine, 12,13 into the neurobiology and psychology of motivation shows that this view is incomplete. In many cases, animals are faced with the choice not between many different actions at a given state, but rather whether a single response is worth executing at all. Evidence suggests that the motivational process underlying this choice has different psychological and neural properties from that underlying action choice. We describe and model these motivational systems, and consider the way they interact.

6 0.36624372 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

7 0.34962043 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

8 0.32664227 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

9 0.31604519 148 nips-2001-Predictive Representations of State

10 0.30362421 177 nips-2001-Switch Packet Arbitration via Queue-Learning

11 0.28416264 128 nips-2001-Multiagent Planning with Factored MDPs

12 0.26318911 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

13 0.24109322 121 nips-2001-Model-Free Least-Squares Policy Iteration

14 0.23758666 40 nips-2001-Batch Value Function Approximation via Support Vectors

15 0.23453055 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

16 0.23427102 175 nips-2001-Stabilizing Value Function Approximation with the BFBP Algorithm

17 0.21913899 13 nips-2001-A Natural Policy Gradient

18 0.19822417 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing

19 0.19100861 181 nips-2001-The Emergence of Multiple Movement Units in the Presence of Noise and Feedback Delay

20 0.18419939 91 nips-2001-Improvisation and Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.023), (17, 0.027), (19, 0.021), (27, 0.071), (30, 0.055), (33, 0.325), (38, 0.026), (59, 0.025), (67, 0.016), (72, 0.046), (79, 0.043), (83, 0.06), (91, 0.137)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80519873 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

Author: Charles Lee Isbell Jr., Christian R. Shelton

2 0.78689182 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement

Author: H. Colonius, A. Diederich

Abstract: Multisensory response enhancement (MRE) is the augmentation of the response of a neuron to sensory input of one modality by simultaneous input from another modality. The maximum likelihood (ML) model presented here modifies the Bayesian model for MRE (Anastasio et al.) by incorporating a decision strategy to maximize the number of correct decisions. Thus the ML model can also deal with the important tasks of stimulus discrimination and identification in the presence of incongruent visual and auditory cues. It accounts for the inverse effectiveness observed in neurophysiological recording data, and it predicts a functional relation between uni- and bimodal levels of discriminability that is testable both in neurophysiological and behavioral experiments. 1

3 0.48583466 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

Abstract: This paper proposes an approach to classiﬁcation of adjacent segments of a time series as being either of classes. We use a hierarchical model that consists of a feature extraction stage and a generative classiﬁer which is built on top of these features. Such two stage approaches are often used in signal and image processing. The novel part of our work is that we link these stages probabilistically by using a latent feature space. To use one joint model is a Bayesian requirement, which has the advantage to fuse information according to its certainty. The classiﬁer is implemented as hidden Markov model with Gaussian and Multinomial observation distributions deﬁned on a suitably chosen representation of autoregressive models. The Markov dependency is motivated by the assumption that successive classiﬁcations will be correlated. Inference is done with Markov chain Monte Carlo (MCMC) techniques. We apply the proposed approach to synthetic data and to classiﬁcation of EEG that was recorded while the subjects performed different cognitive tasks. All experiments show that using a latent feature space results in a signiﬁcant improvement in generalization accuracy. Hence we expect that this idea generalizes well to other hierarchical models.

4 0.4801062 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

Author: Aaron C. Courville, David S. Touretzky

Abstract: The Temporal Coding Hypothesis of Miller and colleagues [7] suggests that animals integrate related temporal patterns of stimuli into single memory representations. We formalize this concept using quasi-Bayes estimation to update the parameters of a constrained hidden Markov model. This approach allows us to account for some surprising temporal effects in the second order conditioning experiments of Miller et al. [1 , 2, 3], which other models are unable to explain. 1

5 0.47932607 183 nips-2001-The Infinite Hidden Markov Model

Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We show that it is possible to extend hidden Markov models to have a countably inﬁnite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the inﬁnitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters deﬁne a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a ﬁnite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be inﬁnite— consider, for example, symbols being possible words appearing in English text.

6 0.47897443 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

7 0.4786934 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

8 0.47515005 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

9 0.4746176 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

10 0.47229329 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

11 0.47138292 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

12 0.47118866 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

13 0.47113049 13 nips-2001-A Natural Policy Gradient

14 0.47086447 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

15 0.4705264 169 nips-2001-Small-World Phenomena and the Dynamics of Information

16 0.46972394 56 nips-2001-Convolution Kernels for Natural Language

17 0.46884125 182 nips-2001-The Fidelity of Local Ordinal Encoding

18 0.46827707 3 nips-2001-ACh, Uncertainty, and Cortical Inference

19 0.4678514 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

20 0.46767572 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's