nips nips2004 nips2004-88 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. [sent-9, score-0.23]
2 But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. [sent-10, score-0.337]
3 In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. [sent-11, score-0.53]
4 1 Introduction Psychologists distinguish between extrinsic motivation, which means being moved to do something because of some specific rewarding outcome, and intrinsic motivation, which refers to being moved to do something because it is inherently enjoyable. [sent-12, score-0.483]
5 The skills making up general competence act as the “building blocks” out of which an agent can form solutions to new problems as they arise. [sent-20, score-0.639]
6 This paper presents an elaboration of the reinforcement learning (RL) framework [11] that encompasses the autonomous development of skill hierarchies through intrinsically motivated reinforcement learning. [sent-23, score-0.455]
7 We illustrate its ability to allow an agent to learn broad competence in a simple “playroom” environment. [sent-24, score-0.41]
8 Many researchers have argued for this kind of devel- opmental approach in which an agent undergoes an extended developmental period during which collections of reusable skills are autonomously learned that will be useful for a wide range of later challenges (e. [sent-27, score-0.708]
9 , [8]) on confidence-based curiosity and the ideas of exploration and shaping bonuses [6, 10], although our definition of intrinsic reward differs from these. [sent-32, score-0.577]
10 The neuromodulator dopamine has long been associated with reward learning [9]. [sent-34, score-0.386]
11 Recent studies [2, 3] have focused on the idea that dopamine not only plays a critical role in the extrinsic motivational control of behaviors aimed at harvesting explicit rewards, but also in the intrinsic motivational control of behaviors associated with novelty and exploration. [sent-35, score-0.673]
12 The agent learns to improve its skill in controlling the environment in the sense of learning how to increase the total amount of reward it receives over time from the critic. [sent-44, score-0.793]
13 "Organism" Sutton and Barto [11] point out that one should not identify this RL agent with an entire animal or robot. [sent-48, score-0.393]
14 An an animal’s reward signals are determined by processes within its brain that monitor not only external state but also the animal’s internal state. [sent-49, score-0.402]
15 1A into an external environment and an internal environment, the latter of which contains the critic which determines primary reward. [sent-53, score-0.214]
16 This scheme still includes cases in which reward is essentially an external stimulus (e. [sent-54, score-0.337]
17 The usual practice in applying RL algorithms is to formulate the problem one wants the agent to learn how to solve (e. [sent-58, score-0.377]
18 , win at backgammon) and define a reward function specially tailored for this problem (e. [sent-60, score-0.31]
19 Sometimes considerable ingenuity is required to craft an appropriate reward function. [sent-63, score-0.31]
20 The point of departure for our approach is to note that the internal environment contains, among other things, the organism’s motivational system, which needs to be a sophisticated system that should not have to be redesigned for different problems. [sent-64, score-0.213]
21 Our approach to skills builds on the theory of options [12]. [sent-68, score-0.453]
22 It consists of 1) an option policy that directs the agent’s behavior for a subset of the environment states, 2) an initiation set consisting of all the states in which the option can be initiated, and 3) a termination condition, which specifies the conditions under which the option terminates. [sent-70, score-1.47]
23 It is important to note that an option is not a sequence of actions; it is a closed-loop control rule, meaning that it is responsive to on-going state changes. [sent-71, score-0.418]
24 Furthermore, because options can invoke other options as actions, hierarchical skills and algorithms for learning them naturally emerge from the conception of skills as options. [sent-72, score-0.931]
25 Theoretically, when options are added to the set of admissible agent actions, the usual Markov decision process (MDP) formulation of RL extends to semi-Markov decision processes (SMDPs), with the one-step actions now becoming the “primitive actions. [sent-73, score-0.678]
26 ” All of the theory and algorithms applicable to SMDPs can be appropriated for decision making and learning with options [12]. [sent-74, score-0.224]
27 Two components of the the options framework are especially important for our approach: 1. [sent-75, score-0.224]
28 Option Models: An option model is a probabilistic description of the effects of executing an option. [sent-76, score-0.384]
29 As a function of an environment state where the option is initiated, it gives the probability with which the option will terminate at any other state, and it gives the total amount of reward expected over the option’s execution. [sent-77, score-1.198]
30 Intra-option Learning Methods: These methods allow the policies of many options to be updated simultaneously during an agent’s interaction with the environment. [sent-81, score-0.276]
31 If an option could have produced a primitive action in a given state, its policy can be updated on the basis of the observed consequences even though it was not directing the agent’s behavior at the time. [sent-82, score-0.653]
32 In most of the work with options, the set of options must be provided by the system designer. [sent-83, score-0.224]
33 While an option’s policy can be improved through learning, each option has to be predefined by providing its initiation set, termination condition, and the reward function that evaluates its performance. [sent-84, score-0.856]
34 For the most part, these methods extract options from the learning system’s attempts to solve a particular problem, whereas our approach creates options outside of the context of solving any particular problem. [sent-88, score-0.448]
35 Developing Hierarchical Collections of Skills—Children accumulate skills while they engage in intrinsically motivated behavior, e. [sent-89, score-0.443]
36 We claim that the concepts of an option and an option model are exactly appropriate to model this type of behavior. [sent-94, score-0.768]
37 3 Intrinsically Motivated RL Our main departure from the usual application of RL is that our agent maintains a knowledge base of skills that it learns using intrinsic rewards. [sent-96, score-0.828]
38 to QB // — Choose next action //— Determine next extrinsic reward e Set rt+1 to the extrinsic reward for transition st , at → st+1 e e i i Set st ← st+1 ; at ← at+1 ; rt ← rt+1 ; rt ← rt+1 Figure 2: Learning Algorithm. [sent-99, score-1.622]
39 Extrinsic reward is denoted re while intrinsic reward is denoted ri . [sent-100, score-0.819]
40 The option action value functions Qo are updated using intra-option Q-learning. [sent-104, score-0.479]
41 Note that the intrinsic reward is only used in updating QB and not any of the Qo . [sent-105, score-0.509]
42 Behavior The agent behaves in its environment according to an -greedy policy with respect to an action-value function QB that is learned using a mix of Q-learning and SMDP planning as described in Fig. [sent-107, score-0.564]
43 Over time, skills represented internally as options and their models also become available to the agent as action choices. [sent-110, score-0.89]
44 Thus, QB maps states s and actions a (both primitive and options) to the expected long-term utility of taking that action a in state s. [sent-111, score-0.284]
45 Salient Events In our current implementation we assume that the agent has intrinsic or hardwired notions of interesting or “salient” events in its environment. [sent-112, score-0.603]
46 For example, in the playroom environment we present shortly, the agent finds changes in light and sound intensity to be salient. [sent-113, score-0.765]
47 Reward In addition to the usual extrinsic rewards there are occasional intrinsic rewards generated by the agent’s critic (see Fig. [sent-115, score-0.621]
48 In this implementation, the agent’s intrinsic reward is generated in a way suggested by the novelty response of dopamine neurons. [sent-117, score-0.616]
49 The intrinsic reward for each salient event is proportional to the error in the prediction of the salient event according to the learned option model for that event (see Fig. [sent-118, score-1.725]
50 Skill-KB The agent maintains a knowledge base of skills that it has learned in its environment. [sent-120, score-0.62]
51 The first time a salient event occurs, say light turned on, structures to learn an option that achieves that salient event (turn-light-on option) are created in the skill-KB. [sent-122, score-1.217]
52 In addition, structures to learn an option model are also created. [sent-123, score-0.384]
53 So for option o, Qo maps states s and actions a (again, both primitive and options) to the long-term utility of taking action a in state s. [sent-124, score-0.668]
54 The option for a salient event terminates with probability one in any state that achieves that event and never terminates in any other state. [sent-125, score-0.859]
55 The initiation set, I o , for an option o is incrementally expanded to includes states that lead to states in the current initiation set. [sent-126, score-0.594]
56 In the playroom are a number of objects: a light switch, a ball, a bell, two movable blocks that are also buttons for turning music on and off, as well as a toy monkey that can make sounds. [sent-132, score-0.574]
57 The agent has an eye, a hand, and a visual marker (seen as a cross hair in the figure). [sent-133, score-0.397]
58 At any time step, the agent has the following actions available to it: 1) move eye to hand, 2) move eye to marker, 3) move eye one step north, south, east or west, 4) move eye to random object, 5) move hand to eye, and 6) move marker to eye. [sent-135, score-1.456]
59 In addition, if both the eye and and hand are on some object, then natural operations suggested by the object become available, e. [sent-136, score-0.198]
60 , if both the hand and the eye are on the light switch, then the action of flicking the light switch becomes available, and if both the hand and eye are on the ball, then the action of kicking the ball becomes available (which when pushed, moves in a straight line to the marker). [sent-138, score-0.919]
61 The light switch controls the lighting in the room. [sent-141, score-0.204]
62 The blue block if pressed turns music on, while the red block if pressed turns music off. [sent-143, score-0.232]
63 The toy monkey makes frightened sounds if simultaneously the room is dark and the music is on and the bell is rung. [sent-145, score-0.262]
64 Notice that if the agent has already learned how to turn the light on and off, how to turn music on, and how to make the bell ring, then those learned skills would be of obvious use in simplifying this process of engaging the toy monkey. [sent-148, score-1.038]
65 The effect of intrinsically motivated learning when extrinsic reward is present. [sent-157, score-0.674]
66 See text for details For this simple example, changes in light and sound intensity are considered salient by the playroom agent. [sent-158, score-0.584]
67 Because the initial action value function, QB , is uninformative, the agent starts by exploring its environment randomly. [sent-159, score-0.5]
68 Each first encounter with a salient event initiates the learning of an option and an option model for that salient event. [sent-160, score-1.39]
69 For example, the first time the agent happens to turn the light on, it initiates the data structures necessary for learning and storing the light-on option. [sent-161, score-0.538]
70 As the agent moves around the environment, all the options (initiated so far) and their models are simultaneously updated using intra-option learning. [sent-162, score-0.632]
71 As a result, when the agent encounters an unpredicted salient event a few times, its updated action value function drives it to repeatedly attempt to achieve that salient event. [sent-165, score-1.102]
72 Of course, the option policy and model become accurate in states the agent encounters frequently. [sent-167, score-0.874]
73 Occasionally, the agent encounters the salient event in a state (set of sensor readings) that it has not encountered before, and it generates intrinsic reward again (it is “surprised”). [sent-168, score-1.27]
74 Each panel of the figure is for a distinct salient event. [sent-171, score-0.251]
75 The graph in each panel shows both the time steps at which the event occurs as well as the intrinsic reward associated by the agent to each occurrence. [sent-172, score-0.95]
76 Each occurrence is denoted by a vertical bar whose height denotes the amount of associated intrinsic reward. [sent-173, score-0.199]
77 Note that as one goes from top to bottom in this figure, the salient events become harder to achieve and, in fact, become more hierarchical. [sent-174, score-0.355]
78 Indeed, the lowest one for turning on the monkey noise (Non) needs light on, music on, light off, sound on in sequence. [sent-175, score-0.545]
79 First note that the salient events that are simpler to achieve occur earlier in time. [sent-177, score-0.309]
80 For example, Lon (light turning on) and Loff (light turning off) are the simplest salient events, and the agent makes these happen quite early. [sent-178, score-0.713]
81 The agent tries them a large number of times before getting bored and moving on to other salient events. [sent-179, score-0.641]
82 The reward obtained for each of these events diminishes after repeated exposure to the event. [sent-180, score-0.391]
83 Each panel depicts the occurrences of salient events as well as the associated intrinsic rewards. [sent-183, score-0.508]
84 Of course, the events keep happening despite their diminished capacity to reward because they are needed to achieve the more complex events. [sent-185, score-0.368]
85 Consequently, the agent continues to turn the light on and off even after it has learned this skill because this is a step along the way toward turning on the music, as well as along the way toward turning on the monkey noise. [sent-186, score-0.807]
86 Finally note that the more complex skills are learned relatively quickly once the required sub-skills are in place, as one can see by the few rewards the agent receives for them. [sent-187, score-0.693]
87 The agent is able to bootstrap and build upon the options it has already learned for the simpler events. [sent-188, score-0.615]
88 We confirmed the hierarchical nature of the learned options by inspecting the greedy policies for the more complex options like Non and Noff. [sent-189, score-0.518]
89 The fact that all the options are successfully learned is also seen in Fig. [sent-190, score-0.269]
90 This figure also shows that the simpler skills are learned earlier than the more complex ones. [sent-192, score-0.274]
91 An agent having a collection of skills learned through intrinsic reward can learn a wide variety of extrinsically rewarded tasks more easily than an agent lacking these skills. [sent-193, score-1.475]
92 To illustrate, we looked at a playroom task in which extrinsic reward was available only if the agent succeeded in making the monkey cry out. [sent-194, score-1.103]
93 This is difficult for an agent to learn if only the extrinsic reward is available, but much easier if the agent can use intrinsic reward to learn a collection of skills, some of which are relevant to the overall task. [sent-196, score-1.686]
94 Each starts out with no knowledge of task, but one employs the intrinsic reward mechanism we have discussed above. [sent-199, score-0.509]
95 The extrinsic reward is always available, but only when the monkey cries out. [sent-200, score-0.567]
96 The figure, which shows the average of 100 repetitions of the experiment, clearly shows the advantage of learning with intrinsic reward. [sent-201, score-0.199]
97 Discussion One of the key aspects of the Playroom example was that intrinsic reward was generated only by unexpected salient events. [sent-202, score-0.76]
98 ” In the future, we intend to implement computational analogs of other forms of intrinsic motivation as suggested in the psychological, statistical, and neuroscience literatures. [sent-205, score-0.239]
99 Moreover, they were achieved quite directly by combining a collection of existing RL algorithms for learning options and option-models with a simple notion of intrinsic reward. [sent-207, score-0.423]
100 Policy invariance under reward transformations: Theory and application to reward shaping. [sent-251, score-0.62]
wordName wordTfidf (topN-words)
[('option', 0.384), ('agent', 0.346), ('reward', 0.31), ('salient', 0.251), ('skills', 0.229), ('options', 0.224), ('st', 0.209), ('intrinsic', 0.199), ('extrinsic', 0.175), ('playroom', 0.161), ('eye', 0.153), ('light', 0.141), ('intrinsically', 0.126), ('rl', 0.126), ('qb', 0.119), ('event', 0.095), ('music', 0.092), ('qo', 0.089), ('oe', 0.088), ('environment', 0.086), ('rt', 0.083), ('monkey', 0.082), ('actions', 0.077), ('initiation', 0.076), ('primitive', 0.076), ('dopamine', 0.076), ('motivational', 0.073), ('rewards', 0.073), ('critic', 0.07), ('action', 0.068), ('competence', 0.064), ('motivated', 0.063), ('switch', 0.063), ('turning', 0.058), ('events', 0.058), ('move', 0.058), ('policy', 0.057), ('ball', 0.053), ('reinforcement', 0.052), ('autonomous', 0.052), ('skill', 0.051), ('marker', 0.051), ('bell', 0.048), ('barto', 0.047), ('animal', 0.047), ('learned', 0.045), ('bored', 0.044), ('curiosity', 0.044), ('maxa', 0.043), ('behavior', 0.041), ('something', 0.04), ('toy', 0.04), ('motivation', 0.04), ('ro', 0.036), ('moves', 0.035), ('encounters', 0.035), ('initiated', 0.035), ('reusable', 0.035), ('state', 0.034), ('usual', 0.031), ('sound', 0.031), ('internal', 0.031), ('novelty', 0.031), ('development', 0.03), ('planning', 0.03), ('states', 0.029), ('chentanez', 0.029), ('cry', 0.029), ('elaboration', 0.029), ('nuttapong', 0.029), ('rewarding', 0.029), ('unpredicted', 0.029), ('termination', 0.029), ('collections', 0.028), ('external', 0.027), ('stimuli', 0.027), ('updated', 0.027), ('sutton', 0.026), ('turn', 0.026), ('ccf', 0.025), ('ipto', 0.025), ('smdp', 0.025), ('smdps', 0.025), ('engage', 0.025), ('psychologists', 0.025), ('developmental', 0.025), ('competent', 0.025), ('initiates', 0.025), ('hierarchical', 0.025), ('singh', 0.025), ('interaction', 0.025), ('block', 0.024), ('bring', 0.024), ('exploration', 0.024), ('become', 0.023), ('departure', 0.023), ('behaviors', 0.023), ('diminishes', 0.023), ('pushed', 0.023), ('hand', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
2 0.30861008 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
3 0.2376049 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
4 0.18269548 24 nips-2004-Approximately Efficient Online Mechanism Design
Author: David C. Parkes, Dimah Yanovsky, Satinder P. Singh
Abstract: Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision process (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the possibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement efficient policies, and retain truth-revelation as an approximate BayesianNash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1
5 0.1564883 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
6 0.15160221 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
7 0.11871778 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
8 0.11391422 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
9 0.099534281 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems
10 0.099185929 48 nips-2004-Convergence and No-Regret in Multiagent Learning
11 0.094443254 123 nips-2004-Multi-agent Cooperation in Diverse Population Games
12 0.090040632 155 nips-2004-Responding to Modalities with Different Latencies
13 0.084275037 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
14 0.08275298 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
15 0.077628583 33 nips-2004-Brain Inspired Reinforcement Learning
16 0.073309094 21 nips-2004-An Information Maximization Model of Eye Movements
17 0.071138166 202 nips-2004-VDCBPI: an Approximate Scalable Algorithm for Large POMDPs
18 0.063550867 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes
19 0.058288157 183 nips-2004-Temporal-Difference Networks
20 0.049547307 148 nips-2004-Probabilistic Computation in Spiking Populations
topicId topicWeight
[(0, -0.137), (1, -0.085), (2, 0.341), (3, -0.12), (4, -0.2), (5, 0.216), (6, -0.031), (7, -0.163), (8, -0.067), (9, 0.136), (10, 0.045), (11, 0.034), (12, 0.086), (13, -0.019), (14, 0.013), (15, 0.04), (16, -0.065), (17, -0.055), (18, 0.019), (19, 0.084), (20, 0.017), (21, -0.018), (22, 0.081), (23, -0.058), (24, -0.096), (25, -0.037), (26, 0.023), (27, 0.0), (28, -0.05), (29, -0.069), (30, -0.109), (31, 0.048), (32, 0.093), (33, -0.019), (34, 0.036), (35, -0.216), (36, 0.006), (37, -0.077), (38, 0.091), (39, -0.007), (40, -0.045), (41, -0.021), (42, -0.046), (43, -0.141), (44, 0.022), (45, -0.058), (46, 0.024), (47, -0.102), (48, -0.078), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.98861969 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
2 0.71534729 64 nips-2004-Experts in a Markov Decision Process
Author: Eyal Even-dar, Sham M. Kakade, Yishay Mansour
Abstract: We consider an MDP setting in which the reward function is allowed to change during each time step of play (possibly in an adversarial manner), yet the dynamics remain fixed. Similar to the experts setting, we address the question of how well can an agent do when compared to the reward achieved under the best stationary policy over time. We provide efficient algorithms, which have regret bounds with no dependence on the size of state space. Instead, these bounds depend only on a certain horizon time of the process and logarithmically on the number of actions. We also show that in the case that the dynamics change over time, the problem becomes computationally hard. 1
3 0.68976802 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
4 0.55519634 39 nips-2004-Coarticulation in Markov Decision Processes
Author: Khashayar Rohanimanesh, Robert Platt, Sridhar Mahadevan, Roderic Grupen
Abstract: We investigate an approach for simultaneously committing to multiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we define a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal statevalue function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and empirically evaluate our approach in a simulated domain. 1
5 0.53487259 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
Author: Daniela D. Farias, Nimrod Megiddo
Abstract: A reactive environment is one that responds to the actions of an agent rather than evolving obliviously. In reactive environments, experts algorithms must balance exploration and exploitation of experts more carefully than in oblivious ones. In addition, a more subtle definition of a learnable value of an expert is required. A general exploration-exploitation experts method is presented along with a proper definition of value. The method is shown to asymptotically perform as well as the best available expert. Several variants are analyzed from the viewpoint of the exploration-exploitation tradeoff, including explore-then-exploit, polynomially vanishing exploration, constant-frequency exploration, and constant-size exploration phases. Complexity and performance bounds are proven. 1
6 0.50395644 24 nips-2004-Approximately Efficient Online Mechanism Design
7 0.49819598 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
8 0.385822 155 nips-2004-Responding to Modalities with Different Latencies
9 0.37859163 123 nips-2004-Multi-agent Cooperation in Diverse Population Games
10 0.36585838 129 nips-2004-New Criteria and a New Algorithm for Learning in Multi-Agent Systems
11 0.32089946 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
12 0.31721139 21 nips-2004-An Information Maximization Model of Eye Movements
13 0.30522871 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
14 0.30375627 183 nips-2004-Temporal-Difference Networks
15 0.29562965 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
16 0.28771305 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes
17 0.27844098 33 nips-2004-Brain Inspired Reinforcement Learning
18 0.27160656 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
19 0.26217312 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
20 0.25163904 138 nips-2004-Online Bounds for Bayesian Algorithms
topicId topicWeight
[(13, 0.094), (15, 0.097), (26, 0.045), (31, 0.022), (33, 0.171), (35, 0.023), (39, 0.012), (50, 0.029), (52, 0.01), (58, 0.362), (71, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.84959996 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
2 0.75876522 80 nips-2004-Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale
Author: Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas L. Brutlag
Abstract: Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that attempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate inference for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high affinity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we find that 2/3 of the predicted active motifs correspond to actual interaction sites. 1
3 0.70758653 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data
Author: Mario Marchand, Mohak Shah
Abstract: We propose a “soft greedy” learning algorithm for building small conjunctions of simple threshold functions, called rays, defined on single real-valued attributes. We also propose a PAC-Bayes risk bound which is minimized for classifiers achieving a non-trivial tradeoff between sparsity (the number of rays used) and the magnitude of the separating margin of each ray. Finally, we test the soft greedy algorithm on four DNA micro-array data sets. 1
4 0.52649283 102 nips-2004-Learning first-order Markov models for control
Author: Pieter Abbeel, Andrew Y. Ng
Abstract: First-order Markov models have been successfully applied to many problems, for example in modeling sequential data using Markov chains, and modeling control problems using the Markov decision processes (MDP) formalism. If a first-order Markov model’s parameters are estimated from data, the standard maximum likelihood estimator considers only the first-order (single-step) transitions. But for many problems, the firstorder conditional independence assumptions are not satisfied, and as a result the higher order transition probabilities may be poorly approximated. Motivated by the problem of learning an MDP’s parameters for control, we propose an algorithm for learning a first-order Markov model that explicitly takes into account higher order interactions during training. Our algorithm uses an optimization criterion different from maximum likelihood, and allows us to learn models that capture longer range effects, but without giving up the benefits of using first-order Markov models. Our experimental results also show the new algorithm outperforming conventional maximum likelihood estimation in a number of control problems where the MDP’s parameters are estimated from data. 1
5 0.52509075 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
Author: Liam Paninski
Abstract: We develop a family of upper and lower bounds on the worst-case expected KL loss for estimating a discrete distribution on a finite number m of points, given N i.i.d. samples. Our upper bounds are approximationtheoretic, similar to recent bounds for estimating discrete entropy; the lower bounds are Bayesian, based on averages of the KL loss under Dirichlet distributions. The upper bounds are convex in their parameters and thus can be minimized by descent methods to provide estimators with low worst-case error; the lower bounds are indexed by a one-dimensional parameter and are thus easily maximized. Asymptotic analysis of the bounds demonstrates the uniform KL-consistency of a wide class of estimators as c = N/m → ∞ (no matter how slowly), and shows that no estimator is consistent for c bounded (in contrast to entropy estimation). Moreover, the bounds are asymptotically tight as c → 0 or ∞, and are shown numerically to be tight within a factor of two for all c. Finally, in the sparse-data limit c → 0, we find that the Dirichlet-Bayes (add-constant) estimator with parameter scaling like −c log(c) optimizes both the upper and lower bounds, suggesting an optimal choice of the “add-constant” parameter in this regime.
6 0.52458769 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition
7 0.52354729 131 nips-2004-Non-Local Manifold Tangent Learning
8 0.52292889 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
9 0.52171254 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
10 0.52107102 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
11 0.52020437 163 nips-2004-Semi-parametric Exponential Family PCA
12 0.519853 127 nips-2004-Neighbourhood Components Analysis
13 0.51984608 116 nips-2004-Message Errors in Belief Propagation
14 0.51899707 64 nips-2004-Experts in a Markov Decision Process
15 0.5188818 124 nips-2004-Multiple Alignment of Continuous Time Series
16 0.51875824 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification
17 0.5186857 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
18 0.51809502 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
19 0.51790196 70 nips-2004-Following Curved Regularized Optimization Solution Paths
20 0.51761293 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications