nips nips2008 nips2008-144 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ali Nouri, Michael L. Littman
Abstract: The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE’s broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The essence of exploration is acting to try to decrease uncertainty. [sent-6, score-0.222]
2 Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. [sent-8, score-0.355]
3 Empirical results show that MRE improves upon state-of-the-art exploration approaches. [sent-10, score-0.222]
4 1 Introduction Exploration, in reinforcement learning, refers to the strategy an agent uses to discover new information about the environment. [sent-11, score-0.198]
5 A rich set of exploration techniques, some ad hoc and some not, have been developed in the RL literature for finite MDPs (Kaelbling et al. [sent-12, score-0.222]
6 But, because they treat each state independently, these techniques are not directly applicable to continuous-space problems, where some form of generalization must be used. [sent-15, score-0.099]
7 Some attempts have been made to improve the exploration effectiveness of algorithms in continuousstate spaces. [sent-16, score-0.222]
8 As a motivating example for the work we present here, consider how a discrete state-space algorithm might be adapted to work for a continuous state-space problem. [sent-22, score-0.068]
9 The practitioner must decide how to discretize the state space. [sent-23, score-0.063]
10 The dilemma of picking fine or coarse resolution has to be resolved in advance using estimates of the available resources, the dynamics and reward structure of the environment, and a desired level of optimality. [sent-25, score-0.104]
11 We propose using multi-resolution exploration (MRE) to create algorithms that explore continuous state spaces in an anytime manner without the need for a priori discretization. [sent-27, score-0.484]
12 The key to this ideal is to be able to dynamically adjust the level of generalization the agent uses during the learning process. [sent-28, score-0.189]
13 MRE sports a knownness criterion for states that allows the agent to reliably apply function approximation with different degrees of generalization to different regions of the state space. [sent-29, score-0.857]
14 One of the main contributions of this work is to provide a general exploration framework that can be used in both model-based and value-based algorithms. [sent-30, score-0.222]
15 While model-based techniques are known for their small sample complexities, thanks to their smart exploration, they haven’t been as successful as value-based methods in continuous spaces because of their expensive planning part. [sent-31, score-0.184]
16 Value-based methods, on the other hand, have been less fortunate in terms of intelligent exploration, and some of the very powerful RL techniques in continuous spaces, such as LSPI (Lagoudakis & Parr, 2003) and fitted Q-iteration (Ernst et al. [sent-32, score-0.068]
17 , 2005) are in the form of offline batch learning and completely ignore the problem of exploration. [sent-33, score-0.087]
18 In practice, an exploration strategy is usually incorporated with these algorithms to create online versions. [sent-34, score-0.222]
19 Here, we examine fitted Q-iteration and show how MRE can be used to improve its performance over conventional exploration schemes by systematically collecting better samples. [sent-35, score-0.222]
20 2 Background We consider environments that are modeled as Markov decision processes (MDPs) with continuous state spaces (Puterman, 1994). [sent-36, score-0.192]
21 T is the transition function that determines the next state given the current state and action. [sent-42, score-0.189]
22 It can be written in the form of st+1 = T (xt , at ) + ωt , where xt and at are the state and action at time t and ωt is a white noise drawn i. [sent-43, score-0.105]
23 In particular, a policy π is a mapping from states to actions that prescribes what action to take from each state. [sent-49, score-0.169]
24 Given a policy π and a starting state s, the value of s under π, denoted by V π (s), is the expected discounted sum of rewards the agent will collect by starting from s and following policy π. [sent-50, score-0.417]
25 Under mild conditions (Puterman, 1994), at least one policy exists that maximizes this value function over all states, which we refer to as the optimal policy or π ∗ . [sent-51, score-0.202]
26 The value of states under this policy is called the ∗ optimal value function V ∗ (·) = V π (·). [sent-52, score-0.127]
27 The learning agent has prior knowledge of S, γ, ω and Rmax , but not T and R, and has to find a near-optimal policy solely through direct interaction with the environment. [sent-53, score-0.179]
28 Conceptually, it refers to the portion of the state space in which the agent can reliably predict the behavior of the environment. [sent-57, score-0.141]
29 Imagine how the agent would decide whether a state is known or unknown as described in (Kakade et al. [sent-58, score-0.141]
30 Based on the prior information about the smoothness of the environment and the level of desired optimality, we can form a hyper sphere around each query point and check if enough data points exist inside it to support the prediction. [sent-60, score-0.224]
31 MRE partitions the state space into a variable resolution discretization that dynamically forms smaller cells for regions with denser sample sets. [sent-64, score-0.37]
32 Generalization happens inside the cells (similar to the hyper sphere example), therefore it allows for wider but less accurate generalization in parts of the state space that have fewer sample points, and narrow but more accurate ones for denser parts. [sent-65, score-0.305]
33 Let’s define a new knownness criterion that maps S into [0, 1] and quantifies how much we should trust the function approximation. [sent-67, score-0.626]
34 In the remainder of this section, we first show how to form the variable resolution structure and compute the knownness, and then we demonstrate how to use this structure in a prototypical modelbased and value-based algorithm. [sent-69, score-0.095]
35 1 Regression Trees and Knownness Regression trees are function approximators that partition the input space into non-overlapping regions and use the training samples of each region for prediction of query points inside it. [sent-71, score-0.255]
36 Their ability to maintain a non-uniform discretization of high-dimensional spaces with relatively fast query time has proven to be very useful in various RL algorithms (Ernst et al. [sent-72, score-0.199]
37 For the purpose of our discussion, we use a variation of the kd-tree structure (Preparata & Shamos, 1985) to maintain our variable-resolution partitioning and produce knownness values. [sent-74, score-0.626]
38 A knownness-tree τ with dimension k accepts points s ∈ k satisfying ||s||∞ ≤ 1 1 , and answers queries of the form 0 ≤ knownness(s) ≤ 1. [sent-77, score-0.067]
39 Each node ς of the tree covers a bounded region and keeps track of the points inside that region, with the root covering the whole space. [sent-78, score-0.195]
40 Given n points, the normalizing size of the resulting tree, denoted by µ, is the region size of a hypothetical uniform discretization of the space that puts ν/k points inside each cell, if the points were uniformly distributed in the space; that is µ = √ 1 . [sent-85, score-0.298]
41 Once inside a leaf l, it adds the point to its list of points; if l. [sent-87, score-0.065]
42 k] and splitting the region into two equal half-regions along the j-th dimension. [sent-91, score-0.082]
43 The points inside the list are added to each of the children according to what half-region they fall into. [sent-92, score-0.102]
44 To answer a query knownness(s), the lookup algorithm first finds the corresponding leaf that contains s, denoted l(s), then computes knownness based on l(s). [sent-95, score-0.658]
45 size (1) The normalizing size of the tree is bigger when the total number of data points is small. [sent-100, score-0.077]
46 This creates higher knownness values for a fixed cell at the beginning of the learning. [sent-101, score-0.724]
47 This process creates a variable degree of generalization over time. [sent-103, score-0.065]
48 ˆ S + sf , A, T , φ, γ by adding a new state, sf , with a ˆ reward of Rmax and only self-loop transitions. [sent-114, score-0.171]
49 The augmented transition function T is a stochastic function defined as: Construct the augmented MDP M = ˆ T (s, a) = sf ˆ T (s, a) + ω , with probability 1 − knownness(s, a) , otherwise (2) Algorithm 1 constructs and solves M and always acts greedily with respect to this internal model. [sent-115, score-0.126]
50 DPlan is a continuous MDP planner that supports two operations: solveModel, which solves a given MDP and getBestAction, which returns the greedy action for a given state. [sent-116, score-0.146]
51 Algorithm 1 A model-based algorithm using MRE for exploration 1: Variables: DPlan, Θ, φ and solving period planFreq 2: Observe a transition of the form (st , at , rt , st+1 ) 3: Add (st , rt ) as a training sample to φ. [sent-117, score-0.285]
52 The core of the algorithm is the way knownness is computed and how it’s used to make the estimated transition function optimistic. [sent-124, score-0.689]
53 That is, like MBIE, the value of a state becomes gradually less optimistic as more data is available. [sent-126, score-0.105]
54 Let’s assume the transition and reward functions are Lipschitz smooth with Lipschitz constants CT and CR respectively. [sent-132, score-0.108]
55 Let ρt be the maximum size of the cells and t be the minimum knownness of all of the trees τij at time t. [sent-133, score-0.7]
56 ˆ We can use the smoothness assumptions to compute the closeness of T to the original transition function based on the shape of the trees and the knownness they output. [sent-136, score-0.729]
57 For example, the incremental refinement of model estimation assures a certain global accuracy before forcing the algorithm to collect denser sampling locally. [sent-138, score-0.121]
58 In fact, we hypothesize that with some caveats concerning the computation of µ, it can be proved that Algorithm 1 converges to the optimal policy in the limit, given that an oracle planner is available. [sent-141, score-0.164]
59 The bound in Theorem 1 is loose because it involves only the biggest cell size, as opposed to individual cell sizes. [sent-142, score-0.074]
60 Alternatively, one might be able to achieve better bounds, similar to those in the work of Munos and Moore (2000), by taking the variable resolution of the tree into account. [sent-143, score-0.099]
61 3 Application to Value-based RL Here, we show how to use MRE in fitted Q-iteration, which is a value-based batch learner for continuous spaces. [sent-145, score-0.125]
62 The fitted Q-iteration algorithm accepts a set of four-tuple samples S = {(sl , al , rl , s l ), l = 1 . [sent-147, score-0.185]
63 n} ˆ ˆ and uses regression trees to iteratively compute more accurate Q-functions. [sent-150, score-0.082]
64 The training samples for Qj are S0 = {(sl , rl )|(sl , al , rl , s l ) ∈ S j }. [sent-153, score-0.274]
65 Qj 0 i+1 ˆ is constructed based on Qi in the following way: xl y l j Si+1 = {sl |(sl , al , rl , s l ) ∈ S j } ˆi = {rl + γ max Qa (s l )|(sl , al , rl , s l ) ∈ S j } (4) = {(xl , y l )}. [sent-154, score-0.337]
66 (5) a∈A (3) Random sampling is usually used to collect S for fitted Q-iteration when used as an offline algorithm. [sent-155, score-0.074]
67 In online settings, -greedy can be used as the exploration scheme to collect samples. [sent-156, score-0.296]
68 knownness(sl ) rl + γ max Qa (s l ) + i a∈A 1 − τ j . [sent-162, score-0.119]
69 In this domain, an underpowered car tries to climb up to the right of a valley, but has to increase its velocity via several back and forth trips across the valley. [sent-165, score-0.111]
70 The state space is 2-dimensional and consists of the horizontal position of the car in the range of [−1. [sent-166, score-0.139]
71 Agent receives a −1 penalty in each timestep except for when it escapes the valley and receives a reward of 0 that ends the episode. [sent-172, score-0.149]
72 This set of parameters makes this environment especially interesting for the purpose of comparing exploration strategies, because it is unlikely for random exploration to guide the car to the top of the hill. [sent-177, score-0.55]
73 Three versions of Algorithm 1 are compared in Figure 1(a): the first two implementations use fixed discretizations instead of the knownness-tree, with different normalized resolutions of 0. [sent-179, score-0.091]
74 The third one uses variable discretization using the knownness-tree as defined in Section 3. [sent-182, score-0.148]
75 3 200 150 100 50 0 50 100 Episode (a) 150 Average step to go per episode oal The learning curve in Figure 1(a) is averaged over 20 runs with different random seeds and smoothed over a window of size 5 to avoid a cluttered graph. [sent-190, score-0.141]
76 The coarse discretization on the other hand, converges very fast, but not to a very good policy; it constructs rough estimations and doesn’t compensate as more samples are gathered. [sent-192, score-0.204]
77 MRE refines the notion of knownness to make use of rough estimations at the beginning and accurate ones later, and therefore converges to a good policy fast. [sent-193, score-0.892]
78 3 variable 200 150 100 50 0 episode 0-100 episode 100-200 episode 200-300 (b) Figure 1: (a) The learning curve of Algorithm 1 in Mountain Car with three different exploration strategies. [sent-196, score-0.645]
79 (b) Average performance of Algorithm 1 in Mountain Car with three exploration strategies. [sent-197, score-0.222]
80 A more detailed comparison of this result is shown in Figure 1(b), where the average time-perepisode is provided for three different phases: At the early stages of learning (episode 1-100), at the middle of learning (episode 100-200), and during the late stages (episode 200-300). [sent-199, score-0.114]
81 05 at the early stages of learning (note that both of them achieve the same performance level at the end), value functions of the two algorithms at timestep = 1500 are shown in Figure 2. [sent-202, score-0.125]
82 Most of the samples at this stage have very small knownness in the fixed version, due to the very fine discretization, and therefore have very little effect on the estimation of the transition function. [sent-203, score-0.689]
83 The variable discretization however, achieves a more realistic and smooth value function by allowing coarser generalizations in parts of the state space with fewer samples. [sent-205, score-0.169]
84 Here, we compare -greedy to two versions of variable-resolution MRE; in the first version, although a knownness-tree is chosen for partitioning the state space, knownness is computed as a Boolean value using the operator. [sent-207, score-0.747]
85 As expected, -greedy performs poorly, because it cannot collect good samples to feed the batch learner. [sent-213, score-0.131]
86 Both of the versions of MRE converge to the same policy, although the one that uses continuous knownness does so faster. [sent-214, score-0.794]
87 5 (b) Figure 2: Snapshot of the value function at timestep 1500 in Algorithm 1 with two configuration: (a) fixed discretization with resolution= 0. [sent-229, score-0.174]
88 300 Continuous knownness Boolean knownness Ste p to goalll 250 ǝ-greedy 200 150 100 50 0 50 100 150 200 250 Episode Figure 3: The learning curve for fitted Q-iteration in Mountain Car. [sent-231, score-1.252]
89 -greedy is compared to two versions of MRE: one that uses Boolean knownness, and one that uses continuous knownness. [sent-232, score-0.21]
90 To have a better understanding of why the continuous knownness helps fitted Q-iteration during the early stages of learning, snapshots of knownness from the two versions are depicted in Figure 6, along with the set of visited states at timestep 1500. [sent-233, score-1.529]
91 The continuous notion of knownness helps fitted Q-iteration in this case to collect better-covering samples at the beginning of learning. [sent-235, score-0.835]
92 5 Conclusion In this paper, we introduced multi-resolution exploration for reinforcement learning in continuous spaces and demonstrated how to use it in two algorithms from the model-based and value-based paradigms. [sent-236, score-0.429]
93 The combination of two key features distinguish MRE from previous smart exploration schemes in continuous spaces: The first is that MRE uses a variable-resolution structure to identify known vs. [sent-237, score-0.387]
94 unknown regions, and the second is that it successively refines the notion of knownness during learning, which allows it to assign continuous, instead of Boolean, knownness. [sent-238, score-0.661]
95 The applicability of MRE to value-based methods allows us to benefit from smart exploration ideas from the model-based setting in powerful value-based batch learners that usually use naive approaches like 0. [sent-239, score-0.334]
96 6 (b) Figure 4: Knownness computed in two versions of MRE for fitted Q-iteration: One that has Boolean values, and one that uses continuous ones. [sent-256, score-0.168]
97 Collected samples are also shown for the same two versions at timestep 1500. [sent-258, score-0.126]
98 Experimental results confirm that MRE holds significant advantage over some other exploration techniques widely used in practice. [sent-260, score-0.222]
99 R-max, a general polynomial time algorithm for nearoptimal reinforcement learning. [sent-270, score-0.105]
100 Online linear regression and its application to model-based reinforcement learning. [sent-345, score-0.078]
wordName wordTfidf (topN-words)
[('knownness', 0.626), ('mre', 0.438), ('exploration', 0.222), ('episode', 0.141), ('rl', 0.119), ('tted', 0.112), ('sl', 0.108), ('discretization', 0.106), ('policy', 0.101), ('littman', 0.078), ('agent', 0.078), ('reinforcement', 0.078), ('car', 0.076), ('collect', 0.074), ('puterman', 0.073), ('anytime', 0.07), ('mdp', 0.069), ('continuous', 0.068), ('timestep', 0.068), ('qj', 0.067), ('inside', 0.065), ('kakade', 0.065), ('kearns', 0.065), ('transition', 0.063), ('state', 0.063), ('moore', 0.063), ('dplan', 0.063), ('estimations', 0.063), ('planfreq', 0.063), ('sf', 0.063), ('strehl', 0.062), ('ernst', 0.062), ('rmax', 0.062), ('spaces', 0.061), ('resolution', 0.059), ('st', 0.059), ('versions', 0.058), ('boolean', 0.057), ('stages', 0.057), ('batch', 0.057), ('smart', 0.055), ('munos', 0.054), ('region', 0.053), ('mountain', 0.052), ('fixed', 0.049), ('denser', 0.047), ('reward', 0.045), ('singh', 0.044), ('optimistic', 0.042), ('jong', 0.042), ('nouri', 0.042), ('preparata', 0.042), ('shamos', 0.042), ('action', 0.042), ('uses', 0.042), ('ner', 0.041), ('tree', 0.04), ('trees', 0.04), ('cell', 0.037), ('points', 0.037), ('prototypical', 0.036), ('planner', 0.036), ('valley', 0.036), ('brafman', 0.036), ('mbie', 0.036), ('piscataway', 0.036), ('tennenholtz', 0.036), ('al', 0.036), ('generalization', 0.036), ('rough', 0.035), ('velocity', 0.035), ('notion', 0.035), ('cells', 0.034), ('qa', 0.033), ('discretizations', 0.033), ('dynamically', 0.033), ('query', 0.032), ('beginning', 0.032), ('lagoudakis', 0.031), ('lspi', 0.031), ('ortner', 0.031), ('stone', 0.031), ('environment', 0.03), ('completely', 0.03), ('parr', 0.03), ('accepts', 0.03), ('sphere', 0.03), ('hyper', 0.03), ('splitting', 0.029), ('creates', 0.029), ('rutgers', 0.028), ('kaelbling', 0.028), ('regions', 0.028), ('cr', 0.027), ('hill', 0.027), ('hypothesize', 0.027), ('xl', 0.027), ('polynomial', 0.027), ('states', 0.026), ('auer', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
Author: Ali Nouri, Michael L. Littman
Abstract: The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE’s broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches. 1
2 0.20657167 248 nips-2008-Using matrices to model symbolic relationship
Author: Ilya Sutskever, Geoffrey E. Hinton
Abstract: We describe a way of learning matrix representations of objects and relationships. The goal of learning is to allow multiplication of matrices to represent symbolic relationships between objects and symbolic relationships between relationships, which is the main novelty of the method. We demonstrate that this leads to excellent generalization in two different domains: modular arithmetic and family relationships. We show that the same system can learn first-order propositions such as (2, 5) ∈ +3 or (Christopher, Penelope) ∈ has wife, and higher-order propositions such as (3, +3) ∈ plus and (+3, −3) ∈ inverse or (has husband, has wife) ∈ higher oppsex. We further demonstrate that the system understands how higher-order propositions are related to first-order ones by showing that it can correctly answer questions about first-order propositions involving the relations +3 or has wife even though it has not been trained on any first-order examples involving these relations. 1
3 0.16047658 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
Author: Peter Auer, Thomas Jaksch, Ronald Ortner
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1
4 0.13205357 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
5 0.13139099 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
6 0.12271752 195 nips-2008-Regularized Policy Iteration
8 0.10368221 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
9 0.10052736 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
10 0.079469249 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
11 0.071591243 223 nips-2008-Structure Learning in Human Sequential Decision-Making
12 0.0714885 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
13 0.068711042 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
14 0.067785785 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
15 0.064835861 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
16 0.063234463 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
17 0.062359136 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor
18 0.06210161 170 nips-2008-Online Optimization in X-Armed Bandits
19 0.060879104 17 nips-2008-Algorithms for Infinitely Many-Armed Bandits
20 0.057309709 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
topicId topicWeight
[(0, -0.156), (1, 0.233), (2, -0.014), (3, -0.091), (4, 0.062), (5, 0.007), (6, 0.086), (7, -0.07), (8, -0.078), (9, -0.012), (10, -0.008), (11, 0.025), (12, 0.012), (13, -0.027), (14, 0.003), (15, -0.038), (16, 0.017), (17, -0.024), (18, 0.084), (19, 0.024), (20, -0.028), (21, 0.021), (22, 0.069), (23, -0.047), (24, -0.046), (25, -0.042), (26, 0.098), (27, 0.052), (28, 0.027), (29, 0.104), (30, 0.046), (31, 0.1), (32, -0.026), (33, 0.009), (34, 0.027), (35, 0.073), (36, 0.125), (37, 0.263), (38, -0.022), (39, -0.096), (40, -0.027), (41, -0.024), (42, 0.047), (43, -0.098), (44, 0.056), (45, 0.024), (46, -0.025), (47, -0.197), (48, -0.041), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.92107213 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
Author: Ali Nouri, Michael L. Littman
Abstract: The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE’s broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches. 1
2 0.69456249 248 nips-2008-Using matrices to model symbolic relationship
Author: Ilya Sutskever, Geoffrey E. Hinton
Abstract: We describe a way of learning matrix representations of objects and relationships. The goal of learning is to allow multiplication of matrices to represent symbolic relationships between objects and symbolic relationships between relationships, which is the main novelty of the method. We demonstrate that this leads to excellent generalization in two different domains: modular arithmetic and family relationships. We show that the same system can learn first-order propositions such as (2, 5) ∈ +3 or (Christopher, Penelope) ∈ has wife, and higher-order propositions such as (3, +3) ∈ plus and (+3, −3) ∈ inverse or (has husband, has wife) ∈ higher oppsex. We further demonstrate that the system understands how higher-order propositions are related to first-order ones by showing that it can correctly answer questions about first-order propositions involving the relations +3 or has wife even though it has not been trained on any first-order examples involving these relations. 1
3 0.59508383 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
Author: Jonathan Taylor, Doina Precup, Prakash Panagaden
Abstract: We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), which takes action similarity into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran & Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than previous bounds provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.
4 0.5944913 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
Author: Peter Auer, Thomas Jaksch, Ronald Ortner
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s there is a policy which moves from s to s in at most D steps (on average). We present a rein√ ˜ forcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. This bound holds with high probability. We also present a corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm. 1
5 0.55430883 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
6 0.51446182 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
8 0.49054751 181 nips-2008-Policy Search for Motor Primitives in Robotics
9 0.47596645 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
10 0.45005465 61 nips-2008-Diffeomorphic Dimensionality Reduction
11 0.42885992 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
12 0.42623308 195 nips-2008-Regularized Policy Iteration
13 0.42596364 212 nips-2008-Skill Characterization Based on Betweenness
14 0.38954213 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
15 0.38493156 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
16 0.34261242 236 nips-2008-The Mondrian Process
17 0.33526316 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
18 0.33479926 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
19 0.31849757 211 nips-2008-Simple Local Models for Complex Dynamical Systems
20 0.31698707 22 nips-2008-An Online Algorithm for Maximizing Submodular Functions
topicId topicWeight
[(4, 0.039), (6, 0.049), (7, 0.039), (12, 0.038), (28, 0.111), (57, 0.047), (59, 0.015), (63, 0.025), (71, 0.022), (77, 0.491), (78, 0.015), (83, 0.025)]
simIndex simValue paperId paperTitle
Author: Christoph Kolodziejski, Bernd Porr, Minija Tamosiunaite, Florentin Wörgötter
Abstract: In this theoretical contribution we provide mathematical proof that two of the most important classes of network learning - correlation-based differential Hebbian learning and reward-based temporal difference learning - are asymptotically equivalent when timing the learning with a local modulatory signal. This opens the opportunity to consistently reformulate most of the abstract reinforcement learning framework from a correlation based perspective that is more closely related to the biophysics of neurons. 1
2 0.87467659 61 nips-2008-Diffeomorphic Dimensionality Reduction
Author: Christian Walder, Bernhard Schölkopf
Abstract: This paper introduces a new approach to constructing meaningful lower dimensional representations of sets of data points. We argue that constraining the mapping between the high and low dimensional spaces to be a diffeomorphism is a natural way of ensuring that pairwise distances are approximately preserved. Accordingly we develop an algorithm which diffeomorphically maps the data near to a lower dimensional subspace and then projects onto that subspace. The problem of solving for the mapping is transformed into one of solving for an Eulerian flow field which we compute using ideas from kernel methods. We demonstrate the efficacy of our approach on various real world data sets. 1
3 0.85395479 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
Author: Haixuan Yang, Irwin King, Michael Lyu
Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1
same-paper 4 0.85099751 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
Author: Ali Nouri, Michael L. Littman
Abstract: The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE’s broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches. 1
5 0.69324541 216 nips-2008-Sparse probabilistic projections
Author: Cédric Archambeau, Francis R. Bach
Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1
6 0.5886392 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
7 0.57589024 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
8 0.51184851 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
9 0.5096119 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
10 0.50278014 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
11 0.48182541 200 nips-2008-Robust Kernel Principal Component Analysis
12 0.48169094 44 nips-2008-Characteristic Kernels on Groups and Semigroups
13 0.47985658 195 nips-2008-Regularized Policy Iteration
14 0.47472441 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
15 0.47179219 215 nips-2008-Sparse Signal Recovery Using Markov Random Fields
16 0.45633185 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
17 0.4551608 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
18 0.4527114 238 nips-2008-Theory of matching pursuit
19 0.44985282 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach
20 0.44964305 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks