nips nips2006 nips2006-171 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
Reference: text
sentIndex sentText sentNum sentScore
1 Sample complexity of policy search with known dynamics Peter L. [sent-1, score-0.582]
2 edu Abstract We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. [sent-6, score-0.509]
3 The policy is chosen based on its empirical performance in simulations. [sent-7, score-0.551]
4 We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. [sent-8, score-1.144]
5 We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. [sent-9, score-0.242]
6 Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. [sent-10, score-0.261]
7 These assumptions and ours are both stronger than the usual combinatorial complexity measures. [sent-11, score-0.16]
8 We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient. [sent-12, score-0.328]
9 the future is independent of the past given the current state of the environment. [sent-15, score-0.073]
10 Except for toy problems with a few states, computing an optimal policy for an MDP is usually out of the question. [sent-16, score-0.509]
11 One possibility is to avoid considering all possible policies by restricting oneself to a smaller class Π of policies. [sent-18, score-0.197]
12 Given a simulator for the environment, we try to pick the best policy from Π. [sent-19, score-0.538]
13 The hope is that if the policy class is appropriately chosen, the best policy in Π would not be too much worse than the true optimal policy. [sent-20, score-1.081]
14 Use of simulators introduces an additional issue: how is one to be sure that performance of policies in the class Π on a few simulations is indicative of their true performance? [sent-21, score-0.17]
15 There the aim is to learn a concept and one restricts attention to a hypotheses class which may or may not contain the “true” concept. [sent-23, score-0.063]
16 The answer turns out to depend on “complexity” of the hypothesis class as measured by combinatorial quantities associated with the class such as the VC dimension, the pseudodimension and the fat-shattering dimension. [sent-25, score-0.422]
17 Some progress [6,7] has already been made to obtain uniform bounds on the difference between value functions and their empirical estimates, where the value function of a policy is the expected long term reward starting from a certain state and following the policy thereafter. [sent-26, score-1.333]
18 We continue this line of work by further investigating what properties of the policy class determine the rate of uniform convergence of value function estimates. [sent-27, score-0.62]
19 The key difference between the usual statistical learning setting and ours is that we not only have to consider the complexity of the class Π but also of the classes derived from Π by composing the functions in Π with themselves and with the state evolution process implied by the simulator. [sent-28, score-0.274]
20 Ng and Jordan [7] used a finite pseudodimension condition along with Lipschitz continuity to derive uniform bounds. [sent-29, score-0.253]
21 The number of samples required grows linearly with the dimension of the parameter space but is independent of the dimension of the state space. [sent-32, score-0.137]
22 Ng and Jordan’s and our assumptions are both stronger than just assuming finiteness of some combinatorial dimension. [sent-33, score-0.13]
23 We show that this is unavoidable by constructing two examples where the fat-shattering dimension and the pseudodimension respectively are bounded, yet no simulation based method succeeds in estimating the true values of policies well. [sent-34, score-0.377]
24 This happens because iteratively composing a function class with itself can quickly destroy finiteness of combinatorial dimensions. [sent-35, score-0.187]
25 Additional assumptions are therefore needed to ensure that these iterates continue to have bounded combinatorial dimensions. [sent-36, score-0.175]
26 Although we restrict ourselves to MDPs for ease of exposition, the analysis in this paper carries over easily to the case of partially obervable MDPs (POMDPs), provided the simulator also simulates the conditional distribution of observations given state using a bounded amount of computation. [sent-37, score-0.186]
27 Section 3 proves Theorem 1, which gives a sample complexity bound for achieving a desired level of performance within the policy class. [sent-41, score-0.569]
28 In Section 4, we give two examples of policy classes whose combinatorial dimensions are bounded. [sent-42, score-0.647]
29 Nevertheless, we can prove strong minimax lower bounds implying that no method of choosing a policy based on empirical estimates can do well for these examples. [sent-43, score-0.709]
30 In this paper, we assume that the state space S and the action space A are finite dimensional Euclidean spaces of dimensionality dS and dA respectively. [sent-45, score-0.102]
31 A (randomized) policy π is a mapping from S to distributions over A. [sent-46, score-0.509]
32 Each policy π induces a natural Markov chain on the state space of the MDP, namely the one obtained by starting in a start state s0 sampled from D and st+1 sampled according to P (·|st , at ) with at drawn from π(st ) for t ≥ 0. [sent-47, score-0.655]
33 Let rt (π) be the expected reward at time step t in this Markov chain, i. [sent-48, score-0.131]
34 Note that the expectation is over the randomness in the choice of the initial state, the state transitions, and the randomized policy and reward outcomes. [sent-51, score-0.672]
35 Define the value VM (π) of the policy by ∞ γ t rt (π) . [sent-52, score-0.55]
36 For a class Π of policies, define opt(M, Π) = sup VM (π) . [sent-54, score-0.199]
37 π∈Π The regret of a policy π relative to an MDP M and a policy class Π is defined as RegM,Π (π ) = opt(M, Π) − VM (π ) . [sent-55, score-1.17]
38 We use a degree bounded version of the Blum-Shub-Smale [3] model of computation over reals. [sent-56, score-0.084]
39 At each time step, we can perform one of the four arithmetic operations +, −, ×, / or can branch based on a comparison (say <). [sent-57, score-0.138]
40 The function f is (Ξ, τ )-computable if there exists a degree bounded finite dimensional machine M over R with input space Rk+m and output space Rl such that the following hold. [sent-62, score-0.124]
41 Informally, the definition states that given access to an oracle which generates samples from Ξ, we can generate samples from f (x) by doing a bounded amount of computation. [sent-67, score-0.116]
42 In Section 3, we assume that the policy class Π is parameterized by a finite dimensional parameter θ ∈ Rd . [sent-70, score-0.572]
43 This assumption will be satisfied if we have three “programs” that make a call to a random number generator for distribution Ξ, do a fixed number of floating-point operations and simulate the policies in our class, the state-transition dynamics and the rewards respectively. [sent-76, score-0.347]
44 • Linear Dynamical System with Additive Noise 1 Suppose P and Q are dS × dS and dS × dA matrices and the system dynamics is given by st+1 = P st + Qat + ξt , (1) where ξt are i. [sent-78, score-0.123]
45 Thus, if ξ has uniform distribution on (0, 1], then f (ξ) = k with probability pk . [sent-96, score-0.108]
46 ignoring rewards beyond time H does not affect the value of any policy by more than . [sent-103, score-0.643]
47 To obtain sample rewards, given initial state s0 and policy πθ = π(·; θ), we first compute the trajectory s0 , . [sent-104, score-0.582]
48 A further H + 1 calls to Mr are then required to generate the rewards ρ0 through ρH . [sent-109, score-0.179]
49 Define the empirical estimate of the value of the policy π by n H 1 (i) ˆH VM (πθ ) = γ t ρt (s0 , θ, ξi ) . [sent-116, score-0.551]
50 Define an -approximate maximizer ˆ to be a policy π such that of V ˆH ˆH VM (π ) ≥ sup VM (π) − . [sent-118, score-0.738]
51 π∈Π 1 In this case, the realizable dynamics (mapping from state to next state for a given policy class) is not uniformly Lipschitz if policies allow unbounded actions. [sent-119, score-0.805]
52 Finally, we mention the definitions of three standard combinatorial dimensions. [sent-121, score-0.091]
53 Let X be some space and consider classes G and F of {−1, +1} and real valued functions on X , respectively. [sent-122, score-0.075]
54 We say that G shatters X if for all bit vectors b ∈ {0, 1}n there exists g ∈ G such that for all i, bi = 0 ⇒ g(xi ) = −1, bi = 1 ⇒ g(xi ) = +1. [sent-127, score-0.331]
55 We say that F shatters X if there exists r ∈ Rn such that, for all bit vectors b ∈ {0, 1}n , there exists f ∈ F such that for all i, bi = 0 ⇒ f (xi ) < ri , bi = 1 ⇒ f (xi ) ≥ ri . [sent-128, score-0.443]
56 We say that F -shatters X if these exists r ∈ Rn such that, for all bit vectors b ∈ {0, 1}n, there exists f ∈ F such that for all i, bi = 0 ⇒ f (xi ) ≤ ri − , bi = 1 ⇒ f (xi ) ≥ ri + . [sent-129, score-0.361]
57 We then have the following definitions, VCdim(G) = max{|X| : G shatters X} , Pdim(F ) = max{|X| : F shatters X} , fatF ( ) = max{|X| : F -shatters X} . [sent-130, score-0.164]
58 Fix an MDP M, a policy class Π = {s → π(s; θ) : θ ∈ Rd }, and an > 0. [sent-132, score-0.572]
59 Then R2 Hdτ R n>O log (1 − γ)2 2 (1 − γ) ˆ ensures that E RegM,Π (πn ) ≤ 3 + , where πn is an -approximate maximizer of V and H = log1/γ (2R/( (1 − γ))) is the /2 horizon time. [sent-134, score-0.122]
60 Given initial state s0 , parameter θ and random numbers ξ1 through ξ3H+1 , we first compute the trajectory as follows. [sent-138, score-0.073]
61 st = ΦMP (st−1 , ΦMπ (θ, s, ξ2t−1 ), ξ2t ), 1 ≤ t ≤ H . [sent-140, score-0.08]
62 (2) The rewards are then computed by ρt = ΦMr (st , ξ2H+t+1 ), 0 ≤ t ≤ H . [sent-141, score-0.134]
63 (3) The H-step discounted reward sum is computed as H γ t ρt = ρ0 + γ(ρ1 + γ(ρ2 + . [sent-142, score-0.116]
64 (4) t=0 H Define the function class R = {(s0 , ξ) → t=0 γ t ρt (s0 , θ, ξ) : θ ∈ Rd }, where we have explicitly shown the dependence of ρt on s0 , θ and ξ. [sent-149, score-0.063]
65 Let us count the number of arithmetic operations needed to compute a function in this class. [sent-150, score-0.138]
66 Using Assumption A, we see that steps (2) and (3) require no more than 2τ H and τ (H + 1) operations respectively. [sent-151, score-0.063]
67 Goldberg and Jerrum [4] showed that the VC dimension of a function class can be bounded in terms of an upper bound on the number of arithmetic operations it takes to compute the functions in the class. [sent-154, score-0.375]
68 Since the pseudodimension of R can be written as Pdim(R) = VCdim{(s0 , ξ, c) → sign(f (s0 , ξ) − c) : f ∈ R, c ∈ R} , we get the following bound by [2, Thm. [sent-155, score-0.266]
69 (6) Functions in R are positive and bounded above by R = R/(1 − γ). [sent-162, score-0.084]
70 There are well-known bounds for deviations of empirical estimates from true expectations for bounded function classes in terms of the pseudodimension of the class (see, for example, Theorems 3 and 5 in [5]; also see Pollard’s book [8]). [sent-163, score-0.522]
71 ˆ In order to ensure that P n (∃π ∈ Π : |V H (π) − V H (π)| > /2) < δ, we need 64eR 8 2 Pdim(R) e− 2 n/256R 2 <δ, Using the bound (5) on Pdim(R), we get that ˆ P n sup V H (π) − V (π) > <δ, (7) π∈Π provided n> 256R2 (1 − γ)2 2 log 8 δ + 8d(6τ H + 3) log 64eR (1 − γ) . [sent-165, score-0.197]
72 Since πn is an -approximate maximizer of V ˆ Thus, for all π ∈ Π, V (π) ≤ V H (πn ) + + . [sent-172, score-0.093]
73 Taking the supremum over π ∈ Π and using the ˆ H (πn ) ≤ V (πn ) + , we get sup fact that V π∈Π V (π) ≤ V (πn ) + 2 + , which is equivalent to RegM,Π (πn ) ≤ 2 + . [sent-173, score-0.167]
74 where we used the fact that regret is bounded above by R/(1 − γ). [sent-176, score-0.173]
75 4 Two Policy Classes Having Bounded Combinatorial Dimensions We will describe two policy classes for which we can prove that there are strong limitations on the performance of any method (of choosing a policy out of a policy class) that has access only to empirically observed rewards. [sent-177, score-1.616]
76 Somewhat surprisingly, one can show this for policy classes which are “simple” in the sense that standard combinatorial dimensions of these classes are bounded. [sent-178, score-0.694]
77 This shows that sufficient conditions for the success of simulation based policy search (such as the assumptions in [7] and in our Theorem 1) have to be necessarily stronger than boundedness of standard combinatorial dimensions. [sent-179, score-0.728]
78 The first example is a policy class F1 for which fatF1 ( ) < ∞ for all > 0. [sent-180, score-0.572]
79 The second example is a class F2 for which Pdim(F2 ) = 1. [sent-181, score-0.063]
80 Since finiteness of pseudodimension is a stronger condition, the second example makes our point more forcefully than the first one. [sent-182, score-0.244]
81 r = deterministic reward that maps s to s, and γ = some fixed discount factor in (0, 1). [sent-198, score-0.09]
82 For a function f : [−1, +1] → [−1, +1], let πf denote the (deterministic) policy which takes action f (s) − s in state s. [sent-199, score-0.642]
83 Given a class F of functions, we define an associated policy class ΠF = {πf : f ∈ F}. [sent-200, score-0.635]
84 By construction, functions in F1 have bounded total variation and so, fatF1 ( ) is O(1/ ) (see, for example, [2, Chap. [sent-209, score-0.112]
85 , sn to a policy in ΠF1 , there is an initial state distribution such that the expected regret of the selected policy is at least γ 2 /(1 − γ) − 2 1 . [sent-226, score-1.263]
86 This is in sharp contrast to Theorem 1 where we could reduce, by using sufficiently many samples, the expected regret down to any positive number given the ability to maximize the ˆ empirical estimates V . [sent-227, score-0.178]
87 Let us see how maximization of empirical estimates behaves in this case. [sent-228, score-0.115]
88 The transitions, policies and rewards here are all deterministic. [sent-232, score-0.241]
89 This means that the 1-step reward function family is just F1 . [sent-234, score-0.09]
90 So the estimates of 1-step rewards are still uniformly concentrated around their expected values. [sent-235, score-0.181]
91 Since the contribution of rewards from time step 2 onwards can be no more than γ 2 + γ 3 + . [sent-236, score-0.134]
92 = γ 2 /(1 − γ), we can claim that the expected ˆ regret of the V maximizer πn behaves like γ2 E RegM,ΠF1 (πn ) ≤ + en 1−γ where en → 0. [sent-239, score-0.264]
93 ,sn ) ) = sup Es∼D s + γf (s) + f ∈F1 γ2 2 f (s) 1−γ − E(s1 ,. [sent-261, score-0.136]
94 , sn )(s) + = sup Es∼D γf (s) + f ∈F1 γ2 gn (s1 , . [sent-267, score-0.505]
95 , sn )2 (s)] 1−γ For all f1 , f2 , |Ef1 − Ef2 | ≤ E|f1 − f2 | ≤ 1 . [sent-273, score-0.083]
96 , sn )(s) + − 2γ 1 2 = γ 1−γ − 2γ sup Es∼D f 2 (s) + 1 − E(s1 ,. [sent-287, score-0.219]
97 , sn )2 (s) + 1] f ∈F1 1 2 From (8), we know that fT (x) + 1 restricted to x ∈ (0, 1) is the same as 1(x∈T ) . [sent-293, score-0.083]
98 Therefore, restricting D to probability measures on (0, 1) and applying Lemma 1, we get γ2 inf sup opt(MD , ΠF1 ) − E(s1 ,. [sent-294, score-0.238]
99 gn D 1−γ To finish the proof, we note that γ < 1 and, by definition, RegMD ,ΠF1 (πgn (s1 ,. [sent-301, score-0.286]
100 Example 2 We use the MDP of the previous section with a different policy class which we now describe. [sent-308, score-0.572]
wordName wordTfidf (topN-words)
[('policy', 0.509), ('gn', 0.286), ('pdim', 0.282), ('ft', 0.247), ('regm', 0.235), ('pseudodimension', 0.205), ('stretch', 0.164), ('sup', 0.136), ('rewards', 0.134), ('mix', 0.125), ('mdp', 0.116), ('dn', 0.113), ('lipschitz', 0.109), ('policies', 0.107), ('vm', 0.104), ('opt', 0.102), ('ds', 0.095), ('maximizer', 0.093), ('combinatorial', 0.091), ('reward', 0.09), ('regret', 0.089), ('bounded', 0.084), ('sn', 0.083), ('shatters', 0.082), ('st', 0.08), ('arithmetic', 0.075), ('state', 0.073), ('bi', 0.073), ('regmd', 0.071), ('tep', 0.071), ('vmd', 0.071), ('class', 0.063), ('operations', 0.063), ('md', 0.062), ('pk', 0.06), ('fix', 0.06), ('boundedness', 0.056), ('ns', 0.055), ('theorem', 0.053), ('odd', 0.052), ('mp', 0.049), ('niteness', 0.049), ('da', 0.049), ('uniform', 0.048), ('ambuj', 0.047), ('vcdim', 0.047), ('estimates', 0.047), ('classes', 0.047), ('rk', 0.045), ('calls', 0.045), ('inf', 0.044), ('dynamics', 0.043), ('prove', 0.042), ('empirical', 0.042), ('mr', 0.041), ('rt', 0.041), ('jerrum', 0.041), ('exists', 0.04), ('mdps', 0.04), ('stronger', 0.039), ('berkeley', 0.039), ('es', 0.038), ('goldberg', 0.037), ('halting', 0.037), ('ri', 0.036), ('pollard', 0.035), ('minimax', 0.035), ('bit', 0.034), ('bounds', 0.034), ('simulation', 0.033), ('composing', 0.033), ('blum', 0.033), ('states', 0.032), ('na', 0.032), ('rm', 0.032), ('dimension', 0.032), ('let', 0.031), ('get', 0.031), ('complexity', 0.03), ('suppose', 0.03), ('bartlett', 0.03), ('nitions', 0.03), ('subscript', 0.03), ('vc', 0.03), ('bound', 0.03), ('action', 0.029), ('nite', 0.029), ('say', 0.029), ('horizon', 0.029), ('simulator', 0.029), ('haussler', 0.029), ('functions', 0.028), ('en', 0.028), ('computable', 0.028), ('nition', 0.027), ('restricting', 0.027), ('everywhere', 0.027), ('rd', 0.026), ('behaves', 0.026), ('discounted', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
2 0.36612207 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
3 0.25606176 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
4 0.21175699 154 nips-2006-Optimal Change-Detection and Spiking Neurons
Author: Angela J. Yu
Abstract: Survival in a non-stationary, potentially adversarial environment requires animals to detect sensory changes rapidly yet accurately, two oft competing desiderata. Neurons subserving such detections are faced with the corresponding challenge to discern “real” changes in inputs as quickly as possible, while ignoring noisy fluctuations. Mathematically, this is an example of a change-detection problem that is actively researched in the controlled stochastic processes community. In this paper, we utilize sophisticated tools developed in that community to formalize an instantiation of the problem faced by the nervous system, and characterize the Bayes-optimal decision policy under certain assumptions. We will derive from this optimal strategy an information accumulation and decision process that remarkably resembles the dynamics of a leaky integrate-and-fire neuron. This correspondence suggests that neurons are optimized for tracking input changes, and sheds new light on the computational import of intracellular properties such as resting membrane potential, voltage-dependent conductance, and post-spike reset voltage. We also explore the influence that factors such as timing, uncertainty, neuromodulation, and reward should and do have on neuronal dynamics and sensitivity, as the optimal decision strategy depends critically on these factors. 1
5 0.1858681 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
Author: Laurent Charlin, Pascal Poupart, Romy Shioda
Abstract: Planning in partially observable domains is a notoriously difficult problem. However, in many real-world scenarios, planning can be simplified by decomposing the task into a hierarchy of smaller planning problems. Several approaches have been proposed to optimize a policy that decomposes according to a hierarchy specified a priori. In this paper, we investigate the problem of automatically discovering the hierarchy. More precisely, we frame the optimization of a hierarchical policy as a non-convex optimization problem that can be solved with general non-linear solvers, a mixed-integer non-linear approximation or a form of bounded hierarchical policy iteration. By encoding the hierarchical structure as variables of the optimization problem, we can automatically discover a hierarchy. Our method is flexible enough to allow any parts of the hierarchy to be specified based on prior knowledge while letting the optimization discover the unknown parts. It can also discover hierarchical policies, including recursive policies, that are more compact (potentially infinitely fewer parameters) and often easier to understand given the decomposition induced by the hierarchy. 1
6 0.14687039 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
7 0.14282867 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
8 0.13456988 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
10 0.12492105 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
11 0.10984364 203 nips-2006-implicit Online Learning with Kernels
12 0.105534 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
13 0.09468393 21 nips-2006-AdaBoost is Consistent
14 0.082559109 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
15 0.082050554 146 nips-2006-No-regret Algorithms for Online Convex Programs
16 0.076881818 116 nips-2006-Learning from Multiple Sources
17 0.074564956 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
18 0.066301934 33 nips-2006-Analysis of Representations for Domain Adaptation
19 0.066113561 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
20 0.062181771 124 nips-2006-Linearly-solvable Markov decision problems
topicId topicWeight
[(0, -0.186), (1, -0.034), (2, -0.396), (3, -0.205), (4, 0.039), (5, -0.049), (6, 0.035), (7, -0.06), (8, 0.36), (9, 0.02), (10, 0.093), (11, -0.072), (12, 0.028), (13, -0.059), (14, 0.047), (15, 0.081), (16, 0.033), (17, -0.031), (18, 0.047), (19, -0.109), (20, -0.054), (21, 0.165), (22, 0.041), (23, 0.049), (24, -0.016), (25, 0.059), (26, -0.102), (27, -0.058), (28, -0.047), (29, 0.056), (30, 0.069), (31, 0.054), (32, -0.131), (33, 0.021), (34, 0.04), (35, -0.062), (36, -0.037), (37, -0.079), (38, -0.035), (39, 0.024), (40, -0.058), (41, -0.046), (42, -0.082), (43, -0.106), (44, 0.025), (45, 0.018), (46, 0.023), (47, 0.013), (48, 0.007), (49, -0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.96390426 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
2 0.89720231 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
3 0.71050477 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
4 0.63440579 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
Author: Sandeep Pandey, Christopher Olston
Abstract: We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. The main difficulty stems from new advertisements whose degree of appeal has yet to be determined. Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. Budget constraints and finite advertisement lifetimes make it necessary to explore as well as exploit. In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. We measure empirical performance via extensive experiments over real-world data. 1
5 0.55356795 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
Author: Alborz Geramifard, Michael Bowling, Martin Zinkevich, Richard S. Sutton
Abstract: We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD. 1
6 0.5451169 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
7 0.51451534 154 nips-2006-Optimal Change-Detection and Spiking Neurons
8 0.47900799 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
9 0.46209663 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
10 0.43627143 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
11 0.4077659 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
12 0.39071572 124 nips-2006-Linearly-solvable Markov decision problems
13 0.36144203 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
14 0.31790242 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
15 0.30394459 116 nips-2006-Learning from Multiple Sources
16 0.29773071 21 nips-2006-AdaBoost is Consistent
17 0.29487756 146 nips-2006-No-regret Algorithms for Online Convex Programs
18 0.28121346 109 nips-2006-Learnability and the doubling dimension
19 0.26289403 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
20 0.22613256 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
topicId topicWeight
[(1, 0.088), (2, 0.047), (3, 0.019), (7, 0.072), (9, 0.069), (17, 0.296), (22, 0.062), (44, 0.083), (57, 0.05), (65, 0.049), (69, 0.038), (71, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.74221331 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
2 0.51565498 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
Author: Alfred O. Hero
Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1
3 0.51447845 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
4 0.50948364 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
5 0.50480914 167 nips-2006-Recursive ICA
Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell
Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1
6 0.50382876 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
7 0.50212485 175 nips-2006-Simplifying Mixture Models through Function Approximation
8 0.50085181 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments
9 0.49973825 117 nips-2006-Learning on Graph with Laplacian Regularization
10 0.4995763 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
11 0.4989543 158 nips-2006-PG-means: learning the number of clusters in data
12 0.49830538 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
13 0.49801034 174 nips-2006-Similarity by Composition
14 0.49770004 20 nips-2006-Active learning for misspecified generalized linear models
15 0.49704722 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
16 0.49586278 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
17 0.4949064 162 nips-2006-Predicting spike times from subthreshold dynamics of a neuron
18 0.49435955 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
19 0.4943102 187 nips-2006-Temporal Coding using the Response Properties of Spiking Neurons
20 0.49272603 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments