nips nips2006 nips2006-125 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
Reference: text
sentIndex sentText sentNum sentScore
1 at Abstract We present a learning algorithm for undiscounted reinforcement learning. [sent-3, score-0.216]
2 Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. [sent-4, score-0.189]
3 In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. [sent-5, score-0.887]
4 We will mainly consider unichain MDPs, in which under any policy any state can be reached (after a finite number of transitions) from any state. [sent-10, score-0.593]
5 For a policy π let µπ be the stationary distribution induced by π on M . [sent-11, score-0.433]
6 1 The average reward of π then is defined as ρ(M, π) := µπ (s)r(s, π(s)). [sent-12, score-0.165]
7 (1) s∈S A policy π ∗ is called optimal on M , if ρ(M, π) ≤ ρ(M, π ∗ ) =: ρ∗ (M ) =: ρ∗ for all policies π. [sent-13, score-0.529]
8 Our measure for the quality of a learning algorithm is the total regret after some finite number of steps. [sent-14, score-0.355]
9 When a learning algorithm A executes action at in state st at step t obtaining reward rt , then T −1 ε RT := t=0 rt − T ρ∗ denotes the total regret of A after T steps. [sent-15, score-1.096]
10 The total regret RT with respect to an ε-optimal policy (i. [sent-16, score-0.767]
11 a policy whose return differs from ρ∗ by at most ε) is defined accordingly. [sent-18, score-0.45]
12 Both take as inputs (among others) a confidence parameter δ and an accuracy parameter 1 Every policy π induces a Markov chain Cπ on M . [sent-22, score-0.435]
13 If Cπ is ergodic with transition matrix P , then there exists a unique invariant and strictly positive distribution µπ , such that independent of µ0 one has µn = 1 ¯ ¯ µ0 Pn → µπ , where Pn = n n P j . [sent-23, score-0.291]
14 In contrast, our algorithm has no such input parameters and converges to an optimal policy with expected logarithmic online regret in the number of steps taken. [sent-27, score-1.084]
15 Obviously, by using a decreasing sequence εt , online regret bounds for E3 and R-Max can be achieved. [sent-28, score-0.544]
16 However, it is not clear whether such a procedure can give logarithmic online regret bounds. [sent-29, score-0.58]
17 The knowledge of this parameter then is eliminated by calculating the ε-optimal policy for Tε = 1, 2, . [sent-33, score-0.412]
18 Moreover, as noted in [2], at some time step the assumed Tε may be exponential in the true Tε , which makes policy computation exponential in Tε . [sent-38, score-0.412]
19 However, average loss is defined in respect to the actually visited states, so that small average loss does not guarantee small total regret, which is defined in respect to the states visited by an optimal policy. [sent-43, score-0.241]
20 For this average loss polylogarithmic online bounds were shown for for the MBIE algorithm [4], while more recently logarithmic bounds for delayed Q-learning were given in [5]. [sent-44, score-0.461]
21 However, discounted reinforcement learning is a bit simpler than undiscounted reinforcement learning, as depending on the discount factor only a finite number of steps is relevant. [sent-45, score-0.461]
22 This makes discounted reinforcement learning similar to the setting with trials of constant length from a fixed initial state [6]. [sent-46, score-0.239]
23 For this case logarithmic online regret bounds in the number of trials have already been given in [7]. [sent-47, score-0.676]
24 In the multi-armed bandit problem, similar exploration-exploitation tradeoffs were handled with upper confidence bounds for the expected immediate returns [8, 9]. [sent-50, score-0.238]
25 Our UCRL algorithm takes into account the state structure of the MDP, but is still based on upper confidence bounds for the expected return of a policy. [sent-52, score-0.253]
26 Upper confidence bounds have been applied to reinforcement learning in various places and different contexts, e. [sent-53, score-0.243]
27 Our UCRL algorithm is similar to Strehl, Littman’s MBIE algorithm [10, 4], but our confidence bounds are different, and we are interested in the undiscounted case. [sent-56, score-0.165]
28 The basic idea of their rather complex index policies is to choose the action with maximal return in some specified confidence region of the MDP’s probability distributions. [sent-58, score-0.205]
29 The online-regret of their algorithm is asymptotically logarithmic in the number of steps, which is best possible. [sent-59, score-0.132]
30 Our UCRL algorithm is simpler and achieves logarithmic regret not only asymptotically but uniformly over time. [sent-60, score-0.487]
31 More recently, online reinforcement learning with changing rewards chosen by an adversary was considered under the presumption that the learner has full knowledge of the transition probabilities √ [14]. [sent-62, score-0.55]
32 The given algorithm achieves best possible regret of O( T ) after T steps. [sent-63, score-0.355]
33 In the subsequent Sections 2 and 3 we introduce our UCRL algorithm and show that its expected online regret in unichain MDPs is O(log T ) after T steps. [sent-64, score-0.57]
34 2 The UCRL Algorithm To select good policies, we keep track of estimates for the average rewards and the transition probabilities. [sent-66, score-0.277]
35 From these numbers we immediately get estimates for the average rewards and transition probabilities, Rt (s, a) , Nt (s, a) Pt (s, a, s ) pt (s, a, s ) := ˆ , Nt (s, a) rt (s, a) := ˆ provided that the number of visits in (s, a), Nt (s, a) > 0. [sent-68, score-0.646]
36 However, together with appropriate confidence intervals they may be used to define a set Mt of plausible MDPs. [sent-70, score-0.118]
37 Our algorithm then chooses an optimal policy πt for ˜ ˜ t with maximal average reward ρ∗ := ρ∗ (Mt ) among the MDPs in Mt . [sent-71, score-0.625]
38 Actually, Mt is defined to contain exactly those unichain MDPs M whose transition probabilities p (·, ·, ·) and rewards r (·, ·) satisfy for all states s, s and actions a r (s, a) ≤ rt (s, a) + ˆ log(2tα |S||A|) , 2Nt (s,a) |p (s, a, s ) − pt (s, a, s )| ≤ ˆ and log(4tα |S|2 |A|) . [sent-75, score-0.767]
39 2Nt (s,a) (3) (4) Conditions (3) and (4) describe confidence bounds on the rewards and transition probabilities of the true MDP M such that (2) is implied (cf. [sent-76, score-0.385]
40 The intuition behind the algorithm is that if a non-optimal policy is followed, then this is eventually observed and something about the MDP is learned. [sent-79, score-0.412]
41 In the proofs we show that this learning happens sufficiently fast to approach an optimal policy with only logarithmic regret. [sent-80, score-0.571]
42 As switching policies too often may be harmful, and estimates don’t change very much after few steps, our algorithm discards the policy πt only if there was considerable progress concerning the es˜ timates p(s, πt (s), s ) or r(s, πt (s)). [sent-81, score-0.526]
43 That is, UCRL sticks to a policy until the length of some of the ˆ ˜ ˆ ˜ confidence intervals given by conditions (3) and (4) is halved. [sent-82, score-0.497]
44 3) that this condition limits the number of policy changes without paying too much for not changing to an optimal policy earlier. [sent-86, score-0.872]
45 The optimal policy π in the algorithm can be efficiently calculated by a modified version ˜ of value iteration (cf. [sent-89, score-0.439]
46 Notation: Set confp (t, s, a) := min 1, log(4tα |S|2 |A|) 2Nt (s,a) and confr (t, s, a) := min 1, log(2tα |S||A|) 2Nt (s,a) . [sent-93, score-0.16]
47 Recalculate estimates rt (s, a) and pt (s, a, s ) according to ˆ ˆ rt (s, a) := ˆ Rt (s,a) Nt (s,a) , and pt (s, a, s ) := ˆ Pt (s,a,s ) Nt (s,a) , ˆ ˆ provided that Nt (s, a) > 0. [sent-103, score-0.644]
48 Otherwise set rt (s, a) := 1 and pt (s, a, s ) := 1 |S| . [sent-104, score-0.31]
49 Calculate new policy πtk := arg max{ρ(M, π) : M ∈ Mt }, ˜ π where Mt consists of plausible unichain MDPs M with rewards r (s, a) − rt (s, a) ≤ confr (t, s, a) ˆ and transition probabilities |p (s, a, s ) − pt (s, a, s )| ≤ confp (t, s, a). [sent-106, score-1.326]
50 While confr (t, S, A) > confr (tk , S, A)/2 and confp (t, S, A) > confp (tk , S, A)/2 do (a) Choose action at := πtk (st ). [sent-108, score-0.376]
51 ˜ (b) Observe obtained reward rt and next state st+1 . [sent-109, score-0.394]
52 For any t, any reward r(s, a) and any transition probability p(s, a, s ) of the true MDP M we have P rt (s, a) < r(s, a) − ˆ P |ˆt (s, a, s ) − p(s, a, s )| > p Proof. [sent-117, score-0.452]
53 This implies that the maximal average reward ρ∗ ˜t assumed by our algorithm when calculating a new policy at step t is an upper bound on ρ∗ (M ) with high probability. [sent-121, score-0.658]
54 ˜t Sufficient Precision and Mixing Times In order to upper bound the loss, we consider the precision needed to guarantee that the policy calculated by UCRL is (ε-)optimal. [sent-125, score-0.526]
55 This sufficient precision will of course depend on ε or – in case one wants to compete with an optimal policy – the minimal difference between ρ∗ and the average reward of some suboptimal policy, ∆ := min π:ρ(M,π)<ρ∗ ρ∗ − ρ(M, π). [sent-126, score-0.729]
56 For if |ρ(M ˜ ˜ ˜ probability ˜ ˜ ε > |ρ(Mt , πt ) − ρ(M, πt )| ≥ |ρ∗ (M ) − ρ(M, πt )|, ˜ ˜ (7) so that πt is already an ε-optimal policy on M . [sent-128, score-0.412]
57 ˜ ˜ Thus, we consider bounds on the deviation of the transition probabilities and rewards for the assumed ˜ MDP Mt from the true values, such that (7) is implied. [sent-130, score-0.385]
58 Given an ergodic Markov chain C, let Ts,s be the first passage time for two states s, s , that is, the time needed to reach s when starting in s. [sent-133, score-0.312]
59 Furthermore let Ts,s the return time to max E(T ) s =s s ,s state s. [sent-134, score-0.124]
60 Then the mixing time 2E(Ts,s ) of a unichain MDP M is TM := maxπ TCπ , where Cπ is the Markov chain induced by π on M . [sent-136, score-0.216]
61 Our notion of mixing time is different from the notion of ε-return mixing time given in [1, 2], which depends on an additional parameter ε. [sent-138, score-0.142]
62 Let p(·, ·), p(·, ·) and r(·), r(·) be the transition probabilities and rewards of the ˜ ˜ ˜ MDPs M and M under the policy π , respectively. [sent-141, score-0.701]
63 ˜ The proposition is an easy consequence of the following result about the difference in the stationary distributions of ergodic Markov chains. [sent-143, score-0.266]
64 Let C, C be two ergodic Markov chains on the same state space S with transition probabilities p(·, ·), p(·, ·) and stationary distributions µ, µ. [sent-145, score-0.429]
65 Then the difference ˜ ˜ in the distributions µ, µ can be upper bounded by the difference in the transition probabilities as ˜ follows: max |µ(s) − µ(s)| ≤ κC max ˜ s∈S s∈S |p(s, s ) − p(s, s )|, ˜ s ∈S where κC is as given in Definition 2. [sent-146, score-0.289]
66 p |µ(s) − µ(s)| ≤ |S|κM max ˜ s∈S s∈S s ∈S (8) As the rewards are ∈ [0, 1] and s ˜ ˜ |ρ(M , π ) − ρ(M, π )| ˜ µ(s) = 1, we have by (1) ≤ |˜(s) − µ(s)|˜(s) + µ r s∈S < |˜(s) − r(s)|µ(s) r s∈S κM |S|2 εp + εr = ε. [sent-149, score-0.141]
67 Since εr > εp and the confidence intervals for rewards are smaller than for transition probabilities (cf. [sent-150, score-0.374]
68 Lemma 1), in the following we only consider the precision needed for transition probabilities. [sent-151, score-0.147]
69 3 Bounding the Regret As can be seen from the description of the algorithm, we split the sequence of steps into rounds, where a new round starts whenever the algorithm recalculates its policy. [sent-153, score-0.141]
70 For halving a confidence interval of a reward or transition probability for some (s, a) ∈ S × A, the number Nt (s, a) of visits in (s, a) has to be at least doubled. [sent-156, score-0.346]
71 The number of rounds after T steps cannot exceed |S||A| log2 Proposition 3. [sent-158, score-0.173]
72 1 Regret due to Suboptimal Rounds Proposition 3 provides an upper bound on the number of visits needed in each (s, a) in order to guarantee that a newly calculated policy is optimal. [sent-164, score-0.555]
73 This can be used to upper bound the total number of steps in suboptimal rounds. [sent-165, score-0.194]
74 Consider all suboptimal rounds with |ˆtk (s, a, s ) − p(s, a, s )| ≥ εp for some s , where a policy p πtk with πtk (s) = a is played. [sent-166, score-0.566]
75 The mean passage time between any state s and s is upper bounded by TM . [sent-171, score-0.163]
76 Thus we may separate each round i into τi (s,a) intervals 2TM of length ≥ 2TM , in each of which the probability of visiting state s is at least 1 . [sent-173, score-0.22]
77 Thus we may 2 lower bound the number of visits Ns,a (n) in (s, a) within n such intervals by an application of Chernoff-Hoeffding’s inequality: P Ns,a (n) ≥ n − 2 α Since by Proposition 3, Nt (s, a) < 2 log(4Tεp|S| 2 m(s,a) i=1 with probability 1 − rounds 1 T τi (s, a) 2TM |A|) ≥ 1− 1 . [sent-174, score-0.229]
78 This gives for the expected regret in these m(s,a) τi (s, a) E 2 n log T i=1 < 2 c · TM log(4T α |S|2 |A|) 1 + 2 m(s, a) TM + T. [sent-176, score-0.385]
79 εp 2 T Applying Corollary 2 and summing up over all (s, a), one sees that the expected regret due to suboptimal rounds cannot exceed 2 c |S||A|TM log(4T α |S|2 |A|) T + 2TM |S|2 |A|2 log2 + |S||A|. [sent-177, score-0.571]
80 2 Loss by Policy Changes For any policy πt there may be some states from which the ex˜ pected average reward for the next τ steps is larger than when starting in some other state. [sent-180, score-0.688]
81 However, as we are playing our policies only for a finite number of steps before considering a change, we have to take into account that every time we switch policies, we may need a start-up phase to get into such a favorable state. [sent-182, score-0.175]
82 For all policies π, all starting states s0 and all T ≥ 0 T −1 r(st , π(st )) E ≥ T ρ(π, M ) − TM . [sent-187, score-0.136]
83 t=0 By Corollary 2, the corresponding expected regret after T steps is ≤ |S||A|TM log2 T |S||A| . [sent-188, score-0.42]
84 3 Regret if Confidence Intervals Fail Finally, we have to take into account the error probabilities, with which in each round a transition probability or a reward, respectively, is not contained in its confidence interval. [sent-191, score-0.193]
85 , tN ≤ T be the steps in which a new round starts. [sent-196, score-0.141]
86 4 Putting Everything Together Summing up over all the sources of regret and replacing for εp yields the following theorem, which is a generalization of similar results that were achieved for the multi-armed bandit problem in [8]. [sent-200, score-0.414]
87 On unichain MDPs, the expected total regret of the UCRL algorithm with respect to an (ε-)optimal policy after T > 1 steps can be upper bounded by T |A|TM κ2 |S|5 M log T + 3TM |S|2 |A|2 log2 , and ε2 |S||A| |A|TM κ2 |S|5 T M E(RT ) < const · log T + 3TM |S|2 |A|2 log2 . [sent-202, score-1.1]
88 ∆2 |S||A| ε E(RT ) < const · 4 Remarks and Open Questions on Multichain MDPs π In a multichain MDP a policy π may split up the MDP into ergodic subchains Si . [sent-203, score-0.712]
89 Thus it may happen during the learning phase that one goes wrong and ends up in a part of the MDP that gives suboptimal return but cannot be left under no policy whatsoever. [sent-204, score-0.56]
90 Obviously, if we knew for each policy which subchains it induces on M (the MDP’s ergodic struc˜ ture), UCRL could choose an MDP Mt and a policy πt that maximizes the reward among all plau˜ sible MDPs with the given ergodic structure. [sent-208, score-1.355]
91 However, only the empiric ergodic structure (based on the observations so far) is known. [sent-209, score-0.214]
92 As the empiric ergodic structure may not be reliable, one may additionally explore the ergodic structures of all policies. [sent-210, score-0.388]
93 Alas, the number of additional exploration steps will depend on the smallest positive transition probability. [sent-211, score-0.232]
94 If the latter is not known, it seems that logarithmic online regret bounds can be no longer guaranteed. [sent-212, score-0.676]
95 However, we conjecture that for a slightly modified algorithm the logarithmic online regret bounds still hold for communicating MDPs, in which for any two states s, s there is a suitable policy π such π that s is reachable from s under π (i. [sent-213, score-1.166]
96 5 Conclusion and Outlook Beside the open problems on multichain MDPs, it is an interesting question whether our results also hold when assuming for the mixing time not the slowest policy for reaching any state but the fastest. [sent-217, score-0.602]
97 Another research direction is to consider value function approximation and continuous reinforcement learning problems. [sent-218, score-0.147]
98 For practical purposes, using the variance of the estimates will reduce the width of the upper confidence bounds and will make the exploration even more focused, improving learning speed and regret bounds. [sent-219, score-0.585]
99 R-max – a general polynomial time algorithm for near-optimal reinforcement learning. [sent-233, score-0.184]
100 Online regret bounds for a new reinforcement learning algorithm. [sent-261, score-0.598]
wordName wordTfidf (topN-words)
[('policy', 0.412), ('regret', 0.355), ('ucrl', 0.26), ('mdp', 0.23), ('mt', 0.211), ('rt', 0.192), ('nt', 0.177), ('tm', 0.174), ('ergodic', 0.174), ('mdps', 0.158), ('dence', 0.149), ('reinforcement', 0.147), ('reward', 0.143), ('logarithmic', 0.132), ('unichain', 0.122), ('pt', 0.118), ('transition', 0.117), ('rewards', 0.114), ('tk', 0.102), ('st', 0.099), ('bounds', 0.096), ('online', 0.093), ('policies', 0.09), ('strehl', 0.087), ('intervals', 0.085), ('rounds', 0.085), ('confp', 0.08), ('confr', 0.08), ('round', 0.076), ('proposition', 0.071), ('mixing', 0.071), ('undiscounted', 0.069), ('suboptimal', 0.069), ('con', 0.065), ('steps', 0.065), ('upper', 0.06), ('mbie', 0.06), ('multichain', 0.06), ('bandit', 0.059), ('visits', 0.059), ('state', 0.059), ('probabilities', 0.058), ('action', 0.056), ('exploration', 0.05), ('markov', 0.048), ('corollary', 0.048), ('states', 0.046), ('auer', 0.044), ('passage', 0.044), ('burnetas', 0.04), ('cho', 0.04), ('empiric', 0.04), ('leoben', 0.04), ('optimism', 0.04), ('ronald', 0.04), ('sham', 0.04), ('subchains', 0.04), ('tc', 0.04), ('yishay', 0.04), ('summing', 0.039), ('michael', 0.038), ('return', 0.038), ('polynomial', 0.037), ('peter', 0.035), ('cti', 0.035), ('eyal', 0.035), ('lemma', 0.034), ('alexander', 0.033), ('discounted', 0.033), ('plausible', 0.033), ('communicating', 0.032), ('brafman', 0.032), ('ti', 0.03), ('kearns', 0.03), ('log', 0.03), ('precision', 0.03), ('littman', 0.03), ('visited', 0.028), ('max', 0.027), ('interval', 0.027), ('optimal', 0.027), ('elimination', 0.026), ('singh', 0.026), ('compete', 0.026), ('const', 0.026), ('reach', 0.025), ('guarantee', 0.024), ('pn', 0.024), ('decision', 0.024), ('estimates', 0.024), ('chain', 0.023), ('handled', 0.023), ('exceed', 0.023), ('loss', 0.022), ('average', 0.022), ('wrong', 0.021), ('maximal', 0.021), ('stationary', 0.021), ('changing', 0.021), ('phase', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
2 0.36612207 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
3 0.24088782 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
4 0.22981919 154 nips-2006-Optimal Change-Detection and Spiking Neurons
Author: Angela J. Yu
Abstract: Survival in a non-stationary, potentially adversarial environment requires animals to detect sensory changes rapidly yet accurately, two oft competing desiderata. Neurons subserving such detections are faced with the corresponding challenge to discern “real” changes in inputs as quickly as possible, while ignoring noisy fluctuations. Mathematically, this is an example of a change-detection problem that is actively researched in the controlled stochastic processes community. In this paper, we utilize sophisticated tools developed in that community to formalize an instantiation of the problem faced by the nervous system, and characterize the Bayes-optimal decision policy under certain assumptions. We will derive from this optimal strategy an information accumulation and decision process that remarkably resembles the dynamics of a leaky integrate-and-fire neuron. This correspondence suggests that neurons are optimized for tracking input changes, and sheds new light on the computational import of intracellular properties such as resting membrane potential, voltage-dependent conductance, and post-spike reset voltage. We also explore the influence that factors such as timing, uncertainty, neuromodulation, and reward should and do have on neuronal dynamics and sensitivity, as the optimal decision strategy depends critically on these factors. 1
5 0.19781451 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
6 0.17519514 146 nips-2006-No-regret Algorithms for Online Convex Programs
7 0.1726782 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
8 0.1401052 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
9 0.13564143 124 nips-2006-Linearly-solvable Markov decision problems
10 0.12463009 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
11 0.11730297 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
12 0.11269514 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
13 0.087966211 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
14 0.086543553 61 nips-2006-Convex Repeated Games and Fenchel Duality
15 0.085207805 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
16 0.075307116 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
17 0.07208211 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
18 0.072004065 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
19 0.071194194 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons
20 0.058853235 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
topicId topicWeight
[(0, -0.173), (1, -0.071), (2, -0.413), (3, -0.216), (4, 0.027), (5, -0.115), (6, 0.041), (7, -0.096), (8, 0.355), (9, 0.002), (10, 0.045), (11, -0.049), (12, -0.006), (13, -0.008), (14, 0.068), (15, 0.053), (16, 0.035), (17, 0.047), (18, 0.018), (19, -0.058), (20, 0.023), (21, 0.111), (22, 0.052), (23, 0.01), (24, 0.015), (25, 0.116), (26, -0.086), (27, -0.089), (28, -0.012), (29, 0.043), (30, 0.03), (31, 0.073), (32, -0.153), (33, -0.078), (34, 0.047), (35, 0.066), (36, -0.029), (37, -0.016), (38, -0.056), (39, 0.053), (40, 0.009), (41, -0.055), (42, -0.047), (43, -0.042), (44, 0.074), (45, -0.044), (46, 0.012), (47, 0.031), (48, 0.059), (49, 0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.97849482 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
2 0.86908937 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
3 0.68542165 44 nips-2006-Bayesian Policy Gradient Algorithms
Author: Mohammad Ghavamzadeh, Yaakov Engel
Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1
4 0.64039236 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
Author: Huan Xu, Shie Mannor
Abstract: Computation of a satisfactory control policy for a Markov decision process when the parameters of the model are not exactly known is a problem encountered in many practical applications. The traditional robust approach is based on a worstcase analysis and may lead to an overly conservative policy. In this paper we consider the tradeoff between nominal performance and the worst case performance over all possible models. Based on parametric linear programming, we propose a method that computes the whole set of Pareto efficient policies in the performancerobustness plane when only the reward parameters are subject to uncertainty. In the more general case when the transition probabilities are also subject to error, we show that the strategy with the “optimal” tradeoff might be non-Markovian and hence is in general not tractable. 1
5 0.61209309 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
Author: Sandeep Pandey, Christopher Olston
Abstract: We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. The main difficulty stems from new advertisements whose degree of appeal has yet to be determined. Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. Budget constraints and finite advertisement lifetimes make it necessary to explore as well as exploit. In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. We measure empirical performance via extensive experiments over real-world data. 1
6 0.56241369 154 nips-2006-Optimal Change-Detection and Spiking Neurons
7 0.55438352 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
8 0.47528934 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
9 0.46721575 124 nips-2006-Linearly-solvable Markov decision problems
10 0.41773376 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
11 0.4175722 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
12 0.41419429 146 nips-2006-No-regret Algorithms for Online Convex Programs
13 0.37796476 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning
14 0.36345184 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games
15 0.27746722 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
16 0.25832254 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems
17 0.25585049 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
18 0.24028713 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
19 0.20882276 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
20 0.19910437 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space
topicId topicWeight
[(1, 0.071), (2, 0.325), (3, 0.013), (7, 0.052), (9, 0.037), (22, 0.082), (25, 0.014), (44, 0.106), (57, 0.069), (65, 0.098), (71, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.8345111 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Author: Peter Auer, Ronald Ortner
Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1
2 0.74918151 27 nips-2006-An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments
Author: Michael I. Mandel, Daniel P. Ellis, Tony Jebara
Abstract: We present a method for localizing and separating sound sources in stereo recordings that is robust to reverberation and does not make any assumptions about the source statistics. The method consists of a probabilistic model of binaural multisource recordings and an expectation maximization algorithm for finding the maximum likelihood parameters of that model. These parameters include distributions over delays and assignments of time-frequency regions to sources. We evaluate this method against two comparable algorithms on simulations of simultaneous speech from two or three sources. Our method outperforms the others in anechoic conditions and performs as well as the better of the two in the presence of reverberation. 1
3 0.69746292 85 nips-2006-Geometric entropy minimization (GEM) for anomaly detection and localization
Author: Alfred O. Hero
Abstract: We introduce a novel adaptive non-parametric anomaly detection approach, called GEM, that is based on the minimal covering properties of K-point entropic graphs when constructed on N training samples from a nominal probability distribution. Such graphs have the property that as N → ∞ their span recovers the entropy minimizing set that supports at least ρ = K/N (100)% of the mass of the Lebesgue part of the distribution. When a test sample falls outside of the entropy minimizing set an anomaly can be declared at a statistical level of significance α = 1 − ρ. A method for implementing this non-parametric anomaly detector is proposed that approximates this minimum entropy set by the influence region of a K-point entropic graph built on the training data. By implementing an incremental leave-one-out k-nearest neighbor graph on resampled subsets of the training data GEM can efficiently detect outliers at a given level of significance and compute their empirical p-values. We illustrate GEM for several simulated and real data sets in high dimensional feature spaces. 1
4 0.65550876 174 nips-2006-Similarity by Composition
Author: Oren Boiman, Michal Irani
Abstract: We propose a new approach for measuring similarity between two signals, which is applicable to many machine learning tasks, and to many signal types. We say that a signal S1 is “similar” to a signal S2 if it is “easy” to compose S1 from few large contiguous chunks of S2 . Obviously, if we use small enough pieces, then any signal can be composed of any other. Therefore, the larger those pieces are, the more similar S1 is to S2 . This induces a local similarity score at every point in the signal, based on the size of its supported surrounding region. These local scores can in turn be accumulated in a principled information-theoretic way into a global similarity score of the entire S1 to S2 . “Similarity by Composition” can be applied between pairs of signals, between groups of signals, and also between different portions of the same signal. It can therefore be employed in a wide variety of machine learning problems (clustering, classification, retrieval, segmentation, attention, saliency, labelling, etc.), and can be applied to a wide range of signal types (images, video, audio, biological data, etc.) We show a few such examples. 1
5 0.55967069 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.
6 0.50714868 44 nips-2006-Bayesian Policy Gradient Algorithms
7 0.50060368 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments
8 0.49651596 124 nips-2006-Linearly-solvable Markov decision problems
9 0.48680705 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation
10 0.48560199 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
11 0.48142749 61 nips-2006-Convex Repeated Games and Fenchel Duality
12 0.48011413 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
13 0.47921899 65 nips-2006-Denoising and Dimension Reduction in Feature Space
14 0.4788405 175 nips-2006-Simplifying Mixture Models through Function Approximation
15 0.47678414 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
16 0.47672027 20 nips-2006-Active learning for misspecified generalized linear models
17 0.47517797 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
18 0.47489583 79 nips-2006-Fast Iterative Kernel PCA
19 0.47418159 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
20 0.47295257 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation