nips nips2007 nips2007-151 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). [sent-6, score-0.185]
2 It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. [sent-8, score-0.294]
3 We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. [sent-9, score-0.266]
4 But OLP is simpler and its regret bound has a better dependence on the size of the MDP. [sent-12, score-0.347]
5 Assuming that the decision maker or agent is able to perfectly observe its own state, uncertain systems are often modeled as Markov decision processes (MDPs). [sent-14, score-0.099]
6 Given a policy π, how do we measure its ability to handle this trade-off? [sent-24, score-0.147]
7 Suppose the agent gets a numerical reward at each time step and we measure performance by the accumulated reward over time. [sent-25, score-0.253]
8 Then, a meaningful quantity to evaluate the policy π is its regret over time. [sent-26, score-0.455]
9 To understand what regret means, consider an omniscient agent who knows all parameters of the MDP accurately π and behaves optimally. [sent-27, score-0.323]
10 Let VT be the expected reward obtained by this agent up to time T . [sent-28, score-0.137]
11 Let VT π π denote the corresponding quantity for π. [sent-29, score-0.062]
12 Then the regret RT = VT − VT measures how much π is π hurt due to its incomplete knowledge of the MDP up to time T . [sent-30, score-0.288]
13 If we can show that the regret RT grows slowly with time T , for all MDPs in a sufficiently big class, then we can safely conclude that π is making a judicious trade-off between exploration and exploitation. [sent-31, score-0.346]
14 It is rather remarkable that 1 for this notion of regret, logarithmic bounds have been proved in the literature [1,2]. [sent-32, score-0.064]
15 This means that π π there are policies π with RT = O(log T ). [sent-33, score-0.092]
16 Thus the per-step regret RT /T goes to zero very quickly. [sent-34, score-0.265]
17 Burnetas and Katehakis [1] proved that for any policy π (satisfying certain reasonable assumptions) π RT ≥ CB (P ) log T where they identified the constant CB (P ). [sent-35, score-0.24]
18 This constant depends on the transition function P of the MDP1 . [sent-36, score-0.092]
19 However, besides assuming that the MDP is irreducible (see Assumption 1 below) they assumed that the support sets of the transition distributions pi (a) are known for all state-action pairs. [sent-38, score-0.407]
20 In this paper, we not only get rid of this assumption but our optimistic linear programming (OLP) algorithm is also computationally simpler. [sent-39, score-0.214]
21 The price we pay for these advantages is that the regret of OLP is C(P ) log T asymptotically, for a constant C(P ) ≥ CB (P ). [sent-45, score-0.358]
22 The algorithm of Auer and Ortner (we refer to it as AOA) is another logarithmic regret algorithm for irreducible2 MDPs. [sent-47, score-0.311]
23 But then the optimization problem they solve is more complicated because they find a policy to use in the next few time steps by optimizing over a set of MDPs. [sent-49, score-0.17]
24 The regret of AOA is CA (P ) log T where CA (P ) = c |S|5 |A|Tw (P )κ(P )2 , ∆∗ (P )2 (1) for some universal constant c. [sent-50, score-0.358]
25 Here |S|, |A| denote the state and action space size, Tw (P ) is the worst case hitting time over deterministic policies (see Eqn. [sent-51, score-0.459]
26 (12)) and ∆∗ (P ) is the difference between the long term average return of the best policy and that of the next best policy. [sent-52, score-0.147]
27 The constant κ(P ) is also defined in terms of hitting times. [sent-53, score-0.175]
28 Under Auer and Ortner’s assumption of bounded rewards, we can show that the constant for OLP satisfies C(P ) ≤ 2|S||A|T (P )2 . [sent-54, score-0.06]
29 Φ∗ (P ) (2) Here T (P ) is the hitting time of an optimal policy is therefore necessarily smaller than Tw (P ). [sent-55, score-0.345]
30 The constant Φ∗ (P ) can roughly be thought of as the minimum (over states) difference between the quality of the best and the second best action (see Eqn. [sent-58, score-0.128]
31 2 Preliminaries Consider an MDP (S, A, R, P ) where S is the set of states, A = ∪i∈S A(i) is the set of actions (A(i) being the actions available in state i), R = {r(i, a)}i∈S,a∈A(i) are the rewards and P = {pi,j (a)}i,j∈S,a∈A(i) are the transition probabilities. [sent-62, score-0.422]
32 For simplicity of analysis, we assume that the rewards are known to us beforehand. [sent-63, score-0.068]
33 We do not assume that we know the support sets of the distributions pi (a). [sent-64, score-0.266]
34 A policy π is a sequence {πt } of probability distributions on A given σt such that πt (A(st )|σt ) = 1 where st denotes the random variable representing the state at time t. [sent-69, score-0.404]
35 A deterministic policy is simply a function µ : S → A such that µ(i) ∈ A(i). [sent-71, score-0.168]
36 The MDPs they call unichain are called irreducible in standard textbooks (for example, see [9, p. [sent-75, score-0.166]
37 Probability and expectation under a policy π, transition function P and starting state i0 will be denoted by Pπ,P and Eπ,P respectively. [sent-77, score-0.245]
38 Given history σt , let Nt (i), i0 i0 Nt (i, a) and Nt (i, a, j) denote the number of occurrences of the state i, the pair (i, a) and the triplet (i, a, j) respectively in σt . [sent-78, score-0.075]
39 For all µ ∈ ΠD , the transition matrix P µ = (pi,j (µ(i)))i,j∈S is irreducible (i. [sent-81, score-0.141]
40 Consider the rewards accumulated by the policy π before time T , T −1 π VT (i0 , P ) := Eπ,P [ i0 r(st , at )] , t=0 where at is the random variable representing the action taken by π at time t. [sent-84, score-0.399]
41 Let VT (i0 , P ) be the maximum possible sum of expected rewards before time T , π VT (i0 , P ) := sup VT (i0 , P ) . [sent-85, score-0.151]
42 π∈Π The regret of a policy π at time T is a measure of how well the expected rewards of π compare with the above quantity, π π RT (i0 , P ) := VT (i0 , P ) − VT (i0 , P ) . [sent-86, score-0.503]
43 Define the long term average reward of a policy π as V π (i0 , P ) . [sent-87, score-0.222]
44 λπ (i0 , P ) := lim inf T T →∞ T Under assumption 1, the above limit exists and is independent of the starting state i0 . [sent-88, score-0.17]
45 Given a restricted set D ⊆ A of actions, the gain or the best long term average performance is λ(P, D) := sup λπ (i0 , P ) . [sent-89, score-0.083]
46 The transition and reward functions of the restricted problems are simply the restrictions of P and r to D. [sent-93, score-0.159]
47 Assumption 1 implies that there is a bias vector h(P, D) = {h(i; P, D)}i∈S such that the gain λ(P, D) and bias h(P, D) are the unique solutions to the average reward optimality equations: ∀i ∈ S, λ(P, D) + h(i; P, D) = max [r(i, a) + pi (a), h(P, D) ] . [sent-94, score-0.457]
48 Note that if h∗ (P ) is a solution to the optimality equations and e is the vector of ones, then h∗ (P ) + ce is also a solution for any scalar c. [sent-97, score-0.077]
49 It will be convenient to have a way to denote the quantity inside the ‘max’ that appears in the optimality equations. [sent-99, score-0.11]
50 Accordingly, define L(i, a, p, h) := r(i, a) + p, h , ∗ L (i; P, D) := max L(i, a, pi (a), h(P, D)) . [sent-100, score-0.266]
51 a∈D(i) To measure the degree of suboptimality of actions available at a state, define φ∗ (i, a; P ) = L∗ (i; P, A) − L(i, a, pi (a), h∗ (P )) . [sent-101, score-0.426]
52 Note that the optimal actions are precisely those for which the above quantity is zero. [sent-102, score-0.202]
53 For a suboptimal action a ∈ O(i; P, A), the following set contains probability distributions q such that if / pi (a) is changed to q, the quality of action a comes within of an optimal action. [sent-109, score-0.53]
54 (6) 1 To make sense of this definition, consider p = pi (a). [sent-114, score-0.266]
55 The above infimum is then the least distance (in the L1 sense) one has to move away from pi (a) to make the suboptimal action a look -optimal. [sent-115, score-0.402]
56 Taking the limit of this as decreases gives us a quantity that also plays a crucial role in determining the regret, K(i, a; P ) := lim Ji,a (pi (a); P, ) . [sent-116, score-0.127]
57 (7) →0 Intuitively, if K(i, a; P ) is small, it is easy to confuse a suboptimal action with an optimal one and so it should be difficult to achieve small regret. [sent-117, score-0.167]
58 The constant that multiplies log T in the regret bound of our algorithm OLP (see Algorithm 1 and Theorem 4 below) is the following: 2φ∗ (i, a; P ) . [sent-118, score-0.393]
59 (8) C(P ) := K(i, a; P ) (i,a)∈Crit(P ) This definition might look a bit hard to interpret, so we give an upper bound on C(P ) just in terms of the infinity norm H ∗ (P ) of the bias and Φ∗ (P ). [sent-119, score-0.057]
60 This latter quantity is defined below to be the minimum degree of suboptimality of a critical action. [sent-120, score-0.098]
61 3 Hitting times It turns out that we can bound the infinity norm of the bias in terms of the hitting time of an optimal policy. [sent-127, score-0.255]
62 For any policy µ define its hitting time to be the worst case expected time to reach one state from another: Tµ (P ) := max Eµ,P [min{t > 0 : st = i}] . [sent-128, score-0.597]
63 (10) j i=j The following constant is the minimum hitting time among optimal policies: T (P ) := min Tµ (P ) . [sent-129, score-0.229]
64 It is the worst case hitting time over all policies: Tw (P ) := max Tµ (P ) . [sent-131, score-0.193]
65 (12) µ∈ΠD We can now bound C(P ) just in terms of the hitting time T (P ) and φ∗ (P ). [sent-132, score-0.202]
66 4 3 The optimistic LP algorithm and its regret bound Algorithm 1 Optimistic Linear Programming 1: for t = 0, 1, 2, . [sent-137, score-0.421]
67 At each time step t, the algorithm computes the empirical estimates for transition probabilities. [sent-142, score-0.084]
68 It then forms a restricted problem ignoring relatively undersampled actions. [sent-143, score-0.098]
69 An action a ∈ A(i) is considered “undersamˆ ˆ pled” if Nt (i, a) < log2 Nt (i). [sent-144, score-0.097]
70 The solutions ht , λt might be misleading due to estimation errors. [sent-145, score-0.158]
71 To avoid being misled by empirical samples we compute optimistic “indices” Ut (st , a) for all legal actions a ∈ A(st ) where st is the current state. [sent-146, score-0.484]
72 The index for action a is computed by looking at an L1 -ball around the empirical estimate pt t (a) and choosing a probability distribution q that maxˆs ˆ imizes L(i, a, q, ht ). [sent-147, score-0.386]
73 Note that if the estimates were perfect, we would take an action maximizing ˆ L(i, a, pt t (a), ht ). [sent-148, score-0.386]
74 Instead, we take an action that maximizes the index. [sent-149, score-0.097]
75 It is when all the optimal actions of the current problem are about to become undersampled at the next time step. [sent-151, score-0.257]
76 In that case, we take one of these actions (steps 18–22). [sent-152, score-0.128]
77 The LP for solving optimality equations can be found in several textbooks (see, for example, [9, p. [sent-154, score-0.138]
78 Like the original Burnetas-Katehakis algorithm, the modified one also satisfies a logarithmic regret bound as stated in the following theorem. [sent-157, score-0.346]
79 Unlike the original algorithm, OLP does not need to know the support sets of the transition distributions. [sent-158, score-0.061]
80 Let β denote the policy implemented by Algorithm 1. [sent-160, score-0.166]
81 Then we have, for all i0 ∈ S and for all P satisfying Assumption 1, β RT (i0 , P ) ≤ C(P ) , log T T →∞ where C(P ) is the MDP-dependent constant defined in (8). [sent-161, score-0.119]
82 i0 i∈S a∈O(i;P,A) / 5 (13) Define the event ˆ At := { ht − h∗ (P ) ∞ ˆ ≤ ∧ O(P t , Dt ) ⊆ O(P )} . [sent-164, score-0.189]
83 (15) The result then follows by combining (13) and (15) with the following three propositions and then letting → 0 sufficiently slowly. [sent-167, score-0.064]
84 For all P and i0 ∈ S, we have lim lim sup →0 T →∞ i∈S a∈O(i;P,A) / 1 Eβ,P [NT (i, a; )] ∗ i0 φ (i, a; P ) ≤ C(P ) . [sent-169, score-0.228]
85 On the event At (recall the / ˆ definition given in (14)), we have | q, ht − q, h∗ (P ) | ≤ for any q ∈ ∆+ . [sent-180, score-0.189]
86 Therefore, ˆ Ut (i, a) ≤ sup {r(i, a) + q, ht } q∈∆+ ≤ sup {r(i, a) + q, h∗ (P ) } + q∈∆+ ∗ < L (i; P, A) − 0 + < L∗ (i; P, A) − 2 provided that 3 < Therefore for < 0 /3, 1 NT (i, a; [ MakeOpt(i, a; P, 0) = ∅] 0 ) = 0. [sent-181, score-0.259]
87 pt (a) − q ˆi 2 1 ≤ 2 log t Nt (i, a) ˆ ∧ r(i, a) + q, ht ≥ L∗ (i; P, A) − 2 . [sent-185, score-0.351]
88 ˆ On the event At , we have | q, ht − q, h∗ (P ) | ≤ and thus the above implies ∃q ∈ ∆+ s. [sent-186, score-0.213]
89 pt (a) − q ˆi 2 1 ≤ 2 log t Nt (i, a) ∧ (r(i, a) + q, h∗ (P ) ≥ L∗ (i; P, A) − 3 ) . [sent-188, score-0.212]
90 6 Recalling the definition (6) of Ji,a (p; P, ), we see that this implies Ji,a (ˆt (a); P, 3 ) ≤ pi 2 log t . [sent-189, score-0.352]
91 Each time the pair (i, a) occurs Nt (i, a) increases by 1, so the first count is no more than 2 log T . [sent-191, score-0.085]
92 By a Chernoff-type bound, we have, pi ˆi for some constant C1 , ˆi Pβ,P [ pi (a) − pt (a) i0 1 > f (δ) | Nt (i, a) = m] ≤ C1 exp(−mf (δ)2 ) . [sent-193, score-0.713]
93 1 − exp(−f (δ)2 ) (18) Combining the bounds (17) and (18) and plugging them into (16), we get 1 Eβ,P [NT (i, a; )] ≤ i0 2 log T C1 + . [sent-195, score-0.08]
94 Ji,a (pi (a); P, 3 ) − δ 1 − exp(−f (δ)2 ) Letting δ → 0 sufficiently slowly, we get that for all > 0, 1 Eβ,P [NT (i, a; )] ≤ i0 2 log T + o(log T ) . [sent-196, score-0.062]
95 Ji,a (pi (a); P, 3 ) Therefore, lim lim sup →0 T →∞ 1 Eβ,P [NT (i, a; )] 2 2 i0 ≤ lim = , →0 Ji,a (pi (a); P, 3 ) log T K(i, a; P ) where the last equality follows from the definition (7) of K(i, a; P ). [sent-197, score-0.374]
96 Therefore, for all t optimal actions a∗ ∈ O(i; P, A), we have, on the event At (i, a; ), Ut (i, a∗ ) ≤ Ut (i, a) < L∗ (i; P, A) − 2 . [sent-204, score-0.209]
97 7 Since L∗ (i; P, A) = r(i, a∗ ) + pi (a∗ ), h∗ (P ) , this implies ∀q ∈ ∆+ , q − pt (a∗ ) ˆi 1 2 log t ˆ ⇒ q, ht < pi (a∗ ), h∗ (P ) − 2 . [sent-205, score-0.907]
98 Nt (i, a∗ ) ≤ ˆ Moreover, on the event At , | q, ht − q, h∗ (P ) | ≤ . [sent-206, score-0.189]
99 Using a union bound, we therefore have, t Pβ,P [At (i, a; )] ≤ i0 C1 exp − m=1 ≤ C1 t ∞ exp − m=1 m 2 h∗ (P ) m 2 ∗ (P ) 2 h + ∞ 2 log t m 2 √ 2 ∞ − 2m log t h∗ (P ) ∞ =o 1 t . [sent-208, score-0.184]
100 (2007) Logarithmic online regret bounds for undiscounted reinforcement learning. [sent-217, score-0.327]
wordName wordTfidf (topN-words)
[('nt', 0.562), ('olp', 0.303), ('pi', 0.266), ('regret', 0.265), ('st', 0.197), ('crit', 0.151), ('pt', 0.15), ('policy', 0.147), ('hitting', 0.144), ('ht', 0.139), ('makeopt', 0.13), ('actions', 0.128), ('optimistic', 0.121), ('vt', 0.12), ('tw', 0.108), ('mdp', 0.107), ('proposition', 0.104), ('ut', 0.1), ('action', 0.097), ('auer', 0.093), ('policies', 0.092), ('burnetas', 0.086), ('katehakis', 0.086), ('ortner', 0.086), ('lim', 0.084), ('irreducible', 0.08), ('undersampled', 0.075), ('reward', 0.075), ('rt', 0.075), ('rewards', 0.068), ('aoa', 0.065), ('bka', 0.065), ('log', 0.062), ('transition', 0.061), ('mdps', 0.06), ('sup', 0.06), ('cb', 0.052), ('event', 0.05), ('dt', 0.049), ('optimality', 0.048), ('logarithmic', 0.046), ('propositions', 0.045), ('ambuj', 0.043), ('tewari', 0.043), ('textbooks', 0.043), ('unichain', 0.043), ('quantity', 0.043), ('accumulated', 0.041), ('suboptimal', 0.039), ('agent', 0.039), ('rid', 0.038), ('state', 0.037), ('bound', 0.035), ('berkeley', 0.033), ('suboptimality', 0.032), ('constant', 0.031), ('optimal', 0.031), ('lp', 0.031), ('exp', 0.03), ('decision', 0.03), ('assumption', 0.029), ('mf', 0.029), ('equations', 0.029), ('simpler', 0.028), ('nity', 0.028), ('worst', 0.026), ('satisfying', 0.026), ('programming', 0.026), ('reinforcement', 0.025), ('implies', 0.024), ('division', 0.024), ('ca', 0.023), ('restricted', 0.023), ('time', 0.023), ('california', 0.023), ('critical', 0.023), ('bias', 0.022), ('programs', 0.022), ('deterministic', 0.021), ('bartlett', 0.021), ('slowly', 0.02), ('proof', 0.02), ('inf', 0.02), ('denote', 0.019), ('letting', 0.019), ('misleading', 0.019), ('misled', 0.019), ('optimistically', 0.019), ('legal', 0.019), ('omniscient', 0.019), ('judicious', 0.019), ('triplet', 0.019), ('brafman', 0.019), ('tennenholtz', 0.019), ('undiscounted', 0.019), ('exploration', 0.019), ('inspired', 0.019), ('dependence', 0.019), ('bounds', 0.018), ('solving', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
2 0.2708852 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
3 0.21673214 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
4 0.19820496 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
Author: Elad Hazan, Satyen Kale
Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efficiently approximate fixed points of a given deviation if and only if there exist efficient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1
5 0.18785161 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini
Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1
6 0.17374414 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
7 0.17235518 86 nips-2007-Exponential Family Predictive Representations of State
8 0.16503642 21 nips-2007-Adaptive Online Gradient Descent
9 0.15128137 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
10 0.14711873 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
11 0.14384533 197 nips-2007-The Infinite Markov Model
12 0.13352956 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
13 0.13350014 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
14 0.1325777 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
15 0.13227229 165 nips-2007-Regret Minimization in Games with Incomplete Information
16 0.13040024 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
17 0.10500887 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
18 0.10375308 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
19 0.096541218 162 nips-2007-Random Sampling of States in Dynamic Programming
20 0.0867365 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
topicId topicWeight
[(0, -0.204), (1, -0.418), (2, 0.1), (3, -0.012), (4, 0.172), (5, -0.029), (6, -0.037), (7, -0.008), (8, 0.06), (9, 0.009), (10, -0.007), (11, 0.069), (12, -0.067), (13, 0.044), (14, -0.021), (15, -0.01), (16, 0.056), (17, -0.03), (18, -0.038), (19, 0.088), (20, -0.177), (21, 0.043), (22, 0.044), (23, -0.142), (24, 0.127), (25, -0.051), (26, -0.073), (27, -0.1), (28, -0.025), (29, 0.149), (30, 0.06), (31, -0.037), (32, 0.158), (33, 0.035), (34, 0.112), (35, 0.031), (36, 0.011), (37, -0.032), (38, -0.052), (39, 0.063), (40, -0.003), (41, 0.045), (42, -0.027), (43, 0.025), (44, -0.026), (45, -0.058), (46, -0.06), (47, -0.048), (48, 0.103), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.96571952 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
2 0.7139281 199 nips-2007-The Price of Bandit Information for Online Optimization
Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes
Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efficiently in special cases of particular interest, such as path planning and Markov Decision Problems. 1
3 0.61884952 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information
Author: John Langford, Tong Zhang
Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1
4 0.60239983 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines
Author: Peter Frazier, Angela J. Yu
Abstract: Most models of decision-making in neuroscience assume an infinite horizon, which yields an optimal solution that integrates evidence up to a fixed decision threshold; however, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some finite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution.
5 0.53913522 102 nips-2007-Incremental Natural Actor-Critic Algorithms
Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton
Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms. 1
6 0.52263325 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
7 0.51174599 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
8 0.50893176 191 nips-2007-Temporal Difference Updating without a Learning Rate
9 0.42300299 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
10 0.41308236 21 nips-2007-Adaptive Online Gradient Descent
11 0.41222653 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
12 0.40844443 86 nips-2007-Exponential Family Predictive Representations of State
13 0.3946237 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
14 0.37725413 197 nips-2007-The Infinite Markov Model
15 0.36386988 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess
16 0.36090812 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
17 0.34802711 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
18 0.34458336 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
19 0.30795926 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding
20 0.30732399 162 nips-2007-Random Sampling of States in Dynamic Programming
topicId topicWeight
[(5, 0.02), (13, 0.086), (21, 0.047), (24, 0.315), (31, 0.037), (34, 0.024), (35, 0.026), (47, 0.096), (49, 0.012), (56, 0.038), (83, 0.142), (85, 0.015), (90, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.73924774 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
Author: Ambuj Tewari, Peter L. Bartlett
Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1
2 0.65152037 175 nips-2007-Semi-Supervised Multitask Learning
Author: Qiuhua Liu, Xuejun Liao, Lawrence Carin
Abstract: A semi-supervised multitask learning (MTL) framework is presented, in which M parameterized semi-supervised classifiers, each associated with one of M partially labeled data manifolds, are learned jointly under the constraint of a softsharing prior imposed over the parameters of the classifiers. The unlabeled data are utilized by basing classifier learning on neighborhoods, induced by a Markov random walk over a graph representation of each manifold. Experimental results on real data sets demonstrate that semi-supervised MTL yields significant improvements in generalization performance over either semi-supervised single-task learning (STL) or supervised MTL. 1
3 0.53650439 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
Author: Alexander L. Strehl, Michael L. Littman
Abstract: We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.
4 0.52568132 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
5 0.52088165 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
6 0.51867533 102 nips-2007-Incremental Natural Actor-Critic Algorithms
7 0.51667559 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
8 0.51529127 86 nips-2007-Exponential Family Predictive Representations of State
9 0.51475507 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning
10 0.51420361 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
11 0.51160806 62 nips-2007-Convex Learning with Invariances
12 0.51127595 134 nips-2007-Multi-Task Learning via Conic Programming
13 0.51080894 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs
14 0.51030058 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
15 0.51027232 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
16 0.50959241 69 nips-2007-Discriminative Batch Mode Active Learning
17 0.50853968 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
18 0.50775623 115 nips-2007-Learning the 2-D Topology of Images
19 0.50707644 24 nips-2007-An Analysis of Inference with the Universum
20 0.50681168 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning