nips nips2007 nips2007-151 knowledge-graph by maker-knowledge-mining

151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Source: pdf

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities, the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in ﬂavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). [sent-6, score-0.185]

2 It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. [sent-8, score-0.294]

3 We show that the total expected reward obtained by OLP up to time T is within C(P ) log T of the reward obtained by the optimal policy, where C(P ) is an explicit, MDP-dependent constant. [sent-9, score-0.266]

4 But OLP is simpler and its regret bound has a better dependence on the size of the MDP. [sent-12, score-0.347]

5 Assuming that the decision maker or agent is able to perfectly observe its own state, uncertain systems are often modeled as Markov decision processes (MDPs). [sent-14, score-0.099]

6 Given a policy π, how do we measure its ability to handle this trade-off? [sent-24, score-0.147]

7 Suppose the agent gets a numerical reward at each time step and we measure performance by the accumulated reward over time. [sent-25, score-0.253]

8 Then, a meaningful quantity to evaluate the policy π is its regret over time. [sent-26, score-0.455]

9 To understand what regret means, consider an omniscient agent who knows all parameters of the MDP accurately π and behaves optimally. [sent-27, score-0.323]

10 Let VT be the expected reward obtained by this agent up to time T . [sent-28, score-0.137]

11 Let VT π π denote the corresponding quantity for π. [sent-29, score-0.062]

12 Then the regret RT = VT − VT measures how much π is π hurt due to its incomplete knowledge of the MDP up to time T . [sent-30, score-0.288]

13 If we can show that the regret RT grows slowly with time T , for all MDPs in a sufﬁciently big class, then we can safely conclude that π is making a judicious trade-off between exploration and exploitation. [sent-31, score-0.346]

14 It is rather remarkable that 1 for this notion of regret, logarithmic bounds have been proved in the literature [1,2]. [sent-32, score-0.064]

15 This means that π π there are policies π with RT = O(log T ). [sent-33, score-0.092]

16 Thus the per-step regret RT /T goes to zero very quickly. [sent-34, score-0.265]

17 Burnetas and Katehakis [1] proved that for any policy π (satisfying certain reasonable assumptions) π RT ≥ CB (P ) log T where they identiﬁed the constant CB (P ). [sent-35, score-0.24]

18 This constant depends on the transition function P of the MDP1 . [sent-36, score-0.092]

19 However, besides assuming that the MDP is irreducible (see Assumption 1 below) they assumed that the support sets of the transition distributions pi (a) are known for all state-action pairs. [sent-38, score-0.407]

20 In this paper, we not only get rid of this assumption but our optimistic linear programming (OLP) algorithm is also computationally simpler. [sent-39, score-0.214]

21 The price we pay for these advantages is that the regret of OLP is C(P ) log T asymptotically, for a constant C(P ) ≥ CB (P ). [sent-45, score-0.358]

22 The algorithm of Auer and Ortner (we refer to it as AOA) is another logarithmic regret algorithm for irreducible2 MDPs. [sent-47, score-0.311]

23 But then the optimization problem they solve is more complicated because they ﬁnd a policy to use in the next few time steps by optimizing over a set of MDPs. [sent-49, score-0.17]

24 The regret of AOA is CA (P ) log T where CA (P ) = c |S|5 |A|Tw (P )κ(P )2 , ∆∗ (P )2 (1) for some universal constant c. [sent-50, score-0.358]

25 Here |S|, |A| denote the state and action space size, Tw (P ) is the worst case hitting time over deterministic policies (see Eqn. [sent-51, score-0.459]

26 (12)) and ∆∗ (P ) is the difference between the long term average return of the best policy and that of the next best policy. [sent-52, score-0.147]

27 The constant κ(P ) is also deﬁned in terms of hitting times. [sent-53, score-0.175]

28 Under Auer and Ortner’s assumption of bounded rewards, we can show that the constant for OLP satisﬁes C(P ) ≤ 2|S||A|T (P )2 . [sent-54, score-0.06]

29 Φ∗ (P ) (2) Here T (P ) is the hitting time of an optimal policy is therefore necessarily smaller than Tw (P ). [sent-55, score-0.345]

30 The constant Φ∗ (P ) can roughly be thought of as the minimum (over states) difference between the quality of the best and the second best action (see Eqn. [sent-58, score-0.128]

31 2 Preliminaries Consider an MDP (S, A, R, P ) where S is the set of states, A = ∪i∈S A(i) is the set of actions (A(i) being the actions available in state i), R = {r(i, a)}i∈S,a∈A(i) are the rewards and P = {pi,j (a)}i,j∈S,a∈A(i) are the transition probabilities. [sent-62, score-0.422]

32 For simplicity of analysis, we assume that the rewards are known to us beforehand. [sent-63, score-0.068]

33 We do not assume that we know the support sets of the distributions pi (a). [sent-64, score-0.266]

34 A policy π is a sequence {πt } of probability distributions on A given σt such that πt (A(st )|σt ) = 1 where st denotes the random variable representing the state at time t. [sent-69, score-0.404]

35 A deterministic policy is simply a function µ : S → A such that µ(i) ∈ A(i). [sent-71, score-0.168]

36 The MDPs they call unichain are called irreducible in standard textbooks (for example, see [9, p. [sent-75, score-0.166]

37 Probability and expectation under a policy π, transition function P and starting state i0 will be denoted by Pπ,P and Eπ,P respectively. [sent-77, score-0.245]

38 Given history σt , let Nt (i), i0 i0 Nt (i, a) and Nt (i, a, j) denote the number of occurrences of the state i, the pair (i, a) and the triplet (i, a, j) respectively in σt . [sent-78, score-0.075]

39 For all µ ∈ ΠD , the transition matrix P µ = (pi,j (µ(i)))i,j∈S is irreducible (i. [sent-81, score-0.141]

40 Consider the rewards accumulated by the policy π before time T , T −1 π VT (i0 , P ) := Eπ,P [ i0 r(st , at )] , t=0 where at is the random variable representing the action taken by π at time t. [sent-84, score-0.399]

41 Let VT (i0 , P ) be the maximum possible sum of expected rewards before time T , π VT (i0 , P ) := sup VT (i0 , P ) . [sent-85, score-0.151]

42 π∈Π The regret of a policy π at time T is a measure of how well the expected rewards of π compare with the above quantity, π π RT (i0 , P ) := VT (i0 , P ) − VT (i0 , P ) . [sent-86, score-0.503]

43 Deﬁne the long term average reward of a policy π as V π (i0 , P ) . [sent-87, score-0.222]

44 λπ (i0 , P ) := lim inf T T →∞ T Under assumption 1, the above limit exists and is independent of the starting state i0 . [sent-88, score-0.17]

45 Given a restricted set D ⊆ A of actions, the gain or the best long term average performance is λ(P, D) := sup λπ (i0 , P ) . [sent-89, score-0.083]

46 The transition and reward functions of the restricted problems are simply the restrictions of P and r to D. [sent-93, score-0.159]

47 Assumption 1 implies that there is a bias vector h(P, D) = {h(i; P, D)}i∈S such that the gain λ(P, D) and bias h(P, D) are the unique solutions to the average reward optimality equations: ∀i ∈ S, λ(P, D) + h(i; P, D) = max [r(i, a) + pi (a), h(P, D) ] . [sent-94, score-0.457]

48 Note that if h∗ (P ) is a solution to the optimality equations and e is the vector of ones, then h∗ (P ) + ce is also a solution for any scalar c. [sent-97, score-0.077]

49 It will be convenient to have a way to denote the quantity inside the ‘max’ that appears in the optimality equations. [sent-99, score-0.11]

50 Accordingly, deﬁne L(i, a, p, h) := r(i, a) + p, h , ∗ L (i; P, D) := max L(i, a, pi (a), h(P, D)) . [sent-100, score-0.266]

51 a∈D(i) To measure the degree of suboptimality of actions available at a state, deﬁne φ∗ (i, a; P ) = L∗ (i; P, A) − L(i, a, pi (a), h∗ (P )) . [sent-101, score-0.426]

52 Note that the optimal actions are precisely those for which the above quantity is zero. [sent-102, score-0.202]

53 For a suboptimal action a ∈ O(i; P, A), the following set contains probability distributions q such that if / pi (a) is changed to q, the quality of action a comes within of an optimal action. [sent-109, score-0.53]

54 (6) 1 To make sense of this deﬁnition, consider p = pi (a). [sent-114, score-0.266]

55 The above inﬁmum is then the least distance (in the L1 sense) one has to move away from pi (a) to make the suboptimal action a look -optimal. [sent-115, score-0.402]

56 Taking the limit of this as decreases gives us a quantity that also plays a crucial role in determining the regret, K(i, a; P ) := lim Ji,a (pi (a); P, ) . [sent-116, score-0.127]

57 (7) →0 Intuitively, if K(i, a; P ) is small, it is easy to confuse a suboptimal action with an optimal one and so it should be difﬁcult to achieve small regret. [sent-117, score-0.167]

58 The constant that multiplies log T in the regret bound of our algorithm OLP (see Algorithm 1 and Theorem 4 below) is the following: 2φ∗ (i, a; P ) . [sent-118, score-0.393]

59 (8) C(P ) := K(i, a; P ) (i,a)∈Crit(P ) This deﬁnition might look a bit hard to interpret, so we give an upper bound on C(P ) just in terms of the inﬁnity norm H ∗ (P ) of the bias and Φ∗ (P ). [sent-119, score-0.057]

60 This latter quantity is deﬁned below to be the minimum degree of suboptimality of a critical action. [sent-120, score-0.098]

61 3 Hitting times It turns out that we can bound the inﬁnity norm of the bias in terms of the hitting time of an optimal policy. [sent-127, score-0.255]

62 For any policy µ deﬁne its hitting time to be the worst case expected time to reach one state from another: Tµ (P ) := max Eµ,P [min{t > 0 : st = i}] . [sent-128, score-0.597]

63 (10) j i=j The following constant is the minimum hitting time among optimal policies: T (P ) := min Tµ (P ) . [sent-129, score-0.229]

64 It is the worst case hitting time over all policies: Tw (P ) := max Tµ (P ) . [sent-131, score-0.193]

65 (12) µ∈ΠD We can now bound C(P ) just in terms of the hitting time T (P ) and φ∗ (P ). [sent-132, score-0.202]

66 4 3 The optimistic LP algorithm and its regret bound Algorithm 1 Optimistic Linear Programming 1: for t = 0, 1, 2, . [sent-137, score-0.421]

67 At each time step t, the algorithm computes the empirical estimates for transition probabilities. [sent-142, score-0.084]

68 It then forms a restricted problem ignoring relatively undersampled actions. [sent-143, score-0.098]

69 An action a ∈ A(i) is considered “undersamˆ ˆ pled” if Nt (i, a) < log2 Nt (i). [sent-144, score-0.097]

70 The solutions ht , λt might be misleading due to estimation errors. [sent-145, score-0.158]

71 To avoid being misled by empirical samples we compute optimistic “indices” Ut (st , a) for all legal actions a ∈ A(st ) where st is the current state. [sent-146, score-0.484]

72 The index for action a is computed by looking at an L1 -ball around the empirical estimate pt t (a) and choosing a probability distribution q that maxˆs ˆ imizes L(i, a, q, ht ). [sent-147, score-0.386]

73 Note that if the estimates were perfect, we would take an action maximizing ˆ L(i, a, pt t (a), ht ). [sent-148, score-0.386]

74 Instead, we take an action that maximizes the index. [sent-149, score-0.097]

75 It is when all the optimal actions of the current problem are about to become undersampled at the next time step. [sent-151, score-0.257]

76 In that case, we take one of these actions (steps 18–22). [sent-152, score-0.128]

77 The LP for solving optimality equations can be found in several textbooks (see, for example, [9, p. [sent-154, score-0.138]

78 Like the original Burnetas-Katehakis algorithm, the modiﬁed one also satisﬁes a logarithmic regret bound as stated in the following theorem. [sent-157, score-0.346]

79 Unlike the original algorithm, OLP does not need to know the support sets of the transition distributions. [sent-158, score-0.061]

80 Let β denote the policy implemented by Algorithm 1. [sent-160, score-0.166]

81 Then we have, for all i0 ∈ S and for all P satisfying Assumption 1, β RT (i0 , P ) ≤ C(P ) , log T T →∞ where C(P ) is the MDP-dependent constant deﬁned in (8). [sent-161, score-0.119]

82 i0 i∈S a∈O(i;P,A) / 5 (13) Deﬁne the event ˆ At := { ht − h∗ (P ) ∞ ˆ ≤ ∧ O(P t , Dt ) ⊆ O(P )} . [sent-164, score-0.189]

83 (15) The result then follows by combining (13) and (15) with the following three propositions and then letting → 0 sufﬁciently slowly. [sent-167, score-0.064]

84 For all P and i0 ∈ S, we have lim lim sup →0 T →∞ i∈S a∈O(i;P,A) / 1 Eβ,P [NT (i, a; )] ∗ i0 φ (i, a; P ) ≤ C(P ) . [sent-169, score-0.228]

85 On the event At (recall the / ˆ deﬁnition given in (14)), we have | q, ht − q, h∗ (P ) | ≤ for any q ∈ ∆+ . [sent-180, score-0.189]

86 Therefore, ˆ Ut (i, a) ≤ sup {r(i, a) + q, ht } q∈∆+ ≤ sup {r(i, a) + q, h∗ (P ) } + q∈∆+ ∗ < L (i; P, A) − 0 + < L∗ (i; P, A) − 2 provided that 3 < Therefore for < 0 /3, 1 NT (i, a; [ MakeOpt(i, a; P, 0) = ∅] 0 ) = 0. [sent-181, score-0.259]

87 pt (a) − q ˆi 2 1 ≤ 2 log t Nt (i, a) ˆ ∧ r(i, a) + q, ht ≥ L∗ (i; P, A) − 2 . [sent-185, score-0.351]

88 ˆ On the event At , we have | q, ht − q, h∗ (P ) | ≤ and thus the above implies ∃q ∈ ∆+ s. [sent-186, score-0.213]

89 pt (a) − q ˆi 2 1 ≤ 2 log t Nt (i, a) ∧ (r(i, a) + q, h∗ (P ) ≥ L∗ (i; P, A) − 3 ) . [sent-188, score-0.212]

90 6 Recalling the deﬁnition (6) of Ji,a (p; P, ), we see that this implies Ji,a (ˆt (a); P, 3 ) ≤ pi 2 log t . [sent-189, score-0.352]

91 Each time the pair (i, a) occurs Nt (i, a) increases by 1, so the ﬁrst count is no more than 2 log T . [sent-191, score-0.085]

92 By a Chernoff-type bound, we have, pi ˆi for some constant C1 , ˆi Pβ,P [ pi (a) − pt (a) i0 1 > f (δ) | Nt (i, a) = m] ≤ C1 exp(−mf (δ)2 ) . [sent-193, score-0.713]

93 1 − exp(−f (δ)2 ) (18) Combining the bounds (17) and (18) and plugging them into (16), we get 1 Eβ,P [NT (i, a; )] ≤ i0 2 log T C1 + . [sent-195, score-0.08]

94 Ji,a (pi (a); P, 3 ) − δ 1 − exp(−f (δ)2 ) Letting δ → 0 sufﬁciently slowly, we get that for all > 0, 1 Eβ,P [NT (i, a; )] ≤ i0 2 log T + o(log T ) . [sent-196, score-0.062]

95 Ji,a (pi (a); P, 3 ) Therefore, lim lim sup →0 T →∞ 1 Eβ,P [NT (i, a; )] 2 2 i0 ≤ lim = , →0 Ji,a (pi (a); P, 3 ) log T K(i, a; P ) where the last equality follows from the deﬁnition (7) of K(i, a; P ). [sent-197, score-0.374]

96 Therefore, for all t optimal actions a∗ ∈ O(i; P, A), we have, on the event At (i, a; ), Ut (i, a∗ ) ≤ Ut (i, a) < L∗ (i; P, A) − 2 . [sent-204, score-0.209]

97 7 Since L∗ (i; P, A) = r(i, a∗ ) + pi (a∗ ), h∗ (P ) , this implies ∀q ∈ ∆+ , q − pt (a∗ ) ˆi 1 2 log t ˆ ⇒ q, ht < pi (a∗ ), h∗ (P ) − 2 . [sent-205, score-0.907]

98 Nt (i, a∗ ) ≤ ˆ Moreover, on the event At , | q, ht − q, h∗ (P ) | ≤ . [sent-206, score-0.189]

99 Using a union bound, we therefore have,  t Pβ,P [At (i, a; )] ≤ i0 C1 exp − m=1 ≤ C1 t ∞ exp − m=1 m 2 h∗ (P ) m 2 ∗ (P ) 2 h + ∞ 2 log t m 2   √ 2 ∞ − 2m log t h∗ (P ) ∞ =o 1 t . [sent-208, score-0.184]

100 (2007) Logarithmic online regret bounds for undiscounted reinforcement learning. [sent-217, score-0.327]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nt', 0.562), ('olp', 0.303), ('pi', 0.266), ('regret', 0.265), ('st', 0.197), ('crit', 0.151), ('pt', 0.15), ('policy', 0.147), ('hitting', 0.144), ('ht', 0.139), ('makeopt', 0.13), ('actions', 0.128), ('optimistic', 0.121), ('vt', 0.12), ('tw', 0.108), ('mdp', 0.107), ('proposition', 0.104), ('ut', 0.1), ('action', 0.097), ('auer', 0.093), ('policies', 0.092), ('burnetas', 0.086), ('katehakis', 0.086), ('ortner', 0.086), ('lim', 0.084), ('irreducible', 0.08), ('undersampled', 0.075), ('reward', 0.075), ('rt', 0.075), ('rewards', 0.068), ('aoa', 0.065), ('bka', 0.065), ('log', 0.062), ('transition', 0.061), ('mdps', 0.06), ('sup', 0.06), ('cb', 0.052), ('event', 0.05), ('dt', 0.049), ('optimality', 0.048), ('logarithmic', 0.046), ('propositions', 0.045), ('ambuj', 0.043), ('tewari', 0.043), ('textbooks', 0.043), ('unichain', 0.043), ('quantity', 0.043), ('accumulated', 0.041), ('suboptimal', 0.039), ('agent', 0.039), ('rid', 0.038), ('state', 0.037), ('bound', 0.035), ('berkeley', 0.033), ('suboptimality', 0.032), ('constant', 0.031), ('optimal', 0.031), ('lp', 0.031), ('exp', 0.03), ('decision', 0.03), ('assumption', 0.029), ('mf', 0.029), ('equations', 0.029), ('simpler', 0.028), ('nity', 0.028), ('worst', 0.026), ('satisfying', 0.026), ('programming', 0.026), ('reinforcement', 0.025), ('implies', 0.024), ('division', 0.024), ('ca', 0.023), ('restricted', 0.023), ('time', 0.023), ('california', 0.023), ('critical', 0.023), ('bias', 0.022), ('programs', 0.022), ('deterministic', 0.021), ('bartlett', 0.021), ('slowly', 0.02), ('proof', 0.02), ('inf', 0.02), ('denote', 0.019), ('letting', 0.019), ('misleading', 0.019), ('misled', 0.019), ('optimistically', 0.019), ('legal', 0.019), ('omniscient', 0.019), ('judicious', 0.019), ('triplet', 0.019), ('brafman', 0.019), ('tennenholtz', 0.019), ('undiscounted', 0.019), ('exploration', 0.019), ('inspired', 0.019), ('dependence', 0.019), ('bounds', 0.018), ('solving', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Author: Ambuj Tewari, Peter L. Bartlett

2 0.2708852 199 nips-2007-The Price of Bandit Information for Online Optimization

Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes

Abstract: In the online linear optimization problem, a learner must choose, in each round, a decision from a set D ⊂ Rn in order to minimize an (unknown and changing) linear cost function. We present sharp rates of convergence (with respect to additive regret) for both the full information setting (where the cost function is revealed at the end of each round) and the bandit setting (where only the scalar cost incurred is revealed). In particular, this paper is concerned with the price of bandit information, by which we mean the ratio of the best achievable regret in the bandit setting to that in the full-information setting. For the full informa√ tion case, the upper bound on the regret is O∗ ( nT ), where n is the ambient dimension and T is the time horizon. For the bandit case, we present an algorithm √ which achieves O∗ (n3/2 T ) regret — all previous (nontrivial) bounds here were O(poly(n)T 2/3 ) or worse. It is striking that the convergence rate for the bandit setting is only a factor of n worse than in the full information case — in stark contrast to the K-arm bandit setting, where the gap in the dependence on K is √ √ exponential ( T K√ vs. T log K). We also present lower bounds showing that this gap is at least n, which we conjecture to be the correct order. The bandit algorithm we present can be implemented efﬁciently in special cases of particular interest, such as path planning and Markov Decision Problems. 1

3 0.21673214 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

Abstract: We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or inﬁnite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients, and they extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the ﬁrst convergence proofs and the ﬁrst fully incremental algorithms. 1

4 0.19820496 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

Author: Elad Hazan, Satyen Kale

Abstract: We study the relation between notions of game-theoretic equilibria which are based on stability under a set of deviations, and empirical equilibria which are reached by rational players. Rational players are modeled by players using no regret algorithms, which guarantee that their payoff in the long run is close to the maximum they could hope to achieve by consistently deviating from the algorithm’s suggested action. We show that for a given set of deviations over the strategy set of a player, it is possible to efﬁciently approximate ﬁxed points of a given deviation if and only if there exist efﬁcient no regret algorithms resistant to the deviations. Further, we show that if all players use a no regret algorithm, then the empirical distribution of their plays converges to an equilibrium. 1

5 0.18785161 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Author: Alessandro Lazaric, Marcello Restelli, Andrea Bonarini

Abstract: Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identiﬁcation of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modiﬁes the actor’s policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river. 1

6 0.17374414 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

7 0.17235518 86 nips-2007-Exponential Family Predictive Representations of State

8 0.16503642 21 nips-2007-Adaptive Online Gradient Descent

9 0.15128137 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

10 0.14711873 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines

11 0.14384533 197 nips-2007-The Infinite Markov Model

12 0.13352956 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

13 0.13350014 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

14 0.1325777 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

15 0.13227229 165 nips-2007-Regret Minimization in Games with Incomplete Information

16 0.13040024 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

17 0.10500887 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

18 0.10375308 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

19 0.096541218 162 nips-2007-Random Sampling of States in Dynamic Programming

20 0.0867365 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, -0.418), (2, 0.1), (3, -0.012), (4, 0.172), (5, -0.029), (6, -0.037), (7, -0.008), (8, 0.06), (9, 0.009), (10, -0.007), (11, 0.069), (12, -0.067), (13, 0.044), (14, -0.021), (15, -0.01), (16, 0.056), (17, -0.03), (18, -0.038), (19, 0.088), (20, -0.177), (21, 0.043), (22, 0.044), (23, -0.142), (24, 0.127), (25, -0.051), (26, -0.073), (27, -0.1), (28, -0.025), (29, 0.149), (30, 0.06), (31, -0.037), (32, 0.158), (33, 0.035), (34, 0.112), (35, 0.031), (36, 0.011), (37, -0.032), (38, -0.052), (39, 0.063), (40, -0.003), (41, 0.045), (42, -0.027), (43, 0.025), (44, -0.026), (45, -0.058), (46, -0.06), (47, -0.048), (48, 0.103), (49, -0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96571952 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Author: Ambuj Tewari, Peter L. Bartlett

2 0.7139281 199 nips-2007-The Price of Bandit Information for Online Optimization

Author: Varsha Dani, Sham M. Kakade, Thomas P. Hayes

3 0.61884952 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

Author: John Langford, Tong Zhang

Abstract: We present Epoch-Greedy, an algorithm for contextual multi-armed bandits (also known as bandits with side information). Epoch-Greedy has the following properties: 1. No knowledge of a time horizon T is necessary. 2. The regret incurred by Epoch-Greedy is controlled by a sample complexity bound for a hypothesis class. 3. The regret scales as O(T 2/3 S 1/3 ) or better (sometimes, much better). Here S is the complexity term in a sample complexity bound for standard supervised learning. 1

4 0.60239983 176 nips-2007-Sequential Hypothesis Testing under Stochastic Deadlines

Author: Peter Frazier, Angela J. Yu

Abstract: Most models of decision-making in neuroscience assume an inﬁnite horizon, which yields an optimal solution that integrates evidence up to a ﬁxed decision threshold; however, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some ﬁnite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We use numerical simulations to illustrate the optimal policy in the special cases of a ﬁxed deadline and one that is drawn from a gamma distribution.

5 0.53913522 102 nips-2007-Incremental Natural Actor-Critic Algorithms

Author: Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, Richard S. Sutton

6 0.52263325 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

7 0.51174599 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

8 0.50893176 191 nips-2007-Temporal Difference Updating without a Learning Rate

9 0.42300299 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

10 0.41308236 21 nips-2007-Adaptive Online Gradient Descent

11 0.41222653 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

12 0.40844443 86 nips-2007-Exponential Family Predictive Representations of State

13 0.3946237 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

14 0.37725413 197 nips-2007-The Infinite Markov Model

15 0.36386988 208 nips-2007-TrueSkill Through Time: Revisiting the History of Chess

16 0.36090812 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration

17 0.34802711 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

18 0.34458336 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

19 0.30795926 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding

20 0.30732399 162 nips-2007-Random Sampling of States in Dynamic Programming

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (13, 0.086), (21, 0.047), (24, 0.315), (31, 0.037), (34, 0.024), (35, 0.026), (47, 0.096), (49, 0.012), (56, 0.038), (83, 0.142), (85, 0.015), (90, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73924774 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

Author: Ambuj Tewari, Peter L. Bartlett

2 0.65152037 175 nips-2007-Semi-Supervised Multitask Learning

Author: Qiuhua Liu, Xuejun Liao, Lawrence Carin

Abstract: A semi-supervised multitask learning (MTL) framework is presented, in which M parameterized semi-supervised classiﬁers, each associated with one of M partially labeled data manifolds, are learned jointly under the constraint of a softsharing prior imposed over the parameters of the classiﬁers. The unlabeled data are utilized by basing classiﬁer learning on neighborhoods, induced by a Markov random walk over a graph representation of each manifold. Experimental results on real data sets demonstrate that semi-supervised MTL yields signiﬁcant improvements in generalization performance over either semi-supervised single-task learning (STL) or supervised MTL. 1

3 0.53650439 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

Author: Alexander L. Strehl, Michael L. Littman

Abstract: We provide a provably efﬁcient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Speciﬁcally, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh’s work that provides a provably efﬁcient algorithm for ﬁnite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

4 0.52568132 84 nips-2007-Expectation Maximization and Posterior Constraints

Author: Kuzman Ganchev, Ben Taskar, João Gama

Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to ﬁnd a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difﬁcult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efﬁcient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1

5 0.52088165 63 nips-2007-Convex Relaxations of Latent Variable Training

Author: Yuhong Guo, Dale Schuurmans

Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semideﬁnite relaxation that yields global training by eliminating local minima. 1

6 0.51867533 102 nips-2007-Incremental Natural Actor-Critic Algorithms

7 0.51667559 168 nips-2007-Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

8 0.51529127 86 nips-2007-Exponential Family Predictive Representations of State

9 0.51475507 5 nips-2007-A Game-Theoretic Approach to Apprenticeship Learning

10 0.51420361 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

11 0.51160806 62 nips-2007-Convex Learning with Invariances

12 0.51127595 134 nips-2007-Multi-Task Learning via Conic Programming

13 0.51080894 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

14 0.51030058 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

15 0.51027232 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

16 0.50959241 69 nips-2007-Discriminative Batch Mode Active Learning

17 0.50853968 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

18 0.50775623 115 nips-2007-Learning the 2-D Topology of Images

19 0.50707644 24 nips-2007-An Analysis of Inference with the Universum

20 0.50681168 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning