nips nips2011 nips2011-18 knowledge-graph by maker-knowledge-mining

18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning


Source: pdf

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. [sent-2, score-0.084]

2 As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. [sent-3, score-0.665]

3 For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. [sent-4, score-0.125]

4 Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. [sent-5, score-0.333]

5 1 Introduction This paper introduces a new type of regularity in the reinforcement learning (RL) and planning problems with finite-action spaces that suggests that the convergence rate of the performance loss to zero is faster than what previous analyses had indicated. [sent-6, score-0.556]

6 The effect of this regularity, which we call the action-gap regularity, is that oftentimes the performance of the RL agent reaches very close to the optimal performance (e. [sent-7, score-0.215]

7 , it always solves the mountain-car problem with the optimal number of steps) even though the estimated action-value function is still far from the optimal one. [sent-9, score-0.142]

8 Figure 1 illustrates the effect of this regularity in a simple problem. [sent-10, score-0.272]

9 We use value iteration to solve a stochastic 1D chain walk problem (slight modification of the example in Section 9. [sent-11, score-0.177]

10 The behavior of the supremum of the difference between the estimate after k iterations and the optimal action-value function is O(γ k ), in which 0 ≤ γ < 1 is the discount factor (notations shall be introduced in Section 2). [sent-13, score-0.141]

11 The current theoretical results suggest that the convergence of the performance loss, which is defined as the average difference between the value of the optimal policy and the value of the greedy policy w. [sent-14, score-0.918]

12 (with respect to) the estimated action-value function, should have the same O(γ k ) behavior (cf. [sent-17, score-0.076]

13 However, the behavior of the performance loss is often considerably faster, e. [sent-20, score-0.125]

14 To gain a better understanding of the action-gap regularity, focus on a single state and suppose that there are only two actions available. [sent-24, score-0.068]

15 When the estimated action-value function has a large error, the greedy policy w. [sent-25, score-0.498]

16 However, when the error becomes smaller than the (half of the) gap between the value of the optimal action and the other one, the selected greedy action is the optimal action. [sent-29, score-0.421]

17 After passing this threshold, the size of the error in the estimate of the action-value function in that state does not have any effect on the quality of the selected action. [sent-30, score-0.11]

18 The larger the gap is, the more inaccurate the estimate can be while the selected greedy action is the optimal one. [sent-31, score-0.284]

19 On the other hand, if the estimated action-value function does not suggest a correct ordering of actions but the gap is negligibly small, the performance loss of not ∗ www. [sent-32, score-0.268]

20 value function 10 Performance loss 0 10 O("k) −1 Error/Loss 10 −2 10 O("1. [sent-37, score-0.123]

21 The rate of decrease for the performance loss is considerably faster than that of the estimation error. [sent-39, score-0.119]

22 The problem is a 1D stochastic chain walk with 500 states and γ = 0. [sent-40, score-0.092]

23 The presence of this gap in the optimal action-value function is what we call the action-gap regularity of the problem and the described behavior is called the action-gap phenomenon. [sent-43, score-0.378]

24 Action-gap regularity is similar to the low-noise (or margin) condition in the classification literature. [sent-44, score-0.248]

25 It is notable that there have been some works that used classification algorithms to solve reinforcement learning (e. [sent-50, score-0.1]

26 In the rest of this paper, we formalize the action-gap phenomenon and prove that whenever the MDP has a favorable action-gap regularity, fast convergence rate is achievable. [sent-58, score-0.142]

27 Theorem 1 upper bounds the performance loss of the greedy policy w. [sent-59, score-0.629]

28 the estimated action-value function by a function of the Lp -norm of the difference between the estimated action-value function and the optimal one. [sent-62, score-0.134]

29 Our result complements previous theoretical analyses of RL/Planning problems such as those by Antos et al. [sent-63, score-0.066]

30 [10], Munos and Szepesv´ ri [11], Farahmand et al. [sent-64, score-0.107]

31 Finally as an example of Theorem 1, we address the question of how the errors caused at each iteration of the Approximate Value Iteration (AVI) algorithm affect the quality of the outcome policy and show that the AVI procedure benefits from the action-gap regularity of the problem (Theorem 2). [sent-68, score-0.673]

32 For more information, the reader is referred to Bertsekas and Tsitsiklis [2], Sutton and Barto [15], Szepesv´ ri [16]. [sent-70, score-0.075]

33 B(Ω) denotes the space of bounded measurable functions w. [sent-72, score-0.089]

34 A finite-action discounted MDP is a 5-tuple (X , A, P, R, γ), where X is a measurable state space, A is a finite set of actions, P : X ×A → M(X ) is the transition probability kernel, R : X ×A → R is the reward distribution, and 0 ≤ γ < 1 is a discount factor. [sent-76, score-0.243]

35 A measurable mapping π : X → A is called a deterministic Markov stationary policy, or just a policy in short. [sent-78, score-0.437]

36 An agent’s following a policy π in an MDP means that at each time step At = π(Xt ). [sent-79, score-0.348]

37 A policy π induces two transition probability kernels P π : X → M(X ) and P π : X × A → M(X × A). [sent-80, score-0.393]

38 For a measurable subset A of X and a measurable subset B of X × A, we define (P π )(A|x) P (dy|x, π(x))I{y∈A} and (P π )(B|x, a) P (dy|x, a)I{(y,π(y))∈B} . [sent-81, score-0.178]

39 The m-step transition probability kernel (P π )m : X ×A → M(X ×A) for m = 2, 3, · · · are inductively defined as (P π )m (B|x, a) P (dy|x, a)(P π )m−1 (B|y, π(y)) (similarly for (P π )m : X → M(X )). [sent-82, score-0.094]

40 X Given a transition probability kernel P : X → M(X ), define the right-linear operator P · : B(X ) → B(X ) by (P V )(x) P (dy|x)V (y). [sent-83, score-0.101]

41 For a probability measure ρ ∈ M(X ) X and a measurable subset A of X , define the left-linear operators ·P : M(X ) → M(X ) by (ρP )(A) = ρ(dx)P (dy|x)I{y∈A} . [sent-84, score-0.159]

42 These operators for P : X × A → M(X × A) are defined similarly. [sent-86, score-0.07]

43 For a discounted MDP, we define the optimal value and optimal action-value functions by V ∗ (x) = supπ V π (x) for all states x ∈ X and Q∗ (x, a) = supπ Qπ (x, a) for all state-actions (x, a) ∈ X ×A. [sent-89, score-0.181]

44 ∗ We say that a policy π ∗ is optimal if it achieves the best values in every state, i. [sent-90, score-0.398]

45 Greedy policies are important because a greedy policy w. [sent-97, score-0.456]

46 the optimal action-value function Q∗ is an optimal policy. [sent-100, score-0.1]

47 For a fixed policy π, the Bellman operators T π : B(X ) → B(X ) and T π : B(X × A) → B(X × A) (for the action-value functions) are defined as (T π V )(x) r(x, π(x)) + γ X V (y)P (dy|x, π(x)) and (T π Q)(x, a) r(x, a) + γ X Q(y, π(y))P (dy|x, a). [sent-101, score-0.418]

48 The fixed point of the Bellman operator is the (action-)value function of the policy π, i. [sent-102, score-0.379]

49 Similarly, the Bellman optimality operators T ∗ : B(X ) → B(X ) and T ∗ : B(X × A) → B(X × A) (for the action-value functions) are defined as (T ∗ V )(x) maxa r(x, a) + γ R×X V (y)P (dr, dy|x, a) ∗ and (T Q)(x, a) r(x, a) + γ R×X maxa Q(y, a )P (dr, dy|x, a). [sent-105, score-0.166]

50 Again, these operators enjoy a fixed-point property similar to that of the Bellman operators: T ∗ Q∗ = Q∗ and T ∗ V ∗ = V ∗ . [sent-106, score-0.07]

51 For a probability measure ρ ∈ M(X ), and a measurable function V ∈ B(X ), we define the Lp (ρ)1/p norm (1 ≤ p < ∞) of V as V p,ρ |V (x)|p dρ(x) . [sent-107, score-0.089]

52 We denote ρ∗ ∈ M(X ) as the station3 Figure 2: The action-gap function gQ∗ (x) and the relative ordering of the optimal and the estimated action-value functions for a single state x. [sent-112, score-0.183]

53 Depending on the ordering of the estimates, the greedy action is the same as ( ) or different from (X) the optimal action. [sent-113, score-0.268]

54 This distribution indicates the relative importance of regions of the state space to the user. [sent-116, score-0.069]

55 We quantify the quality of the approximation by the Lp -norm Q The performance loss (or regret) of a policy π is the expected difference between the value of the optimal policy π ∗ to the value of π when the initial state distribution is selected according to ρ, i. [sent-126, score-0.96]

56 Loss(π; ρ) (1) X ˆ The value of Loss(ˆ ; ρ), in which π is the greedy policy w. [sent-129, score-0.488]

57 Q, is the main quantity of interest π ˆ and indicates how much worse the agent following policy π would perform, in average, compared ˆ to the optimal one. [sent-132, score-0.502]

58 For a fixed MDP (X , A, P, R, γ) with |A| = 2, there exist constants cg > 0 and ζ ≥ 0 such that for all t > 0, we have I{0 < gQ∗ (x) ≤ t} dρ∗ (x) ≤ cg tζ . [sent-138, score-0.356]

59 A large value of ζ indicates that the probability that Q(X, 1) being very close to Q(X, 2) is small and vice versa. [sent-140, score-0.066]

60 The smallness of ˆ this probability implies that the estimated action-value function Q might be rather inaccurate in a 4 1 0. [sent-141, score-0.068]

61 75 2 Figure 3: The probability distribution Pρ∗ (0 < gQ∗ (X) ≤ t) for a 1D stochastic chain walk with 500 states and γ = 0. [sent-157, score-0.092]

62 large subset of the state space (measured according to ρ∗ ) but its corresponding greedy policy would still be the same as the optimal one. [sent-160, score-0.541]

63 The case of ζ = 0 and cg = 1 is equivalent to not having any assumption on the action-gap. [sent-161, score-0.227]

64 As an example of a typical behavior of an action-gap function, Figure 3 depicts Pρ∗ (0 < gQ∗ (X) ≤ t) for the same 1D stochastic chain walk problem as mentioned in the Introduction. [sent-163, score-0.126]

65 Note that the specific polynomial form of the upper bound in Assumption A1 is only a modeling assumption that captures the essence of the action-gap regularity without trying to be too general to lead to unnecessarily complicated analyses. [sent-165, score-0.345]

66 As a result of the dynamical nature of MDP, the performance loss depends not only on the choice of ρ and ρ∗ , but also on the transition probability kernel P . [sent-166, score-0.161]

67 To analyze this dependence, we define a concentrability coefficient and use a change of measure argument similar to the work of Munos [23, 24], Antos et al. [sent-167, score-0.151]

68 Given ρ, ρ∗ ∈ M(X ), a policy π, and an integer m ≥ 0, let ρ(P π )m ∈ M(X ) denote the future-state distribution obtained when the first state is distributed according to ρ and we then follow the policy π for m steps. [sent-170, score-0.731]

69 The concentrability of the future-state distribution coefficient is defined as C(ρ, ρ∗ ) γ m c(m; π). [sent-181, score-0.119]

70 sup π m≥0 ˆ The following theorem upper bounds the performance loss Loss(ˆ ; ρ) as a function of Q∗ − Q p,ρ∗ , π the action-gap distribution, and the concentrability coefficient. [sent-182, score-0.386]

71 We then have  1+ζ 21+ζ cg C(ρ, ρ∗ ) Q − Q∗ ˆ  , ∞ p(1+ζ) Loss(ˆ ; ρ) ≤ π p−1 p+ζ  1+ p(1+ζ) p+ζ ˆ 2 p+ζ c C(ρ, ρ∗ ) Q − Q∗ . [sent-190, score-0.178]

72 ˆ ≤ (2) m≥0 in which we used the Radon-Nikodym theorem and the definition of concentrability coefficient. [sent-198, score-0.177]

73 ∗ ˆ ˆ Then because of the assumption |Qπ (x, a) − Q(x, a)| ≤ ε (a = 1, 2), the ordering of Q(x, 1) and ˆ 2) is the same as the ordering of Q∗ (x, 1) and Q∗ (x, 2), which contradicts the assumption that Q(x, π (x) = π ∗ (x) (see Figure 2). [sent-203, score-0.21]

74 This result together with Assumption A1 show that ρ∗ F1 ≤ 2ε0 Pρ∗ (0 < gQ∗ (X) ≤ 2ε0 ) ≤ 2ε0 cg (2ε0 )ζ . [sent-207, score-0.178]

75 ρ∗ and use H¨ lder’s inequality to get o ρ ∗ F1 ≤ D p,ρ∗ ≤ D p,ρ∗ cg tζ ≤ D p,ρ∗ cg tζ [Pρ∗ (0 < gQ∗ (X) ≤ t)] p−1 p p−1 p + D + p−1 p p,ρ∗ [Pρ p D p,ρ∗ . [sent-214, score-0.383]

76 Minimize the upper bound in t to get t = cg D ∗ p−1 p+ζ p(1+ζ) p+ζ p,ρ∗ ρ F1 ≤ 2cg , which in turn alongside inequality (2) and D D proves the second part of this result. [sent-216, score-0.253]

77 ∗ 2γ ˆ ˆ One might compare Theorem 1 with classical upper bounds such as V π − V π ∞ ≤ 1−γ V − ∗ V ∞ (Proposition 6. [sent-219, score-0.082]

78 In order to make these two bounds comparable, we slightly modify the proof of our theorem to get the L∞ -norm in the left hand side. [sent-221, score-0.092]

79 If there is no action-gap assumption (ζ = 0 ∞ and cg = 1), the results are similar (except for a factor of γ and that we measure the error by the difference in the action-value function as opposed to the value function), but when ζ > 0 the error bound significantly improves. [sent-223, score-0.313]

80 4 Application of the Action-Gap Theorem in Approximate Value Iteration The goal of this section is to show how the analysis based on the action-gap phenomenon might lead to a tighter upper bound on the performance loss for the family of the AVI algorithms. [sent-224, score-0.198]

81 [20], Munos and Szepesv´ ri [11], Farahmand a ˆ et al. [sent-226, score-0.107]

82 [12]), that work by generating a sequence of action-value function estimates (Qk )K , in k=0 ˆ k+1 is the result of approximately applying the Bellman optimality operator to the previous which Q ˆ ˆ ˆ estimate Qk , i. [sent-227, score-0.085]

83 Let us denote the error caused at each iteration by ˆ ˆ T ∗ Qk − Qk+1 . [sent-230, score-0.08]

84 [25], relates the perˆ ˆ formance loss Qπ(·;QK ) − Q∗ 1,ρ of the obtained greedy policy π (·; QK ) to the error sequence ˆ ˆ K−1 (εk )k=0 and the action-gap assumption on the MDP. [sent-232, score-0.647]

85 Then for any sequence (Qk )K ⊂ B(X × A, Qmax ) and the corresponding sequence k=0 K−1 (εk )k=0 defined in (3), we have Loss(ˆ (·, QK ); ρ) ≤ 2 π 2 1−γ p(1+ζ) p+ζ p−1 p+ζ cg 1+ζ p+ζ K−1 C(ρ, ρ∗ ) αk εk p p,ρ∗ + αK (2Qmax )p . [sent-238, score-0.226]

86 1 of Munos [24], one may derive ∗ ∗ ∗ ∗ ˆ ˆ ˆ ˆ ˆ Q∗ − Qk+1 = T π Q∗ − T π Qk + T π Qk − T ∗ Qk + εk ≤ γP π (Q∗ − Qk ) + εk ∗ ˆ ˆ where we used the property of the Bellman optimality operator T ∗ Qk ≥ T π Qk and the definition of εk (3). [sent-241, score-0.061]

87 Comparing this theorem with Theorem 3 of Farahmand et al. [sent-245, score-0.09]

88 Denoting E = K−1 2 π ˆ k=0 αk εk 2,ρ∗ , this paper’s result indicates that the effect of the size of εk on Loss(ˆ (·, QK ); ρ) 1+ζ depends on E 2+ζ , while [25], which does not consider the action-gap regularity, suggests that the effect depends on E 1/2 . [sent-247, score-0.082]

89 For ζ > 0, this indicates a faster convergence rate for the performance loss while for ζ = 0, they are the same. [sent-248, score-0.153]

90 5 Conclusion This work introduced the action-gap regularity in reinforcement learning and planning problems and analyzed the action-gap phenomenon for two-action discounted MDPs. [sent-249, score-0.511]

91 We showed that when the problem has a favorable action-gap regularity, quantified by the parameter ζ, the performance loss is much smaller than the error of the estimated optimal action-value function. [sent-250, score-0.237]

92 Also it is important to study the relation between the parameter ζ of the action-gap regularity assumption to the properties of the MDP such as the transition probability kernel and the reward distribution. [sent-254, score-0.367]

93 8 [8] Alessandro Lazaric, Mohammad Ghavamzadeh, and R´ mi Munos. [sent-285, score-0.073]

94 [10] Andr´ s Antos, Csaba Szepesv´ ri, and R´ mi Munos. [sent-301, score-0.073]

95 Learning near-optimal policies with a a e Bellman-residual minimization based fitted policy iteration and a single sample path. [sent-302, score-0.401]

96 [14] Odalric Maillard, R´ mi Munos, Alessandro Lazaric, and Mohammad Ghavamzadeh. [sent-318, score-0.073]

97 Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. [sent-341, score-0.153]

98 Performance bounds in Lp norm for approximate value iteration. [sent-357, score-0.066]

99 [25] Amir-massoud Farahmand, R´ mi Munos, and Csaba Szepesv´ ri. [sent-359, score-0.073]

100 Error propagation for ape a proximate policy and value iteration. [sent-360, score-0.38]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gq', 0.502), ('policy', 0.348), ('qk', 0.307), ('regularity', 0.248), ('dy', 0.194), ('farahmand', 0.19), ('cg', 0.178), ('mdp', 0.139), ('szepesv', 0.137), ('csaba', 0.13), ('concentrability', 0.119), ('bellman', 0.11), ('avi', 0.109), ('greedy', 0.108), ('reinforcement', 0.1), ('munos', 0.098), ('loss', 0.091), ('measurable', 0.089), ('mohammad', 0.078), ('ri', 0.075), ('tsybakov', 0.074), ('mi', 0.073), ('agent', 0.07), ('operators', 0.07), ('bertsekas', 0.067), ('phenomenon', 0.059), ('antos', 0.058), ('lazaric', 0.058), ('theorem', 0.058), ('lp', 0.057), ('lagoudakis', 0.056), ('tsitsiklis', 0.056), ('ordering', 0.056), ('planning', 0.055), ('michail', 0.054), ('action', 0.054), ('ghavamzadeh', 0.054), ('alessandro', 0.054), ('walk', 0.054), ('iteration', 0.053), ('optimal', 0.05), ('tted', 0.05), ('assumption', 0.049), ('discounted', 0.049), ('upper', 0.048), ('mammen', 0.048), ('rinaldo', 0.048), ('gap', 0.046), ('transition', 0.045), ('audibert', 0.045), ('maillard', 0.044), ('syed', 0.044), ('estimated', 0.042), ('apprenticeship', 0.041), ('oftentimes', 0.041), ('classi', 0.04), ('shie', 0.039), ('chain', 0.038), ('alexander', 0.037), ('dr', 0.037), ('ernst', 0.037), ('sup', 0.036), ('ronald', 0.036), ('state', 0.035), ('behavior', 0.034), ('analyses', 0.034), ('bounds', 0.034), ('indicates', 0.034), ('actions', 0.033), ('maxa', 0.033), ('supremum', 0.032), ('rt', 0.032), ('value', 0.032), ('et', 0.032), ('operator', 0.031), ('whenever', 0.031), ('optimality', 0.03), ('reaches', 0.03), ('xt', 0.03), ('faster', 0.028), ('editors', 0.027), ('error', 0.027), ('favorable', 0.027), ('inequality', 0.027), ('sutton', 0.026), ('inaccurate', 0.026), ('dn', 0.026), ('formalize', 0.025), ('zemel', 0.025), ('annals', 0.025), ('rl', 0.025), ('richard', 0.025), ('discount', 0.025), ('notations', 0.025), ('kernel', 0.025), ('sequence', 0.024), ('effect', 0.024), ('quality', 0.024), ('inductively', 0.024), ('farias', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

2 0.33889592 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

3 0.28347251 49 nips-2011-Boosting with Maximum Adaptive Sampling

Author: Charles Dubout, Francois Fleuret

Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1

4 0.2327199 268 nips-2011-Speedy Q-Learning

Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos

Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1

5 0.18318164 215 nips-2011-Policy Gradient Coagent Networks

Author: Philip S. Thomas

Abstract: We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module’s input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods. 1

6 0.17351264 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

7 0.17136082 291 nips-2011-Transfer from Multiple MDPs

8 0.16045538 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

9 0.13259995 283 nips-2011-The Fixed Points of Off-Policy TD

10 0.13189618 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

11 0.1270155 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

12 0.10941378 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

13 0.10374233 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

14 0.095679112 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

15 0.094939493 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

16 0.088562235 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans

17 0.083431445 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

18 0.081120066 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

19 0.078018621 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

20 0.07800018 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.183), (1, -0.23), (2, 0.065), (3, 0.2), (4, -0.269), (5, 0.111), (6, 0.004), (7, -0.0), (8, -0.105), (9, -0.016), (10, -0.054), (11, 0.087), (12, -0.054), (13, -0.042), (14, -0.15), (15, -0.142), (16, 0.058), (17, -0.007), (18, 0.052), (19, -0.004), (20, -0.183), (21, 0.037), (22, 0.026), (23, 0.003), (24, -0.204), (25, -0.107), (26, 0.08), (27, 0.013), (28, -0.061), (29, 0.046), (30, -0.017), (31, 0.11), (32, 0.103), (33, 0.107), (34, -0.027), (35, -0.085), (36, -0.024), (37, -0.114), (38, 0.034), (39, 0.012), (40, -0.117), (41, 0.108), (42, 0.092), (43, 0.102), (44, -0.006), (45, 0.009), (46, 0.037), (47, -0.054), (48, -0.049), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94396853 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

2 0.77438426 268 nips-2011-Speedy Q-Learning

Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos

Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1

3 0.75385845 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Author: Paul Wagner

Abstract: A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results. 1

4 0.62159693 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

5 0.59323704 49 nips-2011-Boosting with Maximum Adaptive Sampling

Author: Charles Dubout, Francois Fleuret

Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1

6 0.59074152 10 nips-2011-A Non-Parametric Approach to Dynamic Programming

7 0.55735272 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

8 0.54952234 174 nips-2011-Monte Carlo Value Iteration with Macro-Actions

9 0.544043 215 nips-2011-Policy Gradient Coagent Networks

10 0.52035761 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

11 0.51637411 291 nips-2011-Transfer from Multiple MDPs

12 0.51148617 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization

13 0.48062012 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

14 0.448311 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning

15 0.41713572 300 nips-2011-Variance Reduction in Monte-Carlo Tree Search

16 0.39459884 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

17 0.36978883 109 nips-2011-Greedy Model Averaging

18 0.34262875 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

19 0.34009475 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

20 0.33790198 283 nips-2011-The Fixed Points of Off-Policy TD


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (4, 0.031), (11, 0.344), (20, 0.029), (26, 0.037), (31, 0.104), (43, 0.082), (45, 0.125), (57, 0.017), (74, 0.043), (83, 0.024), (99, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75184572 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning

Author: Amir-massoud Farahmand

Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1

2 0.74603122 73 nips-2011-Divide-and-Conquer Matrix Factorization

Author: Lester W. Mackey, Michael I. Jordan, Ameet Talwalkar

Abstract: This work introduces Divide-Factor-Combine (DFC), a parallel divide-andconquer framework for noisy matrix factorization. DFC divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative filtering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that DFC enjoys high-probability recovery guarantees comparable to those of its base algorithm.

3 0.69321704 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

Author: Tingting Zhao, Hirotaka Hachiya, Gang Niu, Masashi Sugiyama

Abstract: Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments. 1

4 0.66459203 66 nips-2011-Crowdclustering

Author: Ryan G. Gomes, Peter Welinder, Andreas Krause, Pietro Perona

Abstract: Is it possible to crowdsource categorization? Amongst the challenges: (a) each worker has only a partial view of the data, (b) different workers may have different clustering criteria and may produce different numbers of categories, (c) the underlying category structure may be hierarchical. We propose a Bayesian model of how workers may approach clustering and show how one may infer clusters / categories, as well as worker parameters, using this model. Our experiments, carried out on large collections of images, suggest that Bayesian crowdclustering works well and may be superior to single-expert annotations. 1

5 0.63978058 150 nips-2011-Learning a Distance Metric from a Network

Author: Blake Shaw, Bert Huang, Tony Jebara

Abstract: Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges. 1

6 0.50892079 291 nips-2011-Transfer from Multiple MDPs

7 0.49986124 258 nips-2011-Sparse Bayesian Multi-Task Learning

8 0.49781507 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

9 0.49753273 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines

10 0.49716738 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

11 0.49647054 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

12 0.49612495 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

13 0.49607217 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

14 0.49537843 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

15 0.49401161 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

16 0.49281451 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

17 0.49275732 283 nips-2011-The Fixed Points of Off-Policy TD

18 0.49269027 25 nips-2011-Adaptive Hedge

19 0.49132276 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

20 0.49128073 180 nips-2011-Multiple Instance Filtering