jmlr jmlr2011 jmlr2011-33 knowledge-graph by maker-knowledge-mining

33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Source: pdf

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efﬁciency of model-based methods with the space efﬁciency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efﬁciency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 These include eligibility traces (Sutton, 1988; Watkins, 1989), which update recently visited states in proportion to a trace parameter; experience replay (Lin, 1992), which stores experience sequences and uses them for repeated TD updates; and delayed Q-learning (Strehl et al. [sent-47, score-0.752]

2 We prove that, although just-in-time Q-learning performs the same number of updates as regular Q-learning, the Q-values used in its update targets generally have received more updates. [sent-52, score-0.375]

3 Section 4 extends the idea of using improved update targets to best-match learning with an LVM, in which updates are continually revised such that the update targets constructed from them are more accurate. [sent-54, score-0.464]

4 Section 5 proposes a best-match learning algorithm that uses an n-transition model (NTM), which maintains an estimate of the transition probability for n transition states per state action pair. [sent-62, score-0.488]

5 In particular, we combine best-match learning with gradient-descent function approximation and show empirically that it can outperform Sarsa(λ) and experience replay with linear function approximation while using similar computation. [sent-67, score-0.314]

6 and r t+1 is deﬁned as the reward received after taking action a t in state s t at timestep t. [sent-74, score-0.588]

7 In the control case, TD methods seek to learn the optimal action-value function Q∗ (s, a), which is the solution to the Bellman optimality equations (Bellman, 1957): s Q∗ (s, a) = R sa + γ ∑ Psa max Q∗ (s′ , a′ ) . [sent-80, score-0.51]

8 Many update targets are possible, such as the Q-learning (Watkins and Dayan, 1992) update target υ t = r t+1 + γ max Q t (s t+1 , a) . [sent-83, score-0.334]

9 The agent can either exploit its current knowledge by taking the action that predicts the highest expected return given current estimates, or it can explore by taking a different action in order to improve the accuracy of the Q-value of that action. [sent-89, score-0.431]

10 An example of an update target for policy evaluation is the TD(0) update target υ t = r t+1 + γVt (s t+1 ) . [sent-93, score-0.402]

11 We prove that by postponing Q-learning updates until a state is revisited, the update targets involved receive in general 2048 E XPLOITING B EST-M ATCH E QUATIONS FOR E FFICIENT R EINFORCEMENT L EARNING more updates, while the total number of updates of the current state stays the same. [sent-99, score-0.59]

12 However, postponing the update of a value for too long can negatively affect performance, since a value that has not been updated might be used for action selection or for bootstrapping other values. [sent-103, score-0.349]

13 Figure 1: A state transition sequence in which the initial state sA is revisited at timestep 4. [sent-105, score-0.625]

14 State sA is visited at timestep 0 and revisited at timestep 4. [sent-108, score-0.682]

15 With the regular Q-learning update, the Q-value of state-action pair (sA , a0 ) gets updated at timestep 1: Q1 (sA , a0 ) = (1 − α)Q0 (sA , a0 ) + α [r1 + γ max Q0 (sB , a)] , a while at timesteps 2 − 4 no update of (sA , a0 ) occurs, and therefore Q4 (sA , a0 ) = Q1 (sA , a0 ). [sent-109, score-0.593]

16 The update of the Q-value of (sA , a0 ) at timestep 1 can be considered premature, since the earliest use of its value is in the update target for (sD , a3 ), which uses Q3 (sA , a0 ). [sent-110, score-0.572]

17 Therefore, the update of the Q-value of (sA , a0 ) can be postponed until at least timestep 3 without negatively affecting the update target for (sD , a3 ). [sent-111, score-0.647]

18 When the update of (sD , a3 ) is also postponed, the earliest use of the Q-value of (sA , a0 ) occurs at timestep 4, where it is used for action selection. [sent-112, score-0.537]

19 Thus, if we postpone the update of all state-action pairs, the update of the Q-value of (sA , a0 ) can be postponed until the timestep of its revisit, without causing dependent state values or the action selection procedure to use a value of (sA , a0 ) that has not been updated. [sent-113, score-0.798]

20 We call this type of update a just-in-time update, since the update is postponed until just before the updated value is needed. [sent-114, score-0.354]

21 At timestep 4, under both update schemes, the Q-value of (sA , a0 ) has received one update based on the same experience sample. [sent-120, score-0.714]

22 However, a just-in-time update uses the most recent value of the Q-values of sB , while a regular update uses the value at the timestep of the initial visit of sA . [sent-121, score-0.638]

23 In this case, neither update type makes use of an updated Q-value for sB in the update target for sA . [sent-129, score-0.331]

24 The regular update does not since it uses the values of sB at timestep t ∗ , and the just-in-time update does not since sB is not revisited and therefore no update has occurred yet at timestep t − 1. [sent-130, score-1.093]

25 The regular update still uses the value of sB from timestep t ∗ and therefore does not use an updated value. [sent-132, score-0.51]

26 The just-in-time update on the other hand does use an updated value, since this update occurred at the revisit of sB . [sent-133, score-0.365]

27 Theorem 1 Given the same experience sequence, each Q-value from the current state has received the same number of updates using JIT updates (Equation 3) as using regular updates (Equation 2). [sent-136, score-0.647]

28 However, each Q-value in the update target of a JIT update has received an equal or greater number of updates as in the update target of the corresponding regular update. [sent-137, score-0.649]

29 In the ﬁrst case, neither a regular update nor a just-in-time update make use of an updated value for sB in the update target of sA , while in the second case a just-in-time update does. [sent-139, score-0.649]

30 In the deterministic case, we use a constant learning rate of 1, while in the stochastic case we use an initial learning rate α0 of 1 that is decayed in the following manner:1 α sa = α0 , d · [n(s, a) − 1] + 1 (4) where n(s, a) is the total number of times action a has been selected in state s. [sent-157, score-0.724]

31 Note that for d = 0, α sa = α0 , while for d = 1, α sa = α0 /n(s, a). [sent-158, score-0.892]

32 2 20 40 60 80 100 0 0 episodes Q−learning, Q0 = 0 50 100 150 200 episodes Figure 4: Comparison of the performance of JIT Q-learning and regular Q-learning on the deterministic (left) and stochastic (right) Dyna Maze task for two different initialization schemes. [sent-184, score-0.321]

33 Best-match updates are updates that can correct previous updates when more recent information becomes available. [sent-215, score-0.342]

34 Since we will consider multiple updates per timestep in this section, we denote the Q-value function using two iteration indices: t and i. [sent-221, score-0.435]

35 ′ a Now consider the example shown in Figure 5, which extends Figure 1 to include a second revisit of s0 at timestep t = 7. [sent-226, score-0.344]

36 If the state is revisited and a new action is taken, the previous experience is overwritten and lost. [sent-232, score-0.459]

37 However, if the experience is stored per state-action pair, then the previous experience is not overwritten until the same action is selected again. [sent-233, score-0.528]

38 In the example from Figure 5, the update of (sA , a0 ) can be redone at timestep 7: Q 7,1 (sA , a0 ) = (1 − α)Q1,0 (sA , a0 ) + α[r1 + γ max Q 7,0 (sB , a)] . [sent-239, score-0.413]

39 a (6) Since state sB is revisited at timestep 6, (sB , a1 ) has received an extra update and therefore Q7,0 (sB , a1 ) is likely to be more accurate than Q4,0 (sB , a1 ). [sent-240, score-0.602]

40 Equation 6 is not equivalent to a (postponed) Q-learning update, in contrast to Equation 5, since Q1,0 (sA , a0 ) is not equal to Q7,0 (sA , a0 ) due to the update at timestep 4. [sent-241, score-0.413]

41 Equation 6 corrects the update from timestep 4, by redoing it using the most recent Q-values for the update target. [sent-242, score-0.544]

42 Deﬁnition 2 The last-visit experience of state-action pair (s, a) denotes the last-visit reward, R′t (s, a), that is, the reward received upon the last visit of (s, a), and the last-visit transition state, S′t (s, a), that is, the state transitioned to upon the last visit of (s, a). [sent-245, score-0.528]

43 mf Deﬁnition 3 The model-free Q-value of a state-action pair (s, a), Q t (s, a), is a Q-value that has received updates from all observed samples except those stored in the LVM, that is, R′t (s, a) and mf S′t (s, a). [sent-248, score-0.648]

44 We deﬁne a best-match update as: Deﬁnition 4 A best-match update combines the model-free Q-value of a state-action pair with its last-visit experience from the same timestep according to Q t,i+1 (s, a) = (1 − α)Q t (s, a) + α[R′t (s, a) + γ max Q t,i (S′t (s, a), a′ )] . [sent-252, score-0.709]

45 ′ mf a Using best-match updates to extend the postponing period of a sample update requires additional computation, as the agent typically performs multiple best-match updates per timestep. [sent-253, score-0.752]

46 In the example, at timestep 7 the agent redoes the update of Q(sA , a0 ), but also performs an update of Q(sA , a4 ). [sent-254, score-0.672]

47 Speciﬁcally, at timestep t + 1 Q m f is updated according to mf Q t+1 (s t , a t ) = Q t+1,0 (s t , a t ) . [sent-256, score-0.52]

48 After Q m f has been updated, the last-visit experience for (s t , a t ) is overwritten with the new experience R′t+1 (s t , a t ) = r t+1 , S′t+1 (s t , a t ) = s t+1 . [sent-258, score-0.316]

49 In the approach described above, best-match updates are used to postpone the update from a sample without negatively affecting other updates or the action selection process. [sent-259, score-0.507]

50 With the update strategy described above, best-match updates occur only when a state is revisited. [sent-263, score-0.324]

51 Thus, if at timestep 3 the agent performs a best-match update of Q(sB , a1 ), before updating Q(sA , s0 ), the latter update will exploit more recent Q-values for sB , just as if sB had been revisited. [sent-267, score-0.672]

52 In other words, the agent uses the Q-values of sA to perform a best-match update of sC , then performs a best-match update of sB and ﬁnally updates sA . [sent-271, score-0.504]

53 Deﬁnition 5 The best-match LVM equations at timestep t are deﬁned as (1 − α tsa )Q t (s, a) + α tsa [R′t (s, a) + γ maxc Q tB (S′t (s, a), c)] mf Q t (s, a) mf Q tB (s, a) = 2055 / if S′t (s, a) = 0 ′ (s, a) = 0 . [sent-276, score-0.769]

54 For state-action pair (s, a) this induced model can be described as a transition with probability α to state S′ (s, a) with a reward of R′ (s, a) and a transition with probability 1 − α to a terminal state sT (with a value of 0) and a reward of Q m f (s, a) (see Figure 7). [sent-280, score-0.592]

55 3 mf Qsa p= s a p= sT 1- α α R´ sa S´ sa Figure 7: Illustration of the induced model for state-action pair (s, a) corresponding with the bestmatch LVM equations. [sent-281, score-1.179]

56 Deﬁnition 6 The best-match LVM equations for state values at time t are (1 − αs )Vt (s) + αs [R′t (s) + γVtB (S′t (s))] t t mf Vt (s) mf VtB (s) = mf / if S′t (s) = 0 ′ (s) = 0 . [sent-291, score-0.712]

57 Besides comparing against TD(λ), we also compare against experience replay (Lin, 1992), which stores the n last experience samples and uses them for repeated TD updates. [sent-335, score-0.49]

58 Task A has deterministic state transitions and a deterministic reward of +1, while task B has stochastic transitions and a reward drawn from a normal distribution with mean +1 and standard deviation 0. [sent-350, score-0.346]

59 In task A, at timestep 4 the start state is revisited and the RMS error for best-match LVM drops to 0. [sent-353, score-0.441]

60 The total computation time for the 5000 runs was marginally higher for experience replay with N=4, which has to maintain a queue of recent samples, than for best-match LVM and TD(λ): on task A, around 90 ms compared 2059 VAN S EIJEN , W HITESON , VAN H ASSELT AND W IERING 22 22 TD(λ) exp. [sent-362, score-0.402]

61 replay, N = all best−match LVM 20 14 12 10 8 14 12 10 8 6 6 4 4 2 2 0 0 20 40 60 80 100 0 0 20 timesteps 40 60 80 100 timesteps Figure 9: Comparison of the performance of best-match LVM, TD(λ) and experience replay on tasks A (left) and task B (right) of Figure 8. [sent-366, score-0.43]

62 By varying the way state-action pairs are selected for updating (line 10) and changing the stopping criterion (line 12), a whole range of algorithms can be constructed that trade off computation cost per timestep for better approximations of the best-match Q-values. [sent-383, score-0.321]

63 If the state-action pair selection criterion is the state-action pair visited at the previous timestep and the stopping criterion allows only a single update, the algorithm reduces to the regular Q-learning algorithm. [sent-385, score-0.426]

64 A priority queue of state-action pairs is maintained whose corresponding bestmatch updates have the largest expected effect on the best-match Q-value estimates. [sent-414, score-0.328]

65 When a stateaction pair has received an update, all state-action pairs whose last-visit transition state equals the state from the updated state-action pair are placed into the priority queue with a priority equal to the absolute change an update would cause in its Q-value. [sent-415, score-0.725]

66 By always putting the state-action pair from the previous timestep on top of the priority queue (line 10), the requirement that each visited state-action pair receives at least one best-match update is fulﬁlled, guaranteeing convergence in the limit. [sent-418, score-0.65]

67 On the surface, this algorithm resembles deterministic prioritized sweeping (DPS) (Sutton and Barto, 1998), a simpler variation that learns only a deterministic model, uses a slightly different priority heuristic, and performs Q-learning updates to its Q-values. [sent-419, score-0.44]

68 By performing updates with respect to Q m f instead of Q, BM-LVM corrects previous updates instead of performing multiple updates based on the same sample. [sent-422, score-0.342]

69 For all prioritized-sweeping algorithms we performed a maximum of 20 updates per timestep (i. [sent-448, score-0.435]

70 For experience replay we used the last 20 samples, which also results in 20 updates per timestep. [sent-451, score-0.467]

71 In contrast, the combination of a greedy behavior policy with optimistic initialization enables the prioritized sweeping methods to converge to the optimal policy in a deterministic environment. [sent-472, score-0.407]

72 95 Table 2: Average return and optimal parameters (d = α decay rate) of best-match prioritized sweeping and several competitors on the deterministic Dyna Maze task. [sent-503, score-0.343]

73 7 Table 3: Average return and optimal parameters (d = α decay rate) of best-match prioritized sweeping and several competitors on the stochastic Dyna Maze task. [sent-530, score-0.326]

74 We tested whether doubling the size of the stored experience sequence improves the performance of experience replay, but this led to no signiﬁcant performance increase. [sent-535, score-0.329]

75 The reason for this is that while in both cases the maximum number of updates per timestep is 20, in the deterministic case the priority queue often has fewer than 20 samples, so fewer updates occur. [sent-538, score-0.744]

76 The computation time of Q(λ) is slightly better than that of BM-LVM, while experience replay is about twice as fast as BM-LVM. [sent-539, score-0.314]

77 1 Generalized Best-Match Equations Best-match LVM learning takes the idea of using more accurate update targets to the extreme by continuously revising update targets with best-match updates. [sent-558, score-0.35]

78 Therefore, when using the LVM, at timestep 5 the old experience sample is overwritten with the new experience. [sent-566, score-0.458]

79 Let υx indicate the update target from the sample y B collected at timestep x based on the best-match Q-value of timestep y: υx = rx + γ maxa Qy (sx , a). [sent-568, score-0.743]

80 y Using this convention the update of Q m f at timestep 5 becomes Q5 (sA , a0 ) = (1 − α)Q0 (sA , a0 ) + αυ1 . [sent-569, score-0.413]

81 4 7 mf Thus, the best-match Q-value of (sA , a0 ) at timestep 7 is equal to a weighted average of Q0 , υ1 4 and υ5 . [sent-571, score-0.479]

82 On the other hand, if both the old and the new sample are stored, Q-values from timestep 7 7 could also be used for the update target of the old sample, yielding mf B Q7 (sA , a0 ) = (1 − α)2 Q0 (sA , a0 ) + α (1 − α)υ1 + αυ5 . [sent-572, score-0.638]

83 7 7 mf (10) For the state-sequence from Figure 11 this means that the experience resulting from (sB , a6 ) is also taken into account in the update target for (sA , a0 ). [sent-573, score-0.496]

84 We deﬁne Xsa x as the subset of X containing all samples belonging to state-action pair (s, a) and Nsa as the size of sa for 1 ≤ k ≤ N x . [sent-578, score-0.507]

85 Without loss of generality, we index the samples from Xsa as xk sa x sa x sa we deﬁne Wsa as a set consisting of Nsa + 1 weights wk ∈ IR such that 0 ≤ wk ≤ 1 for 0 ≤ k ≤ Nsa x Nsa sa and ∑k=0 wk = 1. [sent-580, score-1.88]

86 mf Deﬁnition 11 The generalized best-match equations with respect to Q t , X and W are sa sa sa sa sa sa x Q tB (s, a) = w0 Q t (s, a) + w1 υ1 + w2 υ2 + . [sent-582, score-2.915]

87 + wNsa υNsa , x mf for all s, a , (11) sa sa where υk = r + γ maxc Q tB (s′ , c) |r, s′ ∈ xk . [sent-585, score-1.14]

88 When the samples are combined by incremental Q-learning updates, like in Equation 10, the weights have the values sa w0 x Nsa = ∏ (1 − αisa ) , (12) i=1 sa wk = sa αk x Nsa ∏ (1 − αisa ) , x for 1 ≤ k ≤ Nsa . [sent-590, score-1.394]

89 A better weight distribution gives all samples the same weights: sa sa x wk = (1 − w0 )/Nsa , x for 1 ≤ k ≤ Nsa , sa for some value of w0 . [sent-594, score-1.394]

90 Using x x x x sa ˆ ˆ s′ w0 = 1 − Nsa /Nsa , Psa = Nsas′ /Nsa and R sa = ∑X rsa /Nsa , the generalized best-match equations can now be rewritten as sa sa ˆ ˆ s′ Q B (s, a) = w0 Q m f (s, a) + (1 − w0 ) R sa + γ ∑ Psa max Q B (s′ , a′ ) , ′ s′ a for all s, a . [sent-607, score-2.272]

91 ˆ ˆ In these equations, P and R constitute a sparse, approximate model, whose size can be controlled sa ˆ by limiting the number of next states per state-action pair for which P is estimated. [sent-608, score-0.546]

92 We deﬁne an n-transition ˆ model (NTM) to be one that estimates the transition probability P for n next states per state action ˆ pair. [sent-610, score-0.383]

93 The priority of a state-action pair (s, a) for BM-NTM is deﬁned as sa ˆ s1 p = (1 − w0 )Psa · |∆V (s1 )| , where ∆V (s1 ) is the difference in the state value of s1 before and after the best-match update of one of the Q-values of s1 . [sent-636, score-0.742]

94 , by estimating the n most likely transition states), this model is sufﬁcient for our experimental setting since most transition states have similar transition probabilities. [sent-641, score-0.351]

95 When transition to a square is blocked by a wall, the transition probability of that square is added to the transition probability of the square in front of the wall. [sent-653, score-0.315]

96 1 Right, transition probabilities (· 15 ) of a ‘north’ action for different positions of the agent (indicated by the circle) with respect to the walls (black squares). [sent-672, score-0.357]

97 The reason is that the priority queue of PS-NTM is often close to empty in this case and thus the 5 updates per timestep are often not reached. [sent-682, score-0.584]

98 We show that the resulting algorithm, which combines the N most recent samples with the model-free Q-value function, outperforms both linear Sarsa(λ) and linear experience replay on the mountain car task. [sent-690, score-0.35]

99 If there are Nsa samples belonging to state-action pair (s, a), then the best-match update based on these samples is sa sa sa sa sa sa x Q t,i+1 (s, a) = w0 Q t (s, a) + w1 υ1 + w2 υ2 + . [sent-745, score-2.904]

100 + wNsa υNsa , x mf (14) sa sa where υk = r + γ maxc Q t,i (s′ , c) |r, s′ ∈ xk . [sent-748, score-1.14]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sa', 0.446), ('lvm', 0.412), ('timestep', 0.282), ('mf', 0.197), ('replay', 0.174), ('sb', 0.171), ('nsa', 0.166), ('experience', 0.14), ('jit', 0.135), ('update', 0.131), ('agent', 0.128), ('action', 0.124), ('updates', 0.114), ('td', 0.111), ('transition', 0.105), ('asselt', 0.101), ('atch', 0.101), ('eijen', 0.101), ('einforcement', 0.101), ('hiteson', 0.101), ('iering', 0.101), ('prioritized', 0.101), ('quations', 0.101), ('xploiting', 0.101), ('tb', 0.094), ('queue', 0.088), ('episodes', 0.085), ('policy', 0.084), ('revisited', 0.08), ('maze', 0.08), ('ntm', 0.08), ('state', 0.079), ('ps', 0.076), ('reward', 0.073), ('sweeping', 0.072), ('delayed', 0.068), ('fficient', 0.066), ('dps', 0.066), ('bestmatch', 0.065), ('revisit', 0.062), ('priority', 0.061), ('van', 0.059), ('dyna', 0.058), ('timesteps', 0.058), ('regular', 0.056), ('return', 0.055), ('terminal', 0.053), ('postponed', 0.051), ('maxc', 0.051), ('psa', 0.051), ('stored', 0.049), ('deterministic', 0.046), ('episode', 0.046), ('targets', 0.044), ('nsas', 0.043), ('tabular', 0.043), ('bellman', 0.042), ('equations', 0.042), ('updated', 0.041), ('decay', 0.039), ('per', 0.039), ('cb', 0.038), ('visited', 0.038), ('sutton', 0.038), ('visit', 0.038), ('atkeson', 0.036), ('overwritten', 0.036), ('sarsa', 0.036), ('initialize', 0.036), ('states', 0.036), ('samples', 0.036), ('vt', 0.033), ('loop', 0.031), ('competitors', 0.03), ('received', 0.03), ('stochastic', 0.029), ('postponing', 0.029), ('pqueue', 0.029), ('rsum', 0.029), ('traded', 0.029), ('xsa', 0.029), ('strehl', 0.028), ('watkins', 0.028), ('target', 0.028), ('vr', 0.027), ('earning', 0.025), ('eligibility', 0.025), ('pair', 0.025), ('actions', 0.025), ('rms', 0.024), ('negatively', 0.024), ('control', 0.022), ('moore', 0.022), ('hasselt', 0.022), ('redo', 0.022), ('whiteson', 0.022), ('maxa', 0.02), ('wk', 0.02), ('singh', 0.02), ('initialization', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

2 0.29820392 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Author: Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann

Abstract: Bayesian learning methods have recently been shown to provide an elegant solution to the explorationexploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). The primary focus of this paper is to extend these ideas to the case of partially observable domains, by introducing the Bayes-Adaptive Partially Observable Markov Decision Processes. This new framework can be used to simultaneously (1) learn a model of the POMDP domain through interaction with the environment, (2) track the state of the system under partial observability, and (3) plan (near-)optimal sequences of actions. An important contribution of this paper is to provide theoretical results showing how the model can be ﬁnitely approximated while preserving good learning performance. We present approximate algorithms for belief tracking and planning in this model, as well as empirical results that illustrate how the model estimate and agent’s return improve as a function of experience. Keywords: processes reinforcement learning, Bayesian inference, partially observable Markov decision

3 0.12667267 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

Author: Jaedeug Choi, Kee-Eung Kim

Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert’s behavior, namely the case in which the expert’s policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings. Keywords: inverse reinforcement learning, partially observable Markov decision process, inverse optimization, linear programming, quadratically constrained programming

4 0.077370085 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

Author: Marek Petrik, Shlomo Zilberstein

Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing speciﬁc norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also brieﬂy analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes

5 0.05240592 28 jmlr-2011-Double Updating Online Learning

Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin

Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassiﬁed, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufﬁcient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reﬂect the inﬂuence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a ﬁxed weight to the misclassiﬁed example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classiﬁcation

6 0.040274374 36 jmlr-2011-Generalized TD Learning

7 0.032533571 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

8 0.029892739 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

9 0.029650601 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

10 0.028111985 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

11 0.02710484 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

12 0.02497557 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

13 0.024649432 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

14 0.024164815 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

15 0.023858802 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

16 0.022211453 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

17 0.020347882 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

18 0.020237343 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

19 0.019169848 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough

20 0.019039985 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.123), (1, 0.097), (2, -0.225), (3, 0.039), (4, -0.093), (5, 0.369), (6, -0.294), (7, -0.082), (8, 0.073), (9, -0.056), (10, 0.021), (11, 0.004), (12, 0.035), (13, -0.124), (14, 0.053), (15, -0.041), (16, 0.023), (17, -0.235), (18, 0.12), (19, -0.055), (20, 0.051), (21, 0.012), (22, 0.013), (23, -0.157), (24, 0.142), (25, 0.039), (26, -0.095), (27, -0.125), (28, 0.06), (29, -0.021), (30, 0.082), (31, -0.018), (32, 0.019), (33, -0.143), (34, 0.075), (35, 0.107), (36, 0.016), (37, 0.045), (38, 0.074), (39, 0.083), (40, 0.085), (41, 0.016), (42, -0.056), (43, -0.029), (44, -0.066), (45, -0.004), (46, 0.056), (47, -0.014), (48, 0.038), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97732687 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

2 0.92730564 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

Author: Stéphane Ross, Joelle Pineau, Brahim Chaib-draa, Pierre Kreitmann

3 0.44062307 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments

Author: Jaedeug Choi, Kee-Eung Kim

4 0.22143145 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a uniﬁed way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function

5 0.20791893 28 jmlr-2011-Double Updating Online Learning

Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin

6 0.13958581 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

7 0.13829938 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

8 0.1375237 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

9 0.12397337 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

10 0.10829942 94 jmlr-2011-Theoretical Analysis of Bayesian Matrix Factorization

11 0.10620248 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.10389552 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

13 0.10269051 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

14 0.10027792 16 jmlr-2011-Clustering Algorithms for Chains

15 0.09755642 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

16 0.096330501 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

17 0.096018821 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

18 0.092946298 91 jmlr-2011-The Sample Complexity of Dictionary Learning

19 0.091370977 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking

20 0.089327566 42 jmlr-2011-In All Likelihood, Deep Belief Is Not Enough

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.019), (9, 0.011), (10, 0.027), (24, 0.022), (31, 0.038), (32, 0.016), (41, 0.014), (60, 0.011), (65, 0.014), (73, 0.016), (78, 0.712)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.980039 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

2 0.97432321 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and reﬁtting: ﬁrst we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

3 0.96010351 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The ﬁrst method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

4 0.92050618 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or ﬁnancial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classiﬁers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy deﬁnition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classiﬁcation. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classiﬁers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classiﬁcation, optimization, empirical risk minimization, support vector machines, logistic regression

5 0.82449257 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classiﬁcation and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassiﬁcation error or by a statistic of the ROC curve (such as the area under the curve). Speciﬁcally, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classiﬁcation algorithm that generalizes AdaBoost, which we call “P-Classiﬁcation.” We discuss and validate the potential beneﬁts of this equivalence relationship, and perform controlled experiments to understand P-Classiﬁcation’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classiﬁcation and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classiﬁcation, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

6 0.75739455 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

7 0.7508502 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

8 0.7404753 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

9 0.72615367 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

10 0.7239902 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

11 0.6966365 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.68977404 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

13 0.67002547 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

14 0.66714483 36 jmlr-2011-Generalized TD Learning

15 0.66123587 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

16 0.66093528 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

17 0.65549433 104 jmlr-2011-X-Armed Bandits

18 0.65480089 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

19 0.64523607 91 jmlr-2011-The Sample Complexity of Dictionary Learning

20 0.64479959 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning