nips nips2011 nips2011-268 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
Reference: text
sentIndex sentText sentNum sentScore
1 nl Abstract We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. [sent-12, score-0.166]
2 We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. [sent-13, score-0.073]
3 This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. [sent-14, score-0.053]
4 Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. [sent-15, score-0.114]
5 1 Introduction Q-learning [20] is a well-known model-free reinforcement learning (RL) algorithm that finds an estimate of the optimal action-value function. [sent-16, score-0.038]
6 However, it suffers from slow convergence, especially when the discount factor γ is close to one [8, 17]. [sent-19, score-0.037]
7 The main reason for the slow convergence of Q-learning is the combination of the sample-based stochastic approximation (that makes use of a decaying learning rate) and the fact that the Bellman operator propagates information throughout the whole space (specially when γ is close to 1). [sent-20, score-0.087]
8 In this paper, we focus on RL problems that are formulated as finite state-action discounted infinite horizon Markov decision processes (MDPs), and propose an algorithm, called speedy Q-learning (SQL), that addresses the problem of slow convergence of Q-learning. [sent-21, score-0.197]
9 However, this allows SQL to use a more aggressive learning rate for one of the terms in its update rule and eventually achieves a faster convergence rate than the standard Qlearning (see Section 3. [sent-23, score-0.147]
10 We prove a PAC bound on the performance of SQL, which shows that only T = O log(n)/((1 − γ)4 ǫ2 ) number of samples are required for SQL in order to guarantee an ǫ-optimal action-value function with high probability. [sent-25, score-0.052]
11 The rate for SQL is even better than that for the Phased Q-learning algorithm, a model-free batch Q-value 1 iteration algorithm proposed and analyzed by [12]. [sent-27, score-0.066]
12 In addition, SQL’s rate is slightly better than the rate of the model-based batch Q-value iteration algorithm in [12] and has a better computational and memory requirement (computational and space complexity), see Section 3. [sent-28, score-0.094]
13 Similar to Q-learning, SQL may be implemented in synchronous and asynchronous fashions. [sent-31, score-0.11]
14 For the sake of simplicity in the analysis, we only report and analyze its synchronous version in this paper. [sent-32, score-0.067]
15 However, it can easily be implemented in an asynchronous fashion and our theoretical results can also be extended to this setting by following the same path as [8]. [sent-33, score-0.071]
16 This over-estimation is caused by a positive bias introduced by using the maximum action value as an approximation for the expected action value [19]. [sent-39, score-0.032]
17 , a high-probability bound on the performance of SQL, in Section 3. [sent-45, score-0.038]
18 2, and finally compare our bound with the previous results on Q-learning in Section 3. [sent-46, score-0.038]
19 Section 4 contains the detailed proof of the performance bound of the SQL algorithm. [sent-48, score-0.038]
20 We consider the standard reinforcement learning (RL) framework [5, 16] in which a learning agent interacts with a stochastic environment and this interaction is modeled as a discrete-time discounted MDP. [sent-53, score-0.069]
21 A discounted MDP is a quintuple (X, A, P, R, γ), where X and A are the set of states and actions, P is the state transition distribution, R is the reward function, and γ ∈ (0, 1) is a discount factor. [sent-54, score-0.072]
22 We denote by P (·|x, a) and r(x, a) the probability distribution over the next state and the immediate reward of taking action a at state x, respectively. [sent-55, score-0.069]
23 A stationary Markov policy π(·|x) is the distribution over the control actions given the current state x. [sent-60, score-0.042]
24 The value and the action-value functions of a policy π, denoted respectively by V π : X → R and Qπ : Z → R, are defined as the expected sum of discounted rewards that are encountered when the policy π is executed. [sent-62, score-0.075]
25 Given a MDP, the goal is to find a policy that attains the best possible values, V ∗ (x) supπ V π (x), ∀x ∈ X. [sent-63, score-0.022]
26 The optimal action-value function Q∗ is the unique fixed-point of the Bellman optimality operator T defined as (TQ)(x, a) r(x, a) + γ y∈X P (y|x, a) maxb∈A Q(y, b), ∀(x, a) ∈ Z. [sent-66, score-0.049]
27 Finally for the sake of readability, we define the max operator M over action-value functions as (MQ)(x) = maxa∈A Q(x, a), ∀x ∈ X. [sent-71, score-0.07]
28 3 Speedy Q-Learning In this section, we introduce our RL algorithm, called speedy Q-Learning (SQL), derive a performance bound for this algorithm, and compare this bound with similar results on standard Q-learning. [sent-72, score-0.191]
29 2 The derived performance bound shows that SQL has a rate of convergence of order O( 1/T ), which is better than all the existing results for Q-learning. [sent-73, score-0.101]
30 As it can be seen, this is the synchronous version of the algorithm, which will be analyzed in the paper. [sent-76, score-0.052]
31 In the asynchronous version, at each time step, the action-value of the observed state-action pair is updated, while the rest of the state-action pairs remain unchanged. [sent-78, score-0.058]
32 For the convergence of this instance of the algorithm, it is required that all the states and actions are visited infinitely many times, which makes the analysis slightly more complicated. [sent-79, score-0.035]
33 On the other hand, given a generative model, the algorithm may be also formulated in a synchronous fashion, in which we first generate a next state y ∼ P (·|x, a) for each state-action pair (x, a), and then update the action-values of all the stateaction pairs using these samples. [sent-80, score-0.093]
34 We chose to include only the synchronous version of SQL in the paper just for the sake of simplicity in the analysis. [sent-81, score-0.067]
35 However, the algorithm can be implemented in an asynchronous fashion (similar to the more familiar instance of Q-learning) and our theoretical results can also be extended to the asynchronous case under some mild assumptions. [sent-82, score-0.129]
36 1 Algorithm 1: Synchronous Speedy Q-Learning (SQL) Input: Initial action-value function Q0 , discount factor γ, and number of iteration T Q−1 := Q0 ; // Initialization for k := 0, 1, 2, 3, . [sent-83, score-0.04]
37 Let us consider the update rule of Q-learning Qk+1 (x, a) = Qk (x, a) + αk Tk Qk (x, a) − Qk (x, a) , 1 See [2] for the convergence analysis of the asynchronous variant of SQL. [sent-95, score-0.134]
38 Note that other (polynomial) learning steps can also be used ‹ with speedy Q-learning. [sent-96, score-0.115]
39 However one can show that the rate of convergence of SQL is optimized for αk = 1 (k + 1). [sent-97, score-0.075]
40 This is in contrast to the standard Q-learning algorithm for which the rate of convergence is optimized for a polynomial learning step [8]. [sent-98, score-0.075]
41 1, we first notice that the same terms: Tk Qk−1 − Qk and Tk Qk − Tk Qk−1 appear on the RHS of the update rules of both algorithms. [sent-102, score-0.021]
42 However, while Q-learning uses the same conservative learning rate αk for both these terms, SQL uses αk for the first term and a bigger learning step 1 − αk = k/(k + 1) for the second one. [sent-103, score-0.04]
43 Since the term Tk Qk − Tk Qk−1 goes to zero as Qk approaches its optimal value Q∗ , it is not necessary that its learning rate approaches zero. [sent-104, score-0.028]
44 As a result, using the learning rate αk , which goes to zero with k, is too conservative for this term. [sent-105, score-0.04]
45 This might be a reason why SQL that uses a more aggressive learning rate 1 − αk for this term has a faster convergence rate than Q-learning. [sent-106, score-0.106]
46 2 Main Theoretical Result The main theoretical result of the paper is expressed as a high-probability bound over the performance of the SQL algorithm. [sent-108, score-0.038]
47 Let Assumption 1 holds and T be a positive integer. [sent-110, score-0.024]
48 Then, at iteration T of SQL with probability at least 1 − δ, we have 2 log 2n γ δ Q∗ − QT ≤ 2β 2 Rmax + . [sent-111, score-0.037]
49 This result, combined with Borel-Cantelli lemma [9], guarantees that QT converges almost surely to Q∗ with the rate 1/T . [sent-113, score-0.064]
50 Corollary 1 (Finite-time PAC (“probably approximately correct”) performance bound for SQL). [sent-117, score-0.038]
51 3 Relation to Existing Results In this section, we first compare our results for SQL with the existing results on the convergence of standard Q-learning. [sent-121, score-0.035]
52 This comparison indicates that SQL accelerates the convergence of Q-learning, especially for γ close to 1 and small ǫ. [sent-122, score-0.035]
53 We then compare SQL with batch Q-value iteration (QI) in terms of sample and computational complexities, i. [sent-123, score-0.051]
54 1 A Comparison with the Convergence Rate of Standard Q-Learning There are not many studies in the literature concerning the convergence rate of incremental modelfree RL algorithms such as Q-learning. [sent-132, score-0.101]
55 [17] has provided the asymptotic convergence rate for Qlearning under the assumption that all the states have the same next state distribution. [sent-133, score-0.097]
56 This result shows that the asymptotic convergence rate of Q-learning has exponential dependency on 1 − γ, i. [sent-134, score-0.077]
57 , ˜ the rate of convergence is of O(1/t1−γ ) for γ ≥ 1/2. [sent-136, score-0.063]
58 at least 1 − δ after 1 1 nβRmax w 4 2 1−ω β Rmax log δǫ βRmax T = O (4) + β log ǫ2 ǫ 4 steps. [sent-141, score-0.036]
59 When γ ≈ 1, one can argue that β = 1/(1 − γ) becomes the dominant term in the bound of ˜ Eq. [sent-142, score-0.038]
60 They show that to quantify ǫ-optimal action-value functions with high probability, we need O nβ 5 /ǫ2 log(1/ǫ) log(nβ) + log log 1 ǫ and O nβ 4 /ǫ2 (log(nβ) + log log 1 ǫ) samples in model-free and model-based QI, respectively. [sent-157, score-0.072]
61 A comparison between their results and the main result of this paper suggests that the sample complexity of SQL, which is of order O nβ 4 /ǫ2 log n ,3 is better than model-free QI in terms of β and log(1/ǫ). [sent-158, score-0.05]
62 Table 1: Comparison between SQL, Q-learning, model-based and model-free Q-value iteration in terms of sample complexity (SC), computational complexity (CC), and space complexity (SPC). [sent-162, score-0.089]
63 5 ǫ Θ(n) Θ(n) Model-based QI 4 ˜ nβ O ǫ2 5 ˜ nβ O 2 ǫ 4 ˜ nβ O 2 ǫ Model-free QI 5 ˜ nβ O ǫ2 5 ˜ nβ O 2 ǫ Θ(n) Analysis In this section, we give some intuition about the convergence of SQL and provide the full proof of the finite-time analysis reported in Theorem 1. [sent-165, score-0.035]
64 , yk } drawn from the distribution P (·|x, a), for all state action (x, a) up to round k. [sent-170, score-0.1]
65 We define the operator D[Qk , Qk−1 ] as the expected value of the empirical operator Dk conditioned on Fk−1 : D[Qk , Qk−1 ](x, a) E(Dk [Qk , Qk−1 ](x, a) |Fk−1 ) = kTQk (x, a) − (k − 1)TQk−1 (x, a). [sent-171, score-0.072]
66 Thus the update rule of SQL writes Qk+1 (x, a) = (1 − αk )Qk (x, a) + αk (D[Qk , Qk−1 ](x, a) − ǫk (x, a)) , 3 (5) Note that at each round of SQL n new samples are generated. [sent-172, score-0.054]
67 This combined with the result of Corollary 1 deduces the sample complexity of order O(nβ 4 /ǫ2 log(n/δ)). [sent-173, score-0.032]
68 5 where the estimation error ǫk is defined as the difference between the operator D[Qk , Qk−1 ] and its sample estimate Dk [Qk , Qk−1 ] for all (x, a) ∈ Z: ǫk (x, a) D[Qk , Qk−1 ](x, a) − Dk [Qk , Qk−1 ](x, a). [sent-175, score-0.049]
69 , ǫk (x, a)} is a martingale difference sequence w. [sent-179, score-0.039]
70 Let us define the martingale Ek (x, a) to be the sum of the estimation errors: k ǫj (x, a), Ek (x, a) ∀(x, a) ∈ Z. [sent-183, score-0.027]
71 (ii) Lemma 2 states the key property that the SQL iterate Qk+1 is very close to the Bellman operator T applied to the previous iterate Qk plus an estimation error term of order Ek /k. [sent-187, score-0.068]
72 (iii) By induction, Lemma 3 provides a performance bound Q∗ − Qk in terms of a discounted sum of the cumulative estimation errors {Ej }j=0:k−1 . [sent-188, score-0.082]
73 Finally (iv) we use a maximal Azuma’s inequality (see Lemma 4) to bound Ek and deduce the finite time performance for SQL. [sent-189, score-0.059]
74 Now for any k ≥ 0, let us assume that the bound Dk [Qk , Qk−1 ] ≤ Vmax holds. [sent-198, score-0.038]
75 Now the bound on ǫk follows from ǫk = E(Dk [Qk , Qk−1 ]|Fk−1 ) − Dk [Qk , Qk−1 ] ≤ 2Vmax , k−1 and the bound Qk ≤ Vmax is deduced by noticing that Qk = 1/k j=0 Dj [Qj , Qj−1 ]. [sent-200, score-0.076]
76 The next lemma shows that Qk is close to TQk−1 , up to a O(1/k) term plus the average cumulative 1 estimation error k Ek−1 . [sent-201, score-0.049]
77 The result holds for k = 1, where (7) reduces to (5). [sent-206, score-0.024]
78 We now show that if the property (7) holds for k then it also holds for k + 1. [sent-207, score-0.048]
79 = k+1 k+1 Thus (7) holds for k + 1, and is thus true for all k ≥ 1. [sent-210, score-0.024]
80 6 Now we bound the difference between Q∗ and Qk in terms of the discounted sum of cumulative estimation errors {E0 , E1 , . [sent-211, score-0.082]
81 The result holds for k = 1 as: Q∗ − Q1 = TQ∗ − T0 Q0 = ||TQ∗ − TQ0 + ǫ0 || ≤ ||TQ∗ − TQ0 || + ||ǫ0 || ≤ 2γVmax + ||ǫ0 || ≤ 2γβVmax + E0 We now show that if the bound holds for k, then it also holds for k + 1. [sent-219, score-0.11]
82 j=1 Thus (8) holds for k + 1 thus for all k ≥ 1 by induction. [sent-222, score-0.024]
83 k=1 Note that the difference between this bound and the result of Theorem 1 is just in the second term. [sent-226, score-0.038]
84 So, we only need to show that the following inequality holds, with probability at least 1 − δ: 1 T T γ T −k Ek−1 ≤ 2βVmax k=1 We first notice that: T 1 1 γ T −k Ek−1 ≤ T T k=1 2 log T T γ T −k max 1≤k≤T k=1 Ek−1 ≤ 2n δ . [sent-227, score-0.037]
85 T (9) (10) Therefore, in order to prove (9) it is sufficient to bound max1≤k≤T Ek−1 = max(x,a)∈Z max1≤k≤T |Ek−1 (x, a)| in high probability. [sent-229, score-0.052]
86 We start by providing a high probability bound for max1≤k≤T |Ek−1 (x, a)| for a given (x, a). [sent-230, score-0.038]
87 1≤k≤T 2T L2 As mentioned earlier, the sequence of random variables {ǫ0 (x, a), ǫ1 (x, a), · · · , ǫk (x, a)} is a martingale difference sequence w. [sent-251, score-0.051]
88 P max (−Ek−1 (x, a)) > ǫ ≤ exp 2 1≤k≤T 8T Vmax By combining (12) with (11) we deduce that P (max1≤k≤T |Ek−1 (x, a)| > ǫ) ≤ 2 exp and by a union bound over the state-action space, we deduce that −ǫ2 P max Ek−1 > ǫ ≤ 2n exp . [sent-261, score-0.118]
89 2 1≤k≤T 8T Vmax This bound can be rewritten as: for any δ > 0, P max 1≤k≤T Ek−1 ≤ Vmax 8T log 2n δ ≥ 1 − δ, −ǫ2 2 8T Vmax , (13) (14) which by using (10) proves (9) and Theorem 1. [sent-262, score-0.091]
90 5 Conclusions and Future Work In this paper, we introduced a new Q-learning algorithm, called speedy Q-learning (SQL). [sent-263, score-0.115]
91 We analyzed the finite time behavior of this algorithm as well as its asymptotic convergence to the optimal action-value function. [sent-264, score-0.049]
92 Our result is in the form of high probability bound on the performance loss of SQL, which suggests that the algorithm converges to the optimal action-value function in a faster rate than the standard Q-learning. [sent-265, score-0.066]
93 Overall, SQL is a simple, efficient and theoretically well-founded reinforcement learning algorithm, which improves on existing RL algorithms such as Q-learning and model-based value iteration. [sent-266, score-0.038]
94 Therefore, we did not compare our result to the PAC-MDP methods [15,18] and the upper-confidence bound based algorithms [3, 11], in which the choice of the exploration policy impacts the behavior of the learning algorithms. [sent-268, score-0.073]
95 the state of the art in PAC-MDPs, by combining the asynchronous version of SQL with a smart exploration strategy. [sent-272, score-0.091]
96 This is mainly due to the fact that the bound for SQL has been proved to be tighter than the RL algorithms that have been used for estimating the value function in PAC-MDP methods, especially in the model-free case. [sent-273, score-0.053]
97 Another possible direction for future work is to scale up SQL to large (possibly continuous) state and action spaces where function approximation is needed. [sent-275, score-0.036]
98 Reinforcement learning with a near optimal rate of convergence. [sent-292, score-0.028]
99 REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. [sent-298, score-0.038]
100 Model-based reinforcement learning with nearly tight exploration a complexity bounds. [sent-384, score-0.07]
wordName wordTfidf (topN-words)
[('sql', 0.666), ('qk', 0.58), ('ek', 0.234), ('vmax', 0.205), ('speedy', 0.115), ('tk', 0.113), ('dk', 0.091), ('tqk', 0.08), ('ktqk', 0.069), ('tq', 0.066), ('rmax', 0.064), ('asynchronous', 0.058), ('synchronous', 0.052), ('yk', 0.051), ('fk', 0.048), ('bellman', 0.047), ('qi', 0.043), ('rl', 0.042), ('bound', 0.038), ('reinforcement', 0.038), ('nijmegen', 0.037), ('operator', 0.036), ('lemma', 0.036), ('convergence', 0.035), ('mqk', 0.034), ('pac', 0.032), ('discounted', 0.031), ('ltration', 0.028), ('rate', 0.028), ('martingale', 0.027), ('qt', 0.027), ('mdp', 0.027), ('mq', 0.026), ('ej', 0.025), ('holds', 0.024), ('munos', 0.024), ('szepesv', 0.023), ('ascq', 0.023), ('geert', 0.023), ('gheshlaghi', 0.023), ('grooteplein', 0.023), ('halley', 0.023), ('kmqk', 0.023), ('kqk', 0.023), ('modelfree', 0.023), ('qlearning', 0.023), ('spc', 0.023), ('villeneuve', 0.023), ('policy', 0.022), ('discount', 0.021), ('deduce', 0.021), ('update', 0.021), ('rule', 0.02), ('azuma', 0.02), ('azar', 0.02), ('plexity', 0.02), ('state', 0.02), ('batch', 0.019), ('max', 0.019), ('mdps', 0.019), ('complexity', 0.019), ('iteration', 0.019), ('radboud', 0.019), ('phased', 0.019), ('log', 0.018), ('inria', 0.018), ('com', 0.017), ('mohammad', 0.017), ('iterate', 0.016), ('rewritten', 0.016), ('action', 0.016), ('massachusetts', 0.016), ('slow', 0.016), ('tighter', 0.015), ('ghavamzadeh', 0.015), ('lille', 0.015), ('incremental', 0.015), ('sake', 0.015), ('netherlands', 0.015), ('avenue', 0.015), ('aggressive', 0.015), ('prove', 0.014), ('asymptotic', 0.014), ('ez', 0.014), ('cc', 0.014), ('athena', 0.014), ('supremum', 0.014), ('fashion', 0.013), ('exploration', 0.013), ('round', 0.013), ('cumulative', 0.013), ('immediate', 0.013), ('sequel', 0.013), ('optimality', 0.013), ('qj', 0.013), ('sample', 0.013), ('conservative', 0.012), ('optimized', 0.012), ('sequence', 0.012), ('double', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 268 nips-2011-Speedy Q-Learning
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
2 0.26549459 291 nips-2011-Transfer from Multiple MDPs
Author: Alessandro Lazaric, Marcello Restelli
Abstract: Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.
3 0.2327199 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
Author: Amir-massoud Farahmand
Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1
4 0.108407 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
5 0.084863789 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach
Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1
6 0.077459559 49 nips-2011-Boosting with Maximum Adaptive Sampling
7 0.076837406 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
8 0.061808772 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
9 0.044069923 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
10 0.039754331 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
11 0.037470289 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
12 0.035860308 199 nips-2011-On fast approximate submodular minimization
13 0.03326454 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
14 0.032783434 260 nips-2011-Sparse Features for PCA-Like Linear Regression
15 0.03274275 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
16 0.032526717 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
17 0.03237382 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits
18 0.031504739 219 nips-2011-Predicting response time and error rates in visual search
19 0.029455433 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
20 0.028616859 72 nips-2011-Distributed Delayed Stochastic Optimization
topicId topicWeight
[(0, 0.088), (1, -0.087), (2, 0.013), (3, 0.032), (4, -0.095), (5, 0.056), (6, -0.023), (7, 0.006), (8, -0.087), (9, 0.001), (10, -0.066), (11, 0.022), (12, -0.031), (13, -0.0), (14, -0.073), (15, -0.114), (16, -0.024), (17, -0.057), (18, 0.084), (19, 0.047), (20, -0.2), (21, 0.145), (22, -0.028), (23, -0.037), (24, -0.227), (25, -0.11), (26, 0.111), (27, 0.031), (28, -0.135), (29, -0.008), (30, -0.214), (31, 0.185), (32, 0.024), (33, 0.117), (34, -0.026), (35, -0.096), (36, -0.058), (37, -0.158), (38, 0.051), (39, 0.015), (40, -0.045), (41, 0.107), (42, 0.081), (43, 0.023), (44, 0.006), (45, -0.081), (46, 0.148), (47, 0.02), (48, -0.009), (49, -0.078)]
simIndex simValue paperId paperTitle
same-paper 1 0.95922923 268 nips-2011-Speedy Q-Learning
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
2 0.67867202 291 nips-2011-Transfer from Multiple MDPs
Author: Alessandro Lazaric, Marcello Restelli
Abstract: Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.
3 0.6428979 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
Author: Amir-massoud Farahmand
Abstract: Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an ˆ agent following the greedy policy π with respect to an action-value function Q, the ˆ ˆ ˆ performance loss E V ∗ (X) − V π (X) is upper bounded by O( Q − Q∗ 1+ζ ), ∞ in which ζ ≥ 0 is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms. 1
4 0.4838945 49 nips-2011-Boosting with Maximum Adaptive Sampling
Author: Charles Dubout, Francois Fleuret
Abstract: Classical Boosting algorithms, such as AdaBoost, build a strong classifier without concern about the computational cost. Some applications, in particular in computer vision, may involve up to millions of training examples and features. In such contexts, the training time may become prohibitive. Several methods exist to accelerate training, typically either by sampling the features, or the examples, used to train the weak learners. Even if those methods can precisely quantify the speed improvement they deliver, they offer no guarantee of being more efficient than any other, given the same amount of time. This paper aims at shading some light on this problem, i.e. given a fixed amount of time, for a particular problem, which strategy is optimal in order to reduce the training loss the most. We apply this analysis to the design of new algorithms which estimate on the fly at every iteration the optimal trade-off between the number of samples and the number of features to look at in order to maximize the expected loss reduction. Experiments in object recognition with two standard computer vision data-sets show that the adaptive methods we propose outperform basic sampling and state-of-the-art bandit methods. 1
5 0.46737072 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
6 0.34059435 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
7 0.26749831 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
8 0.26578933 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
9 0.26568332 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
10 0.24283899 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies
11 0.23502341 4 nips-2011-A Convergence Analysis of Log-Linear Training
12 0.21988855 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
13 0.21542238 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
14 0.20537426 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
15 0.19422689 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
16 0.19284505 208 nips-2011-Optimistic Optimization of a Deterministic Function without the Knowledge of its Smoothness
17 0.19193128 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
18 0.18577996 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
19 0.18351535 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
20 0.18165202 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
topicId topicWeight
[(0, 0.029), (4, 0.032), (11, 0.028), (20, 0.018), (26, 0.02), (31, 0.08), (33, 0.023), (43, 0.079), (45, 0.132), (57, 0.013), (74, 0.028), (76, 0.323), (83, 0.021), (86, 0.015), (99, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.75643045 268 nips-2011-Speedy Q-Learning
Author: Mohammad Ghavamzadeh, Hilbert J. Kappen, Mohammad G. Azar, Rémi Munos
Abstract: We introduce a new convergent variant of Q-learning, called speedy Q-learning (SQL), to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O log(n)/(ǫ2 (1 − γ)4 ) steps are required for the SQL algorithm to converge to an ǫ-optimal action-value function with high probability. This bound has a better dependency on 1/ǫ and 1/(1 − γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both modelfree and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning. 1
2 0.64405727 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
Author: Adler J. Perotte, Frank Wood, Noemie Elhadad, Nicholas Bartlett
Abstract: We introduce hierarchically supervised latent Dirichlet allocation (HSLDA), a model for hierarchically and multiply labeled bag-of-word data. Examples of such data include web pages and their placement in directories, product descriptions and associated categories from product hierarchies, and free-text clinical records and their assigned diagnosis codes. Out-of-sample label prediction is the primary goal of this work, but improved lower-dimensional representations of the bagof-word data are also of interest. We demonstrate HSLDA on large-scale data from clinical document labeling and retail product categorization tasks. We show that leveraging the structure from hierarchical labels improves out-of-sample label prediction substantially when compared to models that do not. 1
3 0.61866784 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
Author: Jacquelyn A. Shelton, Abdul S. Sheikh, Pietro Berkes, Joerg Bornschein, Joerg Luecke
Abstract: An increasing number of experimental studies indicate that perception encodes a posterior probability distribution over possible causes of sensory stimuli, which is used to act close to optimally in the environment. One outstanding difficulty with this hypothesis is that the exact posterior will in general be too complex to be represented directly, and thus neurons will have to represent an approximation of this distribution. Two influential proposals of efficient posterior representation by neural populations are: 1) neural activity represents samples of the underlying distribution, or 2) they represent a parametric representation of a variational approximation of the posterior. We show that these approaches can be combined for an inference scheme that retains the advantages of both: it is able to represent multiple modes and arbitrary correlations, a feature of sampling methods, and it reduces the represented space to regions of high probability mass, a strength of variational approximations. Neurally, the combined method can be interpreted as a feed-forward preselection of the relevant state space, followed by a neural dynamics implementation of Markov Chain Monte Carlo (MCMC) to approximate the posterior over the relevant states. We demonstrate the effectiveness and efficiency of this approach on a sparse coding model. In numerical experiments on artificial data and image patches, we compare the performance of the algorithms to that of exact EM, variational state space selection alone, MCMC alone, and the combined select and sample approach. The select and sample approach integrates the advantages of the sampling and variational approximations, and forms a robust, neurally plausible, and very efficient model of processing and learning in cortical networks. For sparse coding we show applications easily exceeding a thousand observed and a thousand hidden dimensions. 1
4 0.48952204 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
Author: Nasser M. Nasrabadi, Trac D. Tran, Nam Nguyen
Abstract: This paper studies the problem of accurately recovering a sparse vector β from highly corrupted linear measurements y = Xβ + e + w where e is a sparse error vector whose nonzero entries may be unbounded and w is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both β and e . Our first result shows that the extended Lasso can faithfully recover both the regression and the corruption vectors. Our analysis is relied on a notion of extended restricted eigenvalue for the design matrix X. Our second set of results applies to a general class of Gaussian design matrix X with i.i.d rows N (0, Σ), for which we provide a surprising phenomenon: the extended Lasso can recover exact signed supports of both β and e from only Ω(k log p log n) observations, even the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is optimal. 1
5 0.48915538 4 nips-2011-A Convergence Analysis of Log-Linear Training
Author: Simon Wiesler, Hermann Ney
Abstract: Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our findings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach. 1
6 0.48874983 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
7 0.48796201 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
8 0.48779815 245 nips-2011-Selecting the State-Representation in Reinforcement Learning
9 0.48768711 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
10 0.48740083 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
11 0.48683438 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
12 0.4866209 49 nips-2011-Boosting with Maximum Adaptive Sampling
13 0.48637259 291 nips-2011-Transfer from Multiple MDPs
14 0.48627564 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
15 0.48603186 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
16 0.48559499 67 nips-2011-Data Skeletonization via Reeb Graphs
17 0.48524559 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample
18 0.48478907 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
19 0.4841789 73 nips-2011-Divide-and-Conquer Matrix Factorization
20 0.48408765 80 nips-2011-Efficient Online Learning via Randomized Rounding